Traduki is an open source machine translation program, developed with the Lua programming language and released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone. Traduki means "to translate" in Esperanto.
Development was suspended in mid-2002, but has restarted in 2003.
Traduki is a free Machine Translation program, released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone.
Machine Translation is a complex task. The folowing are preliminary ideas.
Input is the reading the original English text. This can be from a simple console, GUI, or web interface, but it can also be from more complicated things such as OCR, handwriting recognition or speech recognition.
Tolkenization is the division of the text into sentences and of sentences into words and punctuation. The division of the text into sentences can be done using "!", "?" and "." as separators. But sometimes, "." is used un numbers (i.e. 10.233), abbreviations (i.e. Dr.) and Initials (i.e. A. C. Doyle). The punctuation marks ",", ";", "", »«, :. () and [] can also be used to separate semi-independent sentences.
The article "What is a word, What is a sentence? Problems of Tokenization" is a good discussion of tokenization problems. It can be downloaded here
Each word must be analyzed to identify derived words. Dictionaries used in Machine Translation do not have words derived from simpler words. Derived words must be identified by the program itself. Verbal forms and plurals are the most common derived words.
Project Natural Language Toolkit[1] has some python code that could be reused in Traduki. However, Natural Language Toolkit is released under the IBM Common Public License 0.5. Can we use the code?
Syntactical analysis is the determination of the syntactic function of the words. The program should discover if a word is a "verb" or a "noun". A dictionary with the syntactic classification of all root words must be used. WordNet[1] is a good source of data to build a good English dictionary.
A word can have more than one syntactic function. For example, "fat" can be an adjective ("The fat boy eats hamburgers") and can be a noun ("Hamburgers have lots of fat"). So, how do we know that "fat" in the sentence "Hamburgers have lots of fat" is a noun? There are two methods:
Sometimes, some ambiguity may remain after the application of the methods described above. Semantic information may be use to may be use to solve the problem. That's why a good dictionary must have some semantic information. For example, words related to music should be marked as such.
All the syntactic, morphological and semantic information should be codified in an interlanguage. All the source language root words should be translated to root words. Esperanto is often used as an intermediate language (including in Traduki) because 99% of esperanto words have only one sense and because Esperanto is already somewhat of an interlanguage.
Ergane is a free to use multilanguage dictionary that use Esperanto as a interlanguage can be useful for Traduki.
The syntheses of the destination language from interlanguage is an easy step. There is, however, some problems:
Input
Tokenization
Morphological analysis
Sytactical analyses
Disambiguation
Semantic Disambiguation
Translation to an interlanguage
Destination language syntheses
See also
External links and references
Useful resources for the Traduki project
Online articles
Books