Data structures and mechanisms for Babylon project

.........This article was written as a part of Babylon project.
.........The central idea behind the schema used in this proposal can be stated like this: Get as much information as possible from a contributor without bothering him with tedious work.
.........Before I attempt to describe in detail mechanisms I will try to illuminate ideas which lead to the given schema.
.........It is very important that users can contribute with maximum easiness. Everything bothering people will deffer many potential developers. There are two possibilities how to conveniently contribute, to use a Web interface or use e-mail. At the first stage we dacided to rely exclusively on e-mail for delivery. If we use Web, we should have to provide some mechanism which would enable check that signed entries really came from the cited source. E-mail solves the authentication problem, header of e-mail contains information about the sender and this information can be easily extracted.
.........The system should scale well with number of contributors. Everything should be automated, any part requiring personal attendace of a human supervisor could become a bottleneck of the project.
.........We can imagine following ideal solution. A developer sends an email to dedicated e-mail address. A software saves the message to disk, check, if everything in the message is correct and either sends back warning that it can not understand everything or confirmation of successful contribution and proceede further. It will extract e-mail of the sender and look it up in Zvon members directory. If the search is successful, it will replace e-mail with ZvonID. It will attach this identification to each entry in the saved file and the contents of the file will be added to the appropriate places.
.........We do not want to feed the data stright to some on-line service. We should write everything in a suitable form to a file downloadable to everyone. Then people can use whatever need. We will then use these files in the same way as anybody else and provide a service based on this file. We can even not use it at all if somebody else uses it in a way, we could not match. Our primary target is collect data others can play with.
.........And now about the data. We think that input should follow natural thinking. You would say that in normal speach that "and" is "und" in german and "a" in czech. This is a lot of information given. It says how to translate "and" to czech and german, "und" to english and czech, "a" to english and german. You can even say "house" or "building" means "dum" in czech. But this sentence gives not only translation from english to czech. It also says that words "house" and "building" can be synonymous in english. So at the same time you are not building only dictionary, but thesaurus as well. And thesaurus is useful not only for translators, but in common writting as well. Information can be heavily leveraged if all sources are free.
.........We propose following syntax for e-mail message:
lang: word
lang:translation
lang:translation
.........So above mentioned examples would come from an e-mail message:
en:and
cs:a
de:und

en:building
en:house
cs:dum
.........This syntax is understandable for everyone. We should probably provide a XML syntax as well, for experienced users, who could take use of XML editors and validation.
.........The final XML depository would look in this way. Each language would have it's own file, so the about input would generate files en.xml, cs.xml and de.xml with following contents:

en.xml:
<orig word='and'> 
<trans word='a' lang='cs' creator='author@vscht.cz'/>
< word='und' lang='de' creator='author@vscht.cz'/>
</orig> 

<orig word='building'> 
<trans word='dum' lang='cs' creator='author@vscht.cz'/>
</orig>

<orig word='house'> 
<trans word='dum' lang='cs' creator='author@vscht.cz'/>
</orig>

cs.xml:
<orig word='a'> 
<trans word='and' lang='en' creator='author@vscht.cz'/>
<trans word='und' lang='de' creator='author@vscht.cz'/>
</orig>

<orig word='dum'> 
<trans word='building' lang='en' creator='author@vscht.cz'/>
<trans word='house' lang='en' creator='author@vscht.cz'/>
</orig>

de.xml
<orig word='und'> 
<trans word='and' lang='en' creator='author@vscht.cz'/>
<trans word='a' lang='cs' creator='author@vscht.cz'/>
</orig>

.........Especially with common words, the same entry can be independently inputet by several developers. We should take use of this fact. If somebody else inputs the same translation than we have an independent check. And that is very important in a distributed project. So if somebody else sends an e-mail from address creator@berlin.de with text:
de: und
en: and
.........de.xml will have an entry
 
 <orig word='und'> 
<trans word='and' lang='en' creator='author@vscht.cz|creator@berlin.de'/>
<trans word='a' lang='cs' creator='author@vscht.cz'/>
</orig>

 
.........We plan to offer several views of the xml files. One of them would offer complete view of all data, including e-mail addreses. We also plan to add more features as possibility to send a comment about given word, sent an audio file with proper pronunciation and so on.
.........We need several programs which would help with implement this ideas. They are fortunatelly rather easy to write. You can download perl script which takes an XML input from several files and produces the aforementioned files. The testing files are self-explanatory and contain duplicate entries and so on. You can test and if yopu like this ideas please help with code or contribute with new words.

HOMEID: zvon19990716105759669RECENT_PAGES
Editor: Miloslav Nic [MNaaaa]Created:16.7.1999Last change:16.7.1999
[ cs ]
Please, helpNotesSite maintenanceXMLDTDXSL

Zvon is supporting a free exchange of information. You can also become a member. The homepage http://zvon.vscht.cz gives further details.
Donate food.