NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / sci / crypt / 7116 < prev next >

Wrap

Internet Message Format | 1993-01-24 | 1.5 KB

Path: sparky!uunet!spool.mu.edu!agate!usenet.ins.cwru.edu!po.CWRU.Edu!mag6 From: mag6@po.CWRU.Edu (Martin A. Gulaian) Newsgroups: sci.crypt Subject: Re: Automatic lang. determination of titles/subj. lines? Date: 24 Jan 1993 03:17:33 GMT Organization: Case Western Reserve University, Cleveland, OH (USA) Lines: 23 Message-ID: <1jt1odINNhqe@usenet.INS.CWRU.Edu> References: <1993Jan20.163448.17017@daimi.aau.dk> Reply-To: mag6@po.CWRU.Edu (Martin A. Gulaian) NNTP-Posting-Host: slc5.ins.cwru.edu In a previous article, lhp@daimi.aau.dk (Lasse Hiller|e Petersen) says: >> Rather than reinventing the wheel, I'd like to know whether someone knows >> of a program for the automatic determination of the language of short >> sentences, titles or subject lines. >> >> Given a short piece of text, the program should output a (list of) >> language(s) which the text is most likely to be in. >> >> I think such a program could be based on the frequencies of syllables >> or letter-pairs/triplets. Are there any good data collections of such >> frequencies, at least for the European languages? I can vouch for the approach - I wrote just such a program for a class six or seven years ago. I fed it text in English/French/Spanish/German and let it calculate the pair/triplet/whatever (it was selectable, I think triplets and pairs both worked well) frequencies. It worked very well - pretty much 100% accuracy on sentence-length samples of unknown language, unless I deliberately tried to fool it. I don't know where the source code ended up; it was in Prolog anyway. -Marty