home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!spool.mu.edu!agate!usenet.ins.cwru.edu!po.CWRU.Edu!mag6
- From: mag6@po.CWRU.Edu (Martin A. Gulaian)
- Newsgroups: sci.crypt
- Subject: Re: Automatic lang. determination of titles/subj. lines?
- Date: 24 Jan 1993 03:17:33 GMT
- Organization: Case Western Reserve University, Cleveland, OH (USA)
- Lines: 23
- Message-ID: <1jt1odINNhqe@usenet.INS.CWRU.Edu>
- References: <1993Jan20.163448.17017@daimi.aau.dk>
- Reply-To: mag6@po.CWRU.Edu (Martin A. Gulaian)
- NNTP-Posting-Host: slc5.ins.cwru.edu
-
-
- In a previous article, lhp@daimi.aau.dk (Lasse Hiller|e Petersen) says:
- >> Rather than reinventing the wheel, I'd like to know whether someone knows
- >> of a program for the automatic determination of the language of short
- >> sentences, titles or subject lines.
- >>
- >> Given a short piece of text, the program should output a (list of)
- >> language(s) which the text is most likely to be in.
- >>
- >> I think such a program could be based on the frequencies of syllables
- >> or letter-pairs/triplets. Are there any good data collections of such
- >> frequencies, at least for the European languages?
-
- I can vouch for the approach - I wrote just such a program for a class
- six or seven years ago. I fed it text in English/French/Spanish/German
- and let it calculate the pair/triplet/whatever (it was selectable, I
- think triplets and pairs both worked well) frequencies. It worked
- very well - pretty much 100% accuracy on sentence-length samples of
- unknown language, unless I deliberately tried to fool it.
-
- I don't know where the source code ended up; it was in Prolog anyway.
-
- -Marty
-