NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / sci / lang / 8959 < prev next >

Wrap

Text File | 1993-01-23 | 3.4 KB | 84 lines

Newsgroups: sci.lang Path: sparky!uunet!grebyn!daily!lojbab From: lojbab@grebyn.com (Logical Language Group) Subject: Lexical Texts Analysis Message-ID: <1993Jan23.070340.28270@grebyn.com> Organization: The Logical Language Group, Inc./Institute for Russian Language Date: Sat, 23 Jan 1993 07:03:40 GMT Lines: 74 The following is being posted upon the request of Mikhael Maron, a linguist associated with the Institute for Russian Language in Moscow. He is interested in contacting others interested in his work, or doing similar work. He has some limits on email access, and shares his account with many others. Please address any queries on this research to him at: irl@glas.apc.org with attention to his name in the subject line: "ATTN: M. Maron - IRL" ******************* Forwarded message: -------------------------- Subject: Lexical Texts Analysis Large text (ASCII) files such as whole books (fiction, humanities,science) in electronic form are considered. - *Total list* of all word occurences in the text is generated. - Having some general idea of what key words we are intersted in, we select them from this list, making *partial word lists* (PWL): names, words from specific problem areas etc; from several up to about 200-250 words can be included in one PWL. - For given PWL we are to build: (1) a set of all contexts (paragraphs/lines) where the PWL member words occured; * words index, telling on which book pages these occurences took place. The problem is to perform these activities effectively: concordance word crunchers I know need up to analysed_text_volume *10..30 for some service indexes, which makes the search process not practical for real books. - The solution is to introduce markup into the text searched: the words of interest are supplied with special markers, which is done OK with the help of some context-replacement routines I have developed. - Having markupped files, we may extract only markers- containg lines which form the needed set of contexts. This extraction may be done with the help of GREP routine, for instance - and with turbo efficiency also. - Index is generated from the markupped files/set of contexts with the help of the routine I produced. This technique was used to analyse the text of "The Possessed" novel by Dostoyevsky with respect to possession lexicology: to have, to possess, to acquire etc. (imet', vladet', priobretat'...) The idea was inspired by Fromm's "To Have or To Be?" concept of possession. According to Fromm's semantics, to posssess means (a bit roughly) to possess property/goods, NOT abstract properties as in logic/computer science for instance. The "possession" PWL for the novel was built, as well as a set of contexts and an index for about 200 occurences of the words in this list. For each word occurence its usage model was built. For example, "to have" in the context "I had a terrible headache after discussion with Ivanov" is modelled as "to have headache". The complete set of such usage models for the words in the given PWL in the given text provides us the understanding of the semantics of these words with respect to the text. As for the novel, the semantics of "possession" in it appeared to be very interesting and seems to give considerable insight into Dostoyevsky's trail of thought. It seems to be compatible with logic/computer approach - and quite incompatible with Fromm's! _____ EOF