home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: sci.lang
- Path: sparky!uunet!grebyn!daily!lojbab
- From: lojbab@grebyn.com (Logical Language Group)
- Subject: Lexical Texts Analysis
- Message-ID: <1993Jan23.070340.28270@grebyn.com>
- Organization: The Logical Language Group, Inc./Institute for Russian Language
- Date: Sat, 23 Jan 1993 07:03:40 GMT
- Lines: 74
-
- The following is being posted upon the request of Mikhael Maron, a
- linguist associated with the Institute for Russian Language in Moscow.
- He is interested in contacting others interested in his work, or doing
- similar work. He has some limits on email access, and shares his
- account with many others. Please address any queries on this research
- to him at:
-
- irl@glas.apc.org
-
- with attention to his name in the subject line: "ATTN: M. Maron - IRL"
- *******************
-
- Forwarded message:
- --------------------------
- Subject: Lexical Texts Analysis
-
- Large text (ASCII) files such as whole books (fiction,
- humanities,science) in electronic form are considered.
-
- - *Total list* of all word occurences in the text is generated.
-
- - Having some general idea of what key words we are intersted in, we
- select them from this list, making *partial word lists* (PWL): names,
- words from specific problem areas etc; from several up to about 200-250
- words can be included in one PWL.
-
- - For given PWL we are to build: (1) a set of all contexts
- (paragraphs/lines) where the PWL member words occured;
-
- * words index, telling on which book pages these occurences took place.
-
- The problem is to perform these activities effectively: concordance
- word crunchers I know need up to analysed_text_volume *10..30 for some
- service indexes, which makes the search process not practical for real
- books.
-
- - The solution is to introduce markup into the text searched: the words
- of interest are supplied with special markers, which is done OK with the
- help of some context-replacement routines I have developed.
-
- - Having markupped files, we may extract only markers- containg lines
- which form the needed set of contexts. This extraction may be done with
- the help of GREP routine, for instance - and with turbo efficiency also.
-
- - Index is generated from the markupped files/set of contexts with the
- help of the routine I produced.
-
- This technique was used to analyse the text of "The Possessed" novel by
- Dostoyevsky with respect to possession lexicology: to have, to possess,
- to acquire etc. (imet', vladet', priobretat'...)
-
- The idea was inspired by Fromm's "To Have or To Be?" concept of
- possession. According to Fromm's semantics, to posssess means (a bit
- roughly) to possess property/goods, NOT abstract properties as in
- logic/computer science for instance.
-
- The "possession" PWL for the novel was built, as well as a set of
- contexts and an index for about 200 occurences of the words in this
- list.
-
- For each word occurence its usage model was built. For example, "to
- have" in the context "I had a terrible headache after discussion with
- Ivanov" is modelled as "to have headache".
-
- The complete set of such usage models for the words in the given PWL in
- the given text provides us the understanding of the semantics of these
- words with respect to the text.
-
- As for the novel, the semantics of "possession" in it appeared to be
- very interesting and seems to give considerable insight into
- Dostoyevsky's trail of thought. It seems to be compatible with
- logic/computer approach - and quite incompatible with Fromm's!
- _____
- EOF
-