home *** CD-ROM | disk | FTP | other *** search
-
- Mark V. Shaney V1.3
- a probabilistic text generator
- Copyright (c) 1991 Stefan Strack
- Internet: stracks@ctrvax.vanderbilt.edu
- Bitnet: stracks@vuctrvax
- released 11/01/91
-
-
- You may copy this program freely, provided you leave the program and this
- documentation unchanged. You may not charge more than the cost of the media
- for its distribution. I disclaim all liabilities for any damages or loss of
- data incurred by using this program. Comments and suggestions to the above
- email-addresses are appreciated.
-
-
- --------------------
- Origin
-
- Mark V. Shaney featured in the "Computer Recreations" column by A.K.Dewdney
- in Scientific American. The original program (for a mainframe, I believe)
- was written by Bruce Ellis based on an idea by Don P. Mitchell. Dewdney tells
- the amusing story of a riot on net.singles when Mark V. Shaney's ramblings
- were unleashed.
-
-
- --------------------
- Who is Mark V. Shaney?
-
- Mark V. Shaney produces a confused imitation of style and contents of a piece
- of writing. Mark reads the original text and builds a "word probability
- table" that reflects the probability of a word following a sequence of words.
- In output mode, Mark will generate random text weighed by the probabilities
- in this table (a so-called Markov chain, hence Mark's name). Since Mark
- considers punctuation as part of the word, he is likely to produce
- grammatical sentences, albeit a caricature of the original text. Mark V.
- Shaney is in the same league with the famous ELIZA and RACTER programs, and
- shows that you don't need AI for "almost human" writing. This implementation
- of Mark V. Shaney allows you to vary the "randomness" of the text output and
- supports huge probability tables in expanded memory or on disk. Version 1.0
- of this program was posted on comp.binaries.ibm.pc in June 91.
-
-
- --------------------
- New in Version 1.3
-
- * Text reading and generating is slightly faster than in V1.0. A
- "percent done" counter displays the progress of text reading.
-
- * "Keywords" can be specified to focus Mark V. Shaney's ramblings on a
- particular topic.
-
- * A "maximum word length" command line parameter /L# was added. /L1 for
- example tells Mark to create a frequency table for single characters.
- Small values for L create interesting "foreign language" versions of
- the original text.
-
-
- --------------------
- What the program runs on
-
- A 640K AT system with a hard disk is recommended as the minimum
- configuration, although it is possible to run Mark V. Shaney on a lower-end
- machine. If you have memory above 640K and an expanded memory driver
- installed, Mark V. Shaney will try to use it.
-
-
- --------------------
- Using the program
-
- Type MARKV to start Mark V. Shaney. The menu on the bottom line of the screen
- gives you the following choices invoked by pressing their first letter:
-
- Read text
-
- The program will prompt you for a text file to read, building a word
- probability table as it goes along. Text files should not contain
- control or formatting characters other than tabs and carriage
- return/line feeds. Reading is very slow and can take several minutes
- for large files. Speed can be improved by increasing the number of
- DOS buffers in CONFIG.SYS. The bottle neck seems to be the CPU, which
- means you won't gain much from reading from a RAM drive. You can
- repeat this command to build a table from several text files. Reading
- will stop if you press any key.
-
-
- Generate text
-
- After you created a working probability table by either reading text
- files or loading a saved table, "Generate text" will start the
- probabilistic text generator. You are prompted for a file to append
- the text output to. If you press <Return>, output will go to the
- screen. You are then given the option to start the Markov chain with
- a word other than the default, the first word of the input text.
- Mark's ramblings will word-wrap automatically; no other formatting is
- done. The output will stop, when (a) Mark comes across a word that is
- the last word in one of the input texts, or (b) you press <Escape>.
- You can pause and continue output by pressing any key.
-
-
- Load table
-
- Once a probability table has been build, it can be saved to a file
- (see Save table). "Load table" prompts you for a table to be
- retrieved from disk. This is somewhat faster than re-creating the
- table from scratch by reading the original text files again. The
- down-side is that saved tables are huge.
-
-
- Save table
-
- Saves the probability table to disk.
-
-
- Keywords
-
- Keywords are used to control the flow of Mark V. Shaney's ramblings.
- By specifying a keyword and an associated integer emphasis-factor,
- you can increase or decrease the probability with which this word
- will occur in the output. E.g. an emphasis-factor of 3 will triple
- the chances that a keyword will appear in the generated text. An
- emphasis-factor of 0 de-emphasizes a word completely. Keywords often
- have little impact on the output text, because they will affect
- Mark's chaining only if they occur as one of two or more choices.
- Keywords will be more effective for longer texts and lower order
- (/O#) and length (/L#) values (see below). After selecting "Keywords"
- from the menu, you are given the option of clearing all existing
- keywords. This is followed by a prompt for a file name from which to
- read the keyword list. Hitting <Enter> lets you input keywords
- interactively. Keyword lists consist of alternating lines of the
- keyword itself and its emphasis-factor. Specifying keywords does not
- actually modify the frequency table; rather, weighted probabilities
- are calculated "on the fly" during output generation. Mark needs to
- have read some text before you can define keywords.
-
-
- Quit
-
- Exits the program.
-
-
- >
-
- Hitting the greater-than symbol starts a DOS shell. Type 'EXIT' at
- the DOS prompt to return to Mark V. Shaney.
-
-
- --------------------
- Command line options
-
- MARKV [-e|m|d[d:path]] [-o#] [-l#] [-s#]
-
- -e use EMS (default)
- -m use conventional memory
- -d[d:path] use disk
- -o# set order to # (default 2)
- -l# set max. word length to # (default 255)
- -s# set random number seed to #
-
- These options are explained in the next sections. Mark will also
- accept the slash (/) character instead of the hyphen, or simply the
- option letter by itself.
-
-
- --------------------
- The Markov order parameter
-
- The order (O) and maximum word length (L) options allow changing the
- randomness of Mark V. Shaney's text output. The order parameter specifies the
- degree of "grammatical correctness", whereas the length option controls
- Mark's "orthography". Interesting effects can be obtained by varying both
- parameters together.
-
- In the original description of Mark V. Shaney's algorithm, the Markov order
- is 2. This means that the program breaks down an input text into pairs of
- words, and calculates the probability of a third word following a given pair
- of words. The probability table then looks like this:
-
- order 2
-
- \Next Word
- Word seq \ A B C D ..
- ---------------------------------------------------
- AB | 0.0 0.02 0.08 0.15 ..
- BC | 0.0 0.0 0.0 0.02 ..
- CD | 0.02 0.22 0.0 0.0 ..
- DE | 0.1 0.06 0.03 0.0 ..
- .. | .. .. .. .. ..
-
- Mark allows you to choose the order, i.e. the number of words that make up a
- word sequence, with the command line option -o#.
-
- E.g.: MARKV -O3
-
- After reading a text, this results in the following probability table:
-
- order 3
-
- \Next Word
- Word seq \ A B C D ..
- --------------------------------------------------
- ABC | 0.01 0.0 0.0 0.02 ..
- BCD | 0.0 0.0 0.0 0.0 ..
- CDE | 0.04 0.01 0.0 0.0 ..
- DEF | 0.02 0.0 0.08 0.0 ..
- .. | .. .. .. .. ..
-
- Similarly, -o1 breaks down an input text into word sequences of length 1. The
- order can be any number greater than or equal 1. Word sequences of longer
- than 3 are not very interesting, since Mark V. Shaney will essentially
- re-create the original text when in output mode.
-
- In general, large order values produce low variability in the output text.
- Also, large input texts tend to result in more diverse output. You can create
- a sufficiently variable output from a short text sample with order=1.
-
- Table creation speed is largely independent of the order value, whereas table
- size increases with order value.
-
-
- --------------------
- The maximum word length option
-
- By default, Mark V. Shaney parses input text into words, i.e.
- variable-length strings delimited by white-space characters (space, tab and
- newline). Mark will produce non-english output if you set the maximum word
- length (L) parameter to a value shorter than the average length of words in
- the input text. E.g. if Mark is invoked with
-
- MARKV -L2
-
- and the word "again" appears in the input, Mark breaks the word up into
- strings of length 2 or shorter, i.e. "ag", "ai" and "n". The frequency table
- is then updated according to the current setting for the Markov order
- parameter with these short strings as entries. Similarly, -L1 builds a table
- of character frequencies. Tables built with low values for L (< 3) give rise
- to foreign-sounding output with often ludicrous word concoctions.
- Specifying keywords has a large impact on output from low L-value tables; by
- emphasizing or de-emphasizing certain vowels or consonants or combinations
- thereof you can change the "sound" of Mark's ramblings dramatically.
-
- Text reading is slowest for small L values. The size of the frequency table
- generally increases with increasing L values. Text playback depends on the
- structure of the current probability table, but is independent of the value
- of the order or length parameters with which the program was invoked.
-
-
- --------------------
- Storage options
-
- MARKV.EXE is able to use EMS memory or a disk for storing huge probability
- tables. The location of the working probability table can be specified by one
- of the following options:
-
- MARKV -E
-
- This tells Mark to use expanded memory if available; this is also the
- default. If expanded memory is not available, the table will reside in
- conventional memory. 1 MB of EMS allows you to process about 110K of text
- input.
-
- MARKV -M
-
- Mark will keep the table in conventional memory along with the program code.
- On a 640K machine, this gives you room for processing approx. 40K of original
- text.
-
- MARKV -D[Drive:Path]
-
- Mark uses the disk to store the working probability table. The optional
- Drive:Path specifies were the temporary file is to be stored. This is the
- slowest option, but you can speed up processing greatly by specifying a RAM
- drive. The amount of input text that can be processed is limited only by the
- available disk space. Example: MARKV -DE: will set up temporary disk storage
- on drive E:.
-
-
- --------------------
- The random number seed option
-
- Mark will initialize the random number generator by reading the system time
- at program start. This makes sure that the random output is different every
- time you use the same probability table. In case you see some particularly
- interesting output scrolling off screen, you may wish to restart the program
- under identical conditions. For this purpose, Mark displays the number that
- initialized the random number generator when you exit the program. You can
- then provide this seed on the command line as in
-
- MARKV -S12345
-
- to re-create a previous run.
-
-
- --------------------
- Sources of text
-
- The larger the input text, the more diversified will be Mark V. Shaney's
- ramblings. With a Markov order of 2, a 15 to 30K text will produce a varied,
- but still comprehensible output. Good choices for input text are: personal
- letters, text books, childrens' books, simple narratives, poems (anything
- made up of simple, short sentences). Poor choices are texts containing
- incomplete or complicated sentences with special terminology (technical
- writing, program documentation (this one's especially poor :-) )).
-
-
- --------------------
- Implementation
-
- The source for Mark V. Shaney is ca. 500 lines of code written in PDC Prolog
- V3.21. Word probability tables are stored in external databases. The
- dictionary (list of unique words) and frequency table are hashed to speed up
- table generation, and cross-referenced for text output in constant time. I
- designed this program for flexibility (i.e. adjustable order and word length)
- and speedy output generation. The trade-offs are relatively slow text
- scanning and large memory/disk space requirements.
-