home *** CD-ROM | disk | FTP | other *** search
- UNPOST
-
- Name:
-
- unpost - Extract binary files from multi-segment uuencoded USENET
- postings or Email.
-
- Synopsis:
-
- unpost -f[-] -d[-] -c config -e errors -t text -i incompletes [file]
-
- Description:
-
- UNPOST is a tool designed primarily to extract binaries from USENET
- binaries postings such as those made to alt.binaries.pictures.misc
- and comp.binaries.ibm.pc. As well as extracting binaries from USENET
- postings, UNPOST can extract binaries from multi-segment uuencoded
- mailings as well, however, to simplify this documentation only
- USENET article postings will be discussed. The principles are the
- same for multi-segment mailings.
-
- UNPOST assumes that the source file that is given to it will have the
- following format:
-
- SEGMENT begin line
- ...
- HEADER ID line
- ...
- BODY ID line
- ...
- UUENCODED line
-
- The lines are:
-
- SEGMENT begin line - Is the line that identifies the begining of a
- segment.
- HEADER ID line - One or more lines that contain segment number,
- total number of segments or the ID string in
- the article or mail header.
- BODY ID line - One or more lines that contain segment number,
- total number of segments or the ID string in
- the article or mail message body.
- UUENCODED line - Is the first uuencoded line in the file.
- UUencoded lines include the begin and end lines.
- ... - Indicates zero or more lines that can contain
- any information so long as they CANNOT be
- misidentified as SEGMENT begin, ID or UUENCODED
- lines.
-
- Notice that the ID information can be spread across multiple lines. A
- segment is assumed to end at the begining of the next segment, or at
- the end of the source file. An UNPOST source file contains one or more
- segments.
-
- UNPOST has three different modes, interpretation mode, concatenation
- mode and UU decoder mode. In all three modes, UNPOST can accept one
- or more input files.
-
- In the first mode, interpretation mode, UNPOST looks at article header
- and body lines before the first UU encoded line, and attempts to extract
- three pieces of information from them: segment number, total number
- of segments that the binary was split into, and an ID string that is
- common to all segments. If UNPOST finds something that it considers
- to be an ID string, and a uuencoded line in the article, but it does
- not find a segment number and number of segments, UNPOST assumes that
- the article is a single segment binary posting (part 1 of 1).
-
- To aid in finding out what happened, in interpretation mode UNPOST
- will write a list of all the different ID strings and their respective
- segment lists to standard error or the file specified as the error
- file (see Standards section for details of what an ID string is).
- Any errors or warnings detected during processing will also be
- written to standard error or error file.
-
- In interpretation mode three other files can optionally be created.
- All three of these files will contain segments copied out of the source
- file, and none of these files will be created unless they are turned
- on and named by a command line switch.
-
- The first optional file that UNPOST can create for the user in
- interpretation mode is the text file (-t switch). This file will have
- copied to it all segments from the source file that do not contain
- uuencoded data.
-
- Segments that are part 0/# type segments that do not contain uuencoded
- data will NOT be copied to the text file. They are considered to be
- description segments, and they will be copied to the description file
- only if the -d switch is turned on. Also, all binary postings that
- have all of their segments present will have the segment header
- and body of segment #1 (up to and including the uuencode begin line)
- copied into the description file.
-
- The third optional file that can be created in interpretation mode is
- the incomplete or unused uuencode data segments file. This file
- contains all segments that have uuencoded data, that were not used in
- a succesful uudecoding. This file will only be created if the -i
- switch is present.
-
- The incompletes file allows the user to hand decode those binarys which
- could not be interpreted or decoded by UNPOST. Often times, a binary
- will have all of it's parts, but UNPOST will not be able to put them
- together because of differences in the ID string between segments, or
- problems with the part numbering information. The simplest way to
- solve these problems is to collect the incompletes, edit the ID
- lines to correct the problem, and rerun UNPOST on the incompletes
- file.
-
- In the second mode, catentation mode, UNPOST assumes that all of the
- segments in the source file between a uuencode begin and a uuencode
- end line are part of one binary posting and that the segments are in
- order. UNPOST scans from the begining of the file until it finds a
- uuencode begin line, and decodes from there (skipping over non-
- uuencoded lines such as article header lines and signatures) until
- it finds a uuencode end line.
-
- In the last mode, UU decoder mode, UNPOST assumes that the source
- file contains one or more UU encoded files. Only UU encoded lines
- are allowed between the uuencode begin line and the uuencode end line
- of any single uuencoded file.
-
- Options:
-
- -c <file> To read and use a different configuration than the
- default configuration. The default configuration is
- stored in a file called def.cfg.
-
- -d Turns on description capturing and writes descriptions
- to a file that has the same name as the output but with
- a .inf extension. This defaults to off.
-
- -e <file> Redirects error and information output from standard
- error to <file>.
-
- -f[-] Modify file names to be MS-DOS compatible. Use of -f
- turns file name modification on if the default is off,
- and -f- turns file name modification off if the default
- is on. File name modification is currently the default.
-
- -h Turns on full interpretation mode. This is the default.
-
- -i <file> Turns on incomplete binaries capturing and writes the
- segments to file <file>.
-
- -s Switch to ordered segment mode. This mode ignores article
- headers, and assumes that the segments are in order.
-
- -t <file> Turns on text only segment capturing and writes the segments
- to <file>.
-
- -u Switch to uudecoder mode. Assume only uuencoded data
- between begin and end lines. Multiple uuencoded files
- are allowed.
-
- -? Show a summary of the command line switches.
-
- It is important to realize that UNPOST
-
- Standards:
-
- In all modes, UNPOST recognizes and decodes only uuencoded data.
-
- In interpretation mode UNPOST requires that:
-
- 1) The uuencoded lines be true uuencoded lines. This means
- that if trailing spaces are truncated by a mailer, editor
- or news node, UNPOST will not consider those lines to
- be uuencoded lines. Also, the uuencode character set
- recognized by UNPOST is ' ' - '`', with no other characters
- being legal.
-
- 2) That all segments of the same binary file posting have
- the same, recognizable ID string.
-
- 3) Segments have a recognizable SEGMENT begin line as the
- first line in the segment (denoting the begining of a
- segment).
-
- 4) That all ID lines follow the SEGMENT begin line in the
- segment.
-
- 5) That the first UUencoded line of the segment follows the
- last ID line.
-
- 6) That the first uuencode line in the first segment be a
- begin line.
-
- 7) That the last segment contain a uuencode end line.
-
- In sorted segment mode, UNPOST requires that:
-
- 1) The uuencoded lines be true uuencoded lines. This means
- that if trailing spaces are truncated by a mailer, editor
- or news node, UNPOST will not consider those lines to
- be uuencoded lines. Also, the uuencode character set
- recognized by UNPOST is ' ' - '`', with no other characters
- being legal.
-
- 2) That the segments be stored in the file in order.
-
- 3) That the first uuencode line in the first segment be a
- begin line.
-
- 4) That the last segment contain a uuencode end line.
-
- In uudecoder mode, UNPOST requires that:
-
- 1) There be only uuencoded lines between a uuencode begin and
- a uuencode end line. In this mode, UNPOST will recognize
- and attempt to repair lines that had trailing spaces
- truncated.
-
- Examples:
-
- To extract a single binary that had all of it's segments saved in order
- to a single file:
-
- unpost -s binary.uue
-
- To extract all binaries that have had all of their segments saved
- to a single file:
-
- unpost multiple.uue 2> errors
- Or
- unpost -e errors multiple.uue
-
- The file errors will contain a list of all the ID strings that UNPOST
- found and thought could have been binary files, and any errors
- that occurred during processing.
-
- To capture the incomplete or unused segments that have uuencoded
- data in them:
-
- unpost -e errors -i multiple.inc multiple.uue
-
- To capture descriptions and text only segments as well:
-
- unpost -d -e errors -t text -i multiple.inc multiple.uue
-
- To process two different files, one in uuencode mode, one in interpretation
- mode:
-
- unpost -e errors -u uuencode.uue -h multiple.uue
-
- To process a file that requires a different configuration:
-
- unpost -c -e errors multiple.uue
-
- Notes:
-
- To use this program to collect all of the binaries posted to, say,
- the alt.binaries.misc group on a daily basis, start up rn, go to
- the alt.binaries.misc newsgroup, and save all of the unread articles
- by using this command:
-
- .-$smisc.uue:j
-
- This will save all articles from the current number to the last to
- the file misc.uue, then junk them. After exiting rn, run UNPOST
- on the file misc.uue in interpretation mode (default mode):
-
- unpost -e errors -i misc.1 misc.uue
-
- Make sure to check the errors and/or misc.1 file for segments
- that UNPOST couldn't extract.
-
- Diagnostics:
-
- Error - file 'filename' already exists.
-
- UNPOST will not overwrite an existing file. Delete the file or
- rename it and try again.
-
- Error - missing begin line.
-
- UNPOST expected to find a uuencode begin line in this segment,
- but did not.
-
- Error - Could not open description file 'filename' for writing.
-
- UNPOST could not open a file of that name for some reason.
- Possibly a permission problem, or the file exists and is not
- writeable.
-
- Error - Bad write to binary file.
-
- A file write failed for some unknown reason. Possibly a full
- disk?
-
- Error - missing segment #
- Binary ID: 'binaryID'
-
- In attempting to decode a file whose ID string is binaryID,
- one or more segments are missing.
-
- Error - Missing UU end line.
-
- As this is the last segment, it ought to have a uuencode end
- line in it, but UNPOST did not find one.
-
- Warning - Early uuencode end line.
-
- UNPOST found a uuencode end line, but this was not the last
- segment, so we found it early. Did the poster screw up and
- misnumber his segments?
-
- Error - Unexpected UU begin line.
-
- We found an unexpected (read: this is not the first line of the
- first segment, so what is this doing here?) UU begin line.
-
- Error - cannot identify string '' in line #
-
- In reading in a configuration file, the configuration file
- lexical analyzer could not recognize this string.
-
-
- Error - Out of memory.
-
- Yup. Out of memory. Split the source file into smaller
- pieces and try again.
-
- Error - Could not modify file name to be MS-DOS conformant.
-
- File name mungling is turned on, and the name of one of the
- files cannot be made conformant (probably due to having to
- many numbers in it).
-
- Warning - Unexpected end of file in segment:
- Segment: 'segment line'
-
- File name mungling is turned on, and UNPOST is attempting to
- identify the file type (so it can use the proper extension
- when modifying the file name) but the UU begin line was the
- last line in the file.
-
- Warning - No UU line after begin.
- Segment: 'segment line'
-
- File name mungling is turned on, and UNPOST is attempting to
- identify the file type (so it can use the proper extension
- when modifying the file name) but the UU begin line was not
- followed by a line of UU encoded binary data.
-
- Error - Got number of segments but not segment number.
- Error - Got segment number but not number of segments.
-
- UNPOST must have all three pieces of relevant data, but if
- UNPOST has at least an ID string, UNPOST will attempt to
- assume a one part binary.
-
- Error - Could not get ID string.
-
- Fatal error, with no ID string, there is no way to collect
- the pieces together.
-
- Error - No begin line in first segment:
- Segment: 'segment line'
-
- UNPOST did not find a UU begin line in the first segment.
-
- Error - missing '}' in regular expression.
-
- In a regular expression of the type abc{1, 2}, the closing curly
- brace is missing.
-
- Error - To many sub-expressions.
-
- UNPOST has a limit on the number of sub-expressions it
- allows. This is a compile time option that can be changed
- by modifying the value of MAX_SUB_EXPRS in regexp.h.
-
- Error - missing ')' in regular expression.
-
- Mismatched parentheses.
-
- Error - badly formed regular expression.
- Unexpected character 'c'
-
- I give up! What is this character doing at this point in
- a regular expression?
-
- Error, can not enumerate a sub expression.
-
- Regular expressions of the type: (...)* are not allowed.
-
- Error - illegal regular expression node type.
-
- Whoops, we have an internal programmers error here. Let
- me know if you see this.
-
- Error - Sub expression # extraction failed.
-
- Another internal error that needs to be brought to my attention.
-
- Error - could not open file 'filename' for reading.
-
- UNPOST could not open file 'filename' for processing. Did you
- spellit right?
-
- Error - Unexpected end of file.
-
- Error - Unexpected UU begin line.
-
- Error - Segment number # greater than number of segments in:
- Segment: 'segment line'
-
- Either UNPOST got screwed up somehow or the poster posted
- something like (Part 10/9).
-
- Warning - duplicate segment # in:
- Binary ID: 'binaryID'
-
- UNPOST found two segments with the same binary ID and the
- same segment number.
-
- Error - reading source file.
-
- Could not read a line from the source file.
-
- Error - Could not open file 'filename' for output.
-
- Could not open one of the text, incomplete or error files
- for writing.
-
- Configuration:
-
- Ok, here's how to configure UNPOST to work for you. UNPOST relies
- heavily on regular expressions. These regular expressions may
- not be correct for your news reader, or system.
-
- There are five classes of regular expressions:
-
- 1) The SEGMENT begin line regular expression.
-
- 2) The ID line prefix regular expression.
-
- 3) The ID line with part description regular expression list.
-
- 4) The begin line regular expression.
-
- 5) The end line regular expression.
-
- Of these five, I don't expect you to have to modify the regular
- expressions for handling begin and end lines, because they should
- be correct for all uuencoders that follow the standard format.
-
- Be aware that UNPOST has a hierarchy of regular expressions.
- Each SEGMENT begin line regular expression has underneath it two
- lists of regular expressions that recognize ID line prefixes,
- and each element in the list of ID line prefix regular expressions
- has a list under it that attempts to parse the ID line.
-
- The two lists are for 1) the header and 2) the body.
-
- The ID line prefix regular expression exists for the sake of
- efficiency. It is used to find an ID line before we attempt
- to parse it. Modify or add one of these if you wish to change
- whether or not a line is recognized by UNPOST as being an ID line.
- If you modify this, you must modify the list of segment description
- regular expressions to match.
-
- The SEGMENT begin line regular expressions are used to find the begining
- of a SEGMENT, or the end of a previous segment. Modify these to change
- the line or lines that UNPOST recognizes as the begining of a segment.
-
- If you get an error message that indicates that the Subject line
- has no identifiable part description, and you see that some bright
- boy/girl has come up with a brand new part description format, then
- you have two choices, modify the source and hope they don't post
- again, or add a new ID line regular expression to the list of
- ID line regular expressions in the segment.c source file.
-
- Be aware that the lists of regular expressions are searched in order
- from top to bottom to find a match. This means that less specific
- regular expressions should be placed later in the list. For example:
- the regular expression '\((0-9)+/(0-9)+\)' should come before the
- regular expression '(0-9)+ (0-9)+' in the part syntax parsing regular
- expression list. This reduces the number of misparses that occur.
-
- Remember that C uses the backslash (\) as an escape character in
- strings, so to put a backslash into a regular expression you
- need to put two into the C source string.
-
- All regular expressions can be found at the top of the parse.c source
- file. Before you modify the actual source code and recompile, I
- strongly suggest that you compile the regular expression test harness
- and test your new regular expression. Then, when you are sure that
- it is correct, copy the def.cfg file to a new name, make your changes
- there and use that configuration file for a while. If after all this,
- you are sure that it works, go for it.
-
- Before you add or modify a regular expression, you have to know the
- syntax of the regular expressions used in this program. The syntax
- is very similiar to that used by UN*X style regular expressions,
- but is not exactly the same. See the section titled Regular
- Expressions before attempting to configure UNPOST.
-
- Regular Expressions:
-
- Operands
- --------
-
- UNPOST regular expressions have three types of operands, character
- strings (one or more characters), character sets and match any
- single character. A character string is any series of adjacent
- characters that are not not meta-characters (special characters).
- A data set is a string of characters enclosed in square braces with
- an optional caret (^) as the first character following the open
- square brace. The match any character operand matches any single
- character except the end of line character.
-
- A character string in a regular expression matches the exact string
- in the source, including case.
-
- Example of character strings:
-
- AirPlane - Matches the string 'AirPlane', but not the strings
- 'airPlane' or 'Airplane'.
-
- A character set will match any single character in the source if
- that character is a member of the set. If the first character
- of the set is the caret, the character set will match any
- character that is NOT a member of the set (including control
- characters!) except for NUL and LF.
-
- A character set can be described using ranges.
-
- Examples of character sets:
-
- [abcd] - Matches either a, b, c or d.
-
- [0-9] - Matches any decimal character.
-
- [^a-z] - Matches any character that is not a lower
- case alphabetic.
-
- The match any character operand does just that, it matches any
- character. But it does not match the case of no character, NUL
- or LF.
-
- Example of match any character:
-
- . - Matches any character.
-
- Operators
- ---------
-
- UNPOST regular expressions also contain operators. The operators that
- upost recognizes are the alternation operator, the span operators, the
- concatenation operator and the enumeration operators.
-
- The alternation operator has the lowest precedence of all the operators
- and its action is to attempt to match one of two alternatives.
-
- Example of alternation:
-
- Airplane|drigible - Matches either the string Airplane or the string
- drigible.
-
- The next higher precedence operator is the catenation operator. The
- catenation operator specifies that both the left and right hand
- regular expressions must match. The catenation operator does not
- have a special character, it is assumed to exist between two
- different operands that have no other operator between them.
-
- Example of catenation:
-
- [Aa]irplane - Matches either a 'A' or an 'a' followed by the string
- irplane. This is a catenation of the two regular
- expressions [Aa] and irplane.
-
- The next higher precedence operator is the enumeration operator.
- The enumeration operator specifies how many instances of a regular
- expression must be matched.
-
- Examples of Enumeration:
-
- abc* - Matches zero or more occurences of the string abc.
- [A-Z]+ - Matches one or more occurences of an upper case
- alphabetic character.
- [ ]? - Matches zero or one occurences of the space character.
- very{1} - Matches one or more occurences of the string very.
- b{1,3} - Matches a minimum of one to a maximum of three occurences
- of the string b.
-
- An enumeration operator attempts to match the largest source sub-
- string possible, except in the case of the . (match any character)
- followed by an enumeration operator. In this case, the smallest
- possible sub-string is matched.
-
- The precedence of the operators can be modified with the use of
- parentheses. Parentheses have another meaning as well, described
- below.
-
- Example of parenthesis use:
-
- Death( defying|wish) - Will match either the string 'Death defying'
- or the string 'Deathwish'. Without the
- parentheses, the regular expression would
- match either the string 'Death defying'
- or the string 'wish'.
- Sub Expressions
- ---------------
-
- UNPOST regular expressions are used primarily for identifying a
- particular line and extracting substrings from that line. To
- this end, UNPOST regular expressions support sub-expression
- marking. Subexpressions are marked by parentheses.
-
- To determine the sub-expression number of a sub-expression, scan
- the regular expression from left to right, counting the number
- of left parentheses. Start with one, and whatever the count for
- that sub-expression, is it's subexpression number.
-
- Example:
-
- .*((abcd)((0-9)+/(0-9)+))
-
- Sub-expression ((abcd)((0-9)+/(0-9)+)) is sub-expression #1.
- Sub-expression (abcd) is #2. Sub-expression ((0-9)+/(0-9)+) is #3.
- Sub-expression (0-9)+ is #4. Sub-expression (0-9)+ is #5.
-
- Anchoring
- ---------
-
- Normally, a regular expression will match a sub-string any where in
- the source string. If you want to specify that the matching sub-string
- must start at the begining of the source string, you may use a caret
- character as the first character of the regular expression. This
- anchors the regular expression match to the start of the line.
-
- To anchor a regular expression to the end of a line, use the dollar
- sign character. This effectively matches the end of line or end
- of string character.
-
- Anchor operators have a higher precedence than alternation, but lower
- than catenation.
-
- Bugs:
-
- This program has been pretty extensively tested in interpretation mode,
- and it appears to be both robust and flexible.
-
- Unfortunately, about once a week, somebody comes up with a new and
- unusual way to encode the parts description on the Subject line.
-
- Author:
-
- John W. M. Stevens - jstevens@csn.org
-