back2roots/padua

home *** CD-ROM | disk | FTP | other *** search

/ back2roots/padua / padua.7z / padua / uucp / unpost.lha / unpost.man < prev next >

Wrap

Text File | 1993-02-07 | 24.7 KB | 650 lines

UNPOST Name: unpost - Extract binary files from multi-segment uuencoded USENET postings or Email. Synopsis: unpost -f[-] -d[-] -c config -e errors -t text -i incompletes [file] Description: UNPOST is a tool designed primarily to extract binaries from USENET binaries postings such as those made to alt.binaries.pictures.misc and comp.binaries.ibm.pc. As well as extracting binaries from USENET postings, UNPOST can extract binaries from multi-segment uuencoded mailings as well, however, to simplify this documentation only USENET article postings will be discussed. The principles are the same for multi-segment mailings. UNPOST assumes that the source file that is given to it will have the following format: SEGMENT begin line ... HEADER ID line ... BODY ID line ... UUENCODED line The lines are: SEGMENT begin line - Is the line that identifies the begining of a segment. HEADER ID line - One or more lines that contain segment number, total number of segments or the ID string in the article or mail header. BODY ID line - One or more lines that contain segment number, total number of segments or the ID string in the article or mail message body. UUENCODED line - Is the first uuencoded line in the file. UUencoded lines include the begin and end lines. ... - Indicates zero or more lines that can contain any information so long as they CANNOT be misidentified as SEGMENT begin, ID or UUENCODED lines. Notice that the ID information can be spread across multiple lines. A segment is assumed to end at the begining of the next segment, or at the end of the source file. An UNPOST source file contains one or more segments. UNPOST has three different modes, interpretation mode, concatenation mode and UU decoder mode. In all three modes, UNPOST can accept one or more input files. In the first mode, interpretation mode, UNPOST looks at article header and body lines before the first UU encoded line, and attempts to extract three pieces of information from them: segment number, total number of segments that the binary was split into, and an ID string that is common to all segments. If UNPOST finds something that it considers to be an ID string, and a uuencoded line in the article, but it does not find a segment number and number of segments, UNPOST assumes that the article is a single segment binary posting (part 1 of 1). To aid in finding out what happened, in interpretation mode UNPOST will write a list of all the different ID strings and their respective segment lists to standard error or the file specified as the error file (see Standards section for details of what an ID string is). Any errors or warnings detected during processing will also be written to standard error or error file. In interpretation mode three other files can optionally be created. All three of these files will contain segments copied out of the source file, and none of these files will be created unless they are turned on and named by a command line switch. The first optional file that UNPOST can create for the user in interpretation mode is the text file (-t switch). This file will have copied to it all segments from the source file that do not contain uuencoded data. Segments that are part 0/# type segments that do not contain uuencoded data will NOT be copied to the text file. They are considered to be description segments, and they will be copied to the description file only if the -d switch is turned on. Also, all binary postings that have all of their segments present will have the segment header and body of segment #1 (up to and including the uuencode begin line) copied into the description file. The third optional file that can be created in interpretation mode is the incomplete or unused uuencode data segments file. This file contains all segments that have uuencoded data, that were not used in a succesful uudecoding. This file will only be created if the -i switch is present. The incompletes file allows the user to hand decode those binarys which could not be interpreted or decoded by UNPOST. Often times, a binary will have all of it's parts, but UNPOST will not be able to put them together because of differences in the ID string between segments, or problems with the part numbering information. The simplest way to solve these problems is to collect the incompletes, edit the ID lines to correct the problem, and rerun UNPOST on the incompletes file. In the second mode, catentation mode, UNPOST assumes that all of the segments in the source file between a uuencode begin and a uuencode end line are part of one binary posting and that the segments are in order. UNPOST scans from the begining of the file until it finds a uuencode begin line, and decodes from there (skipping over non- uuencoded lines such as article header lines and signatures) until it finds a uuencode end line. In the last mode, UU decoder mode, UNPOST assumes that the source file contains one or more UU encoded files. Only UU encoded lines are allowed between the uuencode begin line and the uuencode end line of any single uuencoded file. Options: -c <file> To read and use a different configuration than the default configuration. The default configuration is stored in a file called def.cfg. -d Turns on description capturing and writes descriptions to a file that has the same name as the output but with a .inf extension. This defaults to off. -e <file> Redirects error and information output from standard error to <file>. -f[-] Modify file names to be MS-DOS compatible. Use of -f turns file name modification on if the default is off, and -f- turns file name modification off if the default is on. File name modification is currently the default. -h Turns on full interpretation mode. This is the default. -i <file> Turns on incomplete binaries capturing and writes the segments to file <file>. -s Switch to ordered segment mode. This mode ignores article headers, and assumes that the segments are in order. -t <file> Turns on text only segment capturing and writes the segments to <file>. -u Switch to uudecoder mode. Assume only uuencoded data between begin and end lines. Multiple uuencoded files are allowed. -? Show a summary of the command line switches. It is important to realize that UNPOST Standards: In all modes, UNPOST recognizes and decodes only uuencoded data. In interpretation mode UNPOST requires that: 1) The uuencoded lines be true uuencoded lines. This means that if trailing spaces are truncated by a mailer, editor or news node, UNPOST will not consider those lines to be uuencoded lines. Also, the uuencode character set recognized by UNPOST is ' ' - '`', with no other characters being legal. 2) That all segments of the same binary file posting have the same, recognizable ID string. 3) Segments have a recognizable SEGMENT begin line as the first line in the segment (denoting the begining of a segment). 4) That all ID lines follow the SEGMENT begin line in the segment. 5) That the first UUencoded line of the segment follows the last ID line. 6) That the first uuencode line in the first segment be a begin line. 7) That the last segment contain a uuencode end line. In sorted segment mode, UNPOST requires that: 1) The uuencoded lines be true uuencoded lines. This means that if trailing spaces are truncated by a mailer, editor or news node, UNPOST will not consider those lines to be uuencoded lines. Also, the uuencode character set recognized by UNPOST is ' ' - '`', with no other characters being legal. 2) That the segments be stored in the file in order. 3) That the first uuencode line in the first segment be a begin line. 4) That the last segment contain a uuencode end line. In uudecoder mode, UNPOST requires that: 1) There be only uuencoded lines between a uuencode begin and a uuencode end line. In this mode, UNPOST will recognize and attempt to repair lines that had trailing spaces truncated. Examples: To extract a single binary that had all of it's segments saved in order to a single file: unpost -s binary.uue To extract all binaries that have had all of their segments saved to a single file: unpost multiple.uue 2> errors Or unpost -e errors multiple.uue The file errors will contain a list of all the ID strings that UNPOST found and thought could have been binary files, and any errors that occurred during processing. To capture the incomplete or unused segments that have uuencoded data in them: unpost -e errors -i multiple.inc multiple.uue To capture descriptions and text only segments as well: unpost -d -e errors -t text -i multiple.inc multiple.uue To process two different files, one in uuencode mode, one in interpretation mode: unpost -e errors -u uuencode.uue -h multiple.uue To process a file that requires a different configuration: unpost -c -e errors multiple.uue Notes: To use this program to collect all of the binaries posted to, say, the alt.binaries.misc group on a daily basis, start up rn, go to the alt.binaries.misc newsgroup, and save all of the unread articles by using this command: .-$smisc.uue:j This will save all articles from the current number to the last to the file misc.uue, then junk them. After exiting rn, run UNPOST on the file misc.uue in interpretation mode (default mode): unpost -e errors -i misc.1 misc.uue Make sure to check the errors and/or misc.1 file for segments that UNPOST couldn't extract. Diagnostics: Error - file 'filename' already exists. UNPOST will not overwrite an existing file. Delete the file or rename it and try again. Error - missing begin line. UNPOST expected to find a uuencode begin line in this segment, but did not. Error - Could not open description file 'filename' for writing. UNPOST could not open a file of that name for some reason. Possibly a permission problem, or the file exists and is not writeable. Error - Bad write to binary file. A file write failed for some unknown reason. Possibly a full disk? Error - missing segment # Binary ID: 'binaryID' In attempting to decode a file whose ID string is binaryID, one or more segments are missing. Error - Missing UU end line. As this is the last segment, it ought to have a uuencode end line in it, but UNPOST did not find one. Warning - Early uuencode end line. UNPOST found a uuencode end line, but this was not the last segment, so we found it early. Did the poster screw up and misnumber his segments? Error - Unexpected UU begin line. We found an unexpected (read: this is not the first line of the first segment, so what is this doing here?) UU begin line. Error - cannot identify string '' in line # In reading in a configuration file, the configuration file lexical analyzer could not recognize this string. Error - Out of memory. Yup. Out of memory. Split the source file into smaller pieces and try again. Error - Could not modify file name to be MS-DOS conformant. File name mungling is turned on, and the name of one of the files cannot be made conformant (probably due to having to many numbers in it). Warning - Unexpected end of file in segment: Segment: 'segment line' File name mungling is turned on, and UNPOST is attempting to identify the file type (so it can use the proper extension when modifying the file name) but the UU begin line was the last line in the file. Warning - No UU line after begin. Segment: 'segment line' File name mungling is turned on, and UNPOST is attempting to identify the file type (so it can use the proper extension when modifying the file name) but the UU begin line was not followed by a line of UU encoded binary data. Error - Got number of segments but not segment number. Error - Got segment number but not number of segments. UNPOST must have all three pieces of relevant data, but if UNPOST has at least an ID string, UNPOST will attempt to assume a one part binary. Error - Could not get ID string. Fatal error, with no ID string, there is no way to collect the pieces together. Error - No begin line in first segment: Segment: 'segment line' UNPOST did not find a UU begin line in the first segment. Error - missing '}' in regular expression. In a regular expression of the type abc{1, 2}, the closing curly brace is missing. Error - To many sub-expressions. UNPOST has a limit on the number of sub-expressions it allows. This is a compile time option that can be changed by modifying the value of MAX_SUB_EXPRS in regexp.h. Error - missing ')' in regular expression. Mismatched parentheses. Error - badly formed regular expression. Unexpected character 'c' I give up! What is this character doing at this point in a regular expression? Error, can not enumerate a sub expression. Regular expressions of the type: (...)* are not allowed. Error - illegal regular expression node type. Whoops, we have an internal programmers error here. Let me know if you see this. Error - Sub expression # extraction failed. Another internal error that needs to be brought to my attention. Error - could not open file 'filename' for reading. UNPOST could not open file 'filename' for processing. Did you spellit right? Error - Unexpected end of file. Error - Unexpected UU begin line. Error - Segment number # greater than number of segments in: Segment: 'segment line' Either UNPOST got screwed up somehow or the poster posted something like (Part 10/9). Warning - duplicate segment # in: Binary ID: 'binaryID' UNPOST found two segments with the same binary ID and the same segment number. Error - reading source file. Could not read a line from the source file. Error - Could not open file 'filename' for output. Could not open one of the text, incomplete or error files for writing. Configuration: Ok, here's how to configure UNPOST to work for you. UNPOST relies heavily on regular expressions. These regular expressions may not be correct for your news reader, or system. There are five classes of regular expressions: 1) The SEGMENT begin line regular expression. 2) The ID line prefix regular expression. 3) The ID line with part description regular expression list. 4) The begin line regular expression. 5) The end line regular expression. Of these five, I don't expect you to have to modify the regular expressions for handling begin and end lines, because they should be correct for all uuencoders that follow the standard format. Be aware that UNPOST has a hierarchy of regular expressions. Each SEGMENT begin line regular expression has underneath it two lists of regular expressions that recognize ID line prefixes, and each element in the list of ID line prefix regular expressions has a list under it that attempts to parse the ID line. The two lists are for 1) the header and 2) the body. The ID line prefix regular expression exists for the sake of efficiency. It is used to find an ID line before we attempt to parse it. Modify or add one of these if you wish to change whether or not a line is recognized by UNPOST as being an ID line. If you modify this, you must modify the list of segment description regular expressions to match. The SEGMENT begin line regular expressions are used to find the begining of a SEGMENT, or the end of a previous segment. Modify these to change the line or lines that UNPOST recognizes as the begining of a segment. If you get an error message that indicates that the Subject line has no identifiable part description, and you see that some bright boy/girl has come up with a brand new part description format, then you have two choices, modify the source and hope they don't post again, or add a new ID line regular expression to the list of ID line regular expressions in the segment.c source file. Be aware that the lists of regular expressions are searched in order from top to bottom to find a match. This means that less specific regular expressions should be placed later in the list. For example: the regular expression '$(0-9)+/(0-9)+$' should come before the regular expression '(0-9)+ (0-9)+' in the part syntax parsing regular expression list. This reduces the number of misparses that occur. Remember that C uses the backslash (\) as an escape character in strings, so to put a backslash into a regular expression you need to put two into the C source string. All regular expressions can be found at the top of the parse.c source file. Before you modify the actual source code and recompile, I strongly suggest that you compile the regular expression test harness and test your new regular expression. Then, when you are sure that it is correct, copy the def.cfg file to a new name, make your changes there and use that configuration file for a while. If after all this, you are sure that it works, go for it. Before you add or modify a regular expression, you have to know the syntax of the regular expressions used in this program. The syntax is very similiar to that used by UN*X style regular expressions, but is not exactly the same. See the section titled Regular Expressions before attempting to configure UNPOST. Regular Expressions: Operands -------- UNPOST regular expressions have three types of operands, character strings (one or more characters), character sets and match any single character. A character string is any series of adjacent characters that are not not meta-characters (special characters). A data set is a string of characters enclosed in square braces with an optional caret (^) as the first character following the open square brace. The match any character operand matches any single character except the end of line character. A character string in a regular expression matches the exact string in the source, including case. Example of character strings: AirPlane - Matches the string 'AirPlane', but not the strings 'airPlane' or 'Airplane'. A character set will match any single character in the source if that character is a member of the set. If the first character of the set is the caret, the character set will match any character that is NOT a member of the set (including control characters!) except for NUL and LF. A character set can be described using ranges. Examples of character sets: [abcd] - Matches either a, b, c or d. [0-9] - Matches any decimal character. [^a-z] - Matches any character that is not a lower case alphabetic. The match any character operand does just that, it matches any character. But it does not match the case of no character, NUL or LF. Example of match any character: . - Matches any character. Operators --------- UNPOST regular expressions also contain operators. The operators that upost recognizes are the alternation operator, the span operators, the concatenation operator and the enumeration operators. The alternation operator has the lowest precedence of all the operators and its action is to attempt to match one of two alternatives. Example of alternation: Airplane|drigible - Matches either the string Airplane or the string drigible. The next higher precedence operator is the catenation operator. The catenation operator specifies that both the left and right hand regular expressions must match. The catenation operator does not have a special character, it is assumed to exist between two different operands that have no other operator between them. Example of catenation: [Aa]irplane - Matches either a 'A' or an 'a' followed by the string irplane. This is a catenation of the two regular expressions [Aa] and irplane. The next higher precedence operator is the enumeration operator. The enumeration operator specifies how many instances of a regular expression must be matched. Examples of Enumeration: abc* - Matches zero or more occurences of the string abc. [A-Z]+ - Matches one or more occurences of an upper case alphabetic character. [ ]? - Matches zero or one occurences of the space character. very{1} - Matches one or more occurences of the string very. b{1,3} - Matches a minimum of one to a maximum of three occurences of the string b. An enumeration operator attempts to match the largest source sub- string possible, except in the case of the . (match any character) followed by an enumeration operator. In this case, the smallest possible sub-string is matched. The precedence of the operators can be modified with the use of parentheses. Parentheses have another meaning as well, described below. Example of parenthesis use: Death( defying|wish) - Will match either the string 'Death defying' or the string 'Deathwish'. Without the parentheses, the regular expression would match either the string 'Death defying' or the string 'wish'. Sub Expressions --------------- UNPOST regular expressions are used primarily for identifying a particular line and extracting substrings from that line. To this end, UNPOST regular expressions support sub-expression marking. Subexpressions are marked by parentheses. To determine the sub-expression number of a sub-expression, scan the regular expression from left to right, counting the number of left parentheses. Start with one, and whatever the count for that sub-expression, is it's subexpression number. Example: .*((abcd)((0-9)+/(0-9)+)) Sub-expression ((abcd)((0-9)+/(0-9)+)) is sub-expression #1. Sub-expression (abcd) is #2. Sub-expression ((0-9)+/(0-9)+) is #3. Sub-expression (0-9)+ is #4. Sub-expression (0-9)+ is #5. Anchoring --------- Normally, a regular expression will match a sub-string any where in the source string. If you want to specify that the matching sub-string must start at the begining of the source string, you may use a caret character as the first character of the regular expression. This anchors the regular expression match to the start of the line. To anchor a regular expression to the end of a line, use the dollar sign character. This effectively matches the end of line or end of string character. Anchor operators have a higher precedence than alternation, but lower than catenation. Bugs: This program has been pretty extensively tested in interpretation mode, and it appears to be both robust and flexible. Unfortunately, about once a week, somebody comes up with a new and unusual way to encode the parts description on the Subject line. Author: John W. M. Stevens - jstevens@csn.org