home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!dtix!darwin.sura.net!jvnc.net!nuscc!iti.gov.sg!news
- From: kaykit@iti.gov.sg (Chan Kay Kit (KSL))
- Newsgroups: comp.ai
- Subject: A pattern recognition problem
- Message-ID: <1992Nov19.111439.25445@iti.gov.sg>
- Date: 19 Nov 92 11:14:39 GMT
- Sender: news@iti.gov.sg (News Admin)
- Reply-To: kaykit@iti.gov.sg
- Organization: Information Technology Institute, National Computer Board, S'pore
- Lines: 75
-
-
- Hello,
-
- I have an interesting problem on hand that I feel should have
- been solved before but have been unable to find any literature on. Hence
- I hope to seek some pointers by airing the problem in this newsgroup.
-
- Say there are a few types of printed documents each with its own
- different format and layout. These documents are simple enough to be
- segmented into rectangles containing pure text (assuming the documents are
- clean and clearly printed). All rectangles so segmented are either upright
- or lie flat ie. the text is not tilted.
-
- The formats have been so restricted such that the spatial
- relationships between all the rectangles is sufficient to distinguish
- the different types of documents. The size of individual rectangles is
- of secondary importance as the width or height of some of them can vary even
- within a single format. For a particular format, it is known in advance
- which rectangle is of variable height/width.
-
- Spatial relationships refer to the type of overlap between any
- 2 rectangles both in the horizontal and the vertical direction eg. for
- the horizontal direction, some possible overlap types are
-
-
- _________________
- _______
-
-
- and
-
- ________________
- _______
-
- and
-
- _____________
- ____________
-
-
-
- The effect of variable size rectangles is that, for some rectangle-pairs, the
- overlap types can vary from one document to the next for a single document
- type.
-
- For each type of document, I have scanned in a typical sample,
- segmented it into rectangles and computed the spatial relationships between
- all pairs of rectangles.
-
- THE PROBLEM IS THIS: given a new document, how do I match its pattern of
- rectangles to the database of known formats to determine its type? The algorithm
- must be robust enough to handle noise in this new document which can corrupt the
- pattern of rectangles. It should also be able to output a similiarity measure for
- the document type(s) that it has chosen so that one can reject the document as
- being foreign if one deems the similiarity to be too low.
-
- At first glance, I thought syntactic pattern recognition or case-based
- reasoning might be a good way to solve the problem. Are there any pattern
- recognition gurus out there who can offer suggestions/ftpable code/references
- for an accurate and efficient solution? If so, please email to kaykit@iti.gov.sg.
- If there is sufficient interest, I will summarise all replies to the net.
-
- Thanks a million!
-
- (Sorry for occupying so much bandwidth)
-
-
-
-
-
- ---
- Kay-Kit CHAN | Internet: kaykit@iti.gov.sg
- Knowledge Systems Lab | Bitnet: kaykit@itivax
- Information Technology Institute | Tel: (65) 772-0920
- National Computer Board of Singapore | Fax: (65) 770-3043
-