TextSpresso Manual Menu

Replace Pattern & Pattern Insert Filters

Overview

Pattern Design

Pattern Search Examples

The Editor

Replace Pattern vs. Pattern Insert

Real World Examples

Advanced Applications

Current Limitations

Overview

This chapter explains how to design new filters based on the Replace Pattern and Pattern Insert filter types. These filters are pattern matching filters, and currently are the most flexible filters available within TextSpresso. Although not quite as flexible as regular expressions (which we hope to add soon), these pattern filters are very fast and are suitable for the majority of pattern matching tasks which you will have.

This chapter assumes that you already know how to create new filters and open filter editors, and that you know how to use the elements common to all filter editors. If not, please read Editing Filters before proceeding.

Pattern Design

A Pattern filter doesn't look for a simple fixed width string within a text document. Instead it looks for a pattern within the text, a repetition or sequence of characters which fits a series of parameters expected by the filter. You define the pattern of a Pattern filter by defining one or more pattern parts. Each part specifies a small piece of the pattern which is being searched, and the parts are listed in the order in which they must occur. When a Pattern filter searches for a pattern it looks for the first part. Once it finds a match, it checks to see if each of the following parts occur, in order, in the text after the first part. If they do, it has found a match and the selection is the range of text which matches all of the parts. If not, it starts looking again for the first part.

Each part in a pattern is composed of a character table, a match boolean, a minimum number, and a maximum number. The character table lists the characters which are to be matched or not matched. The match boolean specifies whether the character table is to be matched or not matched. The minimum number specifies the minimum number of characters in a row which need to match to satisfy the part, and the maximum number specifies the maximum which can match to satisfy the part.

In the editors parts are listed in the following form: ( table match? min max). This is also how parts will be listed in this manual.

To better understand how patterns work let's look at a few examples on paper. Then we'll go over how to actually enter and edit patterns using the Replace Pattern and Pattern Insert editors.

Pattern Search Examples

Let's pretend for a moment that you want to find the string 'cat' in some text using a Replace Pattern filter. To understand how to design the filter you must first break your find string down into a series of parts, a pattern. In this case the pattern is very simple, a 'c' followed by an 'a' followed by a 't': c | a | t

Now that you have broken the find string into parts, you can create a Replace Pattern part which corresponds to each part in your find string. The first thing you're looking for is a 'c', so your character table consists of the letter 'c'. You are looking for a 'c', so the match boolean would be true. You're looking for exactly one 'c', so you would enter 1 for both the minimum and maximum numbers. If you repeat this process for each part of your find string, you end up with this pattern:

( "c" TRUE 1 1 )
( "a" TRUE 1 1 )
( "t" TRUE 1 1 )

This pattern will first look for one 'c'. When it finds it, it will check to see if one 'a' occurs after the 'c'. If it finds that, it will look for one 't' after the 'a', and if that's present then the word "cat" has been found.

Now let's pretend that you want to find both "cat" and "Cat". How would you modify the above pattern to find that? Break down the word again into parts. Now you're looking for a 'c' or a 'C', followed by an 'a', followed by a 't'. Cc | a | t

( "Cc" TRUE 1 1 )
( "a" TRUE 1 1 )
( "t" TRUE 1 1 )

The above pattern will look for either one 'c' or one 'C', and upon finding either will look for the remaining parts. As you can see if a character matches any character in the table, it is considered a match for that part.

Now let's pretend that the 'a' key has a tendency to stick on your keyboard, so you want to find all misspellings of the word 'cat' where you accidentally typed more than one a. Before you were looking for one 'a' in the middle of your pattern, now you're looking for more than one. You would specify that by entering 2 for the minimum number, and in this case 0 for the maximum number.

Note: Entering 0 for the maximum number is equivalent to "infinity".

( "Cc" TRUE 1 1 )
( "a" TRUE 2 0 )
( "t" TRUE 1 1 )

The above pattern will skip over 'cat' because it only has one a, but will find 'caat', 'caaat', and 'caaaaaaat'.

Finally, let's pretend that the 't' key on your keyboard doesn't work well. So your misspellings may include one t or no t's. How do you find that? Well, currently the pattern is looking for 1 't', so you need to modify it to look for 0 or 1 't'. You would enter 0 for the minimum number, and 1 for the maximum.

Note: Entering 0 for the minimum number is equivalent to saying "this part is optional".

( "Cc" TRUE 1 1 )
( "a" TRUE 2 0 )
( "t" TRUE 0 1 )

Now your pattern will match 'caat', 'caaat', 'caa', 'caaa', etc.

These are very simple examples, but they illustrate how the pattern parts match up to real text. There's one important feature of the part you haven't explored yet, the match boolean. So far it has always been set to TRUE. This tells the Replace Pattern filter to take the character it's looking at and see if it's in the table, and if it is in the table, it's a match.

But what if you want to look for every occurrence of a string * | a | t where '*' is any letter but 'c' and 'C'? You could type out the entire Mac ASCII set except 'c' and 'C' for the character table, but that would take too long. Instead, set the match boolean to FALSE. This tells the Replace Pattern filter to take the character its looking at and see if it's in the table, and if it's not in the table, it's a match.

( "Cc" FALSE 1 1 )
( "a" TRUE 2 0 )
( "t" TRUE 0 1 )

This will find 'baat', 'zaa', 'maat', '!aaaat', etc., but not 'caat', 'Caa', 'caaaat', etc.

Now that's you've seen how Replace Pattern patterns work on paper, let's look at how you enter them into the editor.

The Editor

The Pattern editors have three fields for editing the individual parts of the pattern. They are named Table, Min, and Max, and you can guess what they correspond to. There's also a check box titled Match?

In a Replace Pattern editor there is a field for entering your replace string, and it's exactly the same as the field in the Replace Text editor. Leave it empty to delete found occurrences.

In a Pattern Insert editor there are two fields for entering the text to insert before and after the pattern. If you only want to insert one, leave the other blank.

Below the replace string field is a scrolling list which lists the pattern parts in the order in which they must occur. To the left of this list is three buttons, Add, Save, and Delete.

Adding A New Part: To enter a new pattern part:

  1. Enter the character table, minimum number, and maximum number in the appropriate fields and check or uncheck the match boolean check box.
  2. Click the Add button.

Editing Parts: To edit an existing pattern part:

  1. Optional: Double-click the part in the scroll list. It will be entered into the part fields.
  2. Edit the part fields as necessary.
  3. Make sure the part to be edited is selected.
  4. Click the Store button.

Deleting Parts: To delete an existing part or parts:

  1. Select the part or parts to delete.
  2. Click the Delete button.

Changing The Part Order: To rearrange the parts into a new order, simple drag-n-drop the parts into their new positions in the scroll list.

Replace Pattern vs. Pattern Insert

The Replace Pattern and Pattern Insert filters search in an identical manner, but they differ in what they do after a search. The Replace Pattern filter replaces the found text with the text in the Replace field, with optional wild cards (a wild card in a replace leaves the character in the same position in the found text ). The Pattern Insert filter, however, inserts text before and/or after the found text without changing the found text in any way. You enter the text to be inserted in the Before and After fields.

Real World Examples

(Under Construction)

Let's consider some real world examples of the Replace Pattern filter in action.

Say that you wanted to create a filter to strip HTML tags from a document. (This filter is included with TextSpresso, but for the sake of instruction.) How would you do that?

First you need to break down what you're looking for. An HTML tag always starts with a '<' character, then follows with any number of various characters, and then always ends with a '>' character. The parts are: < | (some text) | >

Well you know how to write the first and last parts of the pattern, but what about the middle? It's a variable range of text which can contain a large number of variable characters. But what it cannot contain is the '>' character. That indicates the end of the tag. So the parts in the Replace Pattern filter would correspond to the parts of the pattern like this:

( "<" TRUE 1 1 )
( ">" FALSE 0 0 )
( ">" TRUE 1 1 )

First the filter looks for one '<' and adds it to the found selection. Then it checks the characters after the '<' and as long as they are not '>' they are added to the selection. Since the number of characters in the tag may vary from 0 to infinity, that's what is entered for the minimum and maximum numbers. The last part adds the '>', which the second part ran into, to the selection.

Advanced Applications

(Under Construction)

Current Limitations

The vast majority of patterns which can be expressed using a regular expression can be expressed using a Replace Pattern filter or a combination of Replace Pattern filters and Insert Text filters (for inserting CR's at the beginning and/or end of the text for BOF and EOF matches). However, currently the Replace Pattern filter type is limited in its replace capability. Where regular expressions can replace individual parts of a pattern and include variables, Replace Pattern filters are limited to replacing the entire pattern with a simple replace string which may, at the most, include wild card characters. Pattern Insert filters insert text around a pattern match.

We are currently working on an advanced Replace Pattern filter type which will be more flexible than regular expressions and support regular expression syntax for users who prefer that language.