Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp .com.

Notice: This material is excerpted from Special Edition Using Java, ISBN: 0-7897-0604-0. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 27 - Content Handlers

by Eric Blossom

Documents found on the Web are of many different types. A document may contain text, an image, sound, etc. HotJava and other Web browsers are able to display a variety of different documents.

However, new types of content come on-line all the time. HotJava and other browsers have software built-in to handle the types they know about. But what about new types? Many browsers support the notion of helper applications. You can associate a particular content type with an application. For example, Microsoft defined a format for document interchange called Rich Text Format (RTF). You can associate the RTF content type with Microsoft Word. Now when you download a document containing text in RTF, your browser will launch Microsoft Word with your document loaded. This is pretty convenient. But, it pops open a whole new application in a different window. There is another way.

With HotJava, you can define new classes to handle new types of content. These new classes are called content handlers. HotJava was designed to accept new content handlers allowing it to handle new types of content directly, without any helper application. The content handlers effectively become a part of HotJava.

With a content handler instead of a helper application, the document will be displayed in the browser's window. No helper application needs to be launched.

In this chapter you will:

Understanding MIME Content Types

When a Web browser gets a document it tries to figure out what the document contains. Part of the HyperText Transport Protocol (HTTP) tells the browser what the coming document contains. For other protocols, the browser will try to deduce the content type from the file name extension.

A standard way of specifying content types comes from MIME (Multipurpose Internet Mail Extensions). The web and the Hypertext Transport Protocol specify content type in exactly the same way. This is why content types are often referred to as MIME types.

ftp://ftp.internic.net/rfc/rfc1521.txt

In mail messages the content type is specified in a header field labeled "Content-type:". Its value is of the form <type>/<subtype>. The content type of most Web pages is "text/html". HotJava knows how to handle this kind of content and many others. Still you can define others and teach HotJava how to handle them. You will write a handler for the standard content-type of text/enriched.

ftp://ftp.internic.net/rfc/rfc1526.txt


Enriched Text

The people who invented the MIME standard also came up with a simple format for encoding text a little richer than plain ASCII. It uses tags similar to HTML. The tags come in pairs of <TAG> and </TAG>

<bold> bold text </bold>

<italic> italic text </italic>

<fixed> text in a mono-spaced font </fixed>

<smaller> text in a smaller font </smaller>

<bigger> text in a larger font </bigger>

<underline> underlined text </underline>

<center> centered text </center>

<flushleft> text aligned on the left margin </flushleft>

<flushright> text aligned on the right margin </flushright>

<flushboth> text spread out to align on both margins </flushboth>

<nofill> verbatim text with new lines honored </nofill>

<indent> indented text </indent>

<indentright> text with a wider right margin </indentright>

<excerpt> text excerpted from another source </excerpt>

<param> specialized commands </param>

A literal "<" is represented by doubling it "<<". Sequences of consecutive newline characters in the input are decreased by one. Isolated newline characters are converted to a single space. This effectively makes newline characters into paragraph separators.


Separating Lines of Text

Plain ASCII text is not so plain when it comes to separating lines of text. Some systems (like DOS) use two characters, a carriage return (CR) and a line feed (LF), to signal a new line. Some (like UNIX) use a single LF character. Some (like Macintosh) use a single CR character. Standard C language functions for reading and writing text use the notion of a newline character. The newline character is encoded with '\n'. The standard C functions handle conversion to and from whatever the local convention is.

In Java newline is also coded as '\n'. However, content handlers use a data stream that bypasses the standard C language IO functions. Text from the internet generally arrives with lines separated by pairs of CR and LF characters. Your content handler will see the CR as '\r' and the LF as '\n'.

Generally you will want to convert whatever comes over the net as a newline to a single '\n' character.

The author noticed when using the text/plain content handler with Alpha3 HotJava on Windows, that the CR characters showed up in the viewing panel as little rectangles. Modifying the text/plain content handler to strip out the '\r' characters fixed this.

Creating a content handler step by step

Your content handler will be a minimal one. It will strip the tags from the enriched text and honor only the <nofill> and </nofill> tags. It will also strip out anything between <param> and </param> tags.

Declaring The Package

Content handlers must be part of a package named for the type. The package will contain a class for each subtype. So for the type text we will need a text package. In that package there should be classes for plain, html, etc. URL objects look for these packages with the prefix sun.net.www.content. So sun.net.www.content.text is the text package. This package already exists in the JDK. The text/plain handler is there. The first line in our source code declares this to be part of that package.

 package sun.net.www.content.text;

Importing Standard Classes

The next thing you need to do is import the standard classes that you will use or extend in your handler.

 import java.net.ContentHandler;
 import java.net.URLConnection;
 import java.io.InputStream;
 import java.io.IOException

Your handler will be a subtype of the java.lang.ContentHandler class. It provides a getContent() method useful to browsers. You will be overriding that method.

A URLConnection gives you control of a connection to a server on the Internet. You will use it's getInputStream() method to provide an InputStream from the server that you can read.

An InputStream is a stream of characters that can be read.

It is actually an abstract class. Any objects returned of type InputStream will actually be of some subtype of InputStream. This will not matter to you. All that matters is that you can use the read() method to get characters. Since the abstract class InputStream has such a method, all its subclasses must also have that method.

An IOException is what can get thrown back to from some of the methods you will be using.

You, in turn, will throw the IOException back to whatever is using your method. This exception will get thrown if there is trouble getting characters from the server.

Declaring the class

You declare the class as extending the java.net.ContentHandler class. It is the only public class you will need. It's name must match the name of the MIME subtype. In your case that is "enriched." Here is an empty version showing out to declare the class. Later you will fill it in.

 public class enriched extends ContentHandler {
 }

Defining the getContent() Method

You override the getContent() method of the ContentHandler class. It must be declared to be the same as the one in the ContentHandler class. It's single parameter is the URLConnection from where the data are coming. It can throw an IOException back to the caller.

Notice that the method returns an Object. Any kind of object can be returned. You will be returning a string. However, you could return a graphic widget containing the text rendered in varying fonts complete with a scrolling mechanism to view it with. A graphic browser would be able to take such a widget and display it in one of it's panels. It might also be able to accept a String and put it in a text box of some sort for display.

Your trivial browser will just write a string to the standard output. So a String will be just fine for your purposes. Here is the enriched class again. This time with an empty getContent() method. This shows how to define the method. You will fill it in later.

 public class enriched extends ContentHandler {
      public Object getContent( URLConnection urlc ) throws IOException {
      }
 }

Reading the InputStream

Now you put in a loop to read data from the URLConnection and build a StringBuffer with them. The StringBuffer is then returned to the caller as a String. This is pretty much just what the text/plain content handler does. It gets an InputStream from the URLConnection passed to it. Then it reads characters one at a time from the InputStream and appends them to a StringBuffer. The last line returns the characters converted to a String object.

Here is your getContent() method filled in enough to do this.

     public Object getContent( URLConnection urlc ) throws IOException {
           StringBuffer out = new StringBuffer();
           int c;

          InputStream in = urlc.getInputStream();
           c = in.read();
           while( 0 <= c ) {
                out.append( (char) c );
                c = in.read();
           }
           in.close();
           return out.toString();
      }

Transforming the content

At last you get to parsing the enriched text. The full content handler is given in listing 27.1 below. The basic loop is the same. However, within that loop you check for tags and convert newline sequences appropriately. This code may look a lot like C. This is because it is a fairly direct translation from the C code given in RFC 1526.

The main loop is just as it was except for a large if statement inside it. That statement is checking for a "<" which may be starting a tag.

You also have three context variables. One keeps track of how long the current sequence of newline characters is. Another, paramct, shows how deeply nested you are in <param>, </param> tags. The third, nofill, shows how deeply nested you are between <nofill> and </nofill> tags.

If there are no tags or "<" characters in the text, only the code in lines 50 to 58 will get exercised within the if, else statement starting at line 26.

Line 50 checks to see if you are between any <param>, </param> pairs. If you are, just ignore the current character have.

In line 52, you check to see if your character is a newline. If it is and you are not in a nofill section, the newline is counted. If the sequence is now more than one newline long (i.e. the previous character was also a newline) then the newline is emitted. Otherwise it is just counted.

At line 54, you know that the character is not a newline. If the newline sequence was exactly one character long, then a space is emitted in line 55. In line 56 you set your newline count to zero, since you know that the current character is not a newline. Line 57 emits the current character.

If 0 < nofill then the newline count will not be incremented. In this case, line 57 will just emit them as they come.

Now let's look at what happens when you encounter a "<". Line 27 will emit a space if the previous character was a single newline and line 29 gets the next character. If the next character is also a "<", then line 31 emits it. Otherwise you probably have a tag.

You read in the rest of the tag in lines 33 to 36 and convert it to a String in line 37. Lines 39 to 47 adjust the appropriate context variable based on the tag just encountered.

Listing 27.1 Enriched Text Content Handler
  1   // Enriched Text Content Handler.
  2   
  3   package sun.net.www.content.text;
  4   
  5   import java.net.ContentHandler;
  6   import java.net.URLConnection;
  7   import java.io.InputStream;
  8   import java.io.IOException;
  9   
 10   public class enriched extends ContentHandler {
 11   
 12      public Object getContent( URLConnection urlc ) throws IOException {
 13         StringBuffer out = new StringBuffer();
 14         StringBuffer ts = new StringBuffer();
 15         String token = new String();
 16         InputStream in;
 17         int c;
 18         int newlinect = 0;
 19         int paramct = 0;
 20         int nofill = 0;
 21   
 22         in = urlc.getInputStream();
 23         try {
 24            c = in.read();
 25            while( -1 < c ) {
 26               if ( '<' == c ) { // We may have a command.
 27                  if ( 1 == newlinect ) { out.append( ' ' ); }
 28                  newlinect = 0;
 29                  c = in.read();
 30                  if ( '<' == c ) { // We have a quoted <
 31                     if ( paramct < 1 ) out.append( (char)c );
 32                  } else { // We have a command.
 33                     while ( -1 < c && '>' != c ) {
 34                        ts.append( (char)c );
 35                        c = in.read();
 36                     }
 37                     token = ts.toString();
 38                     ts.setLength( 0 );
 39                     if ( token.equalsIgnoreCase( "param" ) ) {
 40                        paramct++;
 41                     } else if ( token.equalsIgnoreCase( "nofill" ) ) {
 42                        nofill++;
 43                     } else if ( token.equalsIgnoreCase( "/param" ) ) {
 44                        paramct--;
 45                     } else if ( token.equalsIgnoreCase( "/nofill" ) ) {
 46                        nofill--;
 47                     }
 48                  }
 49               } else { // It's text.
 50                  if ( 0 < paramct ) 
 51                     ; // ignore params
 52                  else if ( c == '\n' && nofill < 1 ) {
 53                     if ( 1 < ++newlinect ) out.append( (char)c );
 54                  } else {
 55                     if ( 1 == newlinect ) out.append( ' ' );
 56                     newlinect = 0;
 57                     out.append( (char)c );
 58                  }
 59               }
 60               c = in.read();
 61            }
 62         } finally {
 63            in.close();
 64         }
 65         out.append( '\n' );
 66         return out.toString();
 67      }
 68   }

Testing our Content Handler

The trivial browser you used for testing in chapter 26 needs to be extended to test your content handler. Listing 27.2 shows the new version of GoGet.

The big change here is the addition of a ContentHandlerFactory. The URLConnection will use the ContentHandlerFactory's createContentHandler() method to create the correct content handler for a MIME type. A ContentHandlerFactory is not a class, but an interface. This means that you need to define every method in the interface. The only method in the interface is the createContentHandler method.

Having defined your ContentHandlerFactory, you need to let the URLConnection know we have done so. That is the purpose of line 40. Of course to set the URLConnection's content handler factory, we need the URLConnection. That is the purpose of line 39. Line 39 actually establishes communication with the server. In the previous version that happened behind the scene when getObject() was used. In this case getObject() realizes that the connection is already established.

Your factory in lines 13 to 22 gives back your enriched handler for enriched text and the plain handler for other subtypes of type text and for objects of unknown content. You do not have handlers for any other content types, so you should return a null pointer in those cases.

 Listing 27.2 A Trivial Browser   GoGet Revisited
  1   import java.net.URL;
  2   import java.net.URLConnection;
  3   import java.net.MalformedURLException;
  4   
  5   import java.net.ContentHandler;
  6   import java.net.ContentHandlerFactory;
  7   
  8   
  9   class ourContentHandlerFactory implements ContentHandlerFactory {
 10   
 11      public ContentHandler createContentHandler( String mimetype ) {
 12   
 13         if ( mimetype.equalsIgnoreCase( "text/enriched" ) ) {
 14            return new sun.net.www.content.text.enriched();
 15   
 16         } else if ( mimetype.startsWith( "text/" ) ) {
 17            return new sun.net.www.content.text.plain();
 18   
 19         } else if ( mimetype.equalsIgnoreCase( "content/unknown" ) ) {
 20            return new sun.net.www.content.text.plain();
 21   
 22         }
 23   
 24         return null;
 25      }
 26   
 27   }
 28   
 39   
 30   public class GoGet {
 31   
 32      public static void main( String args[] ) {
 33   
 34         URL url = null;
 35         Object o = null;
 36   
 37         try {
 38            url = new URL( args[0] );
 39            URLConnection c = url.openConnection();
 40            c.setContentHandlerFactory( new ourContentHandlerFactory() );
 41            o = c.getContent();
 42            System.out.println( o.toString() );
 43   
 44         } catch (ArrayIndexOutOfBoundsException e) {
 45            System.err.println( "usage: java tester URL " );
 46   
 47         } catch (MalformedURLException e) {
 48            System.err.println( "Malformed URL: " + args[0] );
 49            System.err.println( e.toString() );
 50   
 51         } catch (Exception e) {
 52            System.err.println( e.toString() );
 53   
 54         }
 55   
 56      }
 57   
 58   
 59   }

Creating test data

You need to create a file containing some enriched text to see if you can GoGet it.

 Here's a bit of <bold>enriched</bold> text.
 This should be the second sentence in the first paragraph.
 It should all be one long line.

 Did this start a second line
 with no blank line before it?


 Did this start a second paragraph
 with a single blank line before it?
 <nofill>
 This sentence should
 be on two lines.
 </nofill>

Associating MIME types with file fame suffixes

Mosaic and Netscape use a file named mime.types to associate MIME types with file name suffixes. Each line of the file consists of a MIME type and one or more suffixes separated by spaces.

Unfortunately, the final version of the JDK appears to consult an internal table to make these associations. This will make it a bit harder to test our code. The problem is letting the URLConnection object know what the content type is.

One More Adjustment

One way to get around the problem mentioned above is 

to change line 17 in listing 27.2 to read the same as line 14. This will treat all text files as enriched text. Then rename the test file test.txt so that it will be associated with the content type text/plain.

For the more adventurous, another way is to modify the URLConnection.java file and recompile it. To do this, you need find a series of lines of the form:

 setSuffix(".text", "text/plain");

Then you need to add one that reads:

setSuffix(".enriched", "text/enriched");

Then you can recompile and replace the URLConnection.class in the classes.zip file. Don't forget to make a backup of classes.zip first!

Running the Test

At last we can try it. This shows what happened when the author ran the test:

 bash$ javac GoGet file:test.txt
 Here's a bit of enriched text. This should be the second sentence in the first 
 paragraph. It should all be one long line.
 Did this start a second line with no blank line before it?

 Did this start a second paragraph with a single blank line before it?
 This sentence should
 be on two lines.

Open Questions

Installing content handlers is about as risky as installing protocol handlers. See chapter 26 for a discussion of those risks.

Sun's HotJava White Paper describes content handlers being downloaded on the fly if HotJava encounters a content type it doesn't have a handler for. However, Neither the alpha nor beta API's seem to support this. Will future versions of the API support it?

This would be very convenient. It would also be a security risk. Sun (and other browser makers who wanted to use content handlers) would need to decide if the risk is worth it. They may decide to allow it, but with security restrictions similar to those imposed on applets.

One of the great benefits of content handlers is that one can create a new type of content and a handler to go with it. The MIME standard says that when you do such a thing you should give it a subtype starting with "X-". HotJava needs for you to then define a class whose name starts with "X-". However the Java compiler will reject such a name. Class names cannot contain a hyphen. Javac will report a syntax error. How will Sun deal with this problem?

Perhaps Sun could modify HotJava. When confronted with such a type or subtype name, HotJava could seek a class named the same except for an underline character in place of the hyphen. e.g. for a content type of text/x-mystuff, HotJava would look for a class named x_mystuff in the text package. If you don't want to wait for Sun to make this change, you could use a prefix of "X_" rather than the standard "X-". You would not be quite standard, but your new name would probably not collide with any new standard names either.

QUE Home Page

For technical support for our books and software contact support@mcp.com

Copyright ©1996, Que Corporation