Developer's Daily Java Education
  front page | java | perl | unix | DevDirectory
   
Front Page
Java
Education
Pure Java
Articles
   
 
How to break a String into tokens with the StringTokenizer class
 

Introduction

When working with any general-purpose programming language, it's often necessary to break a large string into smaller components.  Whether you're working with Unix system files, older Windows' ".ini" files, or maybe flat files in a text database, you'll often read in a record of information, and then break that record up into smaller chunks.  More recently, I've used this method to programmatically interpret the contents of HTML pages for web "robots".

In this article we'll demonstrate how to break Java String's into smaller components, called tokens.  We'll begin by breaking a simple well-known sentence into words, and then we'll demonstrate how to use the same technique to work with a flat-file database.
 

Breaking a sentence into words

One of the most famous sentences in American history begins with the words "Four score and seven years ago".  For our purposes, suppose this sentence is stored in a variable named speech, like this:

A simple snippet of code to break that sentence into individual words using Java's StringTokenizer class would look like this: In this example, the variable speech is passed into the StringTokenizer constructor method.  Because StringTokenizer is not given a field separator value as an input parameter, it uses it's default field separator, and assumes that fields within the string are separated by whitespace characters (spaces, tabs, and carriage-return characters).  Therefore, each time through the while loop a word is printed on a separate line, and the resulting output from this snippet of code looks like this: The while loop test checks to see if there are any tokens left in the st object.  As long as there are, the println statement is executed.  Once there are no tokens remaining, the println statement is skipped and the while loop is exited.

One quick note before continuing - If you try this example, you'll need to import the java.util.* package to run this example, like this:


An example using a text database file

In the example just shown, a text string was broken down into separate tokens.  Of course in the English language we call these tokens "words", and these words are usually separated by whitespace.  In our next example, we'll show how to break a database record (separated by colon characters) into tokens typically called "fields".

The following two records are from a hypothetical customer database file named customer.db.  Each record contains information about a customer, including their first name, last name, and the city and state of their address.  Within a record, each field is separated by a colon character.

Assuming that the first record (the Homer Simpson record) was read into a String variable named dbRecord, the record could be broken up into four fields and printed as follows: In this example, we assume that the variable dbRecord already contains the entire first record of information from our database.  Because we know that the fields of each record are separated by the colon character, we specify that the colon character should be the field delimiter (or field separator) when we call the StringTokenizer constructor, like this: After that, it's a simple matter to break the record into it's four fields using the nextToken() method of the StringTokenizer class.

This technique is demonstrated completely in Listing 1, where the entire customer.db text database is read (record-by-record), and printed.
 
 

// source code developed by DevDaily Interactive, Inc.

  import java.io.*;
  import java.util.*;

  class TokenTest { 

     public static void main (String[] args) {
        TokenTest tt = new TokenTest();
        tt.dbTest();
     }
 

     void dbTest() { 

        DataInputStream dis = null;
        String dbRecord = null;

        try { 

           File f = new File("customer.db");
           FileInputStream fis = new FileInputStream(f); 
           BufferedInputStream bis = new BufferedInputStream(fis); 
           dis = new DataInputStream(bis);

           // read the first record of the database
           while ( (dbRecord = dis.readLine()) != null) {

              StringTokenizer st = new StringTokenizer(dbRecord, ":");
              String fname = st.nextToken();
              String lname = st.nextToken();
              String city  = st.nextToken();
              String state = st.nextToken();

              System.out.println("First Name:  " + fname);
              System.out.println("Last Name:   " + lname);
              System.out.println("City:        " + city);
              System.out.println("State:       " + state + "\n");
           }

        } catch (IOException e) { 
           // catch io errors from FileInputStream or readLine() 
           System.out.println("Uh oh, got an IOException error: " + e.getMessage()); 

        } finally { 
           // if the file opened okay, make sure we close it 
           if (dis != null) {
              try {
                 dis.close();
              } catch (IOException ioe) {
                 System.out.println("IOException error trying to close the file: " +
                                    e.getMessage()); 
              }

           } // end if

        } // end finally

     } // end dbTest

  } // end class
 

 
Listing 1:  The file TokenTest.java program demonstrates a method to read every record in a text database file (customer.db), and break the data records into tokens (i.e., fields) using the StringTokenizer class.
 
 
Although the code shown in Listing 1 is not a good example of object-oriented programming style, it does demonstrate our technique of reading a file and breaking it's records into fields using the StringTokenizer class.
 


Final thoughts and source code

The simple method we've shown here is a powerful way of breaking a String into tokens.  If you need a more powerful tokenizer, you might look at the StreamTokenizer class instead.  The StreamTokenizer class can recognize various comment styles of programming languages, and offers a number of control flags that can be set to various states.

If you'd like to download the source code shown in Listing 1, just click here, and the Java code will be displayed in your browser.  Then just use the "File | Save As" option of your browser to save the source code to your system.

To download the customer.db database file, just click here, and follow the same procedure to save the file to your local filesystem.

 

What's Related


Copyright 1998-2008 DevDaily Interactive, Inc.
All Rights Reserved.