Developer's Daily Perl Education
  front page | java | perl | unix | dev directory | web log
   
Front Page
Perl
Education
Perl Education Articles
   

Column-oriented data extraction
with Perl

Introduction

Perl is a terrific language for text processing, but several readers have written wondering about how to extract columns of data from text files with Perl.

For instance, when you have a text database that looks like this:

what's the best way to extract the last name from each database record?

When every record is separated by a delimiter (such as a ',', ':', or '|' symbol) it's easy to use the split function. But in a case like this, where data is stored in fixed-width fields, and no field delimiter is used, how do you extract your data?
 

It's still not too hard

Although you can't jump right in and easily use the split function, the solution is almost as easy, though maybe not as obvious. To solve the "fixed-width field" problem, just use Perl's substr function.

Using our three line database above as an example, it appears that the fields of the database are defined like this:

Given these definitions, here's how I'd grab the data from each record from a file named database:

The substr() function

As you can see from this code fragment, the substr function is an important part of the solution. The substr function lets you extract fields of information from strings. All you have to do is define three items: In the example above, the name of the string we're extracting information from is $_, the default string when you're reading from a data file like this. This string holds the value of each record that we're reading from the database file.

The offset is really just the starting point of each field. In our sample database, the last_name field begins in column 1, the first_name field begins in column 10, and the city field begins in column 19. Given these values, the offset for each field is 0, 9, and 18, respectively. (Remember that Perl, like C, C++, Java, and other languages, begin counting at 0. Therefore, if a field starts in column 10, it's really going to be the 9th element of a string; columns 1-9 of the database will be stored in the string elements numbered 0 through 8.)

The length of each substring is also defined by the database. In our example, the lengths of the three fields are the same as the field widths, 9, 9, and 11, respectively.
 

Conclusion

There are many other examples we could show here, but hopefully this is enough to get you started down the right path. Stated simply, the solution can be phrased like this:
    1. Read in each database record.
    2. Extract the fields you want from the record by using the substr function.
    3. Use the fields as desired.
If you're interested in using the sample program and database shown above for your own experimentation, you can download those two files right here: Jot us a quick note if you have any other questions about column-oriented data extraction, and we'll see what we can do for you.
 


Read the BLOG

Copyright © 1998 DevDaily Interactive, Inc.
All Rights Reserved.