| Developer's Daily
|
Perl Education |
Column-oriented data extraction
with Perl
Introduction
Perl is a terrific language for text processing, but several readers have
written wondering about how to extract columns of data from text files
with Perl.
For instance, when you have a text database that looks like this:
simpson bart springfield
flinstonefred bedrock
rubble barney bedrock
what's the best way to extract the last name from each database record?
When every record is separated by a delimiter (such as a ',',
':', or '|' symbol) it's easy to use the split
function. But in a case like this, where data is stored in fixed-width
fields, and no field delimiter is used, how do you extract your data?
It's still not too hard
Although you can't jump right in and easily use the split function,
the solution is almost as easy, though maybe not as obvious. To solve the
"fixed-width field" problem, just use Perl's substr
function.
Using our three line database above as an example, it appears that the
fields of the database are defined like this:
| Field Number |
Field Description |
Field Width |
| 1 |
Last name |
9 |
| 2 |
First name |
9 |
| 3 |
City |
11 |
Given these definitions, here's how I'd grab the data from each record
from a file named database:
open (IN, 'database');
while (<IN>) {
chomp;
$last_name = substr $_,
0, 9; # extract the last name field
$first_name = substr $_,
9, 9; # extract the first name field
$city = substr $_, 18,
11; # extract the city field
# do something useful here
with each value ...
print "$first_name $last_name
lives in $city\n";
}
close (IN);
The substr() function
As you can see from this code fragment, the substr function is
an important part of the solution. The substr function lets you
extract fields of information from strings. All you have to do is define
three items:
-
The name of the string
-
The offset, or starting point
-
The length of the substring to extract
In the example above, the name of the string we're extracting information
from is $_, the default string when you're reading from a data
file like this. This string holds the value of each record that we're reading
from the database file.
The offset is really just the starting point of each field. In
our sample database, the last_name field begins in column 1, the first_name
field begins in column 10, and the city field begins in column 19. Given
these values, the offset for each field is 0, 9, and 18, respectively.
(Remember that Perl, like C, C++, Java, and other languages, begin counting
at 0. Therefore, if a field starts in column 10, it's really going to be
the 9th element of a string; columns 1-9 of the database will be stored
in the string elements numbered 0 through 8.)
The length of each substring is also defined by the database.
In our example, the lengths of the three fields are the same as the field
widths, 9, 9, and 11, respectively.
Conclusion
There are many other examples we could show here, but hopefully this is
enough to get you started down the right path. Stated simply, the solution
can be phrased like this:
-
Read in each database record.
-
Extract the fields you want from the record by using the substr function.
-
Use the fields as desired.
If you're interested in using the sample program and database shown above
for your own experimentation, you can download those two files right here:
Jot us a quick note if you have
any other questions about column-oriented data extraction, and we'll see
what we can do for you.

Read the BLOG
Copyright © 1998 DevDaily Interactive, Inc.
All Rights Reserved.