Example usage of coll2unix/unix2coll

This example shows how the programs xmlpipe, unix2coll and coll2unix work.

1. From a Unix database to a Web Collection

Assume a Unix-style database of the following form:

tcpmux	1/tcp
echo	7/tcp
echo	7/udp
discard	9/tcp	sink null
discard	9/udp	sink null
systat	11/tcp	users
daytime	13/tcp
daytime	13/udp
netstat	15/tcp
chargen	19/tcp	ttytst source
chargen	19/udp	ttytst source
ftp-data	20/tcp
ftp	21/tcp
telnet	23/tcp
smtp	25/tcp	mail
time	37/tcp	timserver
time	37/udp	timserver
name	42/udp	nameserver
whois	43/tcp	nicname	# usually to sri-nic
domain	53/udp
domain	53/tcp
hostnames	101/tcp	hostname	# usually to sri-nic
sunrpc	111/udp	rpcbind
sunrpc	111/tcp	rpcbind

Every line represents one record and there are tabs between the columns. Many Unix files are of this structure, or are very close to it, so that a simple command can convert them to this form. (The above, e.g., was created from the /etc/services file, by removing comment lines and replacing multiple tabs by single tabs.)

There are three columns in this database (if we forget about the comment column), called service-name, port/protocol and aliases. The command line that converts it into an XML-style Web Collection is as follows:

cat file |
 unix2coll -v FS="\t" -v PROFILE=services service-name \
  port/protocol aliases

The -v FS="\t" sets the field separator to be a tab. The -v PROFILE=services defines the profile (which must syntactically look like a URL), and the other arguments provide the names of the fields. The result looks like this:

<webcollection>

<record profile="services">
<field name="service-name" value="tcpmux"/>
<field name="port/protocol" value="1/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="echo"/>
<field name="port/protocol" value="7/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="echo"/>
<field name="port/protocol" value="7/udp"/>
</record>

<record profile="services">
<field name="service-name" value="discard"/>
<field name="port/protocol" value="9/tcp"/>
<field name="aliases" value="sink null"/>
</record>

<record profile="services">
<field name="service-name" value="discard"/>
<field name="port/protocol" value="9/udp"/>
<field name="aliases" value="sink null"/>
</record>

<record profile="services">
<field name="service-name" value="systat"/>
<field name="port/protocol" value="11/tcp"/>
<field name="aliases" value="users"/>
</record>

<record profile="services">
<field name="service-name" value="daytime"/>
<field name="port/protocol" value="13/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="daytime"/>
<field name="port/protocol" value="13/udp"/>
</record>

<record profile="services">
<field name="service-name" value="netstat"/>
<field name="port/protocol" value="15/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="chargen"/>
<field name="port/protocol" value="19/tcp"/>
<field name="aliases" value="ttytst source"/>
</record>

<record profile="services">
<field name="service-name" value="chargen"/>
<field name="port/protocol" value="19/udp"/>
<field name="aliases" value="ttytst source"/>
</record>

<record profile="services">
<field name="service-name" value="ftp-data"/>
<field name="port/protocol" value="20/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="ftp"/>
<field name="port/protocol" value="21/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="telnet"/>
<field name="port/protocol" value="23/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="smtp"/>
<field name="port/protocol" value="25/tcp"/>
<field name="aliases" value="mail"/>
</record>

<record profile="services">
<field name="service-name" value="time"/>
<field name="port/protocol" value="37/tcp"/>
<field name="aliases" value="timserver"/>
</record>

<record profile="services">
<field name="service-name" value="time"/>
<field name="port/protocol" value="37/udp"/>
<field name="aliases" value="timserver"/>
</record>

<record profile="services">
<field name="service-name" value="name"/>
<field name="port/protocol" value="42/udp"/>
<field name="aliases" value="nameserver"/>
</record>

<record profile="services">
<field name="service-name" value="whois"/>
<field name="port/protocol" value="43/tcp"/>
<field name="aliases" value="nicname"/>
</record>

<record profile="services">
<field name="service-name" value="domain"/>
<field name="port/protocol" value="53/udp"/>
</record>

<record profile="services">
<field name="service-name" value="domain"/>
<field name="port/protocol" value="53/tcp"/>
</record>

<record profile="services">
<field name="service-name" value="hostnames"/>
<field name="port/protocol" value="101/tcp"/>
<field name="aliases" value="hostname"/>
</record>

<record profile="services">
<field name="service-name" value="sunrpc"/>
<field name="port/protocol" value="111/udp"/>
<field name="aliases" value="rpcbind"/>
</record>

<record profile="services">
<field name="service-name" value="sunrpc"/>
<field name="port/protocol" value="111/tcp"/>
<field name="aliases" value="rpcbind"/>
</record>

</webcollection>

2. From a Web Collection to a Unix database

To convert the Web Collection to a Unix-style database, we need two steps. The first is a Java program called `xmlpipe' that parses the XML and outputs the structure of the file in a manner that is very suitable for AWK-like scripts. For people that know (n)sgmls: the output is similar. The second step is another little AWK-program that selects the right records and fields and outputs them in a tabular format.

The command line looks like this:

java xmlpipe collfile |
 coll2unix services service-name port/protocol aliases

`collfile' is the Web Collections file shown above. The first argument to `coll2unix' is the profile that we are looking for, since a Web Collection may contain records with different profiles. The program will skip all records with a different profile.

The other arguments are the field names we are looking for. The corresponding fields from the selected records are output in the indicated order, with tabs in between the fields. The result is not surprising:

tcpmux	1/tcp	
echo	7/tcp	
echo	7/udp	
discard	9/tcp	sink null
discard	9/udp	sink null
systat	11/tcp	users
daytime	13/tcp	
daytime	13/udp	
netstat	15/tcp	
chargen	19/tcp	ttytst source
chargen	19/udp	ttytst source
ftp-data	20/tcp	
ftp	21/tcp	
telnet	23/tcp	
smtp	25/tcp	mail
time	37/tcp	timserver
time	37/udp	timserver
name	42/udp	nameserver
whois	43/tcp	nicname
domain	53/udp	
domain	53/tcp	
hostnames	101/tcp	hostname
sunrpc	111/udp	rpcbind
sunrpc	111/tcp	rpcbind

The result is the same as the file we started with, except for the comments, which we didn't put in the Web Collection.

Of course, we could have added the comments as well, but that is left as an excercise for the reader...


Bert Bos
Last modified: Fri May 2 13:58:45 MET DST