Software Archaeology: Re-generating the CoNLL 2000 Chunking Data

I've been using the data from the CoNLL 2000 shared task on syntactic chunking for some ongoing work, but the original dataset is tiny by modern standards. The train set is sections 15-18 of the Penn Treebank, and the test set is section 20---there is no development split.

Since my specific application doesn't need to be comparable to past work and models on the task, I set about re-generating the data from a larger portion of the Penn Treebank. This was more involved than anticipated, maybe because the data and task are so old---I had to do a bit of software archaeology, and the steps are detailed below.

If you ever do a career in science with computers, you'll be doing software archeology more often than you might think: rewamping old code/simulations/analysis to work in new environments. pic.twitter.com/4ZI2qOnwEe
— Gael Varoquaux (@GaelVaroquaux) January 15, 2018

Step 1: Source the script used to generate the data

The CoNLL 2000 shared task site helpfully notes:

http://ilk.uvt.nl/team/sabine/homepage/software.html
The Perl script that was used for generating these training and test data sets from the Penn Treebank. It has been written by Sabine Buchholz from Tilburg University.

However, following the link and proceeding to the script download results in a dead link. Perhaps expected, since it's been almost 20 years.

By searching for the filename on GitHub (a great tool for finding old software and scripts), I stumbled upon this repo from Matt Gormley that has a modified version of the chunklink Perl script. Here's a gist to the script for posterity: https://gist.github.com/nelson-liu/4a1872d7062868cbc1affb545710b836

Step 2: Run the script used to generate the data.

Perl was before my time, but I managed to run the script with the perl on my Macbook. Here's the output of perl -v

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

To run the script, I downloaded the Penn Treebank and wrote a quick bash script to invoke the script on each Penn Treebank section in turn, redirecting the output for each section to a file.

The files generated by chunklink_2-2-2000_for_conll.pl are not in the CoNLL 2000 format, so I wrote a separate Python script called convert_to_conll2000_format.py to massage the output into proper space-sparated CoNLL chunking format. You can download that script here: https://gist.github.com/nelson-liu/4faaf5ccc67636939b299b289720ea94 , and it should be Python 2.x / 3.x compatible.

#! /usr/bin/env bash
set -e

# Untar the raw PTB data
echo "Unzipping raw PTB data"
tar -xf treebank_3_LDC99T42.tgz

# Make chunking data for each PTB section
mkdir -p chunklink_generated_data
mkdir -p conll2000_data
for section_num in {00..24}
do
    echo "Creating chunking data for section ${section_num}"
    cat treebank_3/parsed/mrg/wsj/${section_num}/*.mrg | perl chunklink_2-2-2000_for_conll.pl -N -ns > chunklink_generated_data/${section_num}.chunklink
    python convert_to_conll2000_format.py chunklink_generated_data/${section_num}.chunklink > conll2000_data/${section_num}.conll
done

This produces a folder named conll2000_data with 00.conll, 01.conll, etc. with the ConLL 2000-formatted data for each of the Penn Treebank sections. You can use cat to combine sections and create whatever train, dev, and test splits you might want.

Happy chunking!

Nelson Liu's Blog

Step 1: Source the script used to generate the data

Step 2: Run the script used to generate the data.