Code for flattening the Gigaword corpus and associated usage instructions are at nelson-liu/flatten_gigaword
The English Gigaword Corpus is a massive collection of newswire text; the unzipped corpus is ~26 gigabytes, and there are are ~4 billion tokens. It's a commonly used corpus for language modeling and other NLP tasks that require large amounts of monolingual English data.
Despite its relative ubiquity, I couldn't find anything online to do something very simple --- extract the text from all the files in the corpus into one large text file. My motivation for doing this was to train a n-gram language model, but there are a variety of other uses for the flattened data as well.
Decompressing
The Gigaword corpus comes with seven directories of data compressed in gzip
format. The first step is to naturally unzip all of it. To recursively unzip all the data in these directories, use the -r
flag in gunzip
.
gunzip -r /gigaword_path/data/
If your gunzip
doesn't have this flag, piping the results of find
to gunzip
should do the trick.
Parsing and tokenizing an individual data file
In each of the directories, there are a variable number of files. Each of these data files are in SGML format. To parse a single file, I used the BeautifulSoup
library. Extracting the raw text was as simple as finding all the words between <p>
tags.
However, after looking at the data I quickly realized that it includes the original linebreaks as found inside the newswire text. Thus, one sentence can often have multiple newlines within it --- this confuses many tokenizers. To deal with this, I replace all consecutive newlines with spaces, and then tokenize each paragraph (block of text in a <p>
tag) with SpaCy
.
Thus, to parse a file, I:
- Iterate through all the paragraphs in the SGML
- Extract the text, and tokenize it
- Write a new flattened file with one paragraph per line.
Each line in the output flattened file is thus a paragraph, and the tokens (as delimited by SpaCy) are space-separated. These files are perfectly compatible with language modeling toolkits like KenLM
.
The script to parse a single file is at flatten_one_gigaword.py
Making it fast with parallel processing
Parsing one file can take quite a while (up to around 3 minutes). Combined with the fact that the Gigaword corpus has 1010 files, it's easy to see how processing the whole dataset be quite slow.
However, the task is embarrasingly parallel, so let's use multiple cores to flatten files simultaneously and merge them all at the end! This was pretty easily accomplished with GNU parallel
, like so:
find ${GIGAWORDDIR}/data/*/* | parallel --gnu --progress -j ${NUMJOBS} \
python flatten_one_gigaword.py \
--gigaword-path \{\} \
--output-dir ${OUTPUTDIR}
This command find
s all the data files in the Gigaword directory on the disk, and then runs the flatten_one_gigaword.py
file on each of them. The output directory is where the flattened version of each data file is written, and we can simply cat
them together at the end to get our desired output. The final output is a file named flattened_gigaword.txt
with one paragraph per line and with tokens delimited by spaces.
cat ${OUTPUTDIR}/*.flat > ${OUTPUTDIR}/flattened_gigaword.txt
The script to parse the entire dataset in parallel is at flatten_all_gigaword.sh