# TreebankPreprocessing **Repository Path**: mirrors_hankcs/TreebankPreprocessing ## Basic Information - **Project Name**: TreebankPreprocessing - **Description**: Python scripts preprocessing Penn Treebank and Chinese Treebank - **Primary Language**: Unknown - **License**: GPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2020-08-08 - **Last Updated**: 2026-05-23 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # TreebankPreprocessing Python scripts preprocessing [Penn Treebank (PTB)](https://catalog.ldc.upenn.edu/ldc99t42) and [Chinese Treebank 5.1 (CTB)](https://catalog.ldc.upenn.edu/LDC2005T01). They can convert treebanks to: | Corpus | Format | Description | | --- | --- | --- | | constituency parse tree | `.txt` | one line for one sentence | | dependency parse tree | `.conllx` | [Basic Stanford Dependencies (SD)](https://nlp.stanford.edu/software/stanford-dependencies.shtml) | | word segmentation corpus | `.tsv` | first column for characters, second column for BMES tags, sentences separated by a blank line | | part-of-speech tagging corpus | `.tsv` | first column for words, second column for tags, sentences separated by a blank line | When designing a tagger or parser, preprocessing treebanks is a troublesome problem. We need to: - Split dataset into train/dev/test, following conventional splits. - Remove xml tags inside CTB. - Combine the multiline bracketed files into one file, one line for one sentence. I wondered why there were no open-source tools handling these tedious works. Finally I decide to write one myself. Hopefully it will save you some time. ### Required software - Python3 - NLTK - Optional stanford-parser for converting to dependency parse trees ## Overview What kind of task can we perform on treebanks? ### Chinese Word Segmentation For CTB, segmentation corpus are split as per Jiang et al. (2009): - **CTB** Training: 001–270, 400–1151. Development: 301–325. Test: 271-300. ### Part-of-Speech Tagging - **PTB** Training: 0-18. Development: 19-21. Test: 22-24. As per Collins (2002) and Choi (2016). - **CTB** The same with Chinese Word Segmentation. ### Phrase Structure Parsing These scripts can also convert treebanks into the conventional data setup from Chen and Manning (2014), Dyer et al. (2015). The detailed splits are: - **PTB** Training: 02-21. Development: 22. Test: 23. - **CTB** Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147. ### Dependency Parsing You will need Stanford Parser for converting phrase structure trees to dependency parse trees. Please download the [Stanford Parser Version 3.3.0](https://nlp.stanford.edu/software/stanford-parser-full-2013-11-12.zip) and place them in this folder: ``` TreebankPreprocessing ├── ... ├── stanford-parser-3.3.0-models.jar └── stanford-parser.jar ``` OK, let's do it on the fly. ## PTB ### 1. Import PTB into NLTK Bracketed files parsing relies on NLTK. Please follow [NLTK instruction](http://www.nltk.org/howto/corpus.html#parsed-corpora), put `BROWN` and `WSJ` into `nltk_data/corpora/ptb`, e.g. ``` ptb ├── BROWN └── WSJ ``` ### 2. Run `ptb.py` This script does all the work for you, only requires a path to store output. ```text $ python3 ptb.py --help usage: ptb.py [-h] --output OUTPUT [--task TASK] Combine Penn Treebank WSJ MRG files into train/dev/test set optional arguments: -h, --help show this help message and exit --output OUTPUT The folder where to store the output train/dev/test files --task TASK Which task (par, pos)? Use par for phrase structure parsing, pos for part-of-speech tagging ``` * You will get 3 `.txt` files corresponding to train/dev/test set. * If you want part-of-speech tagging corpora, simply append `--task pos`. This time, you get 3 `.tsv` files. * `.txt` files can be converted to `.conllx` files by `tb_to_stanford.py`: ``` $ python3 tb_to_stanford.py --help usage: tb_to_stanford.py [-h] --input INPUT --lang LANG --output OUTPUT Convert combined Penn Treebank files (.txt) to Stanford Dependency format (.conllx) optional arguments: -h, --help show this help message and exit --input INPUT The folder containing train.txt/dev.txt/test.txt in bracketed format --lang LANG Which language? Use en for English, cn for Chinese --output OUTPUT The folder where to store the output train.conllx/dev.conllx/test.conllx in Stanford Dependency format ``` ## CTB The CTB is a little messy, it contains extra xml tags in every gold tree, and is not natively supported by NLTK. You need to specify the CTB root path (the folder containing index.html). ``` $ python3 ctb.py --help usage: ctb.py [-h] --ctb CTB --output OUTPUT [--task TASK] Combine Chinese Treebank 5.1 fid files into train/dev/test set optional arguments: -h, --help show this help message and exit --ctb CTB The root path to Chinese Treebank 5.1 --output OUTPUT The folder where to store the output train.txt/dev.txt/test.txt --task TASK Which task (seg, pos, par)? Use seg for word segmentation, pos for part-of-speech tagging, par for phrase structure parsing ``` - Tagging and dependency parsing corpora can be obtained similar to PTB. Then you can start your research, enjoy it!