Parsing Chinese text with Stanford NLP

I’m doing some natural language processing on (Mandarin) Chinese text right now, using Stanford’s NLP tools, and I’m documenting the steps here. I’m just calling the tools from the command line, in a Unix environment, so if your use case is different from that, this probably won’t help you.

The tools we’ll be using are:

Step 1: Segmenting Chinese text

Mandarin Chinese is written without spaces between words, for example:

世界就是一个疯子的囚笼
“The world is a den of crazies.”

That’s a sentence from the Tatoeba sentence corpus, which is what I’m working on parsing, by the way.

Unsurprisingly, all natural language processing on Chinese text starts with word segmentation – we won’t get far by trying to interpret that whole string as a single element. There are lots of segmenters out there, including jieba in Python, which I like, but they may have different conventions for how they split things up. So if we’re going to use the output of the segmentation in another Stanford tool downstream, it’s best to stick to the Stanford Word Segmenter, whose usage is simple enough with the script provided:

./segment.sh pku path/to/input.file UTF-8 0 > path/to/segmented.file

The first argument can be either pku (for Beijing (Peking) University) or ctb (for Chinese Treebank). According to the docs, pku results in “smaller vocabulary sizes and OOV rates on test data than CTB models”, so I went with that. “0” at the end indicates that we want the single best guess at the segmentation, without printing its associated probability.

If you’re curious, the output of the segmenter on the sentence above is:

世界就是一个疯子的囚笼

which is an eminently sensible segmentation.

The load times on the segmenter are pretty horrible, so it’s worth it to stuff all your text into a single file and segment everything at one go.

Step 2: Parsing

The Stanford parser gives two different kinds of outputs, a constituency parse, which shows the syntactic structure of the sentence:

(ROOT
  (IP
    (NP (NN 世界))
    (VP
      (ADVP (AD 就))
      (VP (VC 是)
        (NP
          (DNP
            (NP
              (QP (CD 一)
                (CLP (M 个)))
              (NP (NN 疯子)))
            (DEG 的))
          (NP (NN 囚笼)))))))

And a dependency parse, which shows, broadly speaking, the grammatical relations the words have to each other:

nsubj(囚笼-8, 世界-1)
advmod(囚笼-8, 就-2)
cop(囚笼-8, 是-3)
nummod(个-5, 一-4)
clf(疯子-6, 个-5)
assmod(囚笼-8, 疯子-6)
case(疯子-6, 的-7)
root(ROOT-0, 囚笼-8)

There are specialized dependency parsers out there, but the Stanford parser first does a constituency parse and converts it to a dependency parse. This approach seems to work better in general.

There are five Chinese parsing models supplied with the software, which you can see by less-ing the stanford-parser-3.5.2-models.jar file.

edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaPCFG.ser.gz

The FAQ says that the PCFG grammars are the fastest, but the factored grammars are the most performant. So choosing either xinhuaFactored or chineseFactored is the way to go. The xinhua models are trained on newswire data, while the chinese models include more varied types of text including some from other regions, so select the model that best fits your data.

In addition, there is a xinhuaFactoredSegmenting model. This works on unsegmented text, allowing us to bypass the segmentation procedure in Step 1. However, this isn’t recommended as it doesn’t perform as well as the standalone Segmenter.

Now that we’ve chosen our model, it’s time to actually do the parsing. There is a lexparser-lang.sh helper script, but it assumes you’re using GB18030 encoding for your Chinese text. It’s simple to edit the script to include an -encoding utf-8 flag, but it’s not that much more difficult to just construct the Java call yourself.

Here’s how to get the constituency parse:

java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file > path/to/constituency.parsed.file

To get the dependency parse, just add an outputFormat flag, and specify typedDependencies:

java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
-outputFormat typedDependencies
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file > path/to/dependency.parsed.file

Incidentally, the parse that was chosen for this sentence is not the intended reading – it’s interpreting the sentence as “The world is the den of a single (unspecified) crazy person”. Which seems scarily close to truth.

You might want to consider the possibility of multiple parses, therefore. To get multiple parses, we need to use one of the PCFG parsers (not the factored parsers), and add the flag -printPCFGkBest n, where n is 2 or more.

Troubleshooting

The two errors I got while trying to do the parsing step had to do with getting the appropriate Java version running, and supplying the correct classPath.

Version 3.5.2 requires Java 8. If you don’t have it, it will turn up the error Unsupported major.minor version 52.0. If you get this error, make sure that (a) you have Java 8 installed, and that (b) java invokes Java 8. To do the latter, do

sudo update-alternatives --config java

and select Java 8.

The second error you may come across if you follow the commands supplied in the docs is Unable to resolve "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz" as either class path, filename or URL.

If you get this, check the classPath (-cp) argument you’re passing to Java. It should have two parts: the parser .jar, and the models .jar, separated by a colon (a semi-colon in some other OSes).

-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar

Conclusion

I’m really grateful that Stanford makes all this great software available, and particularly for non-English languages. I hope this guide saves someone some time in getting the Chinese parser working. If all goes well, I’ll be sharing what I’ve been using it for soon.