How to read corpus of parsed sentences using NLTK in python? -


i working bllip 1987-89 wsj corpus release 1 (https://catalog.ldc.upenn.edu/ldc2000t43).

i trying use nltk's syntaxcorpusreader class read in parsed sentences. i'm trying work simple example of 1 file. here code...

from nltk.corpus.reader import syntaxcorpusreader  path = '/corpus/wsj' filename = 'wsj1' reader = syntaxcorpusreader('/corpus/wsj','wsj1') 

i able see raw text file. returns string of parsed sentences.

reader.raw() u"(s1 (s (pp-loc (in in)\n\t(np (np (dt a) (nn move))\n\t (sbar (whnp#0 (wdt that))\n\t  (s (np-sbj (-none- *t*-0))\n\t   (vp (md would)\n\t    (vp (vb represent)\n\t     (np (np (dt a) (jj major) (nn break))\n\t      (pp (in with) (np (nn tradition))))\n\t     (pp-loc (in in)\n\t      (np#1004 (dt the) (jj legal) (nn profession)))))))))\n     (, ,)\n     (np-sbj#1005 (np (nn law) (nns firms))\n      (pp-loc (in in) (np#1006 (dt this) (nn city))))\n     (vp (md may)\n      (vp (vb become)\n       (np (np (dt the) (jj first))\n\t(pp-loc (in in) (np (dt the) (nn nation)))\n\t(sbar (whnp#1 (-none- 0))\n\t (s (np-sbj (-none- *t*-1))\n\t  (vp (to to)\n\t   (vp (vb reward)\n\t    (np#1009 (nns non-lawyers))\n\t    (pp-mnr-clr (in with)\n\t     (np#1010 (np (dt the) (vbn cherished) (nn title))\n\t      (pp (in of) (np (nn partner))))))))))))\n     (. .)))\n...' 

but when try parsed sentences, receive error.

reader.parsed_sents() file "<stdin>", line 1, in <module> file "/usr/lib/python2.7/dist-packages/nltk/compat.py", line 487, in wrapper return method(self).encode('ascii', 'backslashreplace') file "/usr/lib/python2.7/dist-packages/nltk/util.py", line 664, in __repr__ elt in self: file "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/util.py", line 291, in iterate_from tokens = self.read_block(self._stream)  file "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 430, in _read_parsed_sent_block return list(filter(none, [self._parse(t) t in self._read_block(stream)]))  file "/usr/lib/python2.7/dist-packages/nltk/corpus/reader/api.py", line 378, in _read_block raise notimplementederror() notimplementederror 

i'm not sure issue is. goal read in parsed sentences , use nltk's tree class extract text of sentences, , perhaps navigate tree structure.

hah, had me going while there. notimplementederror not bug, it's nltk's way of telling you're using incomplete class. syntaxcorpusreader "abstract class", intended basis corpora specific complex syntax. in case, need use bracketparsecorpusreader instead:

reader = bracketparsecorpusreader('/corpus/wsj','wsj1') print(reader.parsed_sents()[0]) 

Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -