python - How to make POS n-grams more effective? -
python - How to make POS n-grams more effective? -
i doing text classification svm, using pos n-grams features. take me 2 hours finish pos unigram. have 5000 texts, in each text there 300 words. here code:
def posngrams(s,n): '''calculate pos n-grams , homecoming dictionary''' text = nltk.word_tokenize(s) text_tags = nltk.pos_tag(text) taglist = [] output = {} item in text_tags: taglist.append(item[1]) in xrange(len(taglist)-n+1): g = ' '.join(taglist[i:i+n]) output.setdefault(g,0) output[g] += 1 homecoming output i tried same method character n-grams , took me several minutes. give me thought how create pos n-grams faster?
using server these specs inxi -c:
cpu(s): 2 hexa core intel xeon cpu e5-2430 v2s (-ht-mcp-smp-) cache: 30720 kb flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) clock speeds: 1: 2500.036 mhz normally, canonical reply utilize batch tagging pos_tag_sents doesn't seem it's faster.
let's seek profile of steps before pos tags (using 1 core):
import time nltk.corpus import brownish nltk import sent_tokenize, word_tokenize, pos_tag nltk import pos_tag_sents # load brownish corpus start = time.time() brown_corpus = brown.raw() loading_time = time.time() - start print "loading brownish corpus took", loading_time # sentence tokenizing corpus start = time.time() brown_sents = sent_tokenize(brown_corpus) sent_time = time.time() - start print "sentence tokenizing corpus took", sent_time # word tokenizing corpus start = time.time() brown_words = [word_tokenize(i) in brown_sents] word_time = time.time() - start print "word tokenizing corpus took", word_time # loading, sent_tokenize, word_tokenize together. start = time.time() brown_words = [word_tokenize(s) s in sent_tokenize(brown.raw())] tokenize_time = time.time() - start print "loading , tokenizing corpus took", tokenize_time # pos tagging 1 sentence @ time took. start = time.time() brown_tagged = [pos_tag(word_tokenize(s)) s in sent_tokenize(brown.raw())] tagging_time = time.time() - start print "tagging sentence sentence took", tagging_time # using batch_pos_tag. start = time.time() brown_tagged = pos_tag_sents([word_tokenize(s) s in sent_tokenize(brown.raw())]) tagging_time = time.time() - start print "tagging sentences batch took", tagging_time [out]:
loading brownish corpus took 0.154870033264 sentence tokenizing corpus took 3.77206301689 word tokenizing corpus took 13.982845068 loading , tokenizing corpus took 17.8847839832 tagging sentence sentence took 1114.65085101 tagging sentences batch took 1104.63432097 note: pos_tag_sents called batch_pos_tag in version before nltk3.0
in conclusion, think need consider other pos tagger preprocess info or have utilize threading handle pos tags.
python nlp svm
Comments
Post a Comment