python - How to make POS n-grams more effective? -



python - How to make POS n-grams more effective? -

i doing text classification svm, using pos n-grams features. take me 2 hours finish pos unigram. have 5000 texts, in each text there 300 words. here code:

def posngrams(s,n): '''calculate pos n-grams , homecoming dictionary''' text = nltk.word_tokenize(s) text_tags = nltk.pos_tag(text) taglist = [] output = {} item in text_tags: taglist.append(item[1]) in xrange(len(taglist)-n+1): g = ' '.join(taglist[i:i+n]) output.setdefault(g,0) output[g] += 1 homecoming output

i tried same method character n-grams , took me several minutes. give me thought how create pos n-grams faster?

using server these specs inxi -c:

cpu(s): 2 hexa core intel xeon cpu e5-2430 v2s (-ht-mcp-smp-) cache: 30720 kb flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) clock speeds: 1: 2500.036 mhz

normally, canonical reply utilize batch tagging pos_tag_sents doesn't seem it's faster.

let's seek profile of steps before pos tags (using 1 core):

import time nltk.corpus import brownish nltk import sent_tokenize, word_tokenize, pos_tag nltk import pos_tag_sents # load brownish corpus start = time.time() brown_corpus = brown.raw() loading_time = time.time() - start print "loading brownish corpus took", loading_time # sentence tokenizing corpus start = time.time() brown_sents = sent_tokenize(brown_corpus) sent_time = time.time() - start print "sentence tokenizing corpus took", sent_time # word tokenizing corpus start = time.time() brown_words = [word_tokenize(i) in brown_sents] word_time = time.time() - start print "word tokenizing corpus took", word_time # loading, sent_tokenize, word_tokenize together. start = time.time() brown_words = [word_tokenize(s) s in sent_tokenize(brown.raw())] tokenize_time = time.time() - start print "loading , tokenizing corpus took", tokenize_time # pos tagging 1 sentence @ time took. start = time.time() brown_tagged = [pos_tag(word_tokenize(s)) s in sent_tokenize(brown.raw())] tagging_time = time.time() - start print "tagging sentence sentence took", tagging_time # using batch_pos_tag. start = time.time() brown_tagged = pos_tag_sents([word_tokenize(s) s in sent_tokenize(brown.raw())]) tagging_time = time.time() - start print "tagging sentences batch took", tagging_time

[out]:

loading brownish corpus took 0.154870033264 sentence tokenizing corpus took 3.77206301689 word tokenizing corpus took 13.982845068 loading , tokenizing corpus took 17.8847839832 tagging sentence sentence took 1114.65085101 tagging sentences batch took 1104.63432097

note: pos_tag_sents called batch_pos_tag in version before nltk3.0

in conclusion, think need consider other pos tagger preprocess info or have utilize threading handle pos tags.

python nlp svm

Comments

Popular posts from this blog

assembly - What is the addressing mode for ld, add, and rjmp instructions? -

vowpalwabbit - Interpreting Vowpal Wabbit results: Why are some lines appended by "h"? -

Is there a way to convert an HTML page styled with Bootstrap CSS into email-compatible html? -