python - How to make POS n-grams more effective? -

i doing text classification svm, using pos n-grams features. take me 2 hours finish pos unigram. have 5000 texts, in each text there 300 words. here code:

def posngrams(s,n):     '''calculate pos n-grams ,  homecoming dictionary'''     text = nltk.word_tokenize(s)     text_tags = nltk.pos_tag(text)     taglist = []     output = {}     item in text_tags:          taglist.append(item[1])     in xrange(len(taglist)-n+1):         g = ' '.join(taglist[i:i+n])         output.setdefault(g,0)         output[g] += 1      homecoming output

i tried same method character n-grams , took me several minutes. give me thought how create pos n-grams faster?

using server these specs inxi -c:

cpu(s): 2 hexa core intel xeon cpu e5-2430 v2s (-ht-mcp-smp-) cache: 30720 kb flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx)  clock speeds: 1: 2500.036 mhz

normally, canonical reply utilize batch tagging pos_tag_sents doesn't seem it's faster.

let's seek profile of steps before pos tags (using 1 core):

import time  nltk.corpus import  brownish nltk import sent_tokenize, word_tokenize, pos_tag nltk import pos_tag_sents  # load  brownish corpus start = time.time() brown_corpus = brown.raw() loading_time = time.time() - start print "loading  brownish corpus took",  loading_time  # sentence tokenizing corpus start = time.time() brown_sents = sent_tokenize(brown_corpus) sent_time = time.time() - start print "sentence tokenizing corpus took", sent_time   # word tokenizing corpus start = time.time() brown_words = [word_tokenize(i) in brown_sents] word_time = time.time() - start print "word tokenizing corpus took", word_time  # loading, sent_tokenize, word_tokenize together. start = time.time() brown_words = [word_tokenize(s) s in sent_tokenize(brown.raw())] tokenize_time = time.time() - start print "loading , tokenizing corpus took", tokenize_time  # pos tagging 1 sentence @ time took. start = time.time() brown_tagged = [pos_tag(word_tokenize(s)) s in sent_tokenize(brown.raw())] tagging_time = time.time() - start print "tagging sentence sentence took", tagging_time   # using batch_pos_tag. start = time.time() brown_tagged = pos_tag_sents([word_tokenize(s) s in sent_tokenize(brown.raw())]) tagging_time = time.time() - start print "tagging sentences batch took", tagging_time

[out]:

loading  brownish corpus took 0.154870033264 sentence tokenizing corpus took 3.77206301689 word tokenizing corpus took 13.982845068 loading , tokenizing corpus took 17.8847839832 tagging sentence sentence took 1114.65085101 tagging sentences batch took 1104.63432097

note: pos_tag_sents called batch_pos_tag in version before nltk3.0

in conclusion, think need consider other pos tagger preprocess info or have utilize threading handle pos tags.

python nlp svm

Search This Blog

Jaimee

python - How to make POS n-grams more effective? -

Comments

Post a Comment

Popular posts from this blog

javascript - THREE.js reposition vertices for RingGeometry -

javascript - I need to update the text of a paragraph by inline edit -

assembly - What is the addressing mode for ld, add, and rjmp instructions? -