python - Latent Semantic Analysis (LSA) Tutorial -


i trying work tutorial in lsa in link (edit: july 2017. remove dead link)

here code of tutorial:

titles = [doc1,doc2] stopwords = ['and','edition','for','in','little','of','the','to'] ignorechars = ''',:'!'''  class lsa(object):     def __init__(self, stopwords, ignorechars):         self.stopwords = open('stop words.txt', 'r').read()         self.ignorechars = ignorechars         self.wdict = {}         self.dcount = 0             def parse(self, doc):         words = doc.split();         w in words:             w = w.lower()             if w in self.stopwords:                 continue             elif w in self.wdict:                 self.wdict[w].append(self.dcount)             else:                 self.wdict[w] = [self.dcount]         self.dcount += 1           def build(self):         self.keys = [k k in self.wdict.keys() if len(self.wdict[k]) > 1]         self.keys.sort()         self.a = zeros([len(self.keys), self.dcount])         i, k in enumerate(self.keys):             d in self.wdict[k]:                 self.a[i,d] += 1     def calc(self):         self.u, self.s, self.vt = svd(self.a)     def tfidf(self):         wordsperdoc = sum(self.a, axis=0)                 docsperword = sum(asarray(self.a > 0, 'i'), axis=1)         rows, cols = self.a.shape         in range(rows):             j in range(cols):                 self.a[i,j] = (self.a[i,j] / wordsperdoc[j]) * log(float(cols) / docsperword[i])     def printa(self):         print 'here count matrix'         print self.a     def printsvd(self):         print 'here singular values'         print self.s         print 'here first 3 columns of u matrix'         print -1*self.u[:, 0:3]         print 'here first 3 rows of vt matrix'         print -1*self.vt[0:3, :]  mylsa = lsa(stopwords, ignorechars) t in titles:     mylsa.parse(t) mylsa.build() mylsa.printa() mylsa.calc() mylsa.printsvd() 

i read , read again, cannot figure something. if execute code, results following

here singular values [  4.28485706e+01   3.36652135e-14] here first 3 columns of u matrix [[  3.30049181e-02  -9.99311821e-01   7.14336493e-04]  [  6.60098362e-02   1.43697129e-03   6.53394384e-02]  [  6.60098362e-02   1.43697129e-03  -9.95952378e-01]  ...,   [  3.30049181e-02   7.18485644e-04   2.02381089e-03]  [  9.90147543e-02   6.81929920e-03   6.35728804e-03]  [  3.30049181e-02   7.18485644e-04   2.02381089e-03]] here first 3 rows of vt matrix array([[ 0.5015178 ,  0.86514732],    [-0.86514732,  0.5015178 ]]) 

how can figure similarity of doc1 , doc2 matrices? in tfidf algorithm wrote myself, have result simple float number , here 3 matrices. advice?

one option run cosine similarity between 2 matrices. think find information in question posted sometime ago. posted answer question , see others have given great answers.

python: tf-idf-cosine: find document similarity


Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -