python - Latent Semantic Analysis (LSA) Tutorial -
i trying work tutorial in lsa in link (edit: july 2017. remove dead link)
here code of tutorial:
titles = [doc1,doc2] stopwords = ['and','edition','for','in','little','of','the','to'] ignorechars = ''',:'!''' class lsa(object): def __init__(self, stopwords, ignorechars): self.stopwords = open('stop words.txt', 'r').read() self.ignorechars = ignorechars self.wdict = {} self.dcount = 0 def parse(self, doc): words = doc.split(); w in words: w = w.lower() if w in self.stopwords: continue elif w in self.wdict: self.wdict[w].append(self.dcount) else: self.wdict[w] = [self.dcount] self.dcount += 1 def build(self): self.keys = [k k in self.wdict.keys() if len(self.wdict[k]) > 1] self.keys.sort() self.a = zeros([len(self.keys), self.dcount]) i, k in enumerate(self.keys): d in self.wdict[k]: self.a[i,d] += 1 def calc(self): self.u, self.s, self.vt = svd(self.a) def tfidf(self): wordsperdoc = sum(self.a, axis=0) docsperword = sum(asarray(self.a > 0, 'i'), axis=1) rows, cols = self.a.shape in range(rows): j in range(cols): self.a[i,j] = (self.a[i,j] / wordsperdoc[j]) * log(float(cols) / docsperword[i]) def printa(self): print 'here count matrix' print self.a def printsvd(self): print 'here singular values' print self.s print 'here first 3 columns of u matrix' print -1*self.u[:, 0:3] print 'here first 3 rows of vt matrix' print -1*self.vt[0:3, :] mylsa = lsa(stopwords, ignorechars) t in titles: mylsa.parse(t) mylsa.build() mylsa.printa() mylsa.calc() mylsa.printsvd()
i read , read again, cannot figure something. if execute code, results following
here singular values [ 4.28485706e+01 3.36652135e-14] here first 3 columns of u matrix [[ 3.30049181e-02 -9.99311821e-01 7.14336493e-04] [ 6.60098362e-02 1.43697129e-03 6.53394384e-02] [ 6.60098362e-02 1.43697129e-03 -9.95952378e-01] ..., [ 3.30049181e-02 7.18485644e-04 2.02381089e-03] [ 9.90147543e-02 6.81929920e-03 6.35728804e-03] [ 3.30049181e-02 7.18485644e-04 2.02381089e-03]] here first 3 rows of vt matrix array([[ 0.5015178 , 0.86514732], [-0.86514732, 0.5015178 ]])
how can figure similarity of doc1 , doc2 matrices? in tfidf algorithm wrote myself, have result simple float number , here 3 matrices. advice?
one option run cosine similarity between 2 matrices. think find information in question posted sometime ago. posted answer question , see others have given great answers.
Comments
Post a Comment