compare - NLP/Machine Learning text comparison -


i'm in process of developing program capability of comparing small text (say 250 characters) collection of similar texts (around 1000-2000 texts).

the purpose evalute if text similar 1 or more texts in collection , if so, text in collection has retrievable id. each texts have unique id.

there 2 ways i'd output be:

option 1: text matched text b 90% similarity, text c 70% similarity, , on.

option 2: text matched text d highest similarity

i have read machine learning in school i'm not sure algorithm suits problem best or if should consider using nlp (not familiar subject).

does have suggestion of algorithm use or can find nessecary literature solve problem?

thanks contribution!

it not seem machine learning problem, looking text similarity measure. once select one, sort data according achieved "scores".

depending on texts, can use 1 of following metrics (list wiki) or define own:

  • hamming distance
  • levenshtein distance , damerau–levenshtein distance
  • needleman–wunsch distance or sellers' algorithm
  • smith–waterman distance
  • gotoh distance or smith-waterman-gotoh distance
  • monge elkan distance
  • block distance or l1 distance or city block distance
  • jaro–winkler distance
  • soundex distance metric
  • simple matching coefficient (smc)
  • dice's coefficient
  • jaccard similarity or jaccard coefficient or tanimoto coefficient
  • tversky index
  • overlap coefficient
  • euclidean distance or l2 distance
  • cosine similarity
  • variational distance
  • hellinger distance or bhattacharyya distance
  • information radius (jensen–shannon divergence)
  • skew divergence
  • confusion probability
  • tau metric, approximation of kullback–leibler divergence
  • fellegi , sunters metric (sfs)
  • maximal matches
  • lee distance

some of above (like ie. cosine similarity) require transforming data vectorized format. process can achieved in many ways, simplest possible bag of words/tfidf techniques.

list far being complete, draft of such methods. in particular, there many string kernels, suited measuring text similarity. in particular wordnet kernel can measure semantic similarity based on 1 of complete semantic databse of english language.


Comments

Popular posts from this blog

java - activate/deactivate sonar maven plugin by profile? -

python - TypeError: can only concatenate tuple (not "float") to tuple -

java - What is the difference between String. and String.this. ? -