Using Apache Mahout for Document Similarity.
Below are steps to run on the text file collection.
docIndex key 0 maps to 0:1.0000000000000002 in similarity.txt output.
Below are steps to run on the text file collection.
- sh mahout seqdirectory -c UTF-8 -i /Users/xxxx/myfiles/ -o seqfiles
- sh mahout seq2sparse -i seqfiles/ -o vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv
- sh mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
- sh mahout rowsimilarity -i matrix/matrix -o similarity --similarityClassname SIMILARITY_COSINE -m 10 -ess
- sh mahout seqdumper -i similarity > similarity.txt
- sh mahout seqdumper -i matrix/docIndex > docIndex.txt
Apache Mahout does following Steps
- Tokenize and Transform
- Generate word vectors and weights
- Find Document similarity based on TF-IDF( Term Frequence - Inverted Document Frequency) using COSINE_SIMILARITY
Tokenization
The first step is to convert the text document into sequence of tokens. Content is tokenized based on single word. All the tokens are then transformed to lower cases so that the content is standardized.
Create N-Grams : A term n-Gram is defined as a series of consecutive tokens of length n. n-Gram consists of all series of consecutive tokens of length n.
Generate Vectors and Weights
Each document is converted to a word vector of n-Grams with weight assigned to each n-Gram. This weight is assigned based on TF-IDF (Term Frequency - Inverse Document Frequency), This is a measure of importance of a term in a document. it is measured by the frequency of the term in the document but is offset by the total frequency of the term in the whole document set. this creates a balance in weighing terms, as some terms can occur more commonly than other and can be less significant in arriving at similarity.
Document Similarity
Each Document word vector is compared to every other document’s vector. Document similarity is based on cosine similarity measure between the word vector’s. two vectors with the same orientation have a cosine similarity of 1 , two vectors at 90 have similarity of 0. A matrix similarity between each document with every other document is generated.
Sample docIndex.txt Below
Key: 0: Value: /File558
Key: 1: Value: /File4340
Key: 2: Value: /File4208
Key: 3: Value: /File471
Sample Similarity.txt Below
Key: 0: Value: {0:1.0000000000000002,3865:0.15434639725421775,318:0.16924612516623572,558:0.24384373237418783,471:0.16826200999114294,4340:0.16651713654958472,7:0.164357811310303,1841:0.15904628827648598,4208:0.16296041411613846,14:0.14043468960009342}
Key: 1: Value: {1615:0.3312716794782159,2181:0.2840451034186393,2126:0.32415666313248037,3188:0.1628119850871482,1496:0.24775558568026784,1:1.0,1575:0.13525396149776772,1269:0.13286526354605824,28:0.45703740702783774,1866:0.3262754564949865}
Key: 2: Value: {2:1.0,4350:0.13571272853930183,348:0.12600225826696973,3210:0.13949921190207168,560:0.15234464042319912,3294:0.2889578044491356,802:0.17942407070282945,1633:0.1964965769704117,3355:0.1298340236494648,495:0.12627029072308343}
Key: 3: Value: {1990:0.17193706865160252,3700:0.1302978723523794,2082:0.16196813164388732,2227:0.12561545966019144,665:0.15584753719122243,1163:0.19345501767136697,6:0.22582692114456704,3:1.0,1555:0.1742692199362734,4:0.1818170186646791}
Key: 4: Value: {1990:0.13213250264185705,3:0.1818170186646791,2082:0.1349661297563062,1163:0.15679941310702006,1555:0.18261523201994426,6:0.1981917938129962,738:0.29646182670818444,1684:0.21050749439902763,4:0.9999999999999999,3150:0.1397676584929176}
you can see above similarity matrix with scores
e;g [ 558:0.24384373237418783,471:0.16826200999114294 ]
docIndex key 0 maps to 0:1.0000000000000002 in similarity.txt output.
Similarly docIndex key 558 maps to 558:0.24384373237418783. in similarity.txt output.
--- Sample
sh mahout seqdumper -i vectors/tfidf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tfidf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tf-vectors > vectors.txt