Samar's Blog: Apache Mahout for Document Similarity.

Using Apache Mahout for Document Similarity.

Below are steps to run on the text file collection.

sh mahout seqdirectory -c UTF-8 -i /Users/xxxx/myfiles/ -o seqfiles
sh mahout seq2sparse -i seqfiles/ -o vectors/ -ow -chunk 100 -x 90 -seq -ml 50 -n 2 -s 5 -md 5 -ng 3 -nv
sh mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
sh mahout rowsimilarity -i matrix/matrix -o similarity --similarityClassname SIMILARITY_COSINE -m 10 -ess
sh mahout seqdumper -i similarity > similarity.txt
sh mahout seqdumper -i matrix/docIndex > docIndex.txt

Apache Mahout does following Steps

Tokenize and Transform
Generate word vectors and weights
Find Document similarity based on TF-IDF( Term Frequence - Inverted Document Frequency) using COSINE_SIMILARITY

Tokenization

The first step is to convert the text document into sequence of tokens. Content is tokenized based on single word. All the tokens are then transformed to lower cases so that the content is standardized.

Create N-Grams : A term n-Gram is defined as a series of consecutive tokens of length n. n-Gram consists of all series of consecutive tokens of length n.

Generate Vectors and Weights

Each document is converted to a word vector of n-Grams with weight assigned to each n-Gram. This weight is assigned based on TF-IDF (Term Frequency - Inverse Document Frequency), This is a measure of importance of a term in a document. it is measured by the frequency of the term in the document but is offset by the total frequency of the term in the whole document set. this creates a balance in weighing terms, as some terms can occur more commonly than other and can be less significant in arriving at similarity.

Document Similarity

Each Document word vector is compared to every other document’s vector. Document similarity is based on cosine similarity measure between the word vector’s. two vectors with the same orientation have a cosine similarity of 1 , two vectors at 90 have similarity of 0. A matrix similarity between each document with every other document is generated.

Sample docIndex.txt Below

Key: 0: Value: /File558

Key: 1: Value: /File4340

Key: 2: Value: /File4208

Key: 3: Value: /File471

Sample Similarity.txt Below

Key: 0: Value: {0:1.0000000000000002,3865:0.15434639725421775,318:0.16924612516623572,558:0.24384373237418783,471:0.16826200999114294,4340:0.16651713654958472,7:0.164357811310303,1841:0.15904628827648598,4208:0.16296041411613846,14:0.14043468960009342}

Key: 1: Value: {1615:0.3312716794782159,2181:0.2840451034186393,2126:0.32415666313248037,3188:0.1628119850871482,1496:0.24775558568026784,1:1.0,1575:0.13525396149776772,1269:0.13286526354605824,28:0.45703740702783774,1866:0.3262754564949865}

Key: 2: Value: {2:1.0,4350:0.13571272853930183,348:0.12600225826696973,3210:0.13949921190207168,560:0.15234464042319912,3294:0.2889578044491356,802:0.17942407070282945,1633:0.1964965769704117,3355:0.1298340236494648,495:0.12627029072308343}

Key: 3: Value: {1990:0.17193706865160252,3700:0.1302978723523794,2082:0.16196813164388732,2227:0.12561545966019144,665:0.15584753719122243,1163:0.19345501767136697,6:0.22582692114456704,3:1.0,1555:0.1742692199362734,4:0.1818170186646791}

Key: 4: Value: {1990:0.13213250264185705,3:0.1818170186646791,2082:0.1349661297563062,1163:0.15679941310702006,1555:0.18261523201994426,6:0.1981917938129962,738:0.29646182670818444,1684:0.21050749439902763,4:0.9999999999999999,3150:0.1397676584929176}

you can see above similarity matrix with scores

e;g [ 558:0.24384373237418783,471:0.16826200999114294 ]

docIndex key 0 maps to 0:1.0000000000000002 in similarity.txt output.

Similarly docIndex key 558 maps to 558:0.24384373237418783. in similarity.txt output.

--- Sample

sh mahout seqdumper -i vectors/tfidf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tf-vectors > vectors.txt

Samar's Blog

Wednesday, April 16, 2014

Apache Mahout for Document Similarity.

Create ElasticSearch cluster on single machine

Search This Blog