Wednesday, April 16, 2014

Restricts the contents of the documents based on information stored in the documents themselves.

MongoDB 2.6 has an API that Restricts the contents of the documents based on information stored in the documents themselves.

Evaluates Access at Every Document/Sub-Document Level

test collection contains documents of the following form where the tags field lists the different access values for that document/subdocument level; i.e. a value of [ “public", “internal" ] specifies either “public" or “internal" can access the data:

Sample Example JSON Document below
{

    "_id" : 1,
    "title" : "Document Formatting",
    "tags" : [
       "PUBLIC",
        "SERVICE_PROVIDER",
        "INTERNAL"
    ],
    "year" : 2014,
    "subsections" : [
        {
            "subtitle" : "Section 1: Overview",
            "tags" : [
                "PUBLIC",
                "SERVICE_PROVIDER"
            ],
            "content" : "Section 1: This is PUBLIC and SERVICE PROVIDER content of section 1."
        },
        {
            "subtitle" : "Section 2: SuperScript",
            "tags" : [
               "SERVICE_PROVIDER"
            ],
            "content" : "Section 2: This is SERVICE PROVIDER content of section 2."
        },
        {
            "subtitle" : "Section 3: SubScript",
            "tags" : [
                "PUBLIC"
            ],
            "content" : {
                "text" : "Section 3: This is INTERNAL content of section3.",
                "tags" : [
                   "INTERNAL"
                ]
            }
        }
    ]
}

A user has access to view information with either the tag “public" or “internal". To run a query on all documents with year 2014 for this user, include a $redact stage as in the following:


var userAccess = ["INTERNAL","PUBLIC"];
db.test.aggregate(
   [
     { $match: { year: 2014 } },
     { $redact:
         {
            $cond:
               {
                 if: { $gt: [ { $size: { $setIntersection: [ "$tags", userAccess ] } }, 0 ] },
                 then: "$$DESCEND",
                 else: "$$PRUNE"
               }
         }
     }
   ]
)

The aggregation operation returns the following “redacted” document, which does not include SERVICE_PROVIDER subsection.

{

    "result" : [
        {
            "_id" : 1,
            "title" : "Document Formatting",
            "tags" : [
               "PUBLIC",
                "SERVICE_PROVIDER",
                "INTERNAL"
            ],
            "year" : 2014,
            "subsections" : [
                {
                    "subtitle" : "Section 1: Overview",
                    "tags" : [
                       "PUBLIC",
                        "SERVICE_PROVIDER"
                    ],
                    "content" : "Section 1: This is PUBLIC and SERVICE PROVIDER content of section 1."
                },
                {
                    "subtitle" : "Section 3: SubScript",
                    "tags" : [
                        "PUBLIC"
                    ],
                    "content" : {
                        "text" : "Section 3: This is INTERNAL content of section3.",
                        "tags" : [
                           "INTERNAL"
                        ]
                    }
                }
            ]
        }
    ],
    "ok" : 1
}

Apache Mahout for Document Similarity.

Using Apache Mahout for Document Similarity.

Below are steps to run on the text file collection.


  • sh mahout seqdirectory -c UTF-8 -i /Users/xxxx/myfiles/ -o seqfiles
  • sh mahout seq2sparse -i seqfiles/ -o vectors/  -ow -chunk 100  -x 90  -seq  -ml 50  -n 2  -s 5 -md 5  -ng 3  -nv
  • sh mahout rowid -i vectors/tfidf-vectors/part-r-00000 -o matrix
  • sh mahout rowsimilarity -i matrix/matrix -o similarity  --similarityClassname SIMILARITY_COSINE -m 10 -ess
  • sh mahout seqdumper -i similarity > similarity.txt
  • sh mahout seqdumper -i matrix/docIndex > docIndex.txt
Apache Mahout does following Steps 

  • Tokenize and Transform
  • Generate word vectors and weights
  • Find Document similarity based on TF-IDF( Term Frequence - Inverted Document Frequency) using COSINE_SIMILARITY

Tokenization

The first step is to convert the text document into sequence of tokens. Content is tokenized based on single word. All the tokens are then transformed to lower cases so that the content is standardized.
Create N-Grams : A term n-Gram is defined as a series of consecutive tokens of length n. n-Gram consists of all series of consecutive tokens of length n.

Generate Vectors and Weights

Each document is converted to a word vector of n-Grams with weight assigned to each n-Gram. This weight is assigned based on TF-IDF (Term Frequency - Inverse Document Frequency), This is a measure of importance of a term in a document. it is measured by the frequency of the term in the document but is offset by the total frequency of the term in the whole document set.  this creates a balance in weighing terms, as some terms can occur more commonly than other and can be less significant in arriving at similarity.

Document Similarity

Each Document word vector is compared to every other document’s vector. Document similarity is based on cosine similarity measure between the word vector’s.  two vectors with the same orientation have a cosine similarity of 1 , two vectors at 90 have similarity of 0. A matrix similarity between each document with every other document is generated.

Sample docIndex.txt Below

Key: 0: Value: /File558
Key: 1: Value: /File4340
Key: 2: Value: /File4208
Key: 3: Value: /File471

Sample Similarity.txt Below

Key: 0: Value: {0:1.0000000000000002,3865:0.15434639725421775,318:0.16924612516623572,558:0.24384373237418783,471:0.16826200999114294,4340:0.16651713654958472,7:0.164357811310303,1841:0.15904628827648598,4208:0.16296041411613846,14:0.14043468960009342}

Key: 1: Value: {1615:0.3312716794782159,2181:0.2840451034186393,2126:0.32415666313248037,3188:0.1628119850871482,1496:0.24775558568026784,1:1.0,1575:0.13525396149776772,1269:0.13286526354605824,28:0.45703740702783774,1866:0.3262754564949865}

Key: 2: Value: {2:1.0,4350:0.13571272853930183,348:0.12600225826696973,3210:0.13949921190207168,560:0.15234464042319912,3294:0.2889578044491356,802:0.17942407070282945,1633:0.1964965769704117,3355:0.1298340236494648,495:0.12627029072308343}

Key: 3: Value: {1990:0.17193706865160252,3700:0.1302978723523794,2082:0.16196813164388732,2227:0.12561545966019144,665:0.15584753719122243,1163:0.19345501767136697,6:0.22582692114456704,3:1.0,1555:0.1742692199362734,4:0.1818170186646791}

Key: 4: Value: {1990:0.13213250264185705,3:0.1818170186646791,2082:0.1349661297563062,1163:0.15679941310702006,1555:0.18261523201994426,6:0.1981917938129962,738:0.29646182670818444,1684:0.21050749439902763,4:0.9999999999999999,3150:0.1397676584929176}


you can see above similarity matrix with scores 
e;g [  558:0.24384373237418783,471:0.16826200999114294 ]

docIndex  key 0 maps to 0:1.0000000000000002  in similarity.txt output.
Similarly docIndex key 558 maps to 558:0.24384373237418783. in similarity.txt output.


--- Sample

sh mahout seqdumper -i vectors/tfidf-vectors > vectors.txt
sh mahout seqdumper -i vectors/tf-vectors > vectors.txt

Create ElasticSearch cluster on single machine

I wanted to figure out how to create a multi-node ElasticSearch cluster on single machine. So i followed these instructions First i did...