Mongodb word count using map reduce

Question

I have a problem with counting words I want to count word in projects.log.subject. ex) count [A],[B],[C].. I searched how to use map reduce.. but I don't understand how to use it for result i want.

{
"_id": ObjectID("569f3a3e9d2540764d8bde59"),
"A": "book",
"server": "us",
"projects": [
    {
        "domainArray": [
            {
                ~~~~
            }
        ],
        "log": [
            {
                ~~~~~,
                "subject": "[A][B]I WANT THIS"
            }
        ],
        "before": "234234234"
    },
    {
        "domainArray": [
            {
                ~~~~
            }
        ],
        "log": [
            {
                ~~~~~,
                "subject": "[B][C]I WANT THIS"
            }
        ],
        "before": "234234234"
    },....
] //end of projects
}//end of document

So [A] ,[B] and [C] represent words you want to look for and ultimately return the count of how many times each word appears within all documents. Correct? Have you at least done some basic research on mapReduce and understand how mapper and reducer functions work? This is always within the same "log" field within the "projects" array? — Blakes Seven
– Blakes Seven, Commented Jan 29, 2016 at 2:17
@BlakesSeven 1. Yes I want to get count of how many times each word in specific documents(like {'$match':{'date':TODAY}}. 2. I understand map, reduce function works 3. Yes always same structure in all documents [projects][log][subject] — Acool5
– Acool5, Commented Jan 29, 2016 at 2:23

Blakes Seven · Accepted Answer · 2016-01-29 03:02:53Z

This is a basic principle of using regular expressions and testing each string against the source string and emitting the found count for the result. In mapReduce terms, you want your "mapper" function to possibly emit multiple values for each "term" as a key, and for every array element present in each document.

So you basically want a source array of regular expressions to process ( likely just a word list ) to iterate and test and also iterate each array member.

Basically something like this:

db.collection.mapReduce(
    function() {
        var list = ["the", "quick", "brown" ];  // words you want to count

        this.projects.forEach(function(project) {
            project.log.forEach(function(log) {
                list.forEach(function(word) {
                    var res = log.subject.match(new RegExp("\\b" + word + "\\b","ig"));
                    if ( res != null )
                        emit(word,res.length);  // returns number of matches for word
                });
            });
        });
    },
    function(key,values) {
        return Array.sum(values);
    },
    { "out": { "inline": 1 } }
)

So the loop processes the array elements in the document and then applies each word to look for with a regular expression to test. The .match() method will return an array of matches in the string or null if done was found. Note the i and g options for the regex in order to search case insensitive and beyond just the first match. You might need m for multi-line if your text includes line break characters as well.

If null is not returned, then we emit the current word as the "key" and the count as the length of the matched array.

The reducer then takes all output values from those emit calls in the mapper and simply adds up the emitted counts.

The result will be one document keyed by each "word/term" provided and the count of total occurances in the inspected field within the collection. For more fields, just add more logic to sum up the results, or similarly just keep "emitting" in the mapper and let the reducer do the work.

Note the "\\b" represents a word boundary expression to wrap each term escaped by` in order to construct the expression from strings. You need these to discriminate "the" from "then" for example, by specifying where the word/term ends.

Also that as regular expressions, characters like [] are reserved, so if you actually were looking for strings like that the you similarly escape, i.e:

"\[A\]"

But if you were actually doing that, then remove the word boundary characters:

new RegExp( "\[A\]", "ig" )

As that is enough of a complete match in itself.

thanks. It works! thank you very much But you mentioned that [] are reserved right? so I tried new RegExp("["+word+"]","ig") but it doesn't work.
@Acool5 I did already mention "escaping" the reserved characters and also gave a literal example. But in direct translation this is RegExp( "\[" + word + "\]", "ig") if you are indeed looking for a "word" in the variable that is always wrapped in brackets [].

Collectives™ on Stack Overflow

Mongodb word count using map reduce

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related