2

I have a MovieRatings database with columns userId, movieId, movie-categoryId, reviewId, movieRating and reviewDate.

In my mapper I want to extract userId -> (movieId, movieRating)

And then in the reducer I want to group all movieId, movieRating pair by user.

Here is my attempt:

Map function:

var map = function() {
    var values={movieId : this.movieId, movieRating : this.movieRating};
    emit(this.userId, values);}

Reduce function:

var reduce = function(key,values) {
    var ratings = [];
    values.forEach(function(V){
        var temp = {movieId : V.movieId, movieRating : V.movieRating};
        Array.prototype.push.apply(ratings, temp);
        });
    return {userId : key, ratings : ratings };
}

Run MapReduce:

db.ratings.mapReduce(map, reduce, { out: "map_reduce_step1" })

Output: db.map_reduce_step1.find()

{ "_id" : 1, "value" : { "userId" : 1, "ratings" : [ ] } } 
{ "_id" : 2, "value" : { "userId" : 2, "ratings" : [ ] } } 
{ "_id" : 3, "value" : { "userId" : 3, "ratings" : [ ] } } 
{ "_id" : 4, "value" : { "userId" : 4, "ratings" : [ ] } } 
{ "_id" : 5, "value" : { "userId" : 5, "ratings" : [ ] } } 
{ "_id" : 6, "value" : { "userId" : 6, "ratings" : [ ] } } 
{ "_id" : 7, "value" : { "userId" : 7, "ratings" : [ ] } } 
{ "_id" : 8, "value" : { "userId" : 8, "ratings" : [ ] } } 
{ "_id" : 9, "value" : { "userId" : 9, "ratings" : [ ] } } 
{ "_id" : 10, "value" : { "userId" : 10, "ratings" : [ ] } } 
{ "_id" : 11, "value" : { "userId" : 11, "ratings" : [ ] } } 
{ "_id" : 12, "value" : { "userId" : 12, "ratings" : [ ] } } 
{ "_id" : 13, "value" : { "userId" : 13, "ratings" : [ ] } } 
{ "_id" : 14, "value" : { "userId" : 14, "ratings" : [ ] } } 
{ "_id" : 15, "value" : { "movieId" : 1, "movieRating" : 3 } } 
{ "_id" : 16, "value" : { "userId" : 16, "ratings" : [ ] } }

I am not getting the expected output. In fact, this output makes no sense to me!

Here is the python equivalent of what I am trying to do in the reducer (just in case the purpose of reducer wasn't clear above) :

def reducer_ratings_by_user(self, user_id, itemRatings):
        #Group (item, rating) pairs by userID
        ratings = []
        for movieID, rating in itemRatings:
            ratings.append((movieID, rating))
        yield user_id, ratings

Edit 1 @chridam

Here is an outline of what I really want to do here :

Movies.csv file looks like :

userId,movieId,movie-categoryId,reviewId,movieRating,reviewDate
1,1,1,1,5,7/12/2000
2,1,1,2,5,7/12/2000
3,1,1,3,5,7/12/2000
4,1,1,4,4,7/12/2000
5,1,1,5,4,7/12/2000
6,1,1,6,5,7/15/2000
1,2,1,7,4,7/25/2000
8,1,1,8,4,7/28/2000
9,1,1,9,3,8/3/2000
...
...

I import this into mongoDB :

mongoimport --db SomeName --collection ratings --type csv --headerline --file Movies.csv 

Then I am trying to apply the map-reduce function as define above. After that I will export it back to a csv by doing somethig like :

mongoexport --db SomeName --collection map_reduce_step1 --csv --out movie_ratings_out.csv --fields ...

This movie_ratings_out.csv file should be like :

userId, movieId1, rating1, movieId2, rating2 ,...
1,1,5,2,4
...
...

So each row contains all the (movie,rating) pair for every user.

Edit 2

Sample :

db.ratings.find().pretty()
{
    "_id" : ObjectId("57f4a0dd9cb74fc4d344a40f"),
    "userId" : 4,
    "movieId" : 1,
    "movie-categoryId" : 1,
    "reviewId" : 4,
    "movieRating" : 4,
    "reviewDate" : "7/12/2000"
}
{
    "_id" : ObjectId("57f4a0dd9cb74fc4d344a410"),
    "userId" : 5,
    "movieId" : 1,
    "movie-categoryId" : 1,
    "reviewId" : 5,
    "movieRating" : 4,
    "reviewDate" : "7/12/2000"
}
{
    "_id" : ObjectId("57f4a0dd9cb74fc4d344a411"),
    "userId" : 4,
    "movieId" : 2,
    "movie-categoryId" : 1,
    "reviewId" : 6,
    "movieRating" : 5,
    "reviewDate" : "7/15/2000"
}
{
    "_id" : ObjectId("57f4a0dd9cb74fc4d344a412"),
    "userId" : 4,
    "movieId" : 3,
    "movie-categoryId" : 1,
    "reviewId" : 2,
    "movieRating" : 5,
    "reviewDate" : "7/12/2000"
}
...

Then after MapReduce expected output json is :

{
    "_id" : ....,
    "userId" : 4,
    "movieList" : [ {
           "movieId" : 2
           "movieRating" : 5
         },
         {
           "movieId" : 1
           "movieRating" : 4
         }
         ...
        ]
   }
   {
    "_id" : ....,
    "userId" : 5,
    "movieList" : ...
   }
   ...
6
  • 1
    Can you update your question to include some sample documents and your expected output? I'm pretty sure the aggregation framework can handle this much better and more efficiently. Commented Oct 5, 2016 at 7:44
  • @chridam check edit! Commented Oct 5, 2016 at 14:40
  • I meant documents from the collection i.e. when you do a query db.ratings.find() pick perhaps 5 documents to make the sample and show us your expected JSON output of the aggregation operation from the sample. Otherwise it's a futile effort to try reproduce the problem with the info above. Can you update your question with the sample documents and expected JSON output? Commented Oct 5, 2016 at 15:08
  • @chridam check edit 2! Commented Oct 5, 2016 at 15:29
  • Hey @chridam! Thanks for the answer. Can you also help me in pointing out how to do this using map reduce functions? Just for practice. Commented Oct 10, 2016 at 16:50

1 Answer 1

1

You just need to run an aggregation pipeline which consists of a $group stage that summarize documents. This groups input documents by a specified identifier expression and applies the accumulator expression(s). The $group pipeline operator is similar to the SQL's GROUP BY clause. In SQL, you can't use GROUP BY unless you use any of the aggregation functions. The same way, you have to use an aggregation function in MongoDB as well. You can read more about the aggregation functions here.

The accumulator operator you would need to create the movieList array is $push.

Another pipeline which follows after the $group stage is the $project operator which is used to select or reshape each document in the stream, include, exclude or rename fields, inject computed fields, create sub-document fields, using mathematical expressions, dates, strings and/or logical (comparison, boolean, control) expressions - similar to what you would do with the SQL SELECT clause.

The last step is the $out pipeline which writes the resulting documents of the aggregation pipeline to a collection. It must be the last stage in the pipeline.

So as a result, you can run the following aggregate operation:

db.ratings.aggregate([
    {
        "$group": {
            "_id": "$userId",
            "movieList": {
                "$push": {
                    "movieId": "$movieId",
                    "movieRating": "$movieRating",
                }
            }
        }
    },
    {
        "$project": {
            "_id": 0, "userId": "$_id", "movieList": 1
        }
    },
    { "$out": "movie_ratings_out" }
])

Using the sample 5 documents above, the sample output if you query db.getCollection('movie_ratings_out').find({}) would yield:

/* 1 */
{
    "_id" : ObjectId("57f52636b9c3ea346ab1d399"),
    "movieList" : [ 
        {
            "movieId" : 1.0,
            "movieRating" : 4.0
        }
    ],
    "userId" : 5.0
}

/* 2 */
{
    "_id" : ObjectId("57f52636b9c3ea346ab1d39a"),
    "movieList" : [ 
        {
            "movieId" : 1.0,
            "movieRating" : 4.0
        }, 
        {
            "movieId" : 2.0,
            "movieRating" : 5.0
        }, 
        {
            "movieId" : 3.0,
            "movieRating" : 5.0
        }
    ],
    "userId" : 4.0
}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.