Part 2 of n: Creating Map-Reduce

From our last post we populated our database.

Now let’s look into applying map-reduce to our database.

From our last post, we have a huge collection with more than 14 million records. And our goal is to show heat map visualizations for top dropoff and pickup locations in NYC for each borough.

How to achieve this?
An SQL query for the above would roughly look like this:

SELECT
	N.IsPckUp,
	N.LgLt[0],
	N.LgLt[1],
	COUNT(*)
FROM
	NYCTaxis N
GROUP BY
	N.IsPckUp,
	N.LgLt[0],
	N.LgLt[1]
ORDER BY
        4;

Note: I have not added condition for boroughs, will explain later.

This requires aggregation and MongoDB provides us two ways to do that-
– Map-reduce
– Aggregation framework
The following link explains them nicely: http://docs.mongodb.org/master/MongoDB-aggregation-guide.pdf

I chose to go with Map Reduce for the following reasons:
– Performs incremental map reduce
– Flexibility in writing logic.

Disadvantages:
– It is slow

Writing incremental map reduce is very helpful as I can write a cron job to keep adding data to my collection. As I have loaded only one month of data, I can use map reduce to add for the rest of the months.

Even after map-reduce’s JavaScript engine was switched from SpiderMonkey to V8, it still lags behind the Aggregation framework which runs on C++.
Maybe using Hadoop for map-reduce can help fasten things up. Need to do research on this.

Let’s look at the map-reduce code for our aggregation:
Code: https://github.com/tarun11ks/NYCTaxi/blob/master/misc/MapReduceAggLocations.js

// Map function to create key-value pair.
// Key represents the group by fields, Value represents the field to apply aggregation function
var mapFunction = function() {
	// Emit both pickup and dropoff locations.
	this.Loc.forEach(function(loc) {
		var lg = parseFloat(loc.LgLt[0].toString().substr(0, 7));
		var lt = parseFloat(loc.LgLt[1].toString().substr(0, 6));

		var key = {
			pu: loc.IsPckUp,
			lglt: [lg, lt]
		};

		emit(key, {
			cnt: 1
		});
	});
};

// Reduce function reduces to a single object all the values associated with a particular key
// Note that the return type should match the Map function's value type
var reduceFunction = function(key, values) {
	var totalCount = 0;
	values.forEach(function(obj) {
		totalCount += obj.cnt;
	});

	return {
		cnt: totalCount
	};
};

// Apply map reduce to NYCTaxis collection and output it to "aggLocations" collection
db.NYCTaxis.mapReduce(
	mapFunction,
	reduceFunction, {
		out: {
			reduce: "aggLocations"
		},
		sort: {
			"_id": 1
		},
		verbose: true
	}
);

The sort in the map reduce command helped speed up a lot. It sorts the incoming documents, would help if the sort field is indexed. When I first ran the above map reduce without sort, it took forever to complete. After reading the below links, I applied the sort and it ran in 45 minutes.
More here: http://edgystuff.tumblr.com/post/54709368492/how-to-speed-up-mongodb-map-reduce-by-20x
and http://docs.mongodb.org/manual/reference/command/mapReduce/#dbcmd.mapReduce

We have pretty much converted(except sort) the SQL query to its equivalent map-reduce.
Here is a diagram which nicely explains the conversion from sql to map-reduce. It helped me a lot!
Link: http://rickosborne.org/download/SQL-to-MongoDB.pdf

While running this map reduce, keep checking the status in the command window. It tells the progress made so far.

Below is how the aggLocations collection looks like:

> db.aggLocations.stats()
{
        "ns" : "NYCTaxiDB.aggLocations",
        "count" : 104045,
        "size" : 11653040,
        "avgObjSize" : 112,
        "storageSize" : 22507520,
        "numExtents" : 7,
        "nindexes" : 1,
        "lastExtentSize" : 11325440,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 7889840,
        "indexSizes" : {
                "_id_" : 7889840
        },
        "ok" : 1
}
> db.aggLocations.find().limit(1).sort({"value.cnt":-1})
{
    "_id": {
        "pu": true,
        "lglt": [-73.991, 40.75]
    },
    "value": {
        "cnt": 63086
    }
}

For our next post, we will look into using this collection to get top pickup and dropoff locations for each borough.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s