Part 5 of n: Using Backbone JS to create application

In this article we will create a single page application using BackboneJS.

Backbone.JS is a simple JavaScript library and un-opinionated. My reason for choosing Backbone.JS is its simplicity and its ease of development.

Below is the heat-map visualization for Manhattan from the application 🙂

Manhattan-Heat map

Manhattan-Heat map

Please have a look at how the application is structured here:

This structure is how application is made modular. I learned from this excellent tutorial: It nicely explains the use of require.js which is what is making the application modular. I would recommend looking at the boilerplate code of the tutorial here: The article is a little outdated but the concept remains the same.
Googling “Backbone js boilerplate” will give you more examples.

Another great article here:

The modular concept revolves around the use of “require.js“. Together with AMD it provides better dependency management with asynchronous loading of scripts.

This post won’t explain about the above concepts as the linked websites provide great deal of information.

Libraries(not the obvious ones) used:
Hyperlapse.js -> To show animated google street view. It requires other libraries as it’s dependencies- Three.JS and GSVPano.JS
Backbone.viewcache -> To keep cache of views to be reused. Currently used for Hyperlapse.

I will briefly explain the components.
The script tag in the header calls the main.js function which starts our application.

The body has navigation menu(using Bootstrap) and a div container to load the views inside it.


], function(Backbone) {
	var taxiAgg = Backbone.Model.extend({


	return taxiAgg;

A simple model that represents a MongoDB document of “aggLocations” collection.


], function(Backbone, taxiModel) {
	console.log('inside Collection');
	var taxis = Backbone.Collection.extend({
		model: taxiModel,
		url: function() {
			// Use fragment to construct url.
			var state = Backbone.history.fragment;
			if (state)
				return 'http://localhost:2387/' + state;
				return 'http://localhost:2387/manhattan'

	return taxis;

A simple Backbone collection with 2 properties: model and url. The model being referenced is TaxiLocAgg.js.
The url property is very crucial. It uses the Backbone’s fragment to determine which borough’s records are needed to be fetched.
These fragments are defined in the routes of the router.js file. Using this technique, we constructed a dynamic url call without needing to create separate collection for each borough.

– GoogleMapsApiLoader.js

var google_maps_loaded_def = null;

define(['jquery'], function($) {
	if (!google_maps_loaded_def) {
		// Create a deferred object. Sets status to pending.
		google_maps_loaded_def = $.Deferred();

		// Executes after API authentication is successful.
		window.google_maps_loaded = function() {

			// Resolve the deferred object. Executes done callback.

		require([''], function() {}, function(err) {
			throw err;

	// Returns deferred's promise object.
	// Provides ability to only attach handlers or determine the state.
	return google_maps_loaded_def.promise();

I was searching online on how to use google maps with our modular application. You must get your own key from here. Finally found this gem which uses jQuery’s deferred function. The same is used in jQuery’s ajax calls too. Using jQuery’s deferred along with promise object helps eliminate the callback hell.
When calling the GoogleMapsApiLoader module, first it creates the deferred object(goes into pending state), validates the API key and when successful, it calls the callback google_maps_loaded which then calls resolve on the deferred object to signal that the google.maps object is instantiated and ready to be used. The promise object is a kind of an abstraction of deferred, once resolved, any done callbacks attached to it gets called. If the api validation fails, then deferred’s reject is called which will call any fail callbacks attached to the module. We will see it in more detail in the MainView.js section below.

This Backbone view is responsible for showing the google heat maps visualization.
The GoogleMapsLoader object of the GoogleMapsApiLoader module attaches callbacks “done” and “fail” to the promise object. When the api key in the module is resolved, the promise object invokes the done callback which then calls the collection to “fetch” the records using the url. Once that is successful, we create an array of longitude and latitude that google.maps understands and finally the heat map visualization is constructed!


Part 4 of n: Using Node.JS to provide a REST interface

From our last post we constructed geospatial queries. The data returned would be very crucial for showing visualizations for our web application. Instead of directly using MongoDB in our application, we will create a rest interface using Node.js which then can be consumed by multiple clients.

As the website states, Node.js is a platform built on Chrome’s JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.
Node.js is single-threaded which makes it avoid the costly context switching seen in multi-threaded environment.

The node.js application that I created is a simple one. It’s purpose is to provide a REST API for our web application. Down the road, if I decide to write a mobile application, I could use that API, hence it could serve multiple clients.

Let’s get started.
Firstly, please download Node.js and install it in your machine. After that, open the node.js command prompt and install nodemon. It is a great utility which observes the changes that you make in your node.js application and restarts it for you. It has a good documentation on how you can start your server.

I am using express.js to create the API. Also using cors to enable cors so that our web application can connect to the server. Lastly used mongoskin which acts as a wrapper to the native mongodb-nodejs driver. It helps ease development. All these come in the form of packages which are easy to install. The links that I have provided explains how to install. After installation, you will notice a folder named “node_modules” in the directory where your server code is. Inside it is where our packages reside.

Let’s have a look at the code:


var express = require('express'),
	app = express(),
	cors = require('cors'),
	mongo = require('mongoskin'),
	url = 'mongodb://tarun:tarun@localhost:27017/NYCTaxiDB',
	db = mongo.db(url, {
		native_parser: true

// Logs content with great readability
var inspect = require('eyes').inspector({
	maxLength: false

// Allows request from any origin 

// A middleware to hook db object to all the requests
app.use(function(req, res, next) {
	req.db = db;
	// Find the matching route

// Returns a random integer between min (included) and max (excluded)
function getRandomInt(min, max) {
	return Math.floor(Math.random() * (max - min)) + min;

// Returns a random taxi record
app.get('/randomtaxi', function(req, res) {
	var rnd = getRandomInt(1, 1000);
	var db = req.db;
	var coll = db.collection('NYCTaxis');
	coll.find().limit(1).skip(rnd).toArray(function(err, results) {
		if (err)
			throw err;

// Get top 2000 locations(pickup and drop) for Staten Islands
app.get('/staten', function(req, res) {
	var coordinates = [
		[-74.071116, 40.651530],
		[-74.071116, 40.624696],
		[-74.040989, 40.595049],
		[-74.244580, 40.484162],
		[-74.258999, 40.509486],
		[-74.236340, 40.557759],
		[-74.219174, 40.557498],
		[-74.200291, 40.598177],
		[-74.202008, 40.631275],
		[-74.185957, 40.645800],
		[-74.071116, 40.651530]
	sendResult(req, res, coordinates, 2000);

// Get top 3000 locations(pickup and drop) for Queens
app.get('/queens', function(req, res) {
	var coordinates = [
		[-73.779696, 40.809849],
		[-73.702106, 40.752136],
		[-73.761844, 40.551572],
		[-73.952044, 40.528612],
		[-73.961658, 40.562006],
		[-73.833255, 40.607895],
		[-73.868960, 40.695414],
		[-73.897113, 40.684220],
		[-73.928870, 40.727814],
		[-73.961658, 40.740562],
		[-73.911618, 40.795101],
		[-73.779696, 40.809849]
	sendResult(req, res, coordinates, 3000);

// Get top 2000 locations(pickup and drop) for Manhattan
app.get('/manhattan', function(req, res) {
	var coordinates = [
		[-74.034240, 40.686697],
		[-74.019992, 40.680709],
		[-73.995495, 40.704948],
		[-73.971463, 40.709893],
		[-73.961764, 40.743814],
		[-73.911724, 40.794679],
		[-73.927174, 40.802346],
		[-73.933354, 40.835214],
		[-73.907433, 40.873646],
		[-73.933699, 40.882083],
		[-74.013984, 40.756951],
		[-74.034240, 40.686697]
	sendResult(req, res, coordinates, 2000);

// Get top 3000 locations(pickup and drop) for Brooklyn
app.get('/brooklyn', function(req, res) {
	var coordinates = [
		[-73.962421, 40.737982],
		[-73.929806, 40.727706],
		[-73.896675, 40.683071],
		[-73.869381, 40.694916],
		[-73.854360, 40.643420],
		[-73.881483, 40.574612],
		[-74.035635, 40.562876],
		[-74.055891, 40.652017],
		[-74.033231, 40.686130],
		[-74.020443, 40.680077],
		[-73.994865, 40.704546],
		[-73.972206, 40.708910],
		[-73.962421, 40.737982]
	sendResult(req, res, coordinates, 3000);

// Get top 2000 locations(pickup and drop) from Bronx
app.get('/bronx', function(req, res) {
	var coordinates = [
		[-73.912002, 40.915643],
		[-73.748580, 40.871751],
		[-73.790809, 40.803112],
		[-73.861877, 40.799214],
		[-73.873550, 40.784821],
		[-73.932773, 40.807628],
		[-73.933803, 40.834681],
		[-73.908054, 40.873243],
		[-73.925906, 40.879538],
		[-73.912002, 40.915643]
	sendResult(req, res, coordinates, 2000);

// Filters by location from the aggLocations collection
// and sends back result to the clients.
function sendResult(req, res, coordinates, limit) {
	var db = req.db;
	var coll = db.collection('aggLocations');
			$match: {
				"_id.lglt": {
					$geoWithin: {
						$polygon: coordinates
		}, {
			$sort: {
				"value.cnt": -1
		}, {
			$limit: limit
		}], {
			allowDiskUse: true
		function(err, result) {
			if (err)
				throw err;


// listen port 2387 on localhost. For ex: http://localhost:2387/manhattan

– The require statements tells it to load the packages mentioned. Since there is no path given to the package, node.js will look for the “node-modules” folder which is were our packages are kept.
– I have also added the connection for our MongoDB database to connect to. 27017 is the default port, you will more information in the shell after you connect to it.
– Also added a middleware to attach our db object to each request that is made by our web application which will be then used to fetch records.
– Next is a series of routes, each requesting GET for each borough of NYC, each with it’s own geographical coordinates represented as a polygon.
– Then used a simple aggregation query to fetch records for the requested borough, sort in descending order of instances and finally limit the number of documents to fetch.
– Finally listen for the connections made on localhost for that particular port.

Start the server using nodemon, open your favorite browser and type http://localhost:2387/manhattan. You will the records being fetched. Please let me know if you are facing any issue.

Part 2 of n: Creating Map-Reduce

From our last post we populated our database.

Now let’s look into applying map-reduce to our database.

From our last post, we have a huge collection with more than 14 million records. And our goal is to show heat map visualizations for top dropoff and pickup locations in NYC for each borough.

How to achieve this?
An SQL query for the above would roughly look like this:

	NYCTaxis N

Note: I have not added condition for boroughs, will explain later.

This requires aggregation and MongoDB provides us two ways to do that-
– Map-reduce
– Aggregation framework
The following link explains them nicely:

I chose to go with Map Reduce for the following reasons:
– Performs incremental map reduce
– Flexibility in writing logic.

– It is slow

Writing incremental map reduce is very helpful as I can write a cron job to keep adding data to my collection. As I have loaded only one month of data, I can use map reduce to add for the rest of the months.

Even after map-reduce’s JavaScript engine was switched from SpiderMonkey to V8, it still lags behind the Aggregation framework which runs on C++.
Maybe using Hadoop for map-reduce can help fasten things up. Need to do research on this.

Let’s look at the map-reduce code for our aggregation:

// Map function to create key-value pair.
// Key represents the group by fields, Value represents the field to apply aggregation function
var mapFunction = function() {
	// Emit both pickup and dropoff locations.
	this.Loc.forEach(function(loc) {
		var lg = parseFloat(loc.LgLt[0].toString().substr(0, 7));
		var lt = parseFloat(loc.LgLt[1].toString().substr(0, 6));

		var key = {
			pu: loc.IsPckUp,
			lglt: [lg, lt]

		emit(key, {
			cnt: 1

// Reduce function reduces to a single object all the values associated with a particular key
// Note that the return type should match the Map function's value type
var reduceFunction = function(key, values) {
	var totalCount = 0;
	values.forEach(function(obj) {
		totalCount += obj.cnt;

	return {
		cnt: totalCount

// Apply map reduce to NYCTaxis collection and output it to "aggLocations" collection
	reduceFunction, {
		out: {
			reduce: "aggLocations"
		sort: {
			"_id": 1
		verbose: true

The sort in the map reduce command helped speed up a lot. It sorts the incoming documents, would help if the sort field is indexed. When I first ran the above map reduce without sort, it took forever to complete. After reading the below links, I applied the sort and it ran in 45 minutes.
More here:

We have pretty much converted(except sort) the SQL query to its equivalent map-reduce.
Here is a diagram which nicely explains the conversion from sql to map-reduce. It helped me a lot!

While running this map reduce, keep checking the status in the command window. It tells the progress made so far.

Below is how the aggLocations collection looks like:

> db.aggLocations.stats()
        "ns" : "NYCTaxiDB.aggLocations",
        "count" : 104045,
        "size" : 11653040,
        "avgObjSize" : 112,
        "storageSize" : 22507520,
        "numExtents" : 7,
        "nindexes" : 1,
        "lastExtentSize" : 11325440,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 7889840,
        "indexSizes" : {
                "_id_" : 7889840
        "ok" : 1
> db.aggLocations.find().limit(1).sort({"value.cnt":-1})
    "_id": {
        "pu": true,
        "lglt": [-73.991, 40.75]
    "value": {
        "cnt": 63086

For our next post, we will look into using this collection to get top pickup and dropoff locations for each borough.

Part 1 of n: Preparing the dataset using MongoDB

This is the first part of the series related to Analyzing NYC 2013 taxi data.

MongoDB is a NoSQL document based database which stores data in BSON format. It stores data in collections, just like tables. I wanted to be a part of the NoSQL movement and thought of giving MongoDB a try. It also has great support for geospatial queries, ideal for our application. We will look into in the next post.

Let’s get started.
Firstly go ahead and download the .csv taxi data from here.

While the files are being downloaded, let’s start installing MongoDB. Their site does a great job in explaining how to get started.
Install MongoDB on your machine using the installation guidelines from their website itself.
After that, we can start creating our database. Please go through the getting started link which provides understanding of the mongo shell.
Next step is providing credentials for connecting to the database. You may skip this if you want.

use NYCTaxiDB
    user: "username",
    pwd: "password",
        role: "userAdmin",
        db: "NYCTaxiDB"

Next step would be to parse the data in csv files and insert into our database.
I have written a C# code for that using the MongoDB C# driver.
Visual studio project:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using MongoDB.Driver;
using PopulateNYCTaxiDB.Model;
using System.IO;

namespace PopulateNYCTaxiDB
    class Program
        static void Main(string[] args)
            // Specify the credentials to connect database.
            Console.WriteLine("Please enter user name");
            string userName = Console.ReadLine();
            Console.WriteLine("Please enter password");
            string password = Console.ReadLine();
            var credential = MongoCredential.CreateMongoCRCredential("NYCTaxiDB", userName, password);
            var settings = new MongoClientSettings { Credentials = new[] { credential } };

            var client = new MongoClient(settings);
            var server = client.GetServer();
            var nycDB = server.GetDatabase("NYCTaxiDB");

            var collection = nycDB.GetCollection<NYCTaxiData>("NYCTaxis");
            Console.WriteLine("Do you want to remove data from collection? (y/n)");
            string answer = Console.ReadLine();
            if (answer.ToLower() == "y")
            Console.WriteLine("Please enter location of fare file (trip_fare_1.csv)");
            string tripFarePath = Console.ReadLine();
            Console.WriteLine("Please enter location of data file (trip_data_1.csv)");
            string tripDataPath = Console.ReadLine();

            // Read both trip and fare line by line. One-to-One relation between trip and fare.
            using (var fareReader = new StreamReader(tripFarePath))
            using (var dataReader = new StreamReader(tripDataPath))
                var tripFareList = new List<NYCTaxiData>();

                // Do not read the first line. They are column headers.
                string[] fareData, tripData;
                string medallion, hackLicense, paymentType;
                DateTime pickupDateTime, dropDateTime;
                int passCount, tripTime;
                double tripDistance, pickupLong, pickupLat, dropLong, dropLat;
                decimal fareAmount, surcharge, mtaTax, tipAmount, tollAmount, totalAmount;
                int looper = 0;
                int count = 0;

                // Start reading.
                while (!fareReader.EndOfStream)
                    tripData = dataReader.ReadLine().Split(',');
                    fareData = fareReader.ReadLine().Split(',');
                    pickupLong = ParseDouble(tripData[10]);
                    pickupLat = ParseDouble(tripData[11]);
                    dropLong = ParseDouble(tripData[12]);
                    dropLat = ParseDouble(tripData[13]);

                    // Skip erroneous records
                    if (pickupLat < -180 || pickupLat > 180 || pickupLong < -180 || pickupLong > 180 ||
                        dropLat < -180 || dropLat > 180 || dropLong < -180 || dropLong > 180 ||
                        pickupLat == 0 || pickupLong == 0 || dropLat == 0 || dropLong == 0)

                    medallion = fareData[0];
                    hackLicense = fareData[1];
                    pickupDateTime = DateTime.Parse(fareData[3]);
                    dropDateTime = DateTime.Parse(tripData[6]);
                    passCount = ParseInt(tripData[7]);
                    tripTime = ParseInt(tripData[8]);
                    tripDistance = ParseDouble(tripData[9]);
                    paymentType = fareData[4];
                    fareAmount = ParseDecimal(fareData[5]);
                    surcharge = ParseDecimal(fareData[6]);
                    mtaTax = ParseDecimal(fareData[7]);
                    tipAmount = ParseDecimal(fareData[8]);
                    tollAmount = ParseDecimal(fareData[9]);
                    totalAmount = ParseDecimal(fareData[10]);

                    Loc pLoc = new Loc
                        IsPckUp = true,
                        LgLt = new[] { pickupLong, pickupLat }

                    Loc dLoc = new Loc
                        IsPckUp = false,
                        LgLt = new[] { dropLong, dropLat }

                    tripFareList.Add(new NYCTaxiData
                        Mdlln = medallion,
                        Hlicense = hackLicense,
                        Pdate = pickupDateTime,
                        Ddate = dropDateTime,
                        Pcount = passCount,
                        Ttime = tripTime,
                        Tdist = tripDistance,
                        Loc = new[] { pLoc, dLoc },
                        Ptype = paymentType,
                        Famnt = fareAmount,
                        Srchrge = surcharge,
                        Mtax = mtaTax,
                        Tamnt = tipAmount,
                        TOamnt = tollAmount,
                        TOTamnt = totalAmount

                    // Insert records in form of batches
                    if (looper == 500000)

                        looper = 0;
                } // End of loop

                // Insert rest of data
                if (looper > 0)

                Console.WriteLine("Total count:{0}", count);

            Console.WriteLine("Operation complete");
            Console.WriteLine("Total count reported by collection", collection.Count());
            Console.WriteLine("Hit enter to exit");

        private static int ParseInt(string data)
            int value = 0;
            int.TryParse(data, out value);
            return value;

        private static double ParseDouble(string data)
            double value = 0;
            double.TryParse(data, out value);
            return value;

        private static decimal ParseDecimal(string data)
            decimal value = Decimal.Zero;
            decimal.TryParse(data, out value);
            return value;

Please go ahead and load this project if you have visual studio installed. Or you can execute the PopulateNYCTaxiDB.exe in the following path: It will open a console application with prompts to enter required information and rest it will take care of populating the database.
Time permitting, I will create a node.js version of the above.

In your mongo shell, to see your stats on the collection, type in


which will give you information on count of records, storage, indexes etc.

Let’s have a look at a record in our collection and see how it is structured. Type in:


You will see the following:

        "_id" : ObjectId("549d14cecf21851240871f34"),
        "Mdlln" : "89D227B655E5C82AECF13C3F540D4CF4",
        "Hlicense" : "BA96DE419E711691B9445D6A6307C170",
        "Pdate" : ISODate("2013-01-01T21:11:48Z"),
        "Ddate" : ISODate("2013-01-01T21:18:10Z"),
        "Pcount" : 4,
        "Ttime" : 382,
        "Tdist" : 1,
        "Loc" : [
                        "IsPckUp" : true,
                        "LgLt" : [
                        "IsPckUp" : false,
                        "LgLt" : [
        "Ptype" : "CSH",
        "Famnt" : "6.5",
        "Srchrge" : "0",
        "Mtax" : "0.5",
        "Tamnt" : "0",
        "TOamnt" : "0",
        "TOTamnt" : "7"

By default, MongoDB creates an index on the “_id” field of every collection.
Execute the code below to see list of indexes on the NYCTaxis collection.


You will notice the “_id” index already created by MongoDB.

Indexing in MongoDB is just like indexing on other databases. Just like relational databases create indexes on tables, MongoDB does it at collection level.
MongoDB will give you best performance if the working set(total size of the indexes + data) fit into system’s RAM, which becomes the cache. If it doesn’t, MongoDB will swap documents to the disk with a little performance penalty as disk access is slow.

There are many other index types available.

Currently I have loaded my database with only 1 month(January) of data.
Below are the stats of my NYCTaxisDB:

> db.NYCTaxis.stats()
        "ns" : "NYCTaxiDB.NYCTaxis",
        "count" : 14490472,
        "size" : 7187274112,
        "avgObjSize" : 496,
        "storageSize" : 9305935856,
        "numExtents" : 25,
        "nindexes" : 2,
        "lastExtentSize" : 2146426864,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 1225549696,
        "indexSizes" : {
                "_id_" : 470438864,
                "Loc.LgLt_2d" : 755110832
        "ok" : 1

Figures are in bytes.

Below is in MB’s:

> db.NYCTaxis.stats(1048576)
        "ns" : "NYCTaxiDB.NYCTaxis",
        "count" : 14490472,
        "size" : 6854,
        "avgObjSize" : 496,
        "storageSize" : 8874,
        "numExtents" : 25,
        "nindexes" : 2,
        "lastExtentSize" : 2046,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 1168,
        "indexSizes" : {
                "_id_" : 448,
                "Loc.LgLt_2d" : 720
        "ok" : 1

(I will later explain the second index “Loc.LgLt_2d” in another blog post).
There are 14 million records in our database and that is just for one month!
So this is how I have populated my database.

In the next post, we will look into applying map-reduce on our database.