Part 1 of n: Preparing the dataset using MongoDB

This is the first part of the series related to Analyzing NYC 2013 taxi data.

MongoDB is a NoSQL document based database which stores data in BSON format. It stores data in collections, just like tables. I wanted to be a part of the NoSQL movement and thought of giving MongoDB a try. It also has great support for geospatial queries, ideal for our application. We will look into in the next post.

Let’s get started.
Firstly go ahead and download the .csv taxi data from here.

While the files are being downloaded, let’s start installing MongoDB. Their site does a great job in explaining how to get started.
Install MongoDB on your machine using the installation guidelines from their website itself.
After that, we can start creating our database. Please go through the getting started link which provides understanding of the mongo shell.
Next step is providing credentials for connecting to the database. You may skip this if you want.
Code: https://github.com/tarun11ks/NYCTaxi/blob/master/misc/CredNYCTaxiDB.js

use NYCTaxiDB
db.createUser(
  {
    user: "username",
    pwd: "password",
    roles:
    [
      {
        role: "userAdmin",
        db: "NYCTaxiDB"
      }
    ]
  }
)

Next step would be to parse the data in csv files and insert into our database.
I have written a C# code for that using the MongoDB C# driver.
Visual studio project: https://github.com/tarun11ks/NYCTaxi/tree/master/csharp/PopulateNYCTaxiDB
Code: https://github.com/tarun11ks/NYCTaxi/blob/master/csharp/PopulateNYCTaxiDB/PopulateNYCTaxiDB/Program.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using MongoDB.Driver;
using PopulateNYCTaxiDB.Model;
using System.IO;

namespace PopulateNYCTaxiDB
{
    class Program
    {
        static void Main(string[] args)
        {
            // Specify the credentials to connect database.
            Console.WriteLine("Please enter user name");
            string userName = Console.ReadLine();
            Console.WriteLine("Please enter password");
            string password = Console.ReadLine();
            var credential = MongoCredential.CreateMongoCRCredential("NYCTaxiDB", userName, password);
            var settings = new MongoClientSettings { Credentials = new[] { credential } };

            var client = new MongoClient(settings);
            var server = client.GetServer();
            var nycDB = server.GetDatabase("NYCTaxiDB");

            var collection = nycDB.GetCollection<NYCTaxiData>("NYCTaxis");
            Console.WriteLine("Do you want to remove data from collection? (y/n)");
            string answer = Console.ReadLine();
            if (answer.ToLower() == "y")
                collection.Drop();
            Console.WriteLine("Please enter location of fare file (trip_fare_1.csv)");
            string tripFarePath = Console.ReadLine();
            Console.WriteLine("Please enter location of data file (trip_data_1.csv)");
            string tripDataPath = Console.ReadLine();

            // Read both trip and fare line by line. One-to-One relation between trip and fare.
            using (var fareReader = new StreamReader(tripFarePath))
            using (var dataReader = new StreamReader(tripDataPath))
            {
                var tripFareList = new List<NYCTaxiData>();

                // Do not read the first line. They are column headers.
                fareReader.ReadLine();
                dataReader.ReadLine();
                string[] fareData, tripData;
                string medallion, hackLicense, paymentType;
                DateTime pickupDateTime, dropDateTime;
                int passCount, tripTime;
                double tripDistance, pickupLong, pickupLat, dropLong, dropLat;
                decimal fareAmount, surcharge, mtaTax, tipAmount, tollAmount, totalAmount;
                int looper = 0;
                int count = 0;

                // Start reading.
                while (!fareReader.EndOfStream)
                {
                    tripData = dataReader.ReadLine().Split(',');
                    fareData = fareReader.ReadLine().Split(',');
                    pickupLong = ParseDouble(tripData[10]);
                    pickupLat = ParseDouble(tripData[11]);
                    dropLong = ParseDouble(tripData[12]);
                    dropLat = ParseDouble(tripData[13]);

                    // Skip erroneous records
                    if (pickupLat < -180 || pickupLat > 180 || pickupLong < -180 || pickupLong > 180 ||
                        dropLat < -180 || dropLat > 180 || dropLong < -180 || dropLong > 180 ||
                        pickupLat == 0 || pickupLong == 0 || dropLat == 0 || dropLong == 0)
                        continue;

                    medallion = fareData[0];
                    hackLicense = fareData[1];
                    pickupDateTime = DateTime.Parse(fareData[3]);
                    dropDateTime = DateTime.Parse(tripData[6]);
                    passCount = ParseInt(tripData[7]);
                    tripTime = ParseInt(tripData[8]);
                    tripDistance = ParseDouble(tripData[9]);
                    paymentType = fareData[4];
                    fareAmount = ParseDecimal(fareData[5]);
                    surcharge = ParseDecimal(fareData[6]);
                    mtaTax = ParseDecimal(fareData[7]);
                    tipAmount = ParseDecimal(fareData[8]);
                    tollAmount = ParseDecimal(fareData[9]);
                    totalAmount = ParseDecimal(fareData[10]);

                    Loc pLoc = new Loc
                    {
                        IsPckUp = true,
                        LgLt = new[] { pickupLong, pickupLat }
                    };

                    Loc dLoc = new Loc
                    {
                        IsPckUp = false,
                        LgLt = new[] { dropLong, dropLat }
                    };

                    tripFareList.Add(new NYCTaxiData
                    {
                        Mdlln = medallion,
                        Hlicense = hackLicense,
                        Pdate = pickupDateTime,
                        Ddate = dropDateTime,
                        Pcount = passCount,
                        Ttime = tripTime,
                        Tdist = tripDistance,
                        Loc = new[] { pLoc, dLoc },
                        Ptype = paymentType,
                        Famnt = fareAmount,
                        Srchrge = surcharge,
                        Mtax = mtaTax,
                        Tamnt = tipAmount,
                        TOamnt = tollAmount,
                        TOTamnt = totalAmount
                    });

                    count++;
                    looper++;
                    // Insert records in form of batches
                    if (looper == 500000)
                    {
                        collection.InsertBatch(tripFareList);

                        looper = 0;
                        tripFareList.Clear();
                    }
                } // End of loop

                // Insert rest of data
                if (looper > 0)
                {
                    collection.InsertBatch(tripFareList);
                }

                Console.WriteLine("Total count:{0}", count);
            }

            Console.WriteLine("Operation complete");
            Console.WriteLine("Total count reported by collection", collection.Count());
            Console.WriteLine("Hit enter to exit");
            Console.ReadLine();
        }

        private static int ParseInt(string data)
        {
            int value = 0;
            int.TryParse(data, out value);
            return value;
        }

        private static double ParseDouble(string data)
        {
            double value = 0;
            double.TryParse(data, out value);
            return value;
        }

        private static decimal ParseDecimal(string data)
        {
            decimal value = Decimal.Zero;
            decimal.TryParse(data, out value);
            return value;
        }
    }
}

Please go ahead and load this project if you have visual studio installed. Or you can execute the PopulateNYCTaxiDB.exe in the following path: https://github.com/tarun11ks/NYCTaxi/tree/master/csharp/PopulateNYCTaxiDB/PopulateNYCTaxiDB/bin/Release. It will open a console application with prompts to enter required information and rest it will take care of populating the database.
Time permitting, I will create a node.js version of the above.

In your mongo shell, to see your stats on the collection, type in

 db.NYCTaxis.stats() 

which will give you information on count of records, storage, indexes etc.

Let’s have a look at a record in our collection and see how it is structured. Type in:

db.NYCTaxis.findOne()

You will see the following:

{
        "_id" : ObjectId("549d14cecf21851240871f34"),
        "Mdlln" : "89D227B655E5C82AECF13C3F540D4CF4",
        "Hlicense" : "BA96DE419E711691B9445D6A6307C170",
        "Pdate" : ISODate("2013-01-01T21:11:48Z"),
        "Ddate" : ISODate("2013-01-01T21:18:10Z"),
        "Pcount" : 4,
        "Ttime" : 382,
        "Tdist" : 1,
        "Loc" : [
                {
                        "IsPckUp" : true,
                        "LgLt" : [
                                -73.978165,
                                40.757977
                        ]
                },
                {
                        "IsPckUp" : false,
                        "LgLt" : [
                                -73.989838,
                                40.751171
                        ]
                }
        ],
        "Ptype" : "CSH",
        "Famnt" : "6.5",
        "Srchrge" : "0",
        "Mtax" : "0.5",
        "Tamnt" : "0",
        "TOamnt" : "0",
        "TOTamnt" : "7"
}

By default, MongoDB creates an index on the “_id” field of every collection.
Execute the code below to see list of indexes on the NYCTaxis collection.

db.NYCTaxis.getIndexes()

You will notice the “_id” index already created by MongoDB.

Indexing in MongoDB is just like indexing on other databases. Just like relational databases create indexes on tables, MongoDB does it at collection level.
MongoDB will give you best performance if the working set(total size of the indexes + data) fit into system’s RAM, which becomes the cache. If it doesn’t, MongoDB will swap documents to the disk with a little performance penalty as disk access is slow.

There are many other index types available.

Currently I have loaded my database with only 1 month(January) of data.
Below are the stats of my NYCTaxisDB:

> db.NYCTaxis.stats()
{
        "ns" : "NYCTaxiDB.NYCTaxis",
        "count" : 14490472,
        "size" : 7187274112,
        "avgObjSize" : 496,
        "storageSize" : 9305935856,
        "numExtents" : 25,
        "nindexes" : 2,
        "lastExtentSize" : 2146426864,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 1225549696,
        "indexSizes" : {
                "_id_" : 470438864,
                "Loc.LgLt_2d" : 755110832
        },
        "ok" : 1
}

Figures are in bytes.

Below is in MB’s:

> db.NYCTaxis.stats(1048576)
{
        "ns" : "NYCTaxiDB.NYCTaxis",
        "count" : 14490472,
        "size" : 6854,
        "avgObjSize" : 496,
        "storageSize" : 8874,
        "numExtents" : 25,
        "nindexes" : 2,
        "lastExtentSize" : 2046,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 1168,
        "indexSizes" : {
                "_id_" : 448,
                "Loc.LgLt_2d" : 720
        },
        "ok" : 1
}

(I will later explain the second index “Loc.LgLt_2d” in another blog post).
There are 14 million records in our database and that is just for one month!
So this is how I have populated my database.

In the next post, we will look into applying map-reduce on our database.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s