Mongo Sharding + GridFS

By Josh Fermin

What is Sharding?

Storing data across multiple nodes/machines.

A single machine may run out of storage or have bottlenecked reads and writes

Sharding is horizontally scalable - Add machines as demand increases.

Database Issues

High throughput and large amounts data put a lot of stress on a single server.

Set sizes larger than RAM stress I/O capacity

High query rates can exhuast CPU capacity of a single server.

Two Solutions

  • Vertical Scaling:
    • Add more CPU and storage resources to a single server
    • Limitations: Expensive.

  • Sharding (Horizontal Scaling):
    • Distribute data set over multiple servers or shards
    • Each shard is an independent db and all together they make up a single logical db.

Sharded Collection

Sharding in MongoDB

Data Partitioning

Sharding happens on a collection level.

Sharding divides a collection's data by a shard key

Two types of partitioning: range based and hash based

Ranged Based Paritioning

Hash Based Paritioning

Mongo Sharding Tutorial

Taken from Mongo Docs.

Fire Up Config Server

Download Mongo here
mkdir /data/configdb
mongod --configsvr --dbpath /data/configdb
  • data/configdb is where server metadata will be stored
  • In production, run configsvr on three different hosts

Setup Mongos - Routing Service

mongos --configdb localhost:27019 --port 27017
  • Lightweight, do not require data directories
  • Usually have two of these in production

Create DBs for Shards

mkdir /data/shard1
mongod --dbpath /data/shard1 --port 27018 --shardsvr
  • This starts a standalone mongod instance

Create another Shard

mkdir /data/shard2
mongod --dbpath /data/shard2 --port 27020 --shardsvr
Note different port
  • Data will be partitioned to these two shards, based on shard key

Fire up mongos shell

mongo --port 27017
mongos> sh.addShard("localhost:27018"){ "shardAdded": "shard0001", "ok" : 1 }
mongos> sh.addShard("localhost:27020"){ "shardAdded": "shard0002", "ok" : 1 }

Sharding on DB/Collection Level

First create a database and a collection:

mongos> use testDBmongos> db.createCollection(testCollection)

Then enable sharding on each:

mongos> sh.enableSharding(“testDB”)mongos> sh.shardCollection(“testDB.testCollection”, { “md5”:”hashed” })
Second param is choosing the shard key as md5 with hash partitioning. For range based do: "md5":1

What we just did

So What is this GridFS business?

GridFS is an extension on top of mongo

Allows you to store/retrieve files that exceed BSON doc limit of 16mb

How is that possible?

Instead of one doc per file, GridFS breaks it into 255k chunks.

Each chunk becomes a separate document.

When you query for a file, the driver or client will reassemble it as needed.

GridFS Collections

GridFS uses two collections to store files.

  • Chunks: store binary chunks
  • Files: store file metadata

Implementation

Two ways:

Example - C# Driver

using MongoDB.Driver;
using MongoDB.Driver.GridFS;

Connect to mongos and get info

MongoClient client = new MongoClient("mongodb://localhost:27017");
MongoServer server = client.GetServer();
MongoDatabase database = server.GetDatabase("testDB");
MongoCollection testCollection = database.GetCollection("testCollection");

GridFS Driver

MongoGridFS gfs = new MongoGridFS(database);
gfs.Upload(@"C:\test.txt");
After uploading, the chunks and files collections will be created.

Conclusion

  • If everything was set up correctly you can now read/write any files you want (movies, music, etc) into mongo using GridFS
  • You can shard on the chunks collection if you'd like giving you horizontal scaling.
  • Learned how to use MongoDB Sharding and GridFS API

Resources