Mongo Replication and Replica Sets

This is a write-up of Dwight Merriman's talk on replication and the new replica sets functionality that is coming up in MongoDB. Dwight is the CEO of 10gen, the company behind MongoDB.

If you want to try any of the examples in this section you'll need two copies of mongod running. From Mathias's talk on MongoDB administration we can borrow these two commands:

$ mongod --master --oplogSize 500
$ mongod --slave --source localhost:27017 --port 3000 --dbpath /data/slave

Replication

Replication in MongoDB is asynchronous master---slave replication. When you first start a slave it does an initial sync, pulling all the databases off the master. Once the slave finishes its initial sync it keeps up to date by reading new data from the oplog.

The --oplogSize option for mongod is important as it controls the number of writes that can be buffered in replication. It defaults to 5% of disk space. Bigger is generally better; there are no performance penalties involved in having a large oplog.

If your oplog file isn't big enough your slave could miss some data. Rather than being stored in a file on disk, the oplog is stored in the local database (which means that it won't be replicated).

You can connect to your slave from the a mongo console connected to the master with connect():

slave = connect('localhost:3000/mydatabase')
slave.db.mycollection.find()

From MongoDB 1.5 onwards, you can check how many slaves are connected to your master server like this:

> use local
> db.local.$main.findOne()  # returns events
> db.local.slaves.find()

On the slave you can run this to see information about the master server (works in 1.4 onwards):

> use local
> db.sources.find()

You can find out whether your oplog is big enough by running this command on your master:

> db.printReplicationInfo()

You can see how long it was since the slave last synced by running this command on the slave:

> db.printSlaveReplicationInfo()

Replication works fine over unreliable connections (e.g. the Internet).

So how do you configure your masters and slaves? However you like. One slave server can run multiple copies of mongodb in order to copy data from multiple live servers. One master server can be backed up by multiple slaves; it's up to you.

Replica sets

Replica sets are new, and will be coming out next month. They're similar to replica pairs, only much better.

A replica set is a cluster of multiple servers. All writes go to the "primary" server. Reads can be to a primary or a secondary (if you don't mind if a read doesn't reflect your most recent write). The members of a replica set are essentially peers; any node can serve as the primary.

The primary server is chosen through an election. If the primary server goes down the members of the set will automatically elect a new one.

Replica set design concepts

  1. A write is only truly committed once it has replicated to the majority of servers in the set. We can wait for confirmation of this with getlasterror.
  2. Writes which are committed at the master of the set may be visible before the true cluster-wide commit has cocurred.
  3. On failover of the primary, any data which hasn't been replicated from the primary is dropped (hence 1).

If the primary dies and later comes back on line it may have received data just before it died that wasn't replicated to any other members of the set. This data is never merged back; merging conflicts is complicated, often involves manual intervention. Typically, we just don't want to have to deal with it.

If this would be a problem for your application you can make a call to getlasterror() to specify how many servers your data must be written to before your application is allowed to proceed (see the MongoDB replication documentation for more details).

So what happens when the old primary comes back online? A new primary will already have been elected while it was down (the member with the freshest data will have won the election). The old primary will go into a "recovering" state, when it will roll back to the last known good state and then replicate changes that have occurred since it went down.

The configuration for a replica set is a JSON object that specifies which hosts are members of the set (you can find example configuration in the docs, or in Dwight's slides). Pass the config to the server like this:

> use admin
> db.runCommand({replSetInitiate: cfg})

When you start mongod you can seed the list of hostnames on the command line:

$ mongod --replSet setname/host1,host2

Check the status of your set with the {replSetGetStatus: 1} command, or check http://localhost:28017/replSetGetStatus?text via HTTP.

Connecting to a replica set

When an application tries to talk to a primary member that has gone down it will get an error (e.g. a socket error), at which point it will then need to try and find out which member it should talk to. Language drivers will be updated to handle this, so application developers shouldn't need to worry about it. mongos will also handle it transparently.

Applications don't need to know the names of all the machines in your replica set; they only needs to know which server is the primary. The language drivers should learn the other members of the set on your behalf, and automatically switch to other members as required. If you need to guarantee consistent reads then you should pass an argument through to the language binding/client driver.

Summary

It was a very interesting introduction to the upcoming replica sets functionality. These notes don't really come close to capturing the full details; if you want to give Replica Sets a try see the MongoDB Replica Sets documentation.

Dwight's slides are online. He's @dmerr on Twitter.