My experience with MongoDB

I’ve recently just finished switching a project from using MongoDB to PostgreSQL and I’m 100% certain I’ve made the correct decision in doing so. Running at basically the same load, PostgreSQL returns queries much faster and uses much less CPU and RAM. Despite it’s popularity a few years ago, Mongo really strikes me as a bit of a mess.

Just some background information to show you what I mean: PostgreSQL is a traditional Relational Database Management System. It stores data in tables separated into ‘rows’, each split into defined ‘columns’. It is built on set theory, is ACID and uses SQL as the query language. MongoDB, on the other hand, is a NoSQL document database, which stores collections of entries as extended JSON documents (the extended format is called BSON.) One of the chief features of MongoDB is that it’s distributed and uses a concept know as ‘eventual consistency’ to theoretically enable faster write operations on clusters than is achievable with an RDBMS.

MongoDB supposedly has two primary advantages. This first is that since it is a schemaless, NoSQL solution, it makes it much simpler to get a database up and running. You don’t have to spend time upfront designing a schema and you don’t have to correct a schema it it’s broken. You just start inserting new records with the fields you want them to have and have your application handle the variations. The second benefit is that because it is distributed, it should be much easier scale. It can be installed on hundreds of machines and changes to one will propagate through the system. There need be no single point of failure. While both of these seem like strong advantages, I’m not sure that they pan out in reality.

For starters, while a schemaless solution makes early tinkering more frictionless, sometimes you want the checks and data protection that a schema can provide. Typed columns also allows relational databases to make optimizations that MongoDB can’t. It seems to me that the difference between schema vs schemaless databases is almost exactly like the difference between static vs dynamic programming languages. On the one end you have constraints which help to catch some bugs early and enforce reliability while at the same time giving the compiler/db more options to optimize speed and memory usage, while on the other hand, you have greater flexibility and less bureaucracy. Both have their place, but it’s worth noticing that parts of code that either need to be performant or reliable are often rewritten in a static language while less critical parts are written in a dynamic language. The thing is, while no schema is nice when your still figuring out your application’s logic, it’s nice to have things more structured by the time you deploy. You might still need to make changes, but since you’ll need to spend time thinking about them, having to make explicit changes to a schema and running migrations on data will no longer be as big a problem.

The other issue I had with MongoDB was that, for my purposes, it wasn’t really performant. Mongo is really designed to work at scale. The docs suggest running it on at least three dedicated servers with ample resources each. This was a bit much for me so I ran it on a single server which it shared with the application. As a result my application was slow and the server crashed periodically. Now you could criticize me for not following the recommended procedure, and you’d be right, but understand that when I switched to PostgreSQL, without increasing the hardware capacity at all, all of my performance and stability problems went away. MongoDB demanded too much for less performance and essentially the same queries. I could have thrown more hardware at Mongo, but I could also throw the same amount of hardware at Postgres and still end up with a more performant system.

The reason MongoDB’s distributed design exists at all is because of the notion that scaling out is cheaper in the long run than scaling up, but I don’t think that scaling up actually that hard at all. If you need more disk space, throwing a SAN onto a server isn’t too much of a problem. 15TB on a SAN is pretty standard, and above that you’re really moving into Big Data territory where specialized tools are necessary anyway. Sharding helps to ‘distribute’ the workload across disks arrays so you’ll even get part of the performance benefit from using multiple servers. Adding faster network access and more CPU to a system isn’t hard either. High end servers are designed to make this easy. Besides, unless your system is particularly write heavy relational databases can use replication to scale out anyhow. MongoDB’s model really isn’t an advantage unless you are solving a write-heavy, Big Data problem.¹ Until you reach that scale, it’s actually slower than the alternative.

Part of the reason MongoDB seems less performant, I think, goes right back to it’s schemaless design. Because fields have no types, Mongo is limited in the kinds of indexes it can create on data. There are performance enhancements that can be made with a lookup table with only integers or only strings as keys that can’t be made when you have a mix of the two. This can make things much faster and less resource intensive, which will improve the stability of the system overall. So I feel that while a JSON (BSON) document store is a neat idea, I think it fails some basic utility tests. Better to use a simple distributed hash with serialized objects.

One final criticism that I feel says a lot about MongoDB is that I was surprised at one point to find that it was endianness dependent. It makes optimizations based on the bit order of the system and cannot be run on systems with the wrong bit order. That means that Sparc and IBM POWER systems cannot run it. This seems like premature optimization to me, trying to make the system faster at the expense of portability.

So, in the end I don’t think I’ll be using MongoDB for production systems in the future. I’ll probably still use it for prototyping though. Sometimes when your still designing an application and you’re writing a prototype, you don’t know what the final shape of the data will be. I think in this one case, the flexibility of a schemaless solution will outweigh the advantages of a RDBMS. I’ll still swap out the backend before the application hits productions, but until then Mongo is fine.

Or just any very write heavy problems. ↩