Running RethinkDB with Python C++ Protobuf Drivers on OSX

Published 2013-09-28

I ran across RethinkDB a few days ago on O'Reilly Radar and decided to give it a shot, since I'm a big fan of MongoDB and Riak, and RethinkDB seems to take the best features of both (slick json data structures with easy administration) and mash them together.

I figured moving from Flask and Mongo to Flask and RethinkDB would be pretty easy, so I ran

brew install rethinkdb
pip install rethinkdb
rethinkdb

and was off to the races. I was having a great time with the repl and RethinkDB's awesome built-in admin interface and stuffing things into the DB left and right.

The totally awesome-looking rethinkdb admin interface

However, query performance in my python app kinda sucked. I wanted to fix that.

You want protoc, not Python

I did some reading and learned that RethinkDB's python driver uses protobuf. If you haven't heard of it before, protobuf or "protocol buffers" is a lower-level data transport library by Google written in C++. In simplistic terms you can think of it as a binary json stream, which is space efficient when sent over the network.

Riak also uses protobuf. When I started using Riak, Google's C/C++ version of protobuf didn't support Python 3, so the performance was abysmal. So you had to use Python 2.7 if you wanted the best performance. (Google's protobuf still doesn't support Python 3 afaik, but you can try one from OpenX if you're adventurous.)

In any case, RethinkDB performs best with the C/C++ version of the python protobuf module. By default -- and you'll notice this in your console spam when you run pip install rethinkdb if you pay close attention -- the C/C++ library is missing so RethinkDB uses a python implementation as a fallback.

If you're curious, you can find out which one RethinkDB is using via:

python
import rethinkdb as r
r.protobuf_implementation

This will output either python or cpp to indicate whether you're running native code (you want it to say cpp for the best performance). Mine said python. So of course, the next step was to fix that and install the C/C++ version of the protobuf module.

Wait, what's this C/C++ thing mean?

Python modules are written in C. Protobuf is written in C++. So first you have to compile protobuf and then compile a C python module wrapper for the C++ library. All of this is included in the protobuf tarball and aside from wiring it all together there's not much involved.

Protobuf Version 2.4.1

At the time of writing, the latest protobuf library is 2.5.0. This is what gets installed via pip. The version of RethinkDB I installed via homebrew is 1.9.0 (1.10.0 actually came out two days ago but it's not in the brew repo yet).

After trying and failing I learned that protobuf 2.5.0 is ostensibly not compatible with Rethink 1.9.0 on OSX. From some of the github issues I was able to infer that RethinkDB is using 2.4.1, apparently dictated by what's in Ubuntu's repositories right now. I don't know if the build-from-source docs make note of this but the python drivers page certainly doesn't.

No problem. I can download 2.4.1 instead.

Compiling Protobuf

Now's the easy part. We'll compile and install protobuf 2.4.1 from source.

tar -xjf ~/Downloads/protobuf-2.4.1.tar.bz2
cd protobuf-2.4.1
./configure
make
make install

Normally I'd say run make test, but protobuf doesn't have one. I suspect the python module tests are used to test the library instead.

At this point you should be able to run protoc --version and get libprotoc 2.4.1 in response.

Note: If you already make installed the wrong version of protobuf from source like I did, you can remove it with make uninstall. Also one point I also was getting segfaults from python so I ran rm /usr/local/lib/libproto* and then ran make install again from the 2.4.1 directory.

Building the Native Python Module

Next, the python module. But first, we need to set an environment variable.

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=cpp

This indicates that we want to use the native C/C++ version of the protobuf python module, which is faster than the pure python version. If you don't have this environment variable set, setup.py will build the pure python version, which is slow and not what you want.

Since the RethinkDB python module had already installed a more recent version of the protobuf module, I took this opportunity to remove it.

pip uninstall protobuf -y

Next you should cd into protobuf-2.4.1/python. If you followed directly from the example above you'll just drop into python from where you are.

cd python
python setup.py build
python setup.py test
python setup.py install --prefix /usr/local

I added --prefix /usr/local because I'm using python 2.7.5 from homebrew, not system python on OSX. If you try to install the module to the wrong place, setuptools will complain so you probably can't break this step accidentally. At this point, pip list | grep protobuf should show you protobuf (2.4.1).

Reinstall the RethinkDB Module

Finally, we need to reinstall the RethinkDB python module so it can be made aware if the C/C++ python module that we've installed. I don't know exactly why this is necessary but it does some linking step when you pip install, so just having the proper protobuf module seems to be insufficient.

pip uninstall rethinkdb -y
pip install rethinkdb

You can tell if it worked correctly because won't see a message indicating that it's using the pure python fallback. But to be more scientific we can use the same trick from earlier:

python
import rethinkdb as r
r.protobuf_implementation

It should now say 'cpp'. And you're done! Have fun adding RethinkDB to your python app.


Related

rethinkdb python mongodb riak