Career Fairs, Part 2: How Can Startups Get Noticed?

I wrote the other day about what I think Comp Sci majors are doing wrong at career fairs and how they should be distinguishing themselves from their peers. There’s a fun debate in the comments about whether I gave the right advice. Regardless, here’s a followup question I need answered from CS undergrads:

If you’ve been to a career fair, what did startups do wrong? How can we get you to notice us?

When I consider how we at 10gen set ourselves up at Big Ivy University’s career fair, I don’t think we did any better than the students did. We displayed our logo and the name of our product MongoDB, and … that’s it. I can’t blame the hundred kids who came up to our table and said, “What do you do?” We should say why an intership with us will be awesome, e.g.:

  • The NoSQL movement is one of the most innovative areas in software these days, and we dominate it.
  • We’re small, so if you’re smart you can make a big relative contribution.
  • We’re run by and for coders: Our CEO codes, I code, our customers code, everyone codes.
  • We’re on the kind of growth trajectory that eventually makes household-name companies.

I don’t know how to say these things convincingly, especially not on a poster, so that a smart undergrad who’s never heard of us will stop at our table. Suggestions?

So You’re Coming to a Career Fair

I went to a career fair at Big Ivy University recently, and talked to fifty or so computer science undergrads who were looking for internships or full-time jobs with my employer, 10gen. I’m sure some of them were very smart, but they had not learned how to distinguish themselves from each other. One after another, these students came with identical resumes, identical suits, and identical pitches about why they should get a gig with us.

CS students, I want to tell you how to stand out when you’re introducing yourself at a career fair. If you’re an extraordinary hacker, you need to tell us that you are, and you need to show that you are on your resume. Otherwise we can’t find you.

What You Learned In School Is Not Enough

The first student I met at BIU handed me her resume, and I saw that she knew Haskell, and she’d done a machine-learning project. I thought, “cool,” and put the resume in the “call this candidate” pile. The third time I saw Haskell and machine learning, I realized that’s just what they teach at Big Ivy.

If you’re competing with students from other schools, then your coursework may be an advantage or a disadvantage. But if you’re coming to a career fair, you’re competing with kids who took the same courses as you. So I’m not impressed that you learned Haskell as a freshman—you’d have been kicked out of the program if you hadn’t.

One possibility too terrifying to contemplate is that the students I met at BIU thought their GPA mattered. If so, they’re in for a rude surprise. I know they all listed their GPAs on their resumes, but I forgot to look, and I think most employers will forget to look at GPA, as well.

Charisma Matters

It’s a shame, but it’s true: a firm handshake, eye contact, and a calm, friendly, enthusiastic manner make a big difference, even for nerds. I will spend more time with you, even though there are five kids in line behind you, and I will answer your questions better and ask you more questions. It’s not just that I’m biased towards charismatic people. Your social skills are part of what my company wants to hire. In the long run, if you work for us, you’ll be making friends with your coworkers, talking to customers, and presenting our products at conferences. We need you to be engaging.

Individual Projects, Unusual Languages, Unusual Courses

Look, if you’re graduating with a CS major, you will get a job. Relax. The market’s great. But if you actually care about software and want to work somewhere that excites you, you’ll need to put some effort into your resume and how you introduce yourself. Here’s what I want to see:

Individual Projects: 100 bonus points each

If you had an idea for a software project and you implemented it, then you should put that at the top of your resume. Above your name. And tell me about that project as soon as you shake my hand at the career fair. The project doesn’t have to be totally unique, or profitable, or complete—just make something. Then I’ll know you have cool ideas for things to build, and that you love coding, which is highly correlated with being great at coding. You’re in the “call back” pile.

If you haven’t built an individual project, start. Let your 4.0 GPA slip a little. It’s worth it to make time for this project. Don’t worry about getting college credit for the time you spend, just build it. Put the GitHub URL on your resume so I can check it out.

Extra Languages: 25 bonus points each

If the only programming languages on your resume are the required ones, then you’re showing me you do your homework. It’s not enough. Learn an extra language. It doesn’t have to be anything exotic like Erlang, just something all your peers didn’t learn in class. Put this at the top of your resume, under the individual project. Tell me about how you taught yourself C++ over summer break because you want to do 3D graphics for a living. It doesn’t matter if I’m not looking for a C++ programmer, you’re showing me you love learning about computers. But be aware that I may know this language, too, so if you claim you’re an “expert,” you better be for real.

Unusual Courses: 10 bonus points each

I know Big Ivy offers a computer graphics class, but it seems like only one student took it. All the rest just listed the same boring courses on their resumes: Operating Systems, Networking, blah blah blah. I know you took those courses; otherwise you wouldn’t be graduating. If you want me to notice you, take lots of electives. Again, your GPA doesn’t matter, so don’t worry about getting a little overloaded.

Longshots

Contributing to Open Source

I don’t recommend that undergrads go on GitHub seeking an open source project to contribute to. It’ll be different once you’ve been working for a few years, but right now you probably don’t have any itches that aren’t well-scratched by an existing project. Even if you do, I doubt you’re ready to write a patch that’s high-quality enough to be accepted. It’s much easier to start a new project on your own. For one thing, when you work on your own project, no one has to approve your patch.

Possible exceptions to this rule: Porting a package to Python 3 if no one else has started it; porting a package from a popular language to an exotic one if there’s no analogous library in the target language.

Freelancing

Your internships for other software companies are great, but I don’t recommend freelance work. It would probably be along the lines of setting up a WordPress site for your friend’s mother’s law firm. The level of sophistication required for your first real gig is going to stomp all over whatever summer job you get, so unless you really need the money, put your time into an individual project instead.

Third Normal Form and Ultimate Truth

I have an opinion: most people learned about relational databases as if RDBMSes were designed to store the ultimate truth about some data. They figured that once the schema had been properly diagrammed and normalized, then they could load all their data into it, and finally, start doing some queries.

To pick on an easy target, look at Wikipedia’s article on schema design. It summarizes the two steps a designer must take:

  1. Determine the relationships between the different data elements.
  2. Superimpose a logical structure upon the data on the basis of these relationships.

Do you see a step that’s missing? If you’ve deployed and maintained a large-scale application you’ll probably see what the Wikipedia authors omitted. In fact, it’s the first step: Figure out what one question your database must answer. Then, design your schema to answer that question as fast as possible. And now you’re done. Come to think of it, you never had to do steps 1 and 2 at all.

There’s a total disconnect between the approaches of introductory SQL courses and real-world application development, and I think this disconnect is slowing down adoption of NoSQL.

Consider Facebook Messages. After a (now rather well-publicized) evaluation process, Facebook chose HBase, a NoSQL data store, as the main database for their message system. I haven’t talked to anyone there, but I figure they chose it based on this criterion:

How fast will our database answer the question, “What are this user’s most recent 10 messages?”

They chose the database system that could answer that question the fastest, and they designed the best schema they could think of to answer that question. Anything else they need to ask HBase may be slow, or difficult, but that doesn’t matter, because “What are this user’s most recent 10 messages?” probably accounts for 99% of the load on their system.

If you learned about databases in college, following some textbook, I expect you were guided through a long process of modeling real-world data using rows and columns, to express some profound truth about the data. Then, you were introduced to SQL, with which you could query the data. At the end of the course, maybe there was a brief discussion of database performance. Probably not.

Data at the scale that the largest websites handle doesn’t work that way. Large applications design their schemas to answer one question as quickly as possible, and no other considerations are significant.

The next time you read about a NoSQL database you might wonder, “What about foreign keys, or normalization? What about transactions? Why can’t I define secondary indexes? Why are range queries prohibited?” (I’m just picking some limitations at random—each system is different.) Consider who built these new database systems, and what their experience has been. The ideas behind NoSQL databases mostly originated at places like Google, Amazon, and Yahoo. They build huge systems, and huge systems’ loads are usually dominated by a handful of queries. Companies build their database systems from the ground up to optimize the performance of these queries. NoSQL databases encourage you to figure out ahead of time, “What one question do I need to answer?” Figure that out, and choose your database software and your schema based on that. Nothing else really matters.

Philly MongoDB User Group: Python, MongoDB, and Asynchronous Web Frameworks

Philadelphia Panorama From Camden
Photo (C) Parent5446

I’ll be recapping last week’s talk on Python, MongoDB, and Asynchronous Web Frameworks this Thursday at 7pm, in Philadelphia, at the Philly MongoDB User Group’s inaugural meetup. We’ll be at the Devnuts office, at 908 North 3rd Street. We’ll have pizza, naturally.

First Philly MongoDB User Group

This Thursday: a talk on Python, MongoDB, and asynchronous web frameworks

MongoDB Logo

This Thursday in NYC I’m talking about Python, MongoDB, and asynchronous web frameworks at a meetup called For the Love of Python: Wine tasting, Red velvet cupcakes, and Tech Talks. The talk is a work in progress. To be strictly accurate, I have not yet started working on the talk, because the code I’ll be talking about is itself a work in progress. But come anyway, because I’ve been thinking a lot on this subject for the last few months, and I intend to present:

  • A high-level discussion of what an async web framework is and when you need it, or don’t. I think there’s a lot of sloppiness on this subject, and I want to work with the audience on tightening up our thinking.
  • A review of pymongo, pthreads, Tornado, asyncmongo, and gevent. You won’t be disappointed.
  • For the first time ever, I will present an exclusive sneak-peak at my own experimental Python driver for MongoDB and Tornado, built on top of the official pymongo driver. It’s pretty snazzy, it uses greenlets, and it’s an example of a general pattern for asynchronizing synchronous database drivers that might inspire you to write your own database driver in Python. Buckle your seatbelts, we’re going deep.

How To Do An Isolated Install of Brubeck

Brubeck

I wanted to install James Dennis’s Brubeck web framework, but lately I’ve become fanatical about installing nothing, nothing, in the system-wide directories. A simple rm -rf brubeck/ should make it like nothing ever happened.

So that I remember this for next time, here’s how I did an isolated install of Brubeck and all its dependencies on Mac OS Lion.

Install virtualenv and virtualenvwrapper (but of course you’ve already done this, because you’re elite like me).

Make a virtualenv

mkvirtualenv brubeck; cdvirtualenv

ZeroMQ

wget http://download.zeromq.org/historic/zeromq-2.1.9.tar.gz
tar zxf zeromq-2.1.9.tar.gz
cd zeromq-2.1.9
./autogen.sh
./configure --prefix=.. # Don't install system-wide, just in your virtualenv's directory
make
cd ..

Mongrel2

git clone https://github.com/zedshaw/mongrel2.git
cd mongrel2
emacs Makefile

Add a line like this to the top of the Makefile, so the compiler can find where you’ve installed ZeroMQ’s header and lib files:

OPTFLAGS += -I/Users/emptysquare/.virtualenvs/brubeck/include -L/Users/emptysquare/.virtualenvs/brubeck/lib

and replace PREFIX?=/usr/local with something like: PREFIX?=/Users/emptysquare/.virtualenvs/brubeck

(If you can get this to work with relative instead of absolute paths, please tell me in the comments!)

make
make install
cd ..

Python Packages

Now we need our isolated include/ and lib/ directories available on the path when we install Brubeck’s Python package dependencies. Specifically, the gevent_zeromq package has some C code that needs to find zmq.h and libzmq in order to compile. We’ll do that by setting the LIBRARY_PATH and C_INCLUDE_PATH environment variables:

cd brubeck
export LIBRARY_PATH=/Users/emptysquare/.virtualenvs/brubeck/lib
export C_INCLUDE_PATH=/Users/emptysquare/.virtualenvs/brubeck/include
pip install -I -r ./envs/brubeck.reqs
pip install -I -r ./envs/gevent.reqs

How nice is that?

Brubeck

git clone https://github.com/j2labs/brubeck.git
cd brubeck

I plan to do a little hacking on Brubeck itself soon, so rather than running python setup.py install here, I’m simply including my copy of Brubeck’s source code on my PYTHONPATH.

Next

Once you’re here, you have a completely isolated install of ZeroMQ, Mongrel2, Brubeck, and all its package dependencies. Continue with James’s Brubeck installation instructions at the “A Demo” portion.

Tornado Unittesting: Eventually Correct

Time was, time is ...

Photo: Tim Green

I’m a fan of Tornado, one of the major async web frameworks for Python, but unittesting async code is a total pain. I’m going to review what the problem is, look at some klutzy solutions, and propose a better way. If you don’t care what I have to say and you just want to steal my code, get it on GitHub.

The problem

Let’s say you’re working on some profoundly complex library that performs a time-consuming calculation, and you want to test its output:

# test_sync.py
import time
import unittest

def calculate():
    # Do something profoundly complex
    time.sleep(1)
    return 42

class SyncTest(unittest.TestCase):
    def test_find(self):
        result = calculate()
        self.assertEqual(42, result)

if __name__ == '__main__':
    unittest.main()

See? You do an operation, then you check that you got the expected result. No sweat.

But what about testing an asynchronous calculation? You’re going to have some troubles. Let’s write an asynchronous calculator and test it:

# test_async.py
import time
import unittest
from tornado import ioloop

def async_calculate(callback):
    """
    @param callback:    A function taking params (result, error)
    """
    # Do something profoundly complex requiring non-blocking I/O, which
    # will complete in one second
    ioloop.IOLoop.instance().add_timeout(
        time.time() + 1,
        lambda: callback(42, None)
    )

class AsyncTest(unittest.TestCase):
    def test_find(self):
        def callback(result, error):
            print 'Got result', result
            self.assertEqual(42, result)

        async_calculate(callback)
        ioloop.IOLoop.instance().start()

if __name__ == '__main__':
    unittest.main()

Huh. If you run python test_async.py, you see the expected result is printed to the console:

Got result 42

… and then the program hangs forever. The problem is that ioloop.IOLoop.instance().start() starts an infinite loop. You have to stop it explicitly before the call to start() will return.

A Klutzy Solution

Let’s stop the loop in the callback:

        def callback(result, error):
            ioloop.IOLoop.instance().stop()
            print 'Got result', result
            self.assertEqual(42, result)

Now if you run python test_async.py everything’s copacetic:

$ python test_async.py
Got result 42
.
----------------------------------------------------------------------
Ran 1 test in 1.001s

OK

Let’s see if our test will actually catch a bug. Change the async_calculate() function to produce the number 17 instead of 42:

def async_calculate(callback):
    """
    @param callback:    A function taking params (result, error)
    """
    # Do something profoundly complex requiring non-blocking I/O, which
    # will complete in one second
    ioloop.IOLoop.instance().add_timeout(
        time.time() + 1,
        lambda: callback(17, None)
    )

And run the test:

$ python foo.py
Got result 17
ERROR:root:Exception in callback 
Traceback (most recent call last):
  File "/Users/emptysquare/.virtualenvs/blog/lib/python2.7/site-packages/tornado/ioloop.py", line 396, in _run_callback
    callback()
  File "foo.py", line 14, in 
    lambda: callback(17, None)
  File "foo.py", line 22, in callback
    self.assertEqual(42, result)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 494, in assertEqual
    assertion_func(first, second, msg=msg)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/unittest/case.py", line 487, in _baseAssertEqual
    raise self.failureException(msg)
AssertionError: 42 != 17
.
----------------------------------------------------------------------
Ran 1 test in 1.002s

OK

An AssertionError is raised, but the test still passes! Alas, Tornado’s IOLoop suppresses all exceptions. The exceptions are printed to the console, but the unittest framework thinks the test has passed.

A Better Way

We’re going to perform some minor surgery on Tornado to fix this up, by creating and installing our own IOLoop which re-raises all exceptions in callbacks. Luckily, Tornado makes this easy. Add import sys to the top of test_async.py, and paste in the following:

class PuritanicalIOLoop(ioloop.IOLoop):
    """
    A loop that quits when it encounters an Exception.
    """
    def handle_callback_exception(self, callback):
        exc_type, exc_value, tb = sys.exc_info()
        raise exc_value

Now add a setUp() method to AsyncTest which will install our puritanical loop:

    def setUp(self):
        super(AsyncTest, self).setUp()

        # So any function that calls IOLoop.instance() gets the
        # PuritanicalIOLoop instead of the default loop.
        if not ioloop.IOLoop.initialized():
            loop = PuritanicalIOLoop()
            loop.install()
        else:
            loop = ioloop.IOLoop.instance()
            self.assert_(
                isinstance(loop, PuritanicalIOLoop),
                "Couldn't install PuritanicalIOLoop"
            )

This is a bit over-complicated for our simple case—a call to PuritanicalIOLoop().install() would suffice—but this will all come in handy later. In our simple test suite, setUp() is only run once, so the check for IOLoop.initialized() is unnecessary, but you’ll need it if you run multiple tests. The call to super() will be necessary if we inherit from a TestCase with a setUp() method, which is exactly what we’re going to do below. For now, just run python test_async.py and observe that we get a proper failure:

$ python foo.py
Got result 17
F
======================================================================
FAIL: test_find (__main__.SyncTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "foo.py", line 49, in test_find
    ioloop.IOLoop.instance().start()
  File "/Users/emptysquare/.virtualenvs/blog/lib/python2.7/site-packages/tornado/ioloop.py", line 263, in start
    self._run_callback(timeout.callback)
  File "/Users/emptysquare/.virtualenvs/blog/lib/python2.7/site-packages/tornado/ioloop.py", line 398, in _run_callback
    self.handle_callback_exception(callback)
  File "foo.py", line 25, in handle_callback_exception
    raise exc_value
AssertionError: 42 != 17

----------------------------------------------------------------------
Ran 1 test in 1.002s

FAILED (failures=1)

Lovely. Change async_calculate() back to the correct version that produces 42.

An Even Better Way

So we’ve verified that our test catches bugs in the calculation. But what if we have a bug that prevents our callback from ever being called? Add a return statement at the top of async_calculate() so we don’t execute the callback:

def async_calculate(callback):
    """
    @param callback:    A function taking params (result, error)
    """
    # Do something profoundly complex requiring non-blocking I/O, which
    # will complete in one second
    return
    ioloop.IOLoop.instance().add_timeout(
        time.time() + 1,
        lambda: callback(42, None)
    )

Now if we run the test, it hangs forever, because IOLoop.stop() is never called. How can we write a test that asserts that the callback is eventually executed? Never fear, I’ve written some code:

class AssertEventuallyTest(unittest.TestCase):
    def setUp(self):
        super(AssertEventuallyTest, self).setUp()

        # Callbacks registered with assertEventuallyEqual()
        self.assert_callbacks = set()

    def assertEventuallyEqual(
        self, expected, fn, msg=None, timeout_sec=None
    ):
        if timeout_sec is None:
            timeout_sec = 5
        timeout_sec = max(timeout_sec, int(os.environ.get('TIMEOUT_SEC', 0)))
        start = time.time()
        loop = ioloop.IOLoop.instance()

        def callback():
            try:
                self.assertEqual(expected, fn(), msg)
                # Passed
                self.assert_callbacks.remove(callback)
                if not self.assert_callbacks:
                    # All asserts have passed
                    loop.stop()
            except AssertionError:
                # Failed -- keep waiting?
                if time.time() - start < timeout_sec:
                    # Try again in about 0.1 seconds
                    loop.add_timeout(time.time() + 0.1, callback)
                else:
                    # Timeout expired without passing test
                    loop.stop()
                    raise

        self.assert_callbacks.add(callback)

        # Run this callback on the next I/O loop iteration
        loop.add_callback(callback)

This class lets us register any number of functions which are called periodically until they equal their expected values, or time out. The last function that succeeds or times out stops the IOLoop, so your test definitely finishes. The timeout is configurable, either as an argument to assertEventuallyEqual() or as an environment variable TIMEOUT_SEC. Setting a very large timeout value in your environment is useful for debugging a misbehaving unittest—set it to a million seconds so you don’t time out while you’re stepping through the code.

(My code’s inspired by the Scala world’s “eventually” test, which Brendan W. McAdams showed me.)

Paste AssertEventuallyTest into test_async.py and fix up your test case to inherit from it:

class AsyncTest(AssertEventuallyTest):
    def setUp(self):
        < ... snip ... >

    def test_find(self):
        results = []
        def callback(result, error):
            print 'Got result', result
            results.append(result)

        async_calculate(callback)

        self.assertEventuallyEqual(
            42,
            lambda: results and results[0]
        )

        ioloop.IOLoop.instance().start()

The call to IOLoop.stop() is gone from the callback, and we’ve added a call to assertEventuallyEqual() just before starting the IOLoop.

There are two details to note about this code:

Detail the First: assertEventuallyEqual()‘s first argument is the expected value, and its second argument is a function that should eventually equal the expected value. Hence the lambda.

Detail the Second: callback() needs a place to store its result so that lambda can find it, but here we run into a nasty peculiarity of Python. Python functions can assign to variables in their own scope, or the global scope (with the global keyword), but inner functions can’t assign to values in outer functions’ scope. Python 3 introduces a nonlocal keyword to solve this, but meanwhile we can hack around the problem by creating a results list in the outer function and appending to it in the inner function. This is a common idiom that you’ll use a lot when you write callbacks in asynchronous unittests.

Conclusion

I’ve packed up PuritanicalIOLoop and AssertEventuallyTest on GitHub; go grab the code. Your test cases can choose to inherit from PuritanicalTornadoTest, AssertEventuallyTest, or both. Just make sure your setUp methods call super(MyTestCaseClass, self).setUp(). Go forth and test!

Save the Monkey: Reliably Writing to MongoDB

Photo: Kevin Jones

MongoDB replica sets claim “automatic failover” when a primary server goes down, and they live up to the claim, but handling failover in your application code takes some care. I’ll walk you through writing a failover-resistant application in Python using a new feature in PyMongo 2.1: the ReplicaSetConnection.

Setting the Scene

Mabel the Swimming Wonder Monkey is participating in your cutting-edge research on simian scuba diving. To keep her alive underwater, your application must measure how much oxygen she consumes each second and pipe the same amount of oxygen to her scuba gear. In this post, I’ll only cover writing reliably to Mongo. I’ll get to reading later.

MongoDB Setup

Since Mabel’s life is in your hands, you want a robust Mongo deployment. Set up a 3-node replica set. We’ll do this on your local machine using three TCP ports, but of course in production you’ll have each node on a separate machine:

1
2
3
4
5
$ mkdir db0 db1 db2
$ mongod --dbpath db0 --logpath db0/log --pidfilepath db0/pid --port 27017 --replSet foo --fork
$ mongod --dbpath db1 --logpath db1/log --pidfilepath db1/pid --port 27018 --replSet foo --fork
$ mongod --dbpath db2 --logpath db2/log --pidfilepath db2/pid --port 27019 --replSet foo --fork

(Make sure you don’t have any mongod processes running on those ports first.)

Now connect up the nodes in your replica set. My machine’s hostname is ‘emptysquare.local’; replace it with yours when you run the example:

1
2
3
4
5
6
7
8
9
10
11
12
$ hostname
emptysquare.local
$ mongo
> rs.initiate({
_id: 'foo',
members: [
{_id: 0, host:'emptysquare.local:27017'},
{_id: 1, host:'emptysquare.local:27018'},
{_id: 2, host:'emptysquare.local:27019'}
]
})

The first _id, ‘foo’, must match the name you passed with –replSet on the command line, otherwise Mongo will complain. If everything’s correct, Mongo replies with, “Config now saved locally. Should come online in about a minute.” Run rs.status() a few times until you see that the replica set has come online—the first member’s stateStr will be “PRIMARY” and the other two members’ stateStrs will be “SECONDARY”. On my laptop this takes about 30 seconds.

VoilĂ : a bulletproof 3-node replica set! Let’s start the Mabel experiment.

Definitely Writing

Install PyMongo 2.1 and create a Python script called mabel.py with the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import datetime, random, time
import pymongo
mabel_db = pymongo.ReplicaSetConnection(
'localhost:27017,localhost:27018,localhost:27019',
replicaSet='foo'
).mabel
while True:
time.sleep(1)
mabel_db.breaths.insert({
'time': datetime.datetime.utcnow(),
'oxygen': random.random()
}, safe=True)
print 'wrote'

mabel.py will record the amount of oxygen Mabel consumes (or, in our test, a random amount) and insert it into Mongo once per second. Run it:

1
2
3
4
5
$ python mabel.py
wrote
wrote
wrote

Now, what happens when our good-for-nothing sysadmin unplugs the primary server? Let’s simulate that in a separate terminal window by grabbing the primary’s process id and killing it:

1
2
$ kill `cat db0/pid`

Switching back to the first window, all is not well with our Python script:

1
2
3
4
5
6
7
8
9
Traceback (most recent call last):
File "mabel.py", line 10, in <module>
'oxygen': random.random()
File "/Users/emptysquare/.virtualenvs/pymongo/mongo-python-driver/pymongo/collection.py", line 310, in insert
continue_on_error, self.__uuid_subtype), safe)
File "/Users/emptysquare/.virtualenvs/pymongo/mongo-python-driver/pymongo/replica_set_connection.py", line 738, in _send_message
raise AutoReconnect(str(why))
pymongo.errors.AutoReconnect: [Errno 61] Connection refused

This is terrible. WTF happened to “automatic failover”? And why does PyMongo raise an AutoReconnect error rather than actually automatically reconnecting?

Well, automatic failover does work, in the sense that one of the secondaries will quickly take over as a new primary. Do rs.status() in the mongo shell to confirm that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ mongo --port 27018 # connect to one of the surviving mongod's
PRIMARY> rs.status()
// edited for readability ...
{
    "set" : "foo",
    "members" : [ {
            "_id" : 0,
            "name" : "emptysquare.local:27017",
            "stateStr" : "(not reachable/healthy)",
            "errmsg" : "socket exception"
        }, {
            "_id" : 1,
            "name" : "emptysquare.local:27018",
            "stateStr" : "PRIMARY"
        }, {
            "_id" : 2,
            "name" : "emptysquare.local:27019",
            "stateStr" : "SECONDARY",
        }
    ]
}

Depending on which mongod took over as the primary, your output could be a little different. Regardless, there is a new primary, so why did our write fail? The answer is that PyMongo doesn’t try repeatedly to insert your document—it just tells you that the first attempt failed. It’s your application’s job to decide what to do about that. To explain why, let us indulge in a brief digression.

Brief Digression: Monkeys vs. Kittens

If what you’re inserting is voluminous but no single document is very important, like pictures of kittens or web analytics, then in the extremely rare event of a failover you might prefer to discard a few documents, rather than blocking your application while it waits for the new primary. Throwing an exception if the primary dies is often the right thing to do: You can notify your user that he should try uploading his kitten picture again in a few seconds once a new primary has been elected.

But if your updates are infrequent and tremendously valuable, like Mabel’s oxygen data, then your application should try very hard to write them. Only you know what’s best for your data, so PyMongo lets you decide. Let’s return from this digression and implement that.

Trying Hard to Write

Let’s bring up the mongod we just killed:

1
2
$ mongod --dbpath db0 --logpath db0/log --pidfilepath db0/pid --port 27017 --replSet foo --fork

And update mabel.py with the following armor-plated loop:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
while True:
time.sleep(1)
data = {
'time': datetime.datetime.utcnow(),
'oxygen': random.random()
}
# Try for five minutes to recover from a failed primary
for i in range(60):
try:
mabel_db.breaths.insert(data, safe=True)
print 'wrote'
break # Exit the retry loop
except pymongo.errors.AutoReconnect, e:
print 'Warning', e
time.sleep(5)

Now run python mabel.py, and again kill the primary. Do either “kill `cat db1/pid`” or “kill `cat db2/pid`”, depending on which mongod is the primary right now. mabel.py’s output will look like:

1
2
3
4
5
6
7
8
9
10
wrote
Warning [Errno 61] Connection refused
Warning emptysquare.local:27017: [Errno 61] Connection refused, emptysquare.local:27019: [Errno 61] Connection refused, emptysquare.local:27018: [Errno 61] Connection refused
Warning emptysquare.local:27017: not primary, emptysquare.local:27019: [Errno 61] Connection refused, emptysquare.local:27018: not primary
wrote
wrote
.
.
.

mabel.py goes through a few stages of grief when the primary dies, but in a few seconds it finds a new primary, inserts its data, and continues happily.

What About Duplicates?

Leaving monkeys and kittens aside, another reason PyMongo doesn’t automatically retry your inserts is the risk of duplication: If the first attempt caused an error, PyMongo can’t know if the error happened before Mongo wrote the data, or after. What if we end up writing Mabel’s oxygen data twice? Well, there’s a trick you can use to prevent this: generate the document id on the client.

Whenever you insert a document, Mongo checks if it has an “_id” field and if not, it generates an ObjectId for it. But you’re free to choose the new document’s id before you insert it, as long as the id is unique within the collection. You can use an ObjectId or any other type of data. In mabel.py you could use the timestamp as the document id, but I’ll show you the more generally applicable ObjectId approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from pymongo.objectid import ObjectId
while True:
time.sleep(1)
data = {
'_id': ObjectId(),
'time': datetime.datetime.utcnow(),
'oxygen': random.random()
}
# Try for five minutes to recover from a failed primary
for i in range(60):
try:
mabel_db.breaths.insert(data, safe=True)
print 'wrote'
break # Exit the retry loop
except pymongo.errors.AutoReconnect, e:
print 'Warning', e
time.sleep(5)
except pymongo.error.DuplicateKeyError:
# It worked the first time
pass

We set the document’s id to a newly-generated ObjectId in our Python code, before entering the retry loop. Then, if our insert succeeds just before the primary dies and we catch the AutoReconnect exception, then the next time we try to insert the document we’ll catch a DuplicateKeyError and we’ll know for sure that the insert succeeded. You can use this technique for safe, reliable writes in general.


Bibliography

Apocryphal story of Mabel, the Swimming Wonder Monkey

More likely true, very brutal story of 3 monkeys killed by a computer error

NYC Python Meetup recap

I went to the NYC Python Meetup tonight at an East Village Bar. We drank, we ate pizza, we fended off recruiters (they knew they couldn’t recruit at the meetup proper, but one ambushed me as I left!), and heard two quirky presentations:

· Roy Smith of songza.com talked about Songza’s complex tech stack, and discussed some nice techniques for dealing with the complexity. In particular, they’ve hacked up their HAProxy front-end load balancer to add an X-Unique-Id header to every incoming HTTP request. All the software at all the tiers of their application logs the unique id along with whatever else it’s logging, so in retrospect it’s easy to track the steps it took to handle a request — or fail to handle it — as the work bubbled from tier to tier. They’ve even integrated with Get Satisfaction so they know the request id a customer is complaining about.

· Aaron Watters showed us Gadfly, a Python library that implements the SQL language for querying in-memory data or flat files. He’s updated Gadfly to talk with Cassandra (a NoSQL contender from Facebook) using SQL. It seems to be at the clever-hack stage right now, but could lead the way to integrating NoSQL databases with legacy systems that expect SQL databases?