Graceful shutdown in node.js

According to wikipedia - Unix Signal:

Signals are a limited form of inter-process communication used in Unix, Unix-like, and other POSIX-compliant operating systems. A signal is an asynchronous notification sent to a process or to a specific thread within the same process in order to notify it of an event that occurred.

There are a bunch of generic signals, but I will focus on two:

  • SIGTERM is used to cause a program termination. It is a way to politely ask a program to terminate. The program can either handle this signal, clean up resources and then exit, or it can ignore the signal.
  • SIGKILL is used to cause inmediate termination. Unlike SIGTERM it can't be handled or ignored by the process.

Wherever and however you are deploying your node.js application it is very likely that the system in charge of running your app use these two signals:

  • Upstart: When stoping a service, by default it sends SIGTERM and waits 5 seconds, if the process is still running, it sends SIGKILL.
  • supervisord: When stoping a service, by default it sends SIGTERM and waits 10 seconds, if the process is still running, it sends SIGKILL.
  • runit: When stoping a service, by default it sends SIGTERM and waits 10 seconds, if the process is still running, it sends SIGKILL.
  • Heroku dynos shutdown: as described in this link heroku send SIGTERM, waits the process to exit for 10 seconds and if the process is still running it sends SIGKILL.
  • Docker: If you run your node app in a docker container, when running docker stop command the main process inside the container will receive SIGTERM, and after a grace period (10 seconds by default), SIGKILL.

So, let's try a with this simple node application:

var http = require('http');

var server = http.createServer(function (req, res) {
  setTimeout(function () { //simulate a long request
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
  }, 4000);
}).listen(9090, function (err) {
  console.log('listening http://localhost:9090/');
  console.log('pid is ' + process.pid)
});

As you can see response are delayed 4 seconds. The node documentation here says:

SIGTERM and SIGINT have default handlers on non-Windows platforms that resets the terminal mode before exiting with code 128 + signal number. If one of these signals has a listener installed, its default behaviour will be removed (node will no longer exit).

It is not clear from here what's the default behavior, I send SIGTERM in the middle of a request the request will fail as you can see here:

» curl http://localhost:9090 &
» kill 23703
[2] 23832
curl: (52) Empty reply from server

Fortunately, the http server has a close method that stops the server for receiving new connections and calls the callback once it finished handling all requests. This method comes from the NET module, so is pretty handy for any type of tcp connections.

Now, if I modify the example to something like this:

var http = require('http');

var server = http.createServer(function (req, res) {
  setTimeout(function () { //simulate a long request
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
  }, 4000);
}).listen(9090, function (err) {
  console.log('listening http://localhost:9090/');
  console.log('pid is ' + process.pid);
});

process.on('SIGTERM', function () {
  server.close(function () {
    process.exit(0);
  });
});

And then I use the same commands as above:

» curl http://localhost:9090 &
» kill 23703
Hello World
[1]  + 24730 done       curl http://localhost:9090

You will notice that the program doesn't exit until it finished processing and serving the last request. More interesting is the fact that after the SIGTERM signal it doesn't handle more requests:

» curl http://localhost:9090 &
[1] 25072

» kill 25070

» curl http://localhost:9090 &
[2] 25097

curl: (7) Failed connect to localhost:9090; Connection refused
[2]  + 25097 exit 7     curl http://localhost:9090

» Hello World
[1]  + 25072 done       curl http://localhost:9090

Some examples in blogs and stackoverflow uses a timeout on SIGTERM in the case that server.close takes longer than expected. As mentioned above this is unnecesary because every process manager will send a SIGKILL if the SIGTERM takes too much time.

Posted in node | 0 Comments and 0 Reactions

A frecuent case of double callbacks in node.js

A double-callback is in javascript jargon a callback that we expect to be called once but for some reason is called twice or more times.

Sometimes it is easy to discover, as in this example:

function doSomething(callback) {
  doAnotherThing(function (err) {
    if (err) callback(err);
    callback(null, result);
  });
}

The obvious error here is that when doAnotherThing fails the callback is called once with the error, and once with result.

However there is one special case that is very hard to reproduce and to discover, moreover it has happened to me several times.

Yesterday, my friend and co-worker Alberto asked me this:

"Why does this test hangs on the assertion line?" expect(foo).to.be.equal('123');

The test look like this

it('test something', function (done) {
  function_under_test(function (err, output) {
    expect(foo).to.be.equal('123');
  });
});

After some debugging I found out that it didn't hang only on expect but when we throw any error inside the callback.

A bunch of calls before in the stack, there was a little function with a bug like this:

function (callback) {
  another_function(function (err, some_data) {
    if (err) return callback(err);
    try {
      callback(null, JSON.parse(some_data));
    } catch(err) {
      callback(new Error(some_data + ' is not a valid JSON'));
    }
  });
}

The intention of the developer with this try method is clear: to catch JSON.parse errors. But the problem is that it also catch errors thrown inside callback and execute the callback with a wrong error.

The solution is trivial, parse outside the try as follows:

function (callback) {
  another_function(function (err, some_data) {
    if (err) return callback(err);
    try {
      var parsed = JSON.parse(some_data)
    } catch(err) {
      return callback(new Error(some_data + ' is not a valid JSON'));
    }
    callback(null, parsed);
  });
}

Introducing these errors is very easy I've done several times, throubleshooting is very hard, so be careful and do not wrap callbacks call in try/catch blocks.

Posted in node | 0 Comments and 0 Reactions

The Architecture we use to Deploy to Public and Private Clouds

Originally posted on inside.auth0.com.

How we use Puppet, GitHub, TeamCity, Windows Azure, Amazon EC2 and Route53 to ship Auth0

Introduction

There are two ways companies deploys web applications to the cloud: PaaS and IaaS. With Platform as a Service you usually deploy your applications to the vendor's platform. Infrastructure as a Service is the basic cloud-service model: you get servers, usually vms.

We started auth0 using Heroku's Platform as a Service but soon we decided to provide a self-hosted option of Auth0 for some customers. In addition, we wanted to have the same deployment mechanism for the cloud/public version and the appliance version. So, we decided to move to IaaS.

What I am going to show here is the work after several iterations at a very high level. It is not a silver bullet and you probably don't need this (yet?) but it is another option to consider. Even if this is exactly what you are looking for, the best advice I can give you "don't architect everything from start", this come from the work of several weeks and took several iterations but we never ceased to ship, everything evolved and keeps evolving, and we keep tuning the process a lot on the run.

The big picture

What's Puppet and why is so important when using IaaS?

Picture yourself deploying your application to your brand new vm today. What is the first thing you will do? Well, if you have a node.js application installing node will be a good start. But probably you will need to install 10 other things as well, and changing the famous ulimit, configuring logrotate, ntp and so on. Then you will copy your application somewhere in the disk and configure it as a service and so on. Where do you keep this recipe?

Puppet is a tool for configuration management. Rather than an install script you describe at a high level the state of the resources in the server. When you run puppet it will check everything and then it will do whatever it takes to put the server in that specific state, from removing a file to installing a software.

There is another tool similar to puppet called chef. One of the things regarding Chef that I would like to test in the future is Amazon OpsWorks.

After you have your configuration in a language like this, deploying to a new server is very easy. Sometimes I modify the configuration via ssh to test and then I update my puppet scripts.

There is another concept emerging called InmutableServers, it is a very interesting way and there seems to be some companies using it.

Sources and repositories

Auth0 is very modularized and it is not a single web application but a network of less than ten. Every web application is a node application. Our core is a web app without ui which handles the authentication flows and provide a rest interface, dashboard is another web application which is just an interface to our core where you can configure and test most of the settings, docs is another app full of markdown tutorials to name a few.

We use github private repositories because we already had a lot of things opensourced there.

We use branches to develop new features and when it is ready we merge to master. Master is always deployable. We took some of the concepts from a talk we saw; "How Github uses Github to build Github". When something is ready? is a tricky question but we are very responsible and self organized team, we do pull-requests from branch to master when we want the approval of our peers. Teamcity automatically run all tests and will mark the pull requests as OK, this is a very useful feature of TC. But the most important thing we do in this stage are code reviews.

Very often we send a branch to one of our 4 tests environments with our hubot (a personal bot on the chat room):

  • ui is our dashboard application
  • template-scripts was a branch
  • proto is the name of te environment

with that in place we can review a living instance of the branch in a environment similar to production.

Then we iterate until we finally merge.

This is what works for us now, anyone of the team can merge or push directly to master and we consciously decide when we should do the pull-request ceremony.

Continuous integration

We used Jenkins for six months but it failed a lot, I had to rebuild few of the plugins we were using. Then I had a short fantasy to build our own CI server but we choose teamcity since I had use it before, I knew how to set it up and it is a good product.

Every application is a project in teamcity, when we push to master teamcity does:

  1. pull the repository
  2. install the dependencies (in some repos node_modules versioned)
  3. run all the tests
  4. bundle node dependencies with carlos8f/bundle-deps
  5. bump package version
  6. npm pack

1,2,3 are very common even in non-node.js projects. In the 4th step what we do is to move all "dependencies" to "bundleDependencies" in the package.json by doing this, the npm pack will contain all the modules already preinstalled. The result of the task is the tgz generated by npm pack.

This will automatically trigger the next task called "configuration". This taks pull our configuration repository written in puppet and all the puppet sub modules, then it will take the last version of every node project and build one ".tgz" for every "site" we have in puppet. We have several "site" implementations like:

  • myauth0 used to create a quick appliance
  • auth0 the cloud version you see in app.auth0.com. It is very different from the previous one, for instance it will not install mongodb since we use MongoLab in the cloud deployment.
  • some-customer some customers need some specific settings or features, so we have configurations with our customers name.

The artifact of the configuration task is a tgz with all puppet modules including auth0 and the site.pp. All the packages are uploaded to Azure Blob storage in this stage.

The next task called "cloud deploy" in the CI pipeline will trigger immediately after the config task, it updates the puppetmaster (currently in the same CI server) and then runs the puppet agent in every node of our load balanced stack via ssh. After it deploys to the first node it does a quick test of the node and if there is something wrong it stop it and it will not deploy to the rest of the nodes. Azure load balancer then will take the node out of rotation until we fix the problem in the next push.

We have a backup environment where we continuously deploy, it is on Amazon and in a different region. It has a clone of our database (max 1h stale). This node node is used in case that azure us-east has an outage or something like that, when this happens Route53 will redirect the traffic to the backup environment. We take high availability seriously, read more here. When running in backup mode, all the settings become read-only, this means that you can't change the properties of an Identity Provider however your users will be able to login to your application which is Auth0 critical mission.

Adding a new server to our infrastructure take very few steps:

  • create the node
  • run an small script that installs and configure puppetagent
  • approve the node in the puppetmaster

Assembling an appliance for a customer is very easy as well, we run an script that install puppetmaster in the vm, download the last config from the blob storage and run it. We use Ubuntu JeOS in this case.

Final thoughts

I've to skip a lot of details to make this article concise. I hope you find it useful, if there is something you will like to know regarding this don't hesitate to put your question in a comment.

Posted in puppet git continuous delivery | 0 Comments and 0 Reactions

  • Categories

  • Archives