Thumbor is a great project and very battle tested.
It serves loads of images for many different companies. In the company I work for it serves hundreds of millions of images a month. In this post I want to show how to scale thumbor, from a VERY small service to a VERY resillient one.
This is going to be a very big post. I'm going through all the steps required to get thumbor working and scaling it.
You don't have to go all the way. Definitely stop at whatever point in the scaling game that makes sense to you.
In order to fully comprehend this post, I expect you to have a grasp on the following:
- How to run python scripts in your operating system;
- How to install python programs in your operating system (pip);
- How to install/configure NGinx (or another reverse proxy of your choice);
- How to install packages or build them in your operating system (mainly applies to OpenCV).
What I'm NOT going to cover
- How to use thumbor's features (we have the wiki for that);
- How to use thumbors libraries for each programming language (again - libraries wiki page);
- How to install OpenCV in your specific operating system (OpenCV website);
- How to install and operate Supervisord.
If you have pip set-up, installing thumbor is as easy as doing:
Well, the simplest possible thumbor server that we can run is a single instance of thumbor using the python interpreter.
After thumbor is running, just head to your browser and go to picture of cat.
You should see the above very nice and resized kitten. Congratulations! You have a fully functional thumbor instance.
We can always do better, right? And we shall.
Right now you are running thumbor with all the "factory" defaults. This is an almost safe thing to do, with the exception of the
That property specifies the key that you'll need to use when generating new URLs for thumbor, but I'm guessing you know that already. For more information on thumbor itself, you can head to thumbor wiki.
In order to run thumbor with a specific configuration file, just run the same as before but use a
-c flag, like this:
There are a couple things you must change from the defaults:
For more information go to thumbor configuration page.
Now let's get thumbor to detect some faces!
Below this part we assume you have OpenCV configured and available for the same python interpreter that's running Thumbor. As I mentioned before this is beyond the scope of this post.
Just to make sure you have OpenCV available to the same python interpreter that's running thumbor, try the following in your terminal:
If that error happens, then you probably have not compiled OpenCV with python support or your current virtualenv does not have the cv binaries linked. This is what you should get when importing OpenCV:
Thumbor comes built-in with the ability to detect faces in images and use that information to get smart crops.
In order to use the smart cropping feature, you must enable the detector in your
So say we want to crop Christina Aguilera's picture to a vertical size (200x500) from this picture:
This is what thumbor would give us as default cropping:
Now if we ask it to consider Mrs. Aguilera's face, we would get a much better picture:
If you need to see what thumbor is seeing in your image, just use
/debug before all the parameters, like this:
Thumbor runs on Tornado. Tornado is an evented web framework that allows for asynchronous I/O.
Since thumbor does a lot of I/O operations (open original image, store original image, store detection data, store and load resulting crop, etc), it is a perfect match.
That also means that each Thumbor process runs a single thread and can only process one request at a time.
In order to scale Thumbor we'll definitely want to serve more than one request at a time. Fortunately it is incredibly simple to scale Thumbor that way.
You can use any of the available solutions for running several instances of Thumbor, but I'll show what a supervisor script looks like:
Please be advised that this is just an EXAMPLE configuration with many things missing. The important bits here are the command and numprocs configurations.
By leveraging supervisord's ability to spawn many instances of the same program we can support many requests at the same time.
The thing is that now thumbor will be responding in many different ports. This can be easily solved by using a reverse proxy (or load balancer) to load balance all incoming connections to each thumbor process.
Configuring a load balacing solution (like NGinx or Apache) is beyond the scope of this post, but can be easily found online.
More information on this process can be found at Thumbor's Hosting wiki page.
Now that our thumbor install can answer many requests, let's verify how long each of our requests is taking:
As demonstrated above, the smart image operation is about 10 times slower than the orientation-based cropping. In order to have a thumbor installation that actually scales, we'll need to take care of that.
Lazy Face Detection
Thumbor uses an eventual consistency approach in order to speed up the delivery of images and thus not flooding your Thumbor servers.
This is what happens when a smart request comes in and we are using queued detection (more on that below):
Thumbor returns a cropped image as fast as possible, even if that's not THE best crop. The trick here is that it returns a very small TTL, meaning that the browser will return for this image later and will get the correct one eventually.
Now let's see how we can configure and run RemoteCV.
Due to the nature of face detection (and object detection in general) of being cpu-intensive, here at Globo.com we created a way of queueing the actual retrieval of CV information.
After you have those properly installed, you can install RemoteCV with:
At the time of writing this post the most current release of RemoteCV is 0.7.2.
After you have selected what backend you'll use, just run:
You'll be greeted with all the options available to RemoteCV. If you have choosen to use PyRes, we can start RemoteCV with:
That command tells RemoteCV that it should connect to a Redis instance in the local system, bound to port
7777 using password
foo to authenticate.
RemoteCV uses a loader and a storage in order to load image bytes and to store them for later usage.
The only built-in loader is the HTTP loader, but you can easily create your own and pass it in using the
--loader option to RemoteCV. If you don't specify a loader, the HTTP loader is used.
RemoteCV comes with two built-in storages: redis (remotecv.result_store.redis_store) and memcache (remotecv.result_store.memcache_store). RemoteCV defaults to the redis result store.
The redis host requires no additional configuration, whereas the memcache one requires that you pass the memcache servers locations using
It's worth noting that you can have as many workers as your servers can handle. They scale very well horizontally.
Now that we have RemoteCV running, we can start sending requests to its queues.
There are some built-in queued detectors for Complete Detection, Face Detection or Feature Detection. It's entirely up to you to use only one of those. We strongly advise the usage of the Complete Detection. It will detect faces, profile faces and glasses as focal points.
In order to use the queued detector in thumbor, we need to change our configuration file a little bit:
This configuration will make sure that we serve temporary images as fast as possible and when the detection information is available we start serving the correctly cropped image.
Ok, now you got your thumbor farm up and running. As you scale it up, you'll probably want to implement your own storage, result storage and loader mechanisms.
What this means is that you get to decouple original images storage and loading, as well as the resulting crops from the applications using it.
For more information on the topics:
I hope we have demonstrated that scaling thumbor is a very easy and simple task. If there's anything you'd like to add here, just let me know in the comments and I'll update the post.