over 3 years ago

Thumbor is a great project and very battle tested.

It serves loads of images for many different companies. In the company I work for it serves hundreds of millions of images a month. In this post I want to show how to scale thumbor, from a VERY small service to a VERY resillient one.

DISCLAIMER

This is going to be a very big post. I'm going through all the steps required to get thumbor working and scaling it.

You don't have to go all the way. Definitely stop at whatever point in the scaling game that makes sense to you.

Requirements

In order to fully comprehend this post, I expect you to have a grasp on the following:

  • How to run python scripts in your operating system;
  • How to install python programs in your operating system (pip);
  • How to install/configure NGinx (or another reverse proxy of your choice);
  • How to install packages or build them in your operating system (mainly applies to OpenCV).

What I'm NOT going to cover

  • How to use thumbor's features (we have the wiki for that);
  • How to use thumbors libraries for each programming language (again - libraries wiki page);
  • How to install OpenCV in your specific operating system (OpenCV website);
  • How to install and operate Supervisord.

Installing thumbor

If you have pip set-up, installing thumbor is as easy as doing:

$ pip install thumbor

Getting started

Well, the simplest possible thumbor server that we can run is a single instance of thumbor using the python interpreter.

$ thumbor

After thumbor is running, just head to your browser and go to picture of cat.

You should see the above very nice and resized kitten. Congratulations! You have a fully functional thumbor instance.

We can always do better, right? And we shall.

Configuring thumbor

Right now you are running thumbor with all the "factory" defaults. This is an almost safe thing to do, with the exception of the SECURITY_KEY property.

That property specifies the key that you'll need to use when generating new URLs for thumbor, but I'm guessing you know that already. For more information on thumbor itself, you can head to thumbor wiki.

In order to run thumbor with a specific configuration file, just run the same as before but use a -c flag, like this:

$ thumbor -c thumbor.conf

There are a couple things you must change from the defaults:

thumbor.conf
SECURITY_KEY = "MY-UNIQUE-SECURITY-KEY

ALLOW_UNSAFE_URL = False  # this is very important so attackers can't generate new images

For more information go to thumbor configuration page.

Now let's get thumbor to detect some faces!

Face Detectors

Below this part we assume you have OpenCV configured and available for the same python interpreter that's running Thumbor. As I mentioned before this is beyond the scope of this post.

Just to make sure you have OpenCV available to the same python interpreter that's running thumbor, try the following in your terminal:

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named cv

If that error happens, then you probably have not compiled OpenCV with python support or your current virtualenv does not have the cv binaries linked. This is what you should get when importing OpenCV:

$ python
Python 2.7.5 (default, Aug 25 2013, 00:04:04)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import cv
>>> 

Thumbor comes built-in with the ability to detect faces in images and use that information to get smart crops.

In order to use the smart cropping feature, you must enable the detector in your thumbor.conf file:

thumbor.conf
# other detectors commented on purpose. enable all that make sense to you.

DETECTORS = [
    'thumbor.detectors.face_detector',
    #'thumbor.detectors.profile_detector',

    #'thumbor.detectors.glasses_detector',

    #'thumbor.detectors.feature_detector',

]

So say we want to crop Christina Aguilera's picture to a vertical size (200x500) from this picture:

This is what thumbor would give us as default cropping:

aguilera_no_smart.jpg

Now if we ask it to consider Mrs. Aguilera's face, we would get a much better picture:

aguilera_smart.jpg

If you need to see what thumbor is seeing in your image, just use /debug before all the parameters, like this:

aguilera_debug.jpg

Reverse proxying

Thumbor runs on Tornado. Tornado is an evented web framework that allows for asynchronous I/O.

Since thumbor does a lot of I/O operations (open original image, store original image, store detection data, store and load resulting crop, etc), it is a perfect match.

That also means that each Thumbor process runs a single thread and can only process one request at a time.

In order to scale Thumbor we'll definitely want to serve more than one request at a time. Fortunately it is incredibly simple to scale Thumbor that way.

You can use any of the available solutions for running several instances of Thumbor, but I'll show what a supervisor script looks like:

supervisord.conf
[program:thumbor]
command=thumbor --port=80%(process_num)02d --conf=/etc/thumbor80%(process_num)02d.conf
process_name=thumbor80%(process_num)02d
numprocs=4
# rest of the file ommitted for clarity

Please be advised that this is just an EXAMPLE configuration with many things missing. The important bits here are the command and numprocs configurations.

By leveraging supervisord's ability to spawn many instances of the same program we can support many requests at the same time.

The thing is that now thumbor will be responding in many different ports. This can be easily solved by using a reverse proxy (or load balancer) to load balance all incoming connections to each thumbor process.

Configuring a load balacing solution (like NGinx or Apache) is beyond the scope of this post, but can be easily found online.

More information on this process can be found at Thumbor's Hosting wiki page.

Benchmarking

Now that our thumbor install can answer many requests, let's verify how long each of our requests is taking:

$ time curl http://localhost:8888/unsafe/200x500/http://www.christinaaguilera.com/sites/caguilera/files/imagecache/1400x859/4_1.jpg

real    0m0.031s
user    0m0.006s
sys 0m0.004s

$ time curl http://localhost:8888/unsafe/200x500/smart/http://www.christinaaguilera.com/sites/caguilera/files/imagecache/1400x859/4_1.jpg

real    0m0.314s
user    0m0.006s
sys 0m0.004s

As demonstrated above, the smart image operation is about 10 times slower than the orientation-based cropping. In order to have a thumbor installation that actually scales, we'll need to take care of that.

Lazy Face Detection

Thumbor uses an eventual consistency approach in order to speed up the delivery of images and thus not flooding your Thumbor servers.

This is what happens when a smart request comes in and we are using queued detection (more on that below):

thumbor_smart_diagram.png

Thumbor returns a cropped image as fast as possible, even if that's not THE best crop. The trick here is that it returns a very small TTL, meaning that the browser will return for this image later and will get the correct one eventually.

Now let's see how we can configure and run RemoteCV.

RemoteCV

Due to the nature of face detection (and object detection in general) of being cpu-intensive, here at Globo.com we created a way of queueing the actual retrieval of CV information.

Enters RemoteCV. In order to install RemoteCV, you should first satisfy some C/C++ requirements.

After you have those properly installed, you can install RemoteCV with:

$ pip install remotecv

At the time of writing this post the most current release of RemoteCV is 0.7.2.

With RemoteCV you can choose if you want to use PyRes or Celery as your queueing backend. We use PyRes, but both should work equally well.

After you have selected what backend you'll use, just run:

$ remotecv -h

You'll be greeted with all the options available to RemoteCV. If you have choosen to use PyRes, we can start RemoteCV with:

$ remotecv --host=localhost --port=7777 --password=foo

That command tells RemoteCV that it should connect to a Redis instance in the local system, bound to port 7777 using password foo to authenticate.

RemoteCV uses a loader and a storage in order to load image bytes and to store them for later usage.

The only built-in loader is the HTTP loader, but you can easily create your own and pass it in using the --loader option to RemoteCV. If you don't specify a loader, the HTTP loader is used.

RemoteCV comes with two built-in storages: redis (remotecv.result_store.redis_store) and memcache (remotecv.result_store.memcache_store). RemoteCV defaults to the redis result store.

The redis host requires no additional configuration, whereas the memcache one requires that you pass the memcache servers locations using --memcache_hosts="memcache1.server.com,memcache2.server.com".

It's worth noting that you can have as many workers as your servers can handle. They scale very well horizontally.

Queued Detection

Now that we have RemoteCV running, we can start sending requests to its queues.

There are some built-in queued detectors for Complete Detection, Face Detection or Feature Detection. It's entirely up to you to use only one of those. We strongly advise the usage of the Complete Detection. It will detect faces, profile faces and glasses as focal points.

In order to use the queued detector in thumbor, we need to change our configuration file a little bit:

thumbor.conf
# in this sample we assume the redis backend is used in RemoteCV


DETECTORS = [
    'thumbor.detectors.queued_detector.queued_complete_detector',
]

# information of the redis server where we will dump messages for remotecv

REDIS_QUEUE_SERVER_HOST = 'redis host'
REDIS_QUEUE_SERVER_PORT = 'redis port'
REDIS_QUEUE_SERVER_DB = 'redis database'
REDIS_QUEUE_SERVER_PASSWORD = 'redis password'

This configuration will make sure that we serve temporary images as fast as possible and when the detection information is available we start serving the correctly cropped image.

Advanced Stuff

Ok, now you got your thumbor farm up and running. As you scale it up, you'll probably want to implement your own storage, result storage and loader mechanisms.

What this means is that you get to decouple original images storage and loading, as well as the resulting crops from the applications using it.

For more information on the topics:

Further Reading

Yipit has detailed how they scale thumbor at their engineering blog.

Square also posted at their engineering blog about how they generate dynamic images with thumbor.

99 Designs also has some info on their architecture using thumbor and amazon at their engineering blog.

Conclusion

I hope we have demonstrated that scaling thumbor is a very easy and simple task. If there's anything you'd like to add here, just let me know in the comments and I'll update the post.

← thumbor ecosystem
 
comments powered by Disqus