Welcome to Wei Dong's Blog!

Algorithms Systems Big Data Deep Learning Computer Vision Audio Processing

The receptive field of a cell in a convolutional neural network (CNN) is the underlying image region that affects the computation of this cell. For a CNN to effectively recognize an object, the receptive field size of the narrowest layer of the network must be large enough to capture the characterisic structures of that object. Receptive field being too big can be problematic, too. In addition to costing excessive computation, the neural network might be forced to learn structures that surround, but are not intrinsically attached to or associate with the object.

This program automatically computes the receptive field size of the narrowest layer of a TensorFlow model. This can be used as a guide to modifying a standard network architecture to work with a simple dataset.

Owl, PicPac and XNN are three tools I wrote to make image-related model training easy.

  • Owl: a web UI for efficient image annotation.
  • PicPac: PicPac is an image database and streaming library that preprocess the images and feed them into a deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana, Theano and Tensorflow.
  • XNN: a C++ wrapper that provides a unified prediction interface to all common deep learning frameworks, including Caffe, MxNet, Tensorflow, Theano and other Python-based frameworks.
  • (Caffe fork with PicPac support)

The goal is to create a model that will detect and localize a target object category within images. We will use a toy dataset for car plate recognition for illustration.


$ git clone https://github.com/aaalgo/owl
$ cd owl
$ # Download the dataset
$ wget http://www.robots.ox.ac.uk/~vgg/data/cars_markus/cars_markus.tar
$ mkdir images
$ cd images
$ tar xf ../cars_markus.tar
$ cd ..
$ # create database
$ ./manage.py migrate
$ # import images into the database
$ find images/ -name '*.jpg' | ./manage.py import --run
$ # start the annotation server

Before starting the annotation server, we need to adjust a couple of parameters in the file owl/annotate/params.py

ROWS = 2        # <-- images rows / page
COLS = 3        # <-- images / row
POLYGON = False     # set to True for polygons
VIEWED_AS_DONE = False  # see below
$ ./run.sh

The URL of the annotation UI is http://HOSTNAME:18000/annotate/.


The UI is designed to minimize hand movements and therefore maximize efficiency. The following design decisions were made:

  • A bounding box is automatically saved by AJAX when created.
  • Refreshing page loads the next batch of examples.

The annotation process finishes when all images are annotated/viewed. The VIEWED_AS_DONE parameter controls the behavior whether an image viewed should be considered annotated even when no annotation is added. Set the value to True if it is know that images without positive regions exist. If the value is set to False and no annotation is made to an image, it will be shown again when all other images are done.

After annotation is done, or sufficient number of annotations are collected, the images and annotations can be exported to a PicPac database by

$./manage.py export db

The file db then contains all the information needed for training.

PicPac Database

A PicPac database contains images and labels/annotations. The annotation produced by Owl is the same format used by Annotorious. Actually Owl uses an extended version of Annotorious. Below is a sample annotation:

{'shapes': [{u'geometry': {u'y': 0.5912162162162162, u'x': 0.6049107142857143, u'width': 0.10491071428571429, u'height': 0.08277027027027027}, u'style': {}, u'type': u'rect'}]}

PicPac provides a web server for viewing the content of a database.

$ picpac-server db
$ picpac-server db
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0901 22:52:20.280788 29210 picpac-server.cpp:146] listening at
I0901 22:52:20.281389 29210 picpac-server.cpp:148] running server with 1 threads.

And samples with annotations can be viewed with http://HOSTNAME:18888/l?annotate=json. The red bounding box is rendered on-the-fly by the server; images and annotations are stored separately in the database.


The server accepts almost all of the perturbation/augmentation parameters, so the effects on the training set can be visualized. For examples, the following can be appended to the URL &perturb=1&pert_angle=20.

Sometimes when the positive regions are too small compared to the background, it is desirable to use only local areas surrounding the postive regions as training example, so that positive pixels and negative pixels are roughly balanced. The command below can be used to do the cropping.

$ picpac-split-region --width 100 --height 50 --bg 200 --no-scale 1 db db.crop
min: 0.668153
mean: 0.743567
max: 0.819342

Using picpac-server to serve db.crop shows this.


The program picpac-split-region accepts the following parameters:

  • (--size, always 50) Scale, or sqrt(width*height), of positive region.
  • --width output image wdith.
  • --height output image height.
  • --no-scale 1. If not set, the cropped region is scaled so

positive region and negative region are of the specified size. If set, the cropped region is not scaled. Rather the size parameters are used to determine the ratio between positive and negative regions, and the output image size is determined accordingly.


XNN provides a couple of templates based on public models. For example, we can train with the above database using the following command.

xnn/train-caffe-fcn.py fcn db ws


  • fcn is the template name.
  • db is the input database.
  • ws is the working directory.

Training will start automatically after the command, and can be canceled with CTRL+C. The ws directory will contain the following:

$ ls wc
log    params.pickle  solver.prototxt       train.log       train.prototxt.tmpl
model  snapshots      solver.prototxt.tmpl  train.prototxt  train.sh

Training can be restarted with train.sh, or continued at a snapshot by supplying a snapshot name under the snapshots directory as the argument of train.sh.

While some parameter can be adjusted via arguments to train-caffe-fcn.py, it is easier to cancel the training process, edit the file train.prototxt and then restarted. The most import parameters of train.prototxt are annotated below.

layer {
  name: "data1"
  type: "PicPac"
  top: "data"
  top: "label"
  picpac_param {
    path: "path/to/db" 
    batch: 1        # batch size, has to be 1 if image sizes are different
    channels: 3     # color channels, use 1 for grayscale images
    split: 5        # randomly split db into 5 parts
    split_fold: 0   # use part 0 for validation and the rest for training

    annotate: "json"
    anno_color1: 1

    threads: 4      
    perturb: true   # enable image augmentation
    pert_color1: 10 # random perturbation range of
    pert_color2: 10 # the three color channels
    pert_color3: 10
    pert_angle: 20  # maximal angle of random rotation, in degrees
    pert_min_scale: 0.8 # min &
    pert_max_scale: 1.2 #       max ramdom scaling factor

PicPac supports a full range of flexible configurations.  See
(documentation)[http://picpac.readthedocs.io/en/latest/] for details.

PicPac with TensorFlow

PicPac has a simple python interface with the same parameters.

    config = dict(loop=True,
    stream = picpac.ImageStream('db', negate=False, perturb=True, **config)

        with tf.Session() as sess:
            for step in xrange(FLAGS.max_steps):
                images, labels, pad = stream.next()
                feed_dict = {X: images,
                             Y_: labels}
                _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)

Caffe requires the user to preload training images into a database, and the images are stored as raw pixels. The following calculation shows that this is not a very good idea.

Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3 = 192KB storage. On the other hand, 255x255x3 JPEG images compressed with default parameters cost about 48KB, about 1/4 the storage of raw pixels. Benchmark shows that a 4-core 2600K can decode jpeg images of this size at a rate of 6500/s using all cores, or 1350/s with one core. If we assume only one core can be allocated for image decoding, the processing power translates to 63MB/s input throughput of JPEG data, and 253MB/s output throughput of raw pixels. The sequential read throughput of a traditional HDD is about 100MB, which is above the input throughput of one core. So an economical design would be to store the images compressed with jpeg on a tradional HDD, and decode the image with a dedicated CPU core. The throughput of Caffe, according to the website, is about 4ms/image for learning and 1ms/image for predicting on a K40 GPU. So the throughput of the above configuration can well saturate the GPU power even for predicting. The whole system is nice and balanced, and the main stream HDD provides about 3TB of storage. This also leaves some room for future growth of GPU power and training image size (HDD/SSD grows in capacity rather than throughput).

Of course, this all relies on being able to achieve 63MB/s throughput from the disk, and achieving this on a HDD requires sequential I/O. With images stored in a database, it requires a very fast SSD to achieve such throughput. That’s why I developed the PicPoc image storage for deep learning.(Benchmarking show that sequential read with LMDB DOES achieve raw hardware throughput, whether HDD or SSD. The storage overhead of LMDB is also reasonably low, around 3% as I measured with the ILSVRC 2012 dataset.)

Here are some performance numbers I’ve been achieving with preliminary experiments.

Importing fall 2011 version of ImageNet (14million images stored on 21935 tar files, totalling about 1.2TB) into PicPoc took about 10 hours. The output is 400GB. The input on one HDD and output on another. CPU usage is 213.6%. Considering reading 1.2TB from HDD takes about 3.5 hours and CPU usage is about 50%, there’s a possibility to double the loading throughput. But that’s a one shot business so I’ll say it’s good enough for now. The ILSVRC 2012 training data, when imported, costs 28GB storage, as apposed to 173GB imported to LMDB as raw pixels as described in Caffe’s documentation. (One doesn’t have to use raw pixels with LMDB. The Caffe Datum can be used to store encoded image, and OpenCV pretty much support all popular image codecs).

On reading with decoding, the system is able to sustain 120MB/s throughput on a traditional 1TB HDD. I’ve also created a Caffe fork with PicPoc backend.

pthread是包在OS线程外的。light-weighted thread需要用到用户态线程。 微软的fiber是用户态的,但是因为是windows世界的,不怎么招人待见。 在多核上做用户态线程是一件非常非常恶心的事情,GO最主要的贡献其实 就是把这件事做成了。主要的恶心之处就是怎么处理job stealing:一个 操统线程上面的任务都跑完了,就需要去别的操统线程那儿把活弄过来。 这就涉及到各种同步,各种locking。Locking多了,性能就下来了。 这种事情以前应该不少公司内部都有人做过,能做这个的人一般也都 屌得不得了。其实稍微对比一下就能知道multi-core有多难做: node.js不支持multi-core,python折腾这么多年也还是个残疾。 如果你想更多地了解一下,可以从man makecontext 看起。每个用户态线程其实是一个context。然后底下每个操统线程负责 管一堆context。context切换主要靠cooperative scheduling,而不是操统 用的preemptive scheduling。也就是说一个context运行到某一步自己主动 把执行权让出来。Unix世界的一般没见过cooperative scheduling. Windows 3.x是cooperative scheduling,所以线程跑一会就得调用yield 让出执行权。因为不可能要求程序员写几行程序就插入一个yield,所以 其实Windows很多UI和I/O的API都内嵌了yield。那些差的程序员不知道 这回事,有时候进入一个纯计算的循环没有在中间插入yield,就会导致 系统挂起。Unix世界从一开始就是pre-emptive的,操统API没有内嵌yield 这回事。手写程序隔几行插入yield也不可行。这就是unix世界的C/C++做 用户态线程几乎不肯能的原因。这也是为啥rob pike非要搞一个新的语言的 原因:在操统API外包一层,并且嵌入yield(还有就是GC)。GO在语言层面 上其实没有任何创新,甚至比好多现有的语言都要低级。如果从出发点来 看,GO的目的其实已经达到了。相比而言,Windows有在API 里面做yield的传统,这也是为什么这么容易搞出来fiber的原因。

C++11的thread API里有个莫名其妙的this_thread::yield,其实就是为了 给non-preemptive的runtime留下余地。理论上说,如果禁止调用操统API, 全都用C++的I/O库,C++是有可能做出来用户态线程的runtime的。有的嵌入 式系统本就没有preemptive scheduling, yield就成了必须得了

Yarn does not provide a tool to profile the memory usage of an app yet, but it does save some instrumentation information to the log. Like this.

yarn-wdong-nodemanager-washtenaw.log:2015-01-06 14:56:43,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 16669 for container-id container_1420574192658_0001_01_000001: 277.3 MB of 9 GB physical memory used; 8.9 GB of 18.9 GB virtual memory used

The numbers reported are actually those based on which Yarn kills processes.

This script analyzes the log and reports maximal memory usage of each container for a particular app.

Sample output

$ yarn-memory-tracker.sh application_1421176927536_0002    # an spark app
383 containers found for app application_1421176927536_0002
container_1421176927536_0001_01_000001: 0.254785 of 16.4 GB
container_1421176927536_0001_01_000002: 16.2 of 51.4 GB
container_1421176927536_0001_01_000003: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000004: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000005: 12.5 of 51.4 GB
container_1421176927536_0002_01_000001: 0.251563 of 16.4 GB
container_1421176927536_0002_01_000002: 16.1 of 51.4 GB

© 2017 Wei Dong