The receptive field of a cell in a convolutional neural network (CNN) is the underlying image region that affects the computation of this cell. For a CNN to effectively recognize an object, the receptive field size of the narrowest layer of the network must be large enough to capture the characterisic structures of that object. Receptive field being too big can be problematic, too. In addition to costing excessive computation, the neural network might be forced to learn structures that surround, but are not intrinsically attached to or associate with the object.

This program automatically computes the receptive field size of the narrowest layer of a TensorFlow model. This can be used as a guide to modifying a standard network architecture to work with a simple dataset.


Owl, PicPac and XNN are three tools I wrote to make image-related model training easy.

  • Owl: a web UI for efficient image annotation.
  • PicPac: PicPac is an image database and streaming library that preprocess the images and feed them into a deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana, Theano and Tensorflow.
  • XNN: a C++ wrapper that provides a unified prediction interface to all common deep learning frameworks, including Caffe, MxNet, Tensorflow, Theano and other Python-based frameworks.
  • (Caffe fork with PicPac support)

The goal is to create a model that will detect and localize a target object category within images. We will use a toy dataset for car plate recognition for illustration.

================

$ git clone https://github.com/aaalgo/owl
$ cd owl
$ # Download the dataset
$ wget http://www.robots.ox.ac.uk/~vgg/data/cars_markus/cars_markus.tar
$ mkdir images
$ cd images
$ tar xf ../cars_markus.tar
$ cd ..
$ # create database
$ ./manage.py migrate
$ # import images into the database
$ find images/ -name '*.jpg' | ./manage.py import --run
$ # start the annotation server

Before starting the annotation server, we need to adjust a couple of parameters in the file owl/annotate/params.py

ROWS = 2        # <-- images rows / page
COLS = 3        # <-- images / row
BATCH = ROWS * COLS
POLYGON = False     # set to True for polygons
VIEWED_AS_DONE = False  # see below
$ ./run.sh

The URL of the annotation UI is http://HOSTNAME:18000/annotate/.

ui

The UI is designed to minimize hand movements and therefore maximize efficiency. The following design decisions were made:

  • A bounding box is automatically saved by AJAX when created.
  • Refreshing page loads the next batch of examples.

The annotation process finishes when all images are annotated/viewed. The VIEWED_AS_DONE parameter controls the behavior whether an image viewed should be considered annotated even when no annotation is added. Set the value to True if it is know that images without positive regions exist. If the value is set to False and no annotation is made to an image, it will be shown again when all other images are done.

After annotation is done, or sufficient number of annotations are collected, the images and annotations can be exported to a PicPac database by

$./manage.py export db

The file db then contains all the information needed for training.

PicPac Database

A PicPac database contains images and labels/annotations. The annotation produced by Owl is the same format used by Annotorious. Actually Owl uses an extended version of Annotorious. Below is a sample annotation:

{'shapes': [{u'geometry': {u'y': 0.5912162162162162, u'x': 0.6049107142857143, u'width': 0.10491071428571429, u'height': 0.08277027027027027}, u'style': {}, u'type': u'rect'}]}

PicPac provides a web server for viewing the content of a database.

$ picpac-server db
$ picpac-server db
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0901 22:52:20.280788 29210 picpac-server.cpp:146] listening at 0.0.0.0:18888
I0901 22:52:20.281389 29210 picpac-server.cpp:148] running server with 1 threads.

And samples with annotations can be viewed with http://HOSTNAME:18888/l?annotate=json. The red bounding box is rendered on-the-fly by the server; images and annotations are stored separately in the database.

ui

The server accepts almost all of the perturbation/augmentation parameters, so the effects on the training set can be visualized. For examples, the following can be appended to the URL &perturb=1&pert_angle=20.

Sometimes when the positive regions are too small compared to the background, it is desirable to use only local areas surrounding the postive regions as training example, so that positive pixels and negative pixels are roughly balanced. The command below can be used to do the cropping.

$ picpac-split-region --width 100 --height 50 --bg 200 --no-scale 1 db db.crop
min: 0.668153
mean: 0.743567
max: 0.819342

Using picpac-server to serve db.crop shows this.

ui

The program picpac-split-region accepts the following parameters:

  • (--size, always 50) Scale, or sqrt(width*height), of positive region.
  • --width output image wdith.
  • --height output image height.
  • --no-scale 1. If not set, the cropped region is scaled so

positive region and negative region are of the specified size. If set, the cropped region is not scaled. Rather the size parameters are used to determine the ratio between positive and negative regions, and the output image size is determined accordingly.

Training

XNN provides a couple of templates based on public models. For example, we can train with the above database using the following command.

xnn/train-caffe-fcn.py fcn db ws

where

  • fcn is the template name.
  • db is the input database.
  • ws is the working directory.

Training will start automatically after the command, and can be canceled with CTRL+C. The ws directory will contain the following:

$ ls wc
log    params.pickle  solver.prototxt       train.log       train.prototxt.tmpl
model  snapshots      solver.prototxt.tmpl  train.prototxt  train.sh

Training can be restarted with train.sh, or continued at a snapshot by supplying a snapshot name under the snapshots directory as the argument of train.sh.

While some parameter can be adjusted via arguments to train-caffe-fcn.py, it is easier to cancel the training process, edit the file train.prototxt and then restarted. The most import parameters of train.prototxt are annotated below.

layer {
  name: "data1"
  type: "PicPac"
  top: "data"
  top: "label"
  picpac_param {
    path: "path/to/db" 
    batch: 1        # batch size, has to be 1 if image sizes are different
    channels: 3     # color channels, use 1 for grayscale images
    split: 5        # randomly split db into 5 parts
    split_fold: 0   # use part 0 for validation and the rest for training

    annotate: "json"
    anno_color1: 1

    threads: 4      
    perturb: true   # enable image augmentation
    pert_color1: 10 # random perturbation range of
    pert_color2: 10 # the three color channels
    pert_color3: 10
    pert_angle: 20  # maximal angle of random rotation, in degrees
    pert_min_scale: 0.8 # min &
    pert_max_scale: 1.2 #       max ramdom scaling factor
  }
}

PicPac supports a full range of flexible configurations.  See
(documentation)[http://picpac.readthedocs.io/en/latest/] for details.

PicPac with TensorFlow

PicPac has a simple python interface with the same parameters.

    config = dict(loop=True,
                shuffle=True,
                reshuffle=True,
                batch=1,
                split=1,
                split_fold=0,
                annotate='json',
                channels=FLAGS.channels,
                stratify=False,
                mixin="db0",
                mixin_group_delta=0,
                #pert_color1=10,
                #pert_angle=5,
                #pert_min_scale=0.8,
                #pert_max_scale=1.2,
                #pad=False,
                #pert_hflip=True,
                channel_first=False
                )
    stream = picpac.ImageStream('db', negate=False, perturb=True, **config)

    ...
        with tf.Session() as sess:
            sess.run(init)
            for step in xrange(FLAGS.max_steps):
                images, labels, pad = stream.next()
                feed_dict = {X: images,
                             Y_: labels}
                _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)

Caffe requires the user to preload training images into a database, and the images are stored as raw pixels. The following calculation shows that this is not a very good idea.

Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3 = 192KB storage. On the other hand, 255x255x3 JPEG images compressed with default parameters cost about 48KB, about 1/4 the storage of raw pixels. Benchmark shows that a 4-core 2600K can decode jpeg images of this size at a rate of 6500/s using all cores, or 1350/s with one core. If we assume only one core can be allocated for image decoding, the processing power translates to 63MB/s input throughput of JPEG data, and 253MB/s output throughput of raw pixels. The sequential read throughput of a traditional HDD is about 100MB, which is above the input throughput of one core. So an economical design would be to store the images compressed with jpeg on a tradional HDD, and decode the image with a dedicated CPU core. The throughput of Caffe, according to the website, is about 4ms/image for learning and 1ms/image for predicting on a K40 GPU. So the throughput of the above configuration can well saturate the GPU power even for predicting. The whole system is nice and balanced, and the main stream HDD provides about 3TB of storage. This also leaves some room for future growth of GPU power and training image size (HDD/SSD grows in capacity rather than throughput).

Of course, this all relies on being able to achieve 63MB/s throughput from the disk, and achieving this on a HDD requires sequential I/O. With images stored in a database, it requires a very fast SSD to achieve such throughput. That’s why I developed the PicPoc image storage for deep learning.(Benchmarking show that sequential read with LMDB DOES achieve raw hardware throughput, whether HDD or SSD. The storage overhead of LMDB is also reasonably low, around 3% as I measured with the ILSVRC 2012 dataset.)

Here are some performance numbers I’ve been achieving with preliminary experiments.

Importing fall 2011 version of ImageNet (14million images stored on 21935 tar files, totalling about 1.2TB) into PicPoc took about 10 hours. The output is 400GB. The input on one HDD and output on another. CPU usage is 213.6%. Considering reading 1.2TB from HDD takes about 3.5 hours and CPU usage is about 50%, there’s a possibility to double the loading throughput. But that’s a one shot business so I’ll say it’s good enough for now. The ILSVRC 2012 training data, when imported, costs 28GB storage, as apposed to 173GB imported to LMDB as raw pixels as described in Caffe’s documentation. (One doesn’t have to use raw pixels with LMDB. The Caffe Datum can be used to store encoded image, and OpenCV pretty much support all popular image codecs).

On reading with decoding, the system is able to sustain 120MB/s throughput on a traditional 1TB HDD. I’ve also created a Caffe fork with PicPoc backend.


© 2017 Wei Dong