Wei Dong's Blog

Sep 01, 2016

Deep Learning with Owl, PicPac and XNN

Owl, PicPac and XNN are three tools I wrote to make image-related model training easy.

  • Owl: a web UI for efficient image annotation.
  • PicPac: PicPac is an image database and streaming library that preprocess the images and feed them into a deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana, Theano and Tensorflow.
  • XNN: a C++ wrapper that provides a unified prediction interface to all common deep learning frameworks, including Caffe, MxNet, Tensorflow, Theano and other Python-based frameworks.
  • (Caffe fork with PicPac support)

The goal is to create a model that will detect and localize a target object category within images. We will use a toy dataset for car plate recognition for illustration.

================

$ git clone https://github.com/aaalgo/owl
$ cd owl
$ # Download the dataset
$ wget http://www.robots.ox.ac.uk/~vgg/data/cars_markus/cars_markus.tar
$ mkdir images
$ cd images
$ tar xf ../cars_markus.tar
$ cd ..
$ # create database
$ ./manage.py migrate
$ # import images into the database
$ find images/ -name '*.jpg' | ./manage.py import --run
$ # start the annotation server

Before starting the annotation server, we need to adjust a couple of parameters in the file owl/annotate/params.py

ROWS = 2        # <-- images rows / page
COLS = 3        # <-- images / row
BATCH = ROWS * COLS
POLYGON = False     # set to True for polygons
VIEWED_AS_DONE = False  # see below
$ ./run.sh

The URL of the annotation UI is http://HOSTNAME:18000/annotate/.

ui

The UI is designed to minimize hand movements and therefore maximize efficiency. The following design decisions were made:

  • A bounding box is automatically saved by AJAX when created.
  • Refreshing page loads the next batch of examples.

The annotation process finishes when all images are annotated/viewed. The VIEWED_AS_DONE parameter controls the behavior whether an image viewed should be considered annotated even when no annotation is added. Set the value to True if it is know that images without positive regions exist. If the value is set to False and no annotation is made to an image, it will be shown again when all other images are done.

After annotation is done, or sufficient number of annotations are collected, the images and annotations can be exported to a PicPac database by

$./manage.py export db

The file db then contains all the information needed for training.

PicPac Database

A PicPac database contains images and labels/annotations. The annotation produced by Owl is the same format used by Annotorious. Actually Owl uses an extended version of Annotorious. Below is a sample annotation:

{'shapes': [{u'geometry': {u'y': 0.5912162162162162, u'x': 0.6049107142857143, u'width': 0.10491071428571429, u'height': 0.08277027027027027}, u'style': {}, u'type': u'rect'}]}

PicPac provides a web server for viewing the content of a database.

$ picpac-server db
$ picpac-server db
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0901 22:52:20.280788 29210 picpac-server.cpp:146] listening at 0.0.0.0:18888
I0901 22:52:20.281389 29210 picpac-server.cpp:148] running server with 1 threads.

And samples with annotations can be viewed with http://HOSTNAME:18888/l?annotate=json. The red bounding box is rendered on-the-fly by the server; images and annotations are stored separately in the database.

ui

The server accepts almost all of the perturbation/augmentation parameters, so the effects on the training set can be visualized. For examples, the following can be appended to the URL &perturb=1&pert_angle=20.

Sometimes when the positive regions are too small compared to the background, it is desirable to use only local areas surrounding the postive regions as training example, so that positive pixels and negative pixels are roughly balanced. The command below can be used to do the cropping.

$ picpac-split-region --width 100 --height 50 --bg 200 --no-scale 1 db db.crop
min: 0.668153
mean: 0.743567
max: 0.819342

Using picpac-server to serve db.crop shows this.

ui

The program picpac-split-region accepts the following parameters:

  • (--size, always 50) Scale, or sqrt(width*height), of positive region.
  • --width output image wdith.
  • --height output image height.
  • --no-scale 1. If not set, the cropped region is scaled so

positive region and negative region are of the specified size. If set, the cropped region is not scaled. Rather the size parameters are used to determine the ratio between positive and negative regions, and the output image size is determined accordingly.

Training

XNN provides a couple of templates based on public models. For example, we can train with the above database using the following command.

xnn/train-caffe-fcn.py fcn db ws

where

  • fcn is the template name.
  • db is the input database.
  • ws is the working directory.

Training will start automatically after the command, and can be canceled with CTRL+C. The ws directory will contain the following:

$ ls wc
log    params.pickle  solver.prototxt       train.log       train.prototxt.tmpl
model  snapshots      solver.prototxt.tmpl  train.prototxt  train.sh

Training can be restarted with train.sh, or continued at a snapshot by supplying a snapshot name under the snapshots directory as the argument of train.sh.

While some parameter can be adjusted via arguments to train-caffe-fcn.py, it is easier to cancel the training process, edit the file train.prototxt and then restarted. The most import parameters of train.prototxt are annotated below.

layer {
  name: "data1"
  type: "PicPac"
  top: "data"
  top: "label"
  picpac_param {
    path: "path/to/db" 
    batch: 1        # batch size, has to be 1 if image sizes are different
    channels: 3     # color channels, use 1 for grayscale images
    split: 5        # randomly split db into 5 parts
    split_fold: 0   # use part 0 for validation and the rest for training

    annotate: "json"
    anno_color1: 1

    threads: 4      
    perturb: true   # enable image augmentation
    pert_color1: 10 # random perturbation range of
    pert_color2: 10 # the three color channels
    pert_color3: 10
    pert_angle: 20  # maximal angle of random rotation, in degrees
    pert_min_scale: 0.8 # min &
    pert_max_scale: 1.2 #       max ramdom scaling factor
  }
}

PicPac supports a full range of flexible configurations.  See
(documentation)[http://picpac.readthedocs.io/en/latest/] for details.

PicPac with TensorFlow

PicPac has a simple python interface with the same parameters.

    config = dict(loop=True,
                shuffle=True,
                reshuffle=True,
                batch=1,
                split=1,
                split_fold=0,
                annotate='json',
                channels=FLAGS.channels,
                stratify=False,
                mixin="db0",
                mixin_group_delta=0,
                #pert_color1=10,
                #pert_angle=5,
                #pert_min_scale=0.8,
                #pert_max_scale=1.2,
                #pad=False,
                #pert_hflip=True,
                channel_first=False
                )
    stream = picpac.ImageStream('db', negate=False, perturb=True, **config)

    ...
        with tf.Session() as sess:
            sess.run(init)
            for step in xrange(FLAGS.max_steps):
                images, labels, pad = stream.next()
                feed_dict = {X: images,
                             Y_: labels}
                _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)
Click to read and post comments

Nov 07, 2015

Image Storage For Deep Learning, Raw or JPG?

Caffe requires the user to preload training images into a database, and the images are stored as raw pixels. The following calculation shows that this is not a very good idea.

Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3 = 192KB storage. On the other hand, 255x255x3 JPEG images compressed with default parameters cost about 48KB, about 1/4 the storage of raw pixels. Benchmark shows that a 4-core 2600K can decode jpeg images of this size at a rate of 6500/s using all cores, or 1350/s with one core. If we assume only one core can be allocated for image decoding, the processing power translates to 63MB/s input throughput of JPEG data, and 253MB/s output throughput of raw pixels. The sequential read throughput of a traditional HDD is about 100MB, which is above the input throughput of one core. So an economical design would be to store the images compressed with jpeg on a tradional HDD, and decode the image with a dedicated CPU core. The throughput of Caffe, according to the website, is about 4ms/image for learning and 1ms/image for predicting on a K40 GPU. So the throughput of the above configuration can well saturate the GPU power even for predicting. The whole system is nice and balanced, and the main stream HDD provides about 3TB of storage. This also leaves some room for future growth of GPU power and training image size (HDD/SSD grows in capacity rather than throughput).

Of course, this all relies on being able to achieve 63MB/s throughput from the disk, and achieving this on a HDD requires sequential I/O. With images stored in a database, it requires a very fast SSD to achieve such throughput. That’s why I developed the PicPoc image storage for deep learning.(Benchmarking show that sequential read with LMDB DOES achieve raw hardware throughput, whether HDD or SSD. The storage overhead of LMDB is also reasonably low, around 3% as I measured with the ILSVRC 2012 dataset.)

Here are some performance numbers I’ve been achieving with preliminary experiments.

Importing fall 2011 version of ImageNet (14million images stored on 21935 tar files, totalling about 1.2TB) into PicPoc took about 10 hours. The output is 400GB. The input on one HDD and output on another. CPU usage is 213.6%. Considering reading 1.2TB from HDD takes about 3.5 hours and CPU usage is about 50%, there’s a possibility to double the loading throughput. But that’s a one shot business so I’ll say it’s good enough for now. The ILSVRC 2012 training data, when imported, costs 28GB storage, as apposed to 173GB imported to LMDB as raw pixels as described in Caffe’s documentation. (One doesn’t have to use raw pixels with LMDB. The Caffe Datum can be used to store encoded image, and OpenCV pretty much support all popular image codecs).

On reading with decoding, the system is able to sustain 120MB/s throughput on a traditional 1TB HDD. I’ve also created a Caffe fork with PicPoc backend.

Click to read and post comments

Jan 13, 2015

How to Profile Yarn App/Container Memory Usage

Yarn does not provide a tool to profile the memory usage of an app yet, but it does save some instrumentation information to the log. Like this.

yarn-wdong-nodemanager-washtenaw.log:2015-01-06 14:56:43,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 16669 for container-id container_1420574192658_0001_01_000001: 277.3 MB of 9 GB physical memory used; 8.9 GB of 18.9 GB virtual memory used

The numbers reported are actually those based on which Yarn kills processes.

This script analyzes the log and reports maximal memory usage of each container for a particular app.

Sample output

$ yarn-memory-tracker.sh application_1421176927536_0002    # an spark app
383 containers found for app application_1421176927536_0002
container_1421176927536_0001_01_000001: 0.254785 of 16.4 GB
container_1421176927536_0001_01_000002: 16.2 of 51.4 GB
container_1421176927536_0001_01_000003: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000004: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000005: 12.5 of 51.4 GB
container_1421176927536_0002_01_000001: 0.251563 of 16.4 GB
container_1421176927536_0002_01_000002: 16.1 of 51.4 GB
......
Click to read and post comments

Jan 08, 2015

HDFS Demistified: How to Manually Assemble an HDFS File When Hadoop is Down

In this blog I’ll explain how to manually assemble a file stored in HDFS using 1) meta data on namenode and 2) blocks on datanode. The process does not require the hadoop system to be up and running. It’s an interesting exercise to gain some knowledge about the internal mechanisms of Hadoop, and such knowledge can be handy when it comes to data recovery.

(1) Fetch the fsimage.

Below is a tree view of what are stored in the namenode directory.

$ cd HADOOP_NAMENODE_DIR
$ tree .
.
|-- current
|   |-- VERSION
|   |-- edits_0000000000005342851-0000000000005440884
|   |-- edits_0000000000006347975-0000000000006347976
... ...
|   |-- edits_0000000000006347977-0000000000006347986
|   |-- edits_inprogress_0000000000006347987
|   |-- fsimage_0000000000006347976
|   |-- fsimage_0000000000006347976.md5
|   |-- fsimage_0000000000006347986
|   |-- fsimage_0000000000006347986.md5
|   `-- seen_txid
`-- in_use.lock

1 directory, 50 files

We are interested in the fsimage file with the largest postfix number. Copy it out as “fsimage”. If our file in the HDFS is recently uploaded or modified, then its full metadata might not be present in the fsimage. Some of the data could still be in one of the edits_ files. We won’t be able to fully assemble the most recent version of the file. One way to force Hadoop to produce a new checkpoint is to restart Hadoop. Edit logs are merged into a new fsimage upon restart.

Before proceeding to the next step, it is useful to examine the content of the VERSION file.

$ cat VERSION
#Wed Oct 01 01:25:37 CST 2014
namespaceID=1453566641
clusterID=CID-a5f06877-24b3-4892-9dcf-05fccf827889
cTime=0
storageType=NAME_NODE
blockpoolID=BP-908018994-10.10.2.27-1412043710870
layoutVersion=-56

We’ll need the blockpoolID information.

(2) Examine the content of fsimage.

Use the following command to dump the content of fsimage to an XML file.

$ hdfs oiv -i fsimage -o fsimage.xml -p XML

Let’s say we are interested in recovering the file “/user/home/playtime_20140915.txt”. We can find the following relavant information in the fsimage dump.

<inode><id>16392</id><type>FILE</type><name>playtime_20140915.txt</name><replication>2</replication><mtime>1412052903661</mtime><atime>1418937665301</atime><perferredBlockSize>134217728</perferredBlockSize><permission>wdong:supergroup:rw-r--r--</permission><blocks><block><id>1073741825</id><genstamp>1001</genstamp><numBytes>134217728</numBytes></block>
<block><id>1073741826</id><genstamp>1002</genstamp><numBytes>134217728</numBytes></block>
<block><id>1073741827</id><genstamp>1003</genstamp><numBytes>49999484</numBytes></block>
</blocks>
</inode>

We can extract the list of block IDs by either eyeballing or programming.

1073741825
1073741826
1073741827

We can also add up the number of bytes (318434940). If Hadoop is up, we can verify if the file size is correct.

(3). Gather the blocks.

Hadoop does not maintain a on-disk file or database mapping block IDs to nodes. This is actually a nice stateless design. We’ll need to manually enumerate each node to find the blocks we need. Here’s a sample layout of Hadoop data directory.

$ tree .
.
|-- current
|   |-- BP-908018994-10.10.2.27-1412043710870
|   |   |-- current
|   |   |   |-- VERSION
|   |   |   |-- dfsUsed
|   |   |   |-- finalized
|   |   |   |   |-- blk_1073743127
|   |   |   |   |   |-- blk_1073834270_93446.meta
|   |   |   |       |-- blk_1073752388_11564.meta
|   |   |   |       |-- blk_1073801146
......
|   |   |   |       `-- blk_1073801146_60322.meta
|   |   |   `-- rbw
|   |   |       |-- blk_1074397675
|   |   |       |-- blk_1074397675_656923.meta
|   |   |       |-- blk_1074397684
|   |   |       `-- blk_1074397684_656932.meta
|   |   |-- dncp_block_verification.log.curr
|   |   |-- dncp_block_verification.log.prev
|   |   `-- tmp
|   `-- VERSION
`-- in_use.lock

390 directories, 16252 files

Here we see the blockpoolId we noted before as a directory name. Blocks are simply named as blk_ID in one of the sub directories.

In our cluster, the data nodes are mounted as “/data/hadoop/data*/”. So it is quite easy to launch a cluster-wide search with pdsh.

$ pdsh "find /data/hadoop/data*/ -name blk_1073741825"
klose4: /data/hadoop/data3/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/subdir56/blk_1073741825
klose2: /data/hadoop/data2/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/blk_1073741825

We see that the block has two replicates. We can modify the above command a little bit to copy the file over:

$ for B in 1073741825 1073741826 1073741827 ; do  pdsh "find /data/hadoop/data*/ -name blk_$B" | while read a b; do scp $a$b . ; break; done ; done

The break command is to stop us from copying more than one replica.

(4) Assemble the file

$ cat blk_1073741825 blk_1073741826 blk_1073741827 | md5sum
ad07d7ced9c9210b4a4b14d08c0d146f -
$ hdfs dfs -cat playtime_20140915.txt | md5sum # only when hadoop is up.
ad07d7ced9c9210b4a4b14d08c0d146f -

Bingo!

Click to read and post comments

Spark on Yarn: Where Have All the Memory Gone?

Efficient processing of big data, especially with Spark, is really all about how much memory one can afford, or how efficient use one can make of the limited amount of available memory. Efficient memory utilization, however, is not what one can take for granted with default configuration shipped with Spark and Yarn. Rather, it takes very careful provisioning and tuning to get as much as possible from the bare metal. In this post I’ll demonstrate a case when not-so-careful configuration of Spark on Yarn leads to poor memory utilization for caching, explain the math that leads to all the observed numbers, and give some tips on parameter tuning to address the problem.

A little bit background first. I’m not working with one of those big names, and do not have thousands of machines at my disposal. My group has less than ten for data crunching. Months of experimental usage of Spark in a standalone configuration showed very promising results, and we want to use those few machines for both production and development. That is, we want to be able to run two instances of Spark apps in parallel. Since we already have Hadoop, Spark on Yarn seems to be a natural choice. It is not difficult at all to setup Spark on Yarn, but I quickly found that I was not able to fire up the second instance of Spark because I was, as seen by Yarn, out of memory. I’ll use the following simplified one-machine setup to demonstrate the problem I’ve seen.

1. Demonstration of the Problem

This demo was run on a desktop with 64g memory. I used the following setting:

yarn.nodemanager.resource.memory-mb = 49152     # 48G
yarn.scheduler.maximum-allocation-mb = 24576    # 24G
SPARK_EXECUTOR_INSTANCES=1
SPARK_EXECUTOR_MEMORY=18G
SPARK_DRIVER_MEMORY=4G

The Yarn parameters went into yarn-site.xml, and the Spark ones in spark-env.sh. I didn’t set any other memory-related parameters.

So the total memory allocated to Yarn was 48G, with 24G maximum for one app. Spark should use 18+4 = 22G memory, which is below the 24G cap. So I should be able to run two Spark apps in parallel.

Following are the numbers I got from all the logs and Web UIs when I actually fired up one Spark app.

  • (Yarn) Memory Total: 48G
  • (Yarn) Memory Used: 25G
  • (Yarn) Container 1, TotalMemoryNeeded: 5120M
  • (Yarn) Container 2, TotalMemoryNeeded: 20480M
  • (Spark) Driver memory requirement: 4480 MB memory including 384 MB overhead (From output of Spark-Shell)
  • (Spark) Driver available memory to App: 2.1G
  • (Spark) Executor available memory to App: 9.3G

Below are the relevant screen shots.

ui ui

So here are the problems that I see with the driver:

I’ve configured Spark driver to use 4G, and Spark asked Yarn for 4G plus an overhead of 384MB. What reflected in Yarn is that the driver has used 5G. What’s really available in the driver’s block manager is only 2.1G. One has to understand that Spark has to reserve a portion of memory for code execution and cannot give everything to the block manager (the cache), but still,

WHERE HAVE ALL THE MEMORY GONE???

2. The Math Behind

Rule 1. Yarn always rounds up memory requirement to multiples of yarn.scheduler.minimum-allocation-mb, which by default is 1024 or 1GB. That’s why the driver’s requirement of 4G+384M showed up as 5G in Yarn. The parameter yarn.scheduler.minimum-allocation-mb is really “minimum-allocation-unit-mb”. This can be easily verified by setting the parameter to a prime number, such as 97, and see Yarn allocate by multiples of the number.

Rule 2. Spark adds an overhead to SPARK_EXECUTOR_MEMORY/SPARK_DRIVER_MEMORY before asking Yarn for the amount. The rule of overhead is the same for both executor and driver, which is

//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
  val MEMORY_OVERHEAD_FACTOR = 0.07
  val MEMORY_OVERHEAD_MIN = 384

//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
  protected val memoryOverhead: Int = sparkConf.getInt("spark.yarn.executor.memoryOverhead",
    math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN))
......
      val totalExecutorMemory = executorMemory + memoryOverhead
      numPendingAllocate.addAndGet(missing)
      logInfo(s"Will allocate $missing executor containers, each with $totalExecutorMemory MB " +
        s"memory including $memoryOverhead MB overhead")

This overhead is necessary because when a JVM program is allowed certain amount of memory (by -Xmx), the overall memory usage of JVM could be more than that, and Yarn literally kills programs which uses more memory than allowed (with complicated rules). One can only adjust the two magic numbers by modifying the source.

The above two rules determine how the configured SPARK_XXX_MEMORY finally show up in Yarn.

Rule 3. How much memory does driver/executor see.

One limits the maximal heap memory of JVM by the option “-Xmx”. Part of the specified memory get used by the Scala runtime and other system component, and what a Scala program sees is less then the specified amount. This can be illustrated with the following example.

$ scala -J-Xmx4g
Welcome to Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> Runtime.getRuntime.maxMemory
res0: Long = 3817865216
scala>

The runtime eats about 455M. (The above process has an RSS of 140.3M in Linux, so a big portion of the 455M is more like being reserved than actually being used.)

The Spark driver is allocated the configured 4G by JVM options. This can be verified by running the following from inside the Spark shell.

scala>

scala> import java.lang.management.ManagementFactory
import java.lang.management.ManagementFactory

scala> ManagementFactory.getRuntimeMXBean.getInputArguments
res0: java.util.List[String] = [-XX:MaxPermSize=128m, -Djava.library.path=/home/hadoop/hadoop-2.4.1/lib/native, -Xms4G, -Xmx4G]

scala> Runtime.getRuntime.maxMemory
res1: Long = 4116709376

scala>

Rule 4. How Spark determines maximal usable memory

//core/src/main/scala/org/apache/spark/storage/BlockManager.scala
/** Return the total amount of storage memory available. */
private def getMaxMemory(conf: SparkConf): Long = {
  val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
  val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
  (Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
}

We have 4116709376 * 0.6 * 0.9 = 2.07G. That is where the 2.1G value comes from. The maximal available memory of executor is derived in the same way.

Overall, the following two formulas guide memory allocation:

What’s seen by Yarn: (SPECIFIED_MEMORY + OVERHEAD) round up to multiples of minimum-allocation-mb , with OVERHEAD = max(SPECIFIED_MEMORY * 0.07, 384M) What’s usable for cache: (SPECIFIED_MEMORY – MEMORY_USED_BY_RUNTIME) * spark.storage.memoryFraction * spark.storage.safetyFraction

3. Tuning Suggestions

We see that the root of memory under utilization is the over-provisioning done in almost each step. Even if a process really reaches the configured memory cap, it’s unlikely it will keep using that amount of memory all the time. Because Yarn actually kills a process when it exceeds the memory cap, we have to keep SPARK_XXX_MEMORY big enough. It is also very difficult to determine the actual amount of memory gets used by Spark code execution, so dealing with spark.storage.memoryFraction is tricky. But if one is sure that it’s unlikely for the overall memory consumption of parallel Spark apps to exceed the physical memory, the easiest way to improve memory utilization is to counter the over-provisioning with overcommitment. That is, to set the Yarn parameter yarn.nodemanager.resource.memory-mb to MORE THAN THE AVAILABLE PHYSICAL MEMORY (luckily Yarn does not check that). It also helps a little bit to set yarn.scheduler.minimum-allocation-mb to a small value like 100M, so an app does not get much more that what it asks for.

Click to read and post comments

Dec 03, 2014

Nov 27, 2014

The Dark Truth Behind the Power of Monads (and why it’s OK you cannot master it)

Years ago when I was in graduate school, there was a period when I became very obsessed with functional programming and the Haskell programming language. (At the dawn of the era of multi-core computation, it was believed that functional programming was one of the most promising technologies that would save the world.) Unlike many other earlier functional languages which allow ad hoc imperative constructions when it comes to IO-related tasks, Haskell is elegant and purely functional. When IO must be done, one programs in Haskell with a mindset that he/she is composing a combination of IO (and other) instructions. The IO instructions, as instructions rather than the realization of them, are purely mathematical and do not interact with the real world. The computation of a Haskell program produces something called an IO monad, and all side effects only take place when the runtime executes the IO monad, which happens outside the functional programming realm. One can argue that this is only a perspective, because a piece of, say, C++ code, as code per se, do not have any side effects either. Side effects only take place when a binary is executed, and the binary is not C++, it’s outside the realm of C++ programming. Well, yes, functional programming IS only a perspective, but imperative programmers just do not think in that way.

Now lets come back to the mysterious monad. It turns out that monads are much, much more powerful than merely an abstraction of IO operations. A whole lot of apparently unrelated programming constructions can be realized with monads, and the resulting code is usually very simple and beautiful. There are monads for old school pessimistic error handling (the Maybe monad), exceptions, lists, state machines, parsers, even software transactional memory (STM) and structured query language (SQL). You name it! I felt my eyesight broaden, and was fascinated by the prospect that I’ll become a very powerful programmer if I master the unqualified, generic MONAD.

So I dug out all articles and tutorials about monads I can find on the Internet (well, except for text books on category theory; I was trying to find a shortcut). And after a long time of struggling, I found myself defeated by the fact that I cannot comprehend this programming construction as simple as three laws, among which at least two if not the third, in my opinion, are trivial! I didn’t have any problem with every one of the specialized monads, and I understood every letter of the monad laws. But the generic monad, I just couldn’t get the hang of it.

Life goes on without one becoming a master monadic programmer. I came back to my bread-earning systems and machine learning programming, and was glad to find that I could still do whatever I needed to do with the good old imperative programming. And with the introduction of lambda (be cautious: the lambda in C++ and other imperative languages are all but functional!) and other nice features to C++11, life after all is becoming better. But deep down inside, this monad thing kept bothering me, and has been, as now I see, fermenting.

Then there came a day, when I was fighting a religious programming language war in a web forum (always a good pass-time), the enlightenment suddenly came to me. And here I’m sharing with you the dark truth behind the power of monads I finally came to understand.

Behind all powers of monads, there is only one true power: the parsing power. Monads can be used in a parser to create internal data structures representing the parsing results of a context-free language (strictly speaking a little more than that, but it doesn’t matter here), to which almost all programming languages belong. The only property that is common to all those specialized monads is that they can all be written in a context free language which can be parsed. The process of monadic programming is nothing more than mentally parsing a piece of imaginary imperative or whatever code and writing down the “internal” representation in monads. It naturally follows that one can mimic with monads any (context-free) programming constructions that can ever be invented. And I’m pretty sure the process can be automated if one doesn’t have to preserve all the syntactic sugar. My hope to become a master programmer by learning to use monads, put in the layman’s terms, is not any more realistic than the hope to become a serial inventor by learning the art of creating text files!

Monad and its equivalents are still nice and powerful as a language for invention. They made it possible to (re-)invent programming constructions without having to create new languages. But very likely they don’t make invention itself any easier. One evidence is that all the monads we so far have seen all have their preexisting counterparts in other languages and libraries. At least now I don’t feel so bad that I cannot become a master monadic programmer.

And speaking of syntactic sugar, people have been feeding on it for a while, and I don’t mind having more of it.

Click to read and post comments

Oct 03, 2014

Buliding Portable C/C++ Programs and Libraries for the Linux World

(Companion Vagrant box is available under the name “wdong/cbox”, or the box can be downloaded at http://anna.wdong.org/cbox.box.)

I have to admit that pulling source code from github has become the mainstream development mode, and virtualization has become the main stream deployment mode, but many times it is still desirable to have a piece of software delivered in the form of a binary executable or library. There are many reasons one wants to do that: one might not want to give away the source code; they client wants the source code, but doesn’t have the capability of building it, or simply does not want to spent the human labor to build it. With tools like maven, software versioning is kind of under control in the Java world. But the C/C++ world is not as lucky, and with all the github code that usually do not even have a version number, build a C/C++ code repository with dependencies is not for everybody.

Virtualization helps to contain all the dependencies, and it is a good solution for bigger software components like web service. One just deliver a VM image and everything is taken care of. But the use cases for C/C++ are usually small performance-critical components that must be tightly integrated to code written in other languages, and the overhead of virtualization is usually too high.

But the Linux world is notoriously heterogeneous. We are living in a world with Linux kernel 3.x, Ubuntu 14.x and RHEL 7.x, but almost all the companies and university labs whose machines I got a chance to log into are still using CentOS 5.x for production and research (RHEL 5 was first released in 2007) — once you got a cluster setup, it’s virtually impossible to upgrade the operating system version. On the other hand, you also want your software to run on the newest systems which are available to today’s startup companies and everything in between and hopefully in future.

Now generic portability between Linux versions and distributions cannot be achieved by C/C++. That’s why Java was invented. But if one just want to deliver a single program/library file that contains all the functionality — thanks to the backward compatibility of the Linux kernel — this is usually achievable by linking almost all the libraries statically into the program.

The library case is more interesting. We want everything to be contained in the library, including all the libraries we depend on. But we cannot provide a static library because in that way we’ll have to expose all the dependencies and it will cause version conflicts for sure when the client tries to link against the library. So the solution is to develop a (almost) statically linked shared library, and maybe a very small piece of interfacing code. The KGraph library for similarity search is provided in this form.

Static linking is not the common practice in the linux world. All software packages are distributed with shared libraries, and if one chooses to build something from source code, shared libraries are produced by default. But fortunately, most software packages use the automake system, and static linking has always been an option which can be easily enabled by adding “–enable-static –disable-shared” in the configure script. The “–disable-shared” part is important because without it, shared library will also be produced, and the default behavior of gcc is to link against shared library. One can force gcc to link statically by adding “-static”, but some system API wont work as expected (getaddrinfo will lose the ability to resolve hostnames). Now with “–enable-static” alone, the build system will assume the library is to be used statically, and will produce non-relocatable machine code, and we won’t be able to use that to produce a shared library. The solution is to export “CFLAGS=-fPIC” and “CXXFLAGS=-fPIC” before running the configure script. These two easy fixes work in most packages, and the remaining have to be worked on cases by case.

It will be misleading to end this blog leaving a novice reader believing static linking is the way to go for everybody. Actually there are strong arguments against static linking (basically one can gain more with dynamic linking). But there are languages like golang which favor static linking. And for people out there by himself, like me, who does not have a lot of human labor at disposal, and would rather spending time on algorithms rather than software packaging, static linking does come in handy.

About the companion box:

This box contains a development environment that is geared towards computation intensive data processing applications without GUI, like machine learning, image/audio processing and such. It is based on CentOS 5.6 with devtools 2.1 (gcc-4.8). I’ve also installed many libraries using the above method, including Boost, Poco, OpenCV, libav and many others. OpenCV and libav have been taylored to removed GUI and device related stuff (including media playback), because such functionality relies on components that are hard to make portable.

Click to read and post comments

May 28, 2014

Bcache for Ubuntu 14.04 Root Filesystem

It's been a while since bcache made its way into the Linux kernel, but the installers of most distributions have not yet caught up to allow users to install to a bcache-backed volume, and user-land tools necessary to make use of bcache are not yet installed by default.  There has been a tutorial on how to convert the root of an existing installation into bcache with a tool named blocks, but the code base and the depending code bases are either too old or too new  to be directly usable with the default python 3.4 setup of Ubuntu 14.04.  After a frustrating process of fixing all compatibility issues, I was able to make the code run, but I don't trust my own patch (and the stability of github forked code) enough to apply that on my real data.  I ended up this not-so-drastic, but cleaner and safer way to get a bcache-backed root filesystem with minimal external dependencies.

The idea is to (1) install Ubuntu into a normal partition (>= 5GB) which would later be converted to the swap space, (2) setup the bcache, and (3) migrate / to the bcache device.

In my case (lenovo U430P), /dev/sda is a 16G SSD, and /dev/sdb is a 1T HDD.

1. Initial Installation

Install Ubuntu 14.04 using the following disk partitioning scheme - 64MB EFI partition /dev/sda1, fat32, mounted at /boot/efi. This is not necessary if the machine is booted in the traditional BIOS way. - 200MB ext4 partition /dev/sda2 to be mounted on /boot.  It is necessary to make /boot on a separate partition.  I made this on SSD to make it faster for the kernel to be loaded (haven't compared, but I guess the speedup over HDD -- if not slowdown -- won't be that obvious as the kernel is a multi-megabyte file.) - 16GB ext4 partition /dev/sdb1 to be mounted as / in installation.  We'll later convert that to be a swap partition. - An empty big partition /dev/sdb2 later to be used as root (/dev/sda2).  Create this partition, but do not use it for now. - An empty partition /dev/sda3 on SSD later to be used as the cache.  Create this partition, but do not use it for now.

The installer will complain about not having a swap space.  Ignore that.

2. Setting Up Bcache

After installation, boot into the newly installed system, install bcache-tools (in PPA) and setup the system:

$ sudo bash
# add-apt-repository ppa:g2p/storage
# apt-get update
# apt-get install bcache-tools
# make-bcache -C /dev/sda3 -B /dev/sdb2
# mkfs.ext4 /dev/bcache0

3. Migrating Root Filesystem

Keep working in the newly installed system.

$ sudo bash
$ mkdir OLD NEW
# mount /dev/sdb1 OLD    # the old root
# mount /dev/bcache0 NEW # this would be our new root
# rsync -a OLD/ NEW/     # now NEW contains the root
# ### mount a serious directory in preparation for grub-install
# mount /dev/sda2 NEW/boot
# mount /dev/sda1 NEW/boot/efi
# mount -o bind /dev NEW/dev
# mount -t proc none NEW/proc
# mount -t sysfs none NEW/sys
# chroot NEW
# #### find out the UUID of /dev/bcache0 and /dev/sdb1
# ls -l /dev/disk/by-uuid/ | grep bcache0
lrwxrwxrwx 1 root root 13 May 29 21:49 4c492013-e8a3-40b5-b5cd-9220ed2e0195 -> ../../bcache0
# ls -l /dev/disk/by-uuid/ | grep sdb1
lrwxrwxrwx 1 root root 10 May 29 21:49 765d6fc0-9ff4-4cf4-95f9-17a6e76ae80c -> ../../sdb1
# vi NEW/etc/fstab NEW/boot/grub.cfg #### edit the files NEW/etc/fstab and NEW/boot/grub.cfg, replace all UUID of sdb1 to that of bcache0.
# grub-install /dev/sda

4. Final Configurations in New System

Reboot into the newly installed system. Now the root is on /dev/bcache0. The old data on /dev/sdb1 is not used, and /dev/sdb1 can be converted to the swap space.

$ sudo bash
# mkswap /dev/sdb1
Setting up swapspace version 1, size = 15624188 KiB
no label, UUID=e35bc636-9944-4dd5-ab3d-6c371b0cb7a8
# swapon /dev/sdb1
##### make sure to change the UUID of the command below
echo "UUID=e35bc636-9944-4dd5-ab3d-6c371b0cb7a8 none swap defaults 0 0" >> /etc/fstab

Now I'm having my Ubuntu running happily on bcache, and I hope it's not going to cause and data loss.

Click to read and post comments

May 07, 2014

机器学习内功总纲

我觉得机器学习的万法之宗就是奥康姆剃刀: 拟合效果类似,模型越简单预测能力越强
。从不同的对“简单”的定义出发,就产生了不同的流派。比如:

1. 特征维度越低就越简单(参考维度诅咒)。从这一点出发产生了各种降维算法,像PCA
, LDA(有两个完全不同的LDA,但本质上都是降维)等。多层神经网络也可以看成一种
降维算法。Hinton在Science上那片autoencoder的文章标题就是"降维",降维的重要性
可见一斑。将降维推广其实就是数据压缩。甚至有一种观点认为数据压缩做到了极致就
是人工智能。

2. 特征的非零维度越少就越简单。所谓的Sparse Coding是机器视觉和相关方向非常重
要的研究课题。在2012年neural network异军突起以前,几乎所有的图像识别算法都要
用到某种Sparse Coding。大家知道如果D维空间内的点用包含D个向量的基标出就是一
个D维向量,这D个维几乎不可能为0。实现sparse coding的方法就是两路:1. 允许有误
差。2.增加基的个数。最土的sparse coding就是k-means clustering(也叫vector 
quantization)。更加一般性的做法是在训练模型的时候加一个L1-regularization来
实现稀疏性。这就引出了下一类方法。

3. Regularization。一个向量维数虽然大,但是缩小每个维度的取值范围,也是一种
形式的简单。从整体来看,缩小取值范围的一个一般做法就是优化向量(或者模型)的模
。从统计的观点看,如果假设数据符合正态分布,最大似然估计基本上就等价于L2-
regularization。着一个流派的学习算法往往都是解下面形式的一个优化问题

        min_M  |f(x;M) - y| + a|M|
               训练误差项      regularization

目前解这类问题的主流是下山法(SGD)。因为加regulariztion项形式上看很容易,所以
往往会被滥用。不管前面出现了什么指数函数对数函数,后面统统来一个
regularization。拿搞物理的人的说法是,连量纲都对不上。更别说统计意义了。神奇
的是竟然还都有一定的效果。

4. 减少训练数据摄入。如果特征是抽象空间的点,没有维度的概念,也没有模的概念
,这时候怎么办?有一种办法是选取所有训练数据中最具代表性/最关键的一小撮样本
来产生模型。从这个角度出发就产生了SVM和boosting这类算法。在SVM中,这类关键样
本被称为 support vector。SV的个数其实就是模型的维度。用SGD训练SVM的时候,如
果碰到一个预测正确的样本,就直接跳过, 碰到错误的才更新模型。 从这个角度推广
一下,按拟合程度如何对数据/模型加权重,基本上就得到了boosting。(boosting不是
这么产生的,但不妨这么理解。) 减少训练数据摄入不是指减小原始训练集的大小,而是
对原始训练集进行精简。比如最终都精简到1兆样本,那么增大原始训练集,比如从10兆
到100兆还是会提高预测精度。

这个是我七八年来学习ML的一些心得,希望对新人有所帮助。哪天看到一种效果很好的
新方法,但是想不通为什么这个方法效果好,或者对方法的来龙去脉摸不着头脑的时候
,不妨万法归宗,往奥康目剃刀上扯一扯,或许就融会贯通了。
Click to read and post comments