Wei Dong's Blog

Equivalence of Subpixel Convolution and Transposed Convolution

2017-05-10T00:00:00-04:00

The idea of subpixel convolution was originally proposed by W. Shi et al from Magic Pony, and well explained in this post. The figure below illustrates in 1-dimensional case that subpixel convolution is probably just transposed convolution (or deconvolution).

2017 Spring

2017-04-17T00:00:00-04:00

Receptive Field Size of Convolutional Network

2017-01-19T00:00:00-05:00

The receptive field of a cell in a convolutional neural network (CNN) is the underlying image region that affects the computation of this cell. For a CNN to effectively recognize an object, the receptive field size of the narrowest layer of the network must be large enough to capture the characterisic structures of that object. Receptive field being too big can be problematic, too. In addition to costing excessive computation, the neural network might be forced to learn structures that surround, but are not intrinsically attached to or associate with the object.

This program automatically computes the receptive field size of the narrowest layer of a TensorFlow model. This can be used as a guide to modifying a standard network architecture to work with a simple dataset.

Deep Learning with Owl, PicPac and XNN

2016-09-01T00:00:00-04:00

Owl, PicPac and XNN are three tools I wrote to make image-related model training easy.

Owl: a web UI for efficient image annotation.
PicPac: PicPac is an image database and streaming library that preprocess the images and feed them into a deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana, Theano and Tensorflow.
XNN: a C++ wrapper that provides a unified prediction interface to all common deep learning frameworks, including Caffe, MxNet, Tensorflow, Theano and other Python-based frameworks.
(Caffe fork with PicPac support)

The goal is to create a model that will detect and localize a target object category within images. We will use a toy dataset for car plate recognition for illustration.

================

$ git clone https://github.com/aaalgo/owl
$ cd owl
$ # Download the dataset
$ wget http://www.robots.ox.ac.uk/~vgg/data/cars_markus/cars_markus.tar
$ mkdir images
$ cd images
$ tar xf ../cars_markus.tar
$ cd ..
$ # create database
$ ./manage.py migrate
$ # import images into the database
$ find images/ -name '*.jpg' | ./manage.py import --run
$ # start the annotation server

Before starting the annotation server, we need to adjust a couple of parameters in the file owl/annotate/params.py

ROWS = 2        # <-- images rows / page
COLS = 3        # <-- images / row
BATCH = ROWS * COLS
POLYGON = False     # set to True for polygons
VIEWED_AS_DONE = False  # see below

$ ./run.sh

The URL of the annotation UI is http://HOSTNAME:18000/annotate/.

The UI is designed to minimize hand movements and therefore maximize efficiency. The following design decisions were made:

A bounding box is automatically saved by AJAX when created.
Refreshing page loads the next batch of examples.

The annotation process finishes when all images are annotated/viewed. The VIEWED_AS_DONE parameter controls the behavior whether an image viewed should be considered annotated even when no annotation is added. Set the value to True if it is know that images without positive regions exist. If the value is set to False and no annotation is made to an image, it will be shown again when all other images are done.

After annotation is done, or sufficient number of annotations are collected, the images and annotations can be exported to a PicPac database by

$./manage.py export db

The file db then contains all the information needed for training.

PicPac Database

A PicPac database contains images and labels/annotations. The annotation produced by Owl is the same format used by Annotorious. Actually Owl uses an extended version of Annotorious. Below is a sample annotation:

{'shapes': [{u'geometry': {u'y': 0.5912162162162162, u'x': 0.6049107142857143, u'width': 0.10491071428571429, u'height': 0.08277027027027027}, u'style': {}, u'type': u'rect'}]}

PicPac provides a web server for viewing the content of a database.

$ picpac-server db
$ picpac-server db
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0901 22:52:20.280788 29210 picpac-server.cpp:146] listening at 0.0.0.0:18888
I0901 22:52:20.281389 29210 picpac-server.cpp:148] running server with 1 threads.

And samples with annotations can be viewed with http://HOSTNAME:18888/l?annotate=json. The red bounding box is rendered on-the-fly by the server; images and annotations are stored separately in the database.

The server accepts almost all of the perturbation/augmentation parameters, so the effects on the training set can be visualized. For examples, the following can be appended to the URL &perturb=1&pert_angle=20.

Sometimes when the positive regions are too small compared to the background, it is desirable to use only local areas surrounding the postive regions as training example, so that positive pixels and negative pixels are roughly balanced. The command below can be used to do the cropping.

$ picpac-split-region --width 100 --height 50 --bg 200 --no-scale 1 db db.crop
min: 0.668153
mean: 0.743567
max: 0.819342

Using picpac-server to serve db.crop shows this.

The program picpac-split-region accepts the following parameters:

(--size, always 50) Scale, or sqrt(width*height), of positive region.
--width output image wdith.
--height output image height.
--no-scale 1. If not set, the cropped region is scaled so

positive region and negative region are of the specified size. If set, the cropped region is not scaled. Rather the size parameters are used to determine the ratio between positive and negative regions, and the output image size is determined accordingly.

Training

XNN provides a couple of templates based on public models. For example, we can train with the above database using the following command.

xnn/train-caffe-fcn.py fcn db ws

where

fcn is the template name.
db is the input database.
ws is the working directory.

Training will start automatically after the command, and can be canceled with CTRL+C. The ws directory will contain the following:

$ ls wc
log    params.pickle  solver.prototxt       train.log       train.prototxt.tmpl
model  snapshots      solver.prototxt.tmpl  train.prototxt  train.sh

Training can be restarted with train.sh, or continued at a snapshot by supplying a snapshot name under the snapshots directory as the argument of train.sh.

While some parameter can be adjusted via arguments to train-caffe-fcn.py, it is easier to cancel the training process, edit the file train.prototxt and then restarted. The most import parameters of train.prototxt are annotated below.

layer {
  name: "data1"
  type: "PicPac"
  top: "data"
  top: "label"
  picpac_param {
    path: "path/to/db" 
    batch: 1        # batch size, has to be 1 if image sizes are different
    channels: 3     # color channels, use 1 for grayscale images
    split: 5        # randomly split db into 5 parts
    split_fold: 0   # use part 0 for validation and the rest for training

    annotate: "json"
    anno_color1: 1

    threads: 4      
    perturb: true   # enable image augmentation
    pert_color1: 10 # random perturbation range of
    pert_color2: 10 # the three color channels
    pert_color3: 10
    pert_angle: 20  # maximal angle of random rotation, in degrees
    pert_min_scale: 0.8 # min &
    pert_max_scale: 1.2 #       max ramdom scaling factor
  }
}

PicPac supports a full range of flexible configurations.  See
(documentation)[http://picpac.readthedocs.io/en/latest/] for details.

PicPac with TensorFlow

PicPac has a simple python interface with the same parameters.

    config = dict(loop=True,
                shuffle=True,
                reshuffle=True,
                batch=1,
                split=1,
                split_fold=0,
                annotate='json',
                channels=FLAGS.channels,
                stratify=False,
                mixin="db0",
                mixin_group_delta=0,
                #pert_color1=10,
                #pert_angle=5,
                #pert_min_scale=0.8,
                #pert_max_scale=1.2,
                #pad=False,
                #pert_hflip=True,
                channel_first=False
                )
    stream = picpac.ImageStream('db', negate=False, perturb=True, **config)

    ...
        with tf.Session() as sess:
            sess.run(init)
            for step in xrange(FLAGS.max_steps):
                images, labels, pad = stream.next()
                feed_dict = {X: images,
                             Y_: labels}
                _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict)

Image Storage For Deep Learning, Raw or JPG?

2015-11-07T00:00:00-05:00

Caffe requires the user to preload training images into a database, and the images are stored as raw pixels. The following calculation shows that this is not a very good idea.

Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3 = 192KB storage. On the other hand, 255x255x3 JPEG images compressed with default parameters cost about 48KB, about 1/4 the storage of raw pixels. Benchmark shows that a 4-core 2600K can decode jpeg images of this size at a rate of 6500/s using all cores, or 1350/s with one core. If we assume only one core can be allocated for image decoding, the processing power translates to 63MB/s input throughput of JPEG data, and 253MB/s output throughput of raw pixels. The sequential read throughput of a traditional HDD is about 100MB, which is above the input throughput of one core. So an economical design would be to store the images compressed with jpeg on a tradional HDD, and decode the image with a dedicated CPU core. The throughput of Caffe, according to the website, is about 4ms/image for learning and 1ms/image for predicting on a K40 GPU. So the throughput of the above configuration can well saturate the GPU power even for predicting. The whole system is nice and balanced, and the main stream HDD provides about 3TB of storage. This also leaves some room for future growth of GPU power and training image size (HDD/SSD grows in capacity rather than throughput).

Of course, this all relies on being able to achieve 63MB/s throughput from the disk, and achieving this on a HDD requires sequential I/O. With images stored in a database, it requires a very fast SSD to achieve such throughput. That’s why I developed the PicPoc image storage for deep learning.(Benchmarking show that sequential read with LMDB DOES achieve raw hardware throughput, whether HDD or SSD. The storage overhead of LMDB is also reasonably low, around 3% as I measured with the ILSVRC 2012 dataset.)

Here are some performance numbers I’ve been achieving with preliminary experiments.

Importing fall 2011 version of ImageNet (14million images stored on 21935 tar files, totalling about 1.2TB) into PicPoc took about 10 hours. The output is 400GB. The input on one HDD and output on another. CPU usage is 213.6%. Considering reading 1.2TB from HDD takes about 3.5 hours and CPU usage is about 50%, there’s a possibility to double the loading throughput. But that’s a one shot business so I’ll say it’s good enough for now. The ILSVRC 2012 training data, when imported, costs 28GB storage, as apposed to 173GB imported to LMDB as raw pixels as described in Caffe’s documentation. (One doesn’t have to use raw pixels with LMDB. The Caffe Datum can be used to store encoded image, and OpenCV pretty much support all popular image codecs).

On reading with decoding, the system is able to sustain 120MB/s throughput on a traditional 1TB HDD. I’ve also created a Caffe fork with PicPoc backend.

Go是怎么来的?

2015-10-15T00:00:00-04:00

pthread是包在OS线程外的。light-weighted thread需要用到用户态线程。微软的fiber是用户态的，但是因为是windows世界的，不怎么招人待见。在多核上做用户态线程是一件非常非常恶心的事情，GO最主要的贡献其实就是把这件事做成了。主要的恶心之处就是怎么处理job stealing:一个操统线程上面的任务都跑完了，就需要去别的操统线程那儿把活弄过来。这就涉及到各种同步，各种locking。Locking多了，性能就下来了。这种事情以前应该不少公司内部都有人做过，能做这个的人一般也都屌得不得了。其实稍微对比一下就能知道multi-core有多难做： node.js不支持multi-core，python折腾这么多年也还是个残疾。如果你想更多地了解一下，可以从man makecontext 看起。每个用户态线程其实是一个context。然后底下每个操统线程负责管一堆context。context切换主要靠cooperative scheduling，而不是操统用的preemptive scheduling。也就是说一个context运行到某一步自己主动把执行权让出来。Unix世界的一般没见过cooperative scheduling. Windows 3.x是cooperative scheduling，所以线程跑一会就得调用yield 让出执行权。因为不可能要求程序员写几行程序就插入一个yield，所以其实Windows很多UI和I/O的API都内嵌了yield。那些差的程序员不知道这回事，有时候进入一个纯计算的循环没有在中间插入yield，就会导致系统挂起。Unix世界从一开始就是pre-emptive的，操统API没有内嵌yield 这回事。手写程序隔几行插入yield也不可行。这就是unix世界的C/C++做用户态线程几乎不肯能的原因。这也是为啥rob pike非要搞一个新的语言的原因：在操统API外包一层，并且嵌入yield（还有就是GC)。GO在语言层面上其实没有任何创新，甚至比好多现有的语言都要低级。如果从出发点来看，GO的目的其实已经达到了。相比而言，Windows有在API 里面做yield的传统，这也是为什么这么容易搞出来fiber的原因。

C++11的thread API里有个莫名其妙的this_thread::yield，其实就是为了给non-preemptive的runtime留下余地。理论上说，如果禁止调用操统API，全都用C++的I/O库，C++是有可能做出来用户态线程的runtime的。有的嵌入式系统本就没有preemptive scheduling, yield就成了必须得了

How to Profile Yarn App/Container Memory Usage

2015-01-13T00:00:00-05:00

Yarn does not provide a tool to profile the memory usage of an app yet, but it does save some instrumentation information to the log. Like this.

yarn-wdong-nodemanager-washtenaw.log:2015-01-06 14:56:43,267 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 16669 for container-id container_1420574192658_0001_01_000001: 277.3 MB of 9 GB physical memory used; 8.9 GB of 18.9 GB virtual memory used

The numbers reported are actually those based on which Yarn kills processes.

This script analyzes the log and reports maximal memory usage of each container for a particular app.

Sample output

$ yarn-memory-tracker.sh application_1421176927536_0002    # an spark app
383 containers found for app application_1421176927536_0002
container_1421176927536_0001_01_000001: 0.254785 of 16.4 GB
container_1421176927536_0001_01_000002: 16.2 of 51.4 GB
container_1421176927536_0001_01_000003: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000004: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000005: 12.5 of 51.4 GB
container_1421176927536_0002_01_000001: 0.251563 of 16.4 GB
container_1421176927536_0002_01_000002: 16.1 of 51.4 GB
......

HDFS Demistified: How to Manually Assemble an HDFS File When Hadoop is Down

2015-01-08T00:00:00-05:00

In this blog I’ll explain how to manually assemble a file stored in HDFS using 1) meta data on namenode and 2) blocks on datanode. The process does not require the hadoop system to be up and running. It’s an interesting exercise to gain some knowledge about the internal mechanisms of Hadoop, and such knowledge can be handy when it comes to data recovery.

(1) Fetch the fsimage.

Below is a tree view of what are stored in the namenode directory.

$ cd HADOOP_NAMENODE_DIR
$ tree .
.
|-- current
|   |-- VERSION
|   |-- edits_0000000000005342851-0000000000005440884
|   |-- edits_0000000000006347975-0000000000006347976
... ...
|   |-- edits_0000000000006347977-0000000000006347986
|   |-- edits_inprogress_0000000000006347987
|   |-- fsimage_0000000000006347976
|   |-- fsimage_0000000000006347976.md5
|   |-- fsimage_0000000000006347986
|   |-- fsimage_0000000000006347986.md5
|   `-- seen_txid
`-- in_use.lock

1 directory, 50 files

We are interested in the fsimage file with the largest postfix number. Copy it out as “fsimage”. If our file in the HDFS is recently uploaded or modified, then its full metadata might not be present in the fsimage. Some of the data could still be in one of the edits_ files. We won’t be able to fully assemble the most recent version of the file. One way to force Hadoop to produce a new checkpoint is to restart Hadoop. Edit logs are merged into a new fsimage upon restart.

Before proceeding to the next step, it is useful to examine the content of the VERSION file.

$ cat VERSION
#Wed Oct 01 01:25:37 CST 2014
namespaceID=1453566641
clusterID=CID-a5f06877-24b3-4892-9dcf-05fccf827889
cTime=0
storageType=NAME_NODE
blockpoolID=BP-908018994-10.10.2.27-1412043710870
layoutVersion=-56

We’ll need the blockpoolID information.

(2) Examine the content of fsimage.

Use the following command to dump the content of fsimage to an XML file.

$ hdfs oiv -i fsimage -o fsimage.xml -p XML

Let’s say we are interested in recovering the file “/user/home/playtime_20140915.txt”. We can find the following relavant information in the fsimage dump.

<inode><id>16392</id><type>FILE</type><name>playtime_20140915.txt</name><replication>2</replication><mtime>1412052903661</mtime><atime>1418937665301</atime><perferredBlockSize>134217728</perferredBlockSize><permission>wdong:supergroup:rw-r--r--</permission><blocks><block><id>1073741825</id><genstamp>1001</genstamp><numBytes>134217728</numBytes></block>
<block><id>1073741826</id><genstamp>1002</genstamp><numBytes>134217728</numBytes></block>
<block><id>1073741827</id><genstamp>1003</genstamp><numBytes>49999484</numBytes></block>
</blocks>
</inode>

We can extract the list of block IDs by either eyeballing or programming.

1073741825
1073741826
1073741827

We can also add up the number of bytes (318434940). If Hadoop is up, we can verify if the file size is correct.

(3). Gather the blocks.

Hadoop does not maintain a on-disk file or database mapping block IDs to nodes. This is actually a nice stateless design. We’ll need to manually enumerate each node to find the blocks we need. Here’s a sample layout of Hadoop data directory.

$ tree .
.
|-- current
|   |-- BP-908018994-10.10.2.27-1412043710870
|   |   |-- current
|   |   |   |-- VERSION
|   |   |   |-- dfsUsed
|   |   |   |-- finalized
|   |   |   |   |-- blk_1073743127
|   |   |   |   |   |-- blk_1073834270_93446.meta
|   |   |   |       |-- blk_1073752388_11564.meta
|   |   |   |       |-- blk_1073801146
......
|   |   |   |       `-- blk_1073801146_60322.meta
|   |   |   `-- rbw
|   |   |       |-- blk_1074397675
|   |   |       |-- blk_1074397675_656923.meta
|   |   |       |-- blk_1074397684
|   |   |       `-- blk_1074397684_656932.meta
|   |   |-- dncp_block_verification.log.curr
|   |   |-- dncp_block_verification.log.prev
|   |   `-- tmp
|   `-- VERSION
`-- in_use.lock

390 directories, 16252 files

Here we see the blockpoolId we noted before as a directory name. Blocks are simply named as blk_ID in one of the sub directories.

In our cluster, the data nodes are mounted as “/data/hadoop/data*/”. So it is quite easy to launch a cluster-wide search with pdsh.

$ pdsh "find /data/hadoop/data*/ -name blk_1073741825"
klose4: /data/hadoop/data3/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/subdir56/blk_1073741825
klose2: /data/hadoop/data2/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/blk_1073741825

We see that the block has two replicates. We can modify the above command a little bit to copy the file over:

$ for B in 1073741825 1073741826 1073741827 ; do  pdsh "find /data/hadoop/data*/ -name blk_$B" | while read a b; do scp $a$b . ; break; done ; done

The break command is to stop us from copying more than one replica.

(4) Assemble the file

$ cat blk_1073741825 blk_1073741826 blk_1073741827 | md5sum
ad07d7ced9c9210b4a4b14d08c0d146f -
$ hdfs dfs -cat playtime_20140915.txt | md5sum # only when hadoop is up.
ad07d7ced9c9210b4a4b14d08c0d146f -

Bingo!

Spark on Yarn: Where Have All the Memory Gone?

2015-01-08T00:00:00-05:00

Efficient processing of big data, especially with Spark, is really all about how much memory one can afford, or how efficient use one can make of the limited amount of available memory. Efficient memory utilization, however, is not what one can take for granted with default configuration shipped with Spark and Yarn. Rather, it takes very careful provisioning and tuning to get as much as possible from the bare metal. In this post I’ll demonstrate a case when not-so-careful configuration of Spark on Yarn leads to poor memory utilization for caching, explain the math that leads to all the observed numbers, and give some tips on parameter tuning to address the problem.

A little bit background first. I’m not working with one of those big names, and do not have thousands of machines at my disposal. My group has less than ten for data crunching. Months of experimental usage of Spark in a standalone configuration showed very promising results, and we want to use those few machines for both production and development. That is, we want to be able to run two instances of Spark apps in parallel. Since we already have Hadoop, Spark on Yarn seems to be a natural choice. It is not difficult at all to setup Spark on Yarn, but I quickly found that I was not able to fire up the second instance of Spark because I was, as seen by Yarn, out of memory. I’ll use the following simplified one-machine setup to demonstrate the problem I’ve seen.

1. Demonstration of the Problem

This demo was run on a desktop with 64g memory. I used the following setting:

yarn.nodemanager.resource.memory-mb = 49152     # 48G
yarn.scheduler.maximum-allocation-mb = 24576    # 24G
SPARK_EXECUTOR_INSTANCES=1
SPARK_EXECUTOR_MEMORY=18G
SPARK_DRIVER_MEMORY=4G

The Yarn parameters went into yarn-site.xml, and the Spark ones in spark-env.sh. I didn’t set any other memory-related parameters.

So the total memory allocated to Yarn was 48G, with 24G maximum for one app. Spark should use 18+4 = 22G memory, which is below the 24G cap. So I should be able to run two Spark apps in parallel.

Following are the numbers I got from all the logs and Web UIs when I actually fired up one Spark app.

(Yarn) Memory Total: 48G
(Yarn) Memory Used: 25G
(Yarn) Container 1, TotalMemoryNeeded: 5120M
(Yarn) Container 2, TotalMemoryNeeded: 20480M
(Spark) Driver memory requirement: 4480 MB memory including 384 MB overhead (From output of Spark-Shell)
(Spark) Driver available memory to App: 2.1G
(Spark) Executor available memory to App: 9.3G

Below are the relevant screen shots.

So here are the problems that I see with the driver:

I’ve configured Spark driver to use 4G, and Spark asked Yarn for 4G plus an overhead of 384MB. What reflected in Yarn is that the driver has used 5G. What’s really available in the driver’s block manager is only 2.1G. One has to understand that Spark has to reserve a portion of memory for code execution and cannot give everything to the block manager (the cache), but still,

WHERE HAVE ALL THE MEMORY GONE???

2. The Math Behind

Rule 1. Yarn always rounds up memory requirement to multiples of yarn.scheduler.minimum-allocation-mb, which by default is 1024 or 1GB. That’s why the driver’s requirement of 4G+384M showed up as 5G in Yarn. The parameter yarn.scheduler.minimum-allocation-mb is really “minimum-allocation-unit-mb”. This can be easily verified by setting the parameter to a prime number, such as 97, and see Yarn allocate by multiples of the number.

Rule 2. Spark adds an overhead to SPARK_EXECUTOR_MEMORY/SPARK_DRIVER_MEMORY before asking Yarn for the amount. The rule of overhead is the same for both executor and driver, which is

//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
  val MEMORY_OVERHEAD_FACTOR = 0.07
  val MEMORY_OVERHEAD_MIN = 384

//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
  protected val memoryOverhead: Int = sparkConf.getInt("spark.yarn.executor.memoryOverhead",
    math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN))
......
      val totalExecutorMemory = executorMemory + memoryOverhead
      numPendingAllocate.addAndGet(missing)
      logInfo(s"Will allocate $missing executor containers, each with $totalExecutorMemory MB " +
        s"memory including $memoryOverhead MB overhead")

This overhead is necessary because when a JVM program is allowed certain amount of memory (by -Xmx), the overall memory usage of JVM could be more than that, and Yarn literally kills programs which uses more memory than allowed (with complicated rules). One can only adjust the two magic numbers by modifying the source.

The above two rules determine how the configured SPARK_XXX_MEMORY finally show up in Yarn.

Rule 3. How much memory does driver/executor see.

One limits the maximal heap memory of JVM by the option “-Xmx”. Part of the specified memory get used by the Scala runtime and other system component, and what a Scala program sees is less then the specified amount. This can be illustrated with the following example.

$ scala -J-Xmx4g
Welcome to Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_51).
Type in expressions to have them evaluated.
Type :help for more information.

scala> Runtime.getRuntime.maxMemory
res0: Long = 3817865216
scala>

The runtime eats about 455M. (The above process has an RSS of 140.3M in Linux, so a big portion of the 455M is more like being reserved than actually being used.)

The Spark driver is allocated the configured 4G by JVM options. This can be verified by running the following from inside the Spark shell.

scala>

scala> import java.lang.management.ManagementFactory
import java.lang.management.ManagementFactory

scala> ManagementFactory.getRuntimeMXBean.getInputArguments
res0: java.util.List[String] = [-XX:MaxPermSize=128m, -Djava.library.path=/home/hadoop/hadoop-2.4.1/lib/native, -Xms4G, -Xmx4G]

scala> Runtime.getRuntime.maxMemory
res1: Long = 4116709376

scala>

Rule 4. How Spark determines maximal usable memory

//core/src/main/scala/org/apache/spark/storage/BlockManager.scala
/** Return the total amount of storage memory available. */
private def getMaxMemory(conf: SparkConf): Long = {
  val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
  val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
  (Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
}

We have 4116709376 * 0.6 * 0.9 = 2.07G. That is where the 2.1G value comes from. The maximal available memory of executor is derived in the same way.

Overall, the following two formulas guide memory allocation:

What’s seen by Yarn: (SPECIFIED_MEMORY + OVERHEAD) round up to multiples of minimum-allocation-mb , with OVERHEAD = max(SPECIFIED_MEMORY * 0.07, 384M) What’s usable for cache: (SPECIFIED_MEMORY – MEMORY_USED_BY_RUNTIME) * spark.storage.memoryFraction * spark.storage.safetyFraction

3. Tuning Suggestions

We see that the root of memory under utilization is the over-provisioning done in almost each step. Even if a process really reaches the configured memory cap, it’s unlikely it will keep using that amount of memory all the time. Because Yarn actually kills a process when it exceeds the memory cap, we have to keep SPARK_XXX_MEMORY big enough. It is also very difficult to determine the actual amount of memory gets used by Spark code execution, so dealing with spark.storage.memoryFraction is tricky. But if one is sure that it’s unlikely for the overall memory consumption of parallel Spark apps to exceed the physical memory, the easiest way to improve memory utilization is to counter the over-provisioning with overcommitment. That is, to set the Yarn parameter yarn.nodemanager.resource.memory-mb to MORE THAN THE AVAILABLE PHYSICAL MEMORY (luckily Yarn does not check that). It also helps a little bit to set yarn.scheduler.minimum-allocation-mb to a small value like 100M, so an app does not get much more that what it asks for.

C++11多线程攻略

2014-12-03T00:00:00-05:00

Tutorial in one program (contains Chinese).

#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <mutex>
#include <future>
#include <stdexcept>

using std::cout;
using std::cerr;
using std::endl;
using std::vector;
using std::lock_guard;
using std::runtime_error;
namespace chrono = std::chrono;
namespace this_thread = std::this_thread;

// 用std::thread显式实现线程

// 用回调函数构造线程
//     注意std::thread的构造函数如何把不确定参数传入回调函数
//     这个是C++11的新特性。 下面是std::thread构造函数的原型
//
//        template< class Function, class... Args >
//        explicit thread::thread( Function&& f, Args&&... args );
//
//     注意第一个参数也是模板类型，所以可以传入函数指针，
//     std::function对象，lambda和各种重载了()算符的类

// 下面的函数实现一个线程的计算，将会被作为回调函数用来创建线程
void fun (int a, int b) {
    cout << "th " << this_thread::get_id() << ": " << a + b << endl;
    // 延时
    cout << "wait 1 second." << endl;
    this_thread::sleep_for(chrono::seconds(1));
    cout << "wait 10 milliseconds." << endl;
    this_thread::sleep_for(chrono::milliseconds(10));

    // 另一种延时
    auto now = chrono::system_clock::now();
    cout << "wait another 10 milliseconds." << endl;
    this_thread::sleep_until(now + chrono::milliseconds(10));
}

void fun_sync (int a, int b);

void demo_thread () {
//      注意A：线程对象在析构之前一定要调用join或者detach
//      否则会出错
    {
        std::thread th(fun, 1, 2);
        // 等待线程执行完成
        th.join();
    }
    {
        std::thread th(fun, 1, 2);
        // 脱离与线程的关系，放任它自己完成
        th.detach();
        // detach以后就无法通过th对象对线程进行操作了
    }
    {
        // 下面的程序会出错
        /*
        std::thread th(fun, 1, 2);
        */
    }
// 用lambda启动线程
    {
        std::thread th(  [](int a, int b) {
                                cout << "thread with lambda " << this_thread::get_id() << ": " << a + b << endl;
                          }, 1, 2);
        th.join();
    }
// std::thread和=操作
// 一个thread对象有两种状态：
//  1. 和一个线程关联
//  2. 不和任何线程关联
// 上面说的注意A只对处于状态1的线程对象有关
//
// 如果够造thread对象时不传入参数，那么对象就处于状态2
// 可以直接被析枸
//
// std::thread 用 = 进行赋值的时候实现的是move而不是copy
// 也就是说：
//      A = B;
// 如果B和一个线程关联，那么操作完成后
//  - 关联线程转移到A
//  - B不再和原有的线程关联
//
// 思考：如果=之前A已经和某线程关联了怎么办？
//       我不也不知道答案: )
//       最好的办法就是保证=以前线程对象是空的，或者干脆
//       不要用=操作，除非...
    {
        // 用vector管理线程
        vector<std::thread> threads(10);
        // 起10个线程
        for (unsigned i = 0; i < threads.size(); ++i) {
            threads[i] = std::thread(fun, i, i);  // 注意move操作
        }
        // 等待他们全部完成
        for (auto &th: threads) {
            th.join();
        }
    }

// 上面打印出来的东西因为没有同步，有时会有混乱
// 所以线程内如果要往cout/cerr输出东西，需要考虑进行同步
// 用fun_sync再做一遍
    {
        // 用vector管理线程
        vector<std::thread> threads(10);
        // 起10个线程
        for (unsigned i = 0; i < threads.size(); ++i) {
            threads[i] = std::thread(fun_sync, i, i);  // 注意move操作
        }
        // 等待他们全部完成
        for (auto &th: threads) {
            th.join();
        }
    }
}


// 线程同步
void fun_sync (int a, int b) {
    static std::mutex mutex; // 全局变量
    // 土法lock/unlock，C++里面不要像下面这样写程序!!!!!!!!!!
    mutex.lock();
    cout << "random sleep" << endl;
    mutex.unlock();

    this_thread::sleep_for(chrono::milliseconds(rand() % 100));

    { // 正确的做法：用lock_guard实现RAII式的保护
        lock_guard<std::mutex> guard(mutex);
        cout << "th " << this_thread::get_id() << ": " << a + b << endl;
    }
    // 问题1. C++里面所有的open/close, lock/unlock, malloc/free, new/delete
    // 这种配对操作最好都要通过RAII方式用一个对象实现
    // 为什么？？？ (答案在最后)
    // 问题2. 非要用动态创建对象的话怎么办?
}
// 普通的mutex可以应付绝大多数需求, 更加fancy的mutex在这里找
// http://en.cppreference.com/w/cpp/header/mutex


void demo_async () {
// 3. 用async进行异步计算
// 传统的函数调用是在一行之内启动函数计算和获取返回值，比如
    auto plus = [](int a, int b) { return a + b;};
    {
        int c = plus(1, 2);
        cout << "1 + 2 = " << c << endl;
    }
// 异步计算的目的是把启动计算和获取返回值分开，并且保证
// 获取返回值的时候计算已经完成。这样分开以后计算在何时何地
// 进行就有了很大的自由，可以用各种方式进行优化.
// 
// C++里的异步计算需要用到下面几种特性
//  - std::async,  用于启动异步计算
//  - std::future,  用于获取计算结果 
    {   // 用异步计算重新做上面的是事情

        std::future<int> c = std::async(plus, 1, 2);
        // 在这里干别的各种事情
        // 然后获取结果
        cout << "async: 1 + 2 = " << c.get() << endl;
        // C++保证future::get()返回的时候异步计算已经完成
        //
    }
// 编译器可以自动推断future的类型，所以可以这么搞 
    {
        auto future = std::async(plus, 1, 2);
        future.get();
    }

    // 如果线程启动后就不需要对它进行管理(创建后直接detach)，那么可以用std::async启动线程
    // 而无需显式创建std::thread对象
    {
        cout << "DEMO ASYNC." << endl;
        std::async(fun, 1, 2);
        cout << "1st async done." << endl;
    }
    // !!!!!!!!TODO 注意察看输出
    // (很可能)你会发现fun并没有被运行！
    // 为什么????????
    //

// async并不一定真是并行进行的。C++里面的async实现了两种方式：并行处理和lazy evaluation。
// (注意"异步"和"并行"的区别。)
// 在调用async时可以传入std::launch::async或std::launch::deferred指定。如果啥都不传，默认行为
// 由编译器决定。在g++里就是deferred，也就是laze evaluation。所谓lazy，就是指返回值在真正需要用
// 的时候才进行计算。像下面这样：
   {
        cout << "DEMO ASYNC, lazy evaluation." << endl;
        cout << "global thread id: " << this_thread::get_id() << endl;
        std::future<void> future = std::async(std::launch::deferred, fun, 1, 2);
        future.get(); // 即使函数返回void，也需要get一下才能保证函数运行
        cout << "You should see output from fun now." << endl;
        // 注意, fun里面打印的线程ID和全局ID是相同的，也就是说async并没有真正创建线程，
        // 而只是在future.get()的时候运行了fun函数
   }
// 下面我们传入std::launch::async迫使async真正创建一个进程
// 通过launch::async方式创建的异步任务即使没运行future::get也会被完成
// (C++里的命名这里有点混乱, 我觉得"std::launch::parallel"更能代表并行运行的意思) 
    {
        cout << "global thread id: " << this_thread::get_id() << endl;
        cout << "run with launch::async" << endl;
        std::async(std::launch::async, fun, 1, 2);
        cout << "You should see output from fun now." << endl;
    }
// 异步异常处理
    {
        auto thrower = [](){cout << "we are going to throw." << endl; throw std::runtime_error("hello, world!");};
        auto future = std::async(std::launch::async, thrower);
        this_thread::sleep_for(chrono::seconds(1));
        cout << "exception was already thrown in another thread." << endl;
        try { // 异步catch异常
            future.get();
        }
        catch (std::runtime_error const &e) {
            cerr << e.what() << endl;
        }
    }
}


int main () {
    cout << "Hardware concurrency: " << std::thread::hardware_concurrency() << endl;
    demo_thread();
    demo_async();
    return 0;
}


// 答案
// 问题1:
// 如果open和close之间抛出了异常，那么close就会被跳过，引起错误
//
// 问题2:
// C++里面尽量不要用new/delete. 有时候非要用, 最好的办法是用new之后
// 立刻用unique_ptr或者shared_ptr包装起来。事实上C++的智能指针可以直接构造对象，
// 不需要用到new关键字，比如
//      unique_ptr<std::thread> p_th(fun, 1, 2, 3);
// 
// 或者干脆用长度为1的std::vector
//      
// 除非保证new/delete之间有catch (...)抓住所有的异常，否则delete都有可能会被跳过。

The Dark Truth Behind the Power of Monads (and why it’s OK you cannot master it)

2014-11-27T00:00:00-05:00

Years ago when I was in graduate school, there was a period when I became very obsessed with functional programming and the Haskell programming language. (At the dawn of the era of multi-core computation, it was believed that functional programming was one of the most promising technologies that would save the world.) Unlike many other earlier functional languages which allow ad hoc imperative constructions when it comes to IO-related tasks, Haskell is elegant and purely functional. When IO must be done, one programs in Haskell with a mindset that he/she is composing a combination of IO (and other) instructions. The IO instructions, as instructions rather than the realization of them, are purely mathematical and do not interact with the real world. The computation of a Haskell program produces something called an IO monad, and all side effects only take place when the runtime executes the IO monad, which happens outside the functional programming realm. One can argue that this is only a perspective, because a piece of, say, C++ code, as code per se, do not have any side effects either. Side effects only take place when a binary is executed, and the binary is not C++, it’s outside the realm of C++ programming. Well, yes, functional programming IS only a perspective, but imperative programmers just do not think in that way.

Now lets come back to the mysterious monad. It turns out that monads are much, much more powerful than merely an abstraction of IO operations. A whole lot of apparently unrelated programming constructions can be realized with monads, and the resulting code is usually very simple and beautiful. There are monads for old school pessimistic error handling (the Maybe monad), exceptions, lists, state machines, parsers, even software transactional memory (STM) and structured query language (SQL). You name it! I felt my eyesight broaden, and was fascinated by the prospect that I’ll become a very powerful programmer if I master the unqualified, generic MONAD.

So I dug out all articles and tutorials about monads I can find on the Internet (well, except for text books on category theory; I was trying to find a shortcut). And after a long time of struggling, I found myself defeated by the fact that I cannot comprehend this programming construction as simple as three laws, among which at least two if not the third, in my opinion, are trivial! I didn’t have any problem with every one of the specialized monads, and I understood every letter of the monad laws. But the generic monad, I just couldn’t get the hang of it.

Life goes on without one becoming a master monadic programmer. I came back to my bread-earning systems and machine learning programming, and was glad to find that I could still do whatever I needed to do with the good old imperative programming. And with the introduction of lambda (be cautious: the lambda in C++ and other imperative languages are all but functional!) and other nice features to C++11, life after all is becoming better. But deep down inside, this monad thing kept bothering me, and has been, as now I see, fermenting.

Then there came a day, when I was fighting a religious programming language war in a web forum (always a good pass-time), the enlightenment suddenly came to me. And here I’m sharing with you the dark truth behind the power of monads I finally came to understand.

Behind all powers of monads, there is only one true power: the parsing power. Monads can be used in a parser to create internal data structures representing the parsing results of a context-free language (strictly speaking a little more than that, but it doesn’t matter here), to which almost all programming languages belong. The only property that is common to all those specialized monads is that they can all be written in a context free language which can be parsed. The process of monadic programming is nothing more than mentally parsing a piece of imaginary imperative or whatever code and writing down the “internal” representation in monads. It naturally follows that one can mimic with monads any (context-free) programming constructions that can ever be invented. And I’m pretty sure the process can be automated if one doesn’t have to preserve all the syntactic sugar. My hope to become a master programmer by learning to use monads, put in the layman’s terms, is not any more realistic than the hope to become a serial inventor by learning the art of creating text files!

Monad and its equivalents are still nice and powerful as a language for invention. They made it possible to (re-)invent programming constructions without having to create new languages. But very likely they don’t make invention itself any easier. One evidence is that all the monads we so far have seen all have their preexisting counterparts in other languages and libraries. At least now I don’t feel so bad that I cannot become a master monadic programmer.

And speaking of syntactic sugar, people have been feeding on it for a while, and I don’t mind having more of it.

Buliding Portable C/C++ Programs and Libraries for the Linux World

2014-10-03T00:00:00-04:00

I have to admit that pulling source code from github has become the mainstream development mode, and virtualization has become the main stream deployment mode, but many times it is still desirable to have a piece of software delivered in the form of a binary executable or library. There are many reasons one wants to do that: one might not want to give away the source code; they client wants the source code, but doesn’t have the capability of building it, or simply does not want to spent the human labor to build it. With tools like maven, software versioning is kind of under control in the Java world. But the C/C++ world is not as lucky, and with all the github code that usually do not even have a version number, build a C/C++ code repository with dependencies is not for everybody.

Virtualization helps to contain all the dependencies, and it is a good solution for bigger software components like web service. One just deliver a VM image and everything is taken care of. But the use cases for C/C++ are usually small performance-critical components that must be tightly integrated to code written in other languages, and the overhead of virtualization is usually too high.

But the Linux world is notoriously heterogeneous. We are living in a world with Linux kernel 3.x, Ubuntu 14.x and RHEL 7.x, but almost all the companies and university labs whose machines I got a chance to log into are still using CentOS 5.x for production and research (RHEL 5 was first released in 2007) — once you got a cluster setup, it’s virtually impossible to upgrade the operating system version. On the other hand, you also want your software to run on the newest systems which are available to today’s startup companies and everything in between and hopefully in future.

Now generic portability between Linux versions and distributions cannot be achieved by C/C++. That’s why Java was invented. But if one just want to deliver a single program/library file that contains all the functionality — thanks to the backward compatibility of the Linux kernel — this is usually achievable by linking almost all the libraries statically into the program.

The library case is more interesting. We want everything to be contained in the library, including all the libraries we depend on. But we cannot provide a static library because in that way we’ll have to expose all the dependencies and it will cause version conflicts for sure when the client tries to link against the library. So the solution is to develop a (almost) statically linked shared library, and maybe a very small piece of interfacing code. The KGraph library for similarity search is provided in this form.

Static linking is not the common practice in the linux world. All software packages are distributed with shared libraries, and if one chooses to build something from source code, shared libraries are produced by default. But fortunately, most software packages use the automake system, and static linking has always been an option which can be easily enabled by adding “–enable-static –disable-shared” in the configure script. The “–disable-shared” part is important because without it, shared library will also be produced, and the default behavior of gcc is to link against shared library. One can force gcc to link statically by adding “-static”, but some system API wont work as expected (getaddrinfo will lose the ability to resolve hostnames, update: solved with c-ares). Now with “–enable-static” alone, the build system will assume the library is to be used statically, and will produce non-relocatable machine code, and we won’t be able to use that to produce a shared library. The solution is to export “CFLAGS=-fPIC” and “CXXFLAGS=-fPIC” before running the configure script. These two easy fixes work in most packages, and the remaining have to be worked on cases by case.

It will be misleading to end this blog leaving a novice reader believing static linking is the way to go for everybody. Actually there are strong arguments against static linking (basically one can gain more with dynamic linking). But there are languages like golang which favor static linking. And for people out there by himself, like me, who does not have a lot of human labor at disposal, and would rather spending time on algorithms rather than software packaging, static linking does come in handy.

About the companion box:

This box contains a development environment that is geared towards computation intensive data processing applications without GUI, like machine learning, image/audio processing and such. It is based on CentOS 5.6 with devtools 2.1 (gcc-4.8). I’ve also installed many libraries using the above method, including Boost, Poco, OpenCV, libav and many others. OpenCV and libav have been taylored to removed GUI and device related stuff (including media playback), because such functionality relies on components that are hard to make portable.

Equivalence between entropy regularization and the softmax function

2014-08-08T00:00:00-04:00

Never realized the relationship between the two until we came up with a fancy convex programming formulization with entropy regulation to solve a recommendation problem and obtained the well-known simple formula with softmax. Google'd the keywords and didn't see a clear explanation within the first few pages of search results, so I guess it might be helpful to write down a summary of the connection.

Bcache for Ubuntu 14.04 Root Filesystem

2014-05-28T00:00:00-04:00

It's been a while since bcache made its way into the Linux kernel, but the installers of most distributions have not yet caught up to allow users to install to a bcache-backed volume, and user-land tools necessary to make use of bcache are not yet installed by default. There has been a tutorial on how to convert the root of an existing installation into bcache with a tool named blocks, but the code base and the depending code bases are either too old or too new to be directly usable with the default python 3.4 setup of Ubuntu 14.04. After a frustrating process of fixing all compatibility issues, I was able to make the code run, but I don't trust my own patch (and the stability of github forked code) enough to apply that on my real data. I ended up this not-so-drastic, but cleaner and safer way to get a bcache-backed root filesystem with minimal external dependencies.

The idea is to (1) install Ubuntu into a normal partition (>= 5GB) which would later be converted to the swap space, (2) setup the bcache, and (3) migrate / to the bcache device.

In my case (lenovo U430P), /dev/sda is a 16G SSD, and /dev/sdb is a 1T HDD.

1. Initial Installation

Install Ubuntu 14.04 using the following disk partitioning scheme - 64MB EFI partition /dev/sda1, fat32, mounted at /boot/efi. This is not necessary if the machine is booted in the traditional BIOS way. - 200MB ext4 partition /dev/sda2 to be mounted on /boot. It is necessary to make /boot on a separate partition. I made this on SSD to make it faster for the kernel to be loaded (haven't compared, but I guess the speedup over HDD -- if not slowdown -- won't be that obvious as the kernel is a multi-megabyte file.) - 16GB ext4 partition /dev/sdb1 to be mounted as / in installation. We'll later convert that to be a swap partition. - An empty big partition /dev/sdb2 later to be used as root (/dev/sda2). Create this partition, but do not use it for now. - An empty partition /dev/sda3 on SSD later to be used as the cache. Create this partition, but do not use it for now.

The installer will complain about not having a swap space. Ignore that.

2. Setting Up Bcache

After installation, boot into the newly installed system, install bcache-tools (in PPA) and setup the system:

$ sudo bash
# add-apt-repository ppa:g2p/storage
# apt-get update
# apt-get install bcache-tools
# make-bcache -C /dev/sda3 -B /dev/sdb2
# mkfs.ext4 /dev/bcache0

3. Migrating Root Filesystem

Keep working in the newly installed system.

$ sudo bash
$ mkdir OLD NEW
# mount /dev/sdb1 OLD    # the old root
# mount /dev/bcache0 NEW # this would be our new root
# rsync -a OLD/ NEW/     # now NEW contains the root
# ### mount a serious directory in preparation for grub-install
# mount /dev/sda2 NEW/boot
# mount /dev/sda1 NEW/boot/efi
# mount -o bind /dev NEW/dev
# mount -t proc none NEW/proc
# mount -t sysfs none NEW/sys
# chroot NEW
# #### find out the UUID of /dev/bcache0 and /dev/sdb1
# ls -l /dev/disk/by-uuid/ | grep bcache0
lrwxrwxrwx 1 root root 13 May 29 21:49 4c492013-e8a3-40b5-b5cd-9220ed2e0195 -> ../../bcache0
# ls -l /dev/disk/by-uuid/ | grep sdb1
lrwxrwxrwx 1 root root 10 May 29 21:49 765d6fc0-9ff4-4cf4-95f9-17a6e76ae80c -> ../../sdb1
# vi NEW/etc/fstab NEW/boot/grub.cfg #### edit the files NEW/etc/fstab and NEW/boot/grub.cfg, replace all UUID of sdb1 to that of bcache0.
# grub-install /dev/sda

4. Final Configurations in New System

Reboot into the newly installed system. Now the root is on /dev/bcache0. The old data on /dev/sdb1 is not used, and /dev/sdb1 can be converted to the swap space.

$ sudo bash
# mkswap /dev/sdb1
Setting up swapspace version 1, size = 15624188 KiB
no label, UUID=e35bc636-9944-4dd5-ab3d-6c371b0cb7a8
# swapon /dev/sdb1
##### make sure to change the UUID of the command below
echo "UUID=e35bc636-9944-4dd5-ab3d-6c371b0cb7a8 none swap defaults 0 0" >> /etc/fstab

Now I'm having my Ubuntu running happily on bcache, and I hope it's not going to cause and data loss.

机器学习内功总纲

2014-05-07T00:00:00-04:00

我觉得机器学习的万法之宗就是奥康姆剃刀: 拟合效果类似，模型越简单预测能力越强
。从不同的对“简单”的定义出发，就产生了不同的流派。比如：

1. 特征维度越低就越简单(参考维度诅咒)。从这一点出发产生了各种降维算法，像PCA
, LDA（有两个完全不同的LDA，但本质上都是降维)等。多层神经网络也可以看成一种
降维算法。Hinton在Science上那片autoencoder的文章标题就是"降维"，降维的重要性
可见一斑。将降维推广其实就是数据压缩。甚至有一种观点认为数据压缩做到了极致就
是人工智能。

2. 特征的非零维度越少就越简单。所谓的Sparse Coding是机器视觉和相关方向非常重
要的研究课题。在2012年neural network异军突起以前，几乎所有的图像识别算法都要
用到某种Sparse Coding。大家知道如果D维空间内的点用包含D个向量的基标出就是一
个D维向量，这D个维几乎不可能为0。实现sparse coding的方法就是两路:1. 允许有误
差。2.增加基的个数。最土的sparse coding就是k-means clustering（也叫vector 
quantization）。更加一般性的做法是在训练模型的时候加一个L1-regularization来
实现稀疏性。这就引出了下一类方法。

3. Regularization。一个向量维数虽然大，但是缩小每个维度的取值范围，也是一种
形式的简单。从整体来看，缩小取值范围的一个一般做法就是优化向量(或者模型)的模
。从统计的观点看，如果假设数据符合正态分布，最大似然估计基本上就等价于L2-
regularization。着一个流派的学习算法往往都是解下面形式的一个优化问题

        min_M  |f(x;M) - y| + a|M|
               训练误差项      regularization

目前解这类问题的主流是下山法(SGD)。因为加regulariztion项形式上看很容易，所以
往往会被滥用。不管前面出现了什么指数函数对数函数，后面统统来一个
regularization。拿搞物理的人的说法是，连量纲都对不上。更别说统计意义了。神奇
的是竟然还都有一定的效果。

4. 减少训练数据摄入。如果特征是抽象空间的点，没有维度的概念，也没有模的概念
，这时候怎么办？有一种办法是选取所有训练数据中最具代表性/最关键的一小撮样本
来产生模型。从这个角度出发就产生了SVM和boosting这类算法。在SVM中，这类关键样
本被称为 support vector。SV的个数其实就是模型的维度。用SGD训练SVM的时候，如
果碰到一个预测正确的样本，就直接跳过, 碰到错误的才更新模型。 从这个角度推广
一下，按拟合程度如何对数据/模型加权重，基本上就得到了boosting。(boosting不是
这么产生的，但不妨这么理解。) 减少训练数据摄入不是指减小原始训练集的大小，而是
对原始训练集进行精简。比如最终都精简到1兆样本，那么增大原始训练集，比如从10兆
到100兆还是会提高预测精度。

这个是我七八年来学习ML的一些心得，希望对新人有所帮助。哪天看到一种效果很好的
新方法，但是想不通为什么这个方法效果好，或者对方法的来龙去脉摸不着头脑的时候
，不妨万法归宗，往奥康目剃刀上扯一扯，或许就融会贯通了。

Cost of Padding for Convolution

2013-09-03T00:00:00-04:00

Padding an image for convolution sounds like a pain. Cuda-convnet has a clever way of implement it, but I found it might be an over-engineering for me to do similar optimization for my CPU-based implementation. The following operf profiling result shows that even if I simply do padding by copying to a larger matrix the overhead is negligible. The last two lines are the cost of the padding node. % of time spent on update and predict are 0.0686% and 8.2e-04%.

operf output:

CPU: Intel Sandy Bridge microarchitecture, speed 3.201e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        image name               symbol name
1108937  18.1060  libgomp.so.1.0.0         gomp_barrier_wait_end
1082524  17.6747  libgomp.so.1.0.0         gomp_team_barrier_wait_end
1078260  17.6051  cifar.ptblas             MNLOOP
475213    7.7589  cifar.ptblas             MNLOOP
374029    6.1069  no-vmlinux               /no-vmlinux
316764    5.1719  libc-2.15.so             __memmove_ssse3_back
275784    4.5028  cifar.ptblas             _ZN6hiperfit6neural10WindowNode6updateEi._omp_fn.17
272491    4.4490  cifar.ptblas             ATL_gemoveT_aX
143035    2.3354  cifar.ptblas             ATL_scol2blk_a1
110492    1.8040  cifar.ptblas             _ZN6hiperfit6neural8PoolNodeINS0_4pool3maxEE7predictEi._omp_fn.3
102437    1.6725  cifar.ptblas             MNLOOP
83093     1.3567  cifar.ptblas             _ZN6hiperfit5ArrayIfE5applyIZNS_6neural12FunctionNodeINS3_8function4reluEE6updateEiEUlRffffE_EEvRKS1_SB_SB_RKT_._omp_fn.10
79453     1.2973  cifar.ptblas             hiperfit::neural::ArrayNode::preupdate(int)
74185     1.2112  cifar.ptblas             _ZN5cifar7DataSetC2ERKSsbj.constprop.352
68051     1.1111  cifar.ptblas             ATL_sJIK0x0x72TN72x72x0_a1_bX
67752     1.1062  cifar.ptblas             _ZN6hiperfit6neural8PoolNodeINS0_4pool3maxEE6updateEi._omp_fn.2
65919     1.0763  cifar.ptblas             ATL_sJIK0x0x0TN0x0x0_a1_bX
60774     0.9923  cifar.ptblas             _ZN6hiperfit5ArrayIfE5applyIZNS_6neural12FunctionNodeINS3_8function4reluEE7predictEiEUlRffE_EEvRKS1_RKT_._omp_fn.11

......

4201      0.0686  cifar.ptblas             hiperfit::neural::PadNode::update(int)
......
50       8.2e-04  cifar.ptblas             hiperfit::neural::PadNode::predict(int)
......

Android WebKit Rendering Pipeline and Instrumentation

2013-07-08T00:00:00-04:00

The Android WebKit rendering pipeline is summarized in this document. The final section provides a patch to the android source code that will log all the rendering activity. A sample browsing session is traced with this code, with all image-rendering activities logged and offline processed and rendered with a standalone installation of Skia. Each time WebKit updates the screen, an frame is produced from the log, and concatenating these frames with ffmpeg makes this video. In this way, I'm able to log all images that are actually drawn on the screen during a browsing session.

I wrote this code to study opportunities for bandwidth saving -- an image in a webpage won't have to be actually downloaded if it is not shown at all, and only a low-resolution version of the image, given multiple resolutions are available, is needed if only a zoomed thumbnail of the image is shown.

Geological Distribution of the Alex Top 500 Websites

2013-06-08T00:00:00-04:00

Top 500

A one-hour exercise of displaying a bunch of points on the map using OpenLayers and jquery. The geological location of the IP addresses are obtained with geoiplookup.

Notes on LDA and RBM

2013-04-01T00:00:00-04:00

Tips for Cross Compiling Libraries for Android

2013-03-19T00:00:00-04:00

Assume we want to install the libraries to "/opt/arm-tools".

Boost

Edit the file "tools/build/v2/user-config.jam" under the boost source directory and add the following line:

using gcc : arm : arm-none-linux-gnueabi-g++ ;

Then build and install boost with the following command

./b2 toolset=gcc-arm  target-os=linux threading=multi link=static runtime-link=static variant=release optimization=space --prefix=/opt/arm-tools

Packages with Autotools

Run ./configure with the following script

#!/bin/bash

export CC=arm-none-linux-gnueabi-gcc
export CXX=arm-none-linux-gnueabi-g++
export AR=arm-none-linux-gnueabi-ar
export CFLAGS=-I/opt/arm-tools/include
export CXXFLAGS=-I/opt/arm-tools/include
export LDFLAGS=-L/opt/arm-tools/lib
./configure --host=x86_64-unknown-linux-gnu --target=arm-none-linux-gnueabi --enable-static --disable-shared --prefix=/opt/arm-tools

Pushing Code to Data: A MapR Exerciese

2013-03-18T00:00:00-04:00

In this blog I'll demonstrate how to push code to data stored on a MapR (or Hadoop) cluster and achieve an order of magnitude speedup with simple bash and C++ coding and without any of the MapReduce and Java stuff.

I just had a MapR cluster set up; and in this exercise, I'm going to test the data-processing speedup I can gain by pushing the code to run where data is. The goal here is to gain some knowledge on the overhead of various system component (e.g. the hadoop commandline) so I can design a in-house platform for distributed data processing.

Setup

The data I have are mostly files of roughly fixed size, about 30MB each. I need to store a large number of such files in the MapR filesystem, and need to frequently run some processing command on each file. The output generated from each file is much smaller than the input and is negligible. In this experiment, I'll use 100 such files as input, and use "md5sum" as the operation. The files are already stored on the MapR filesystem, with the text file "list" containing a list of the paths to the 100 files.

The hardware setup is as follows. The cluster has 8 nodes, with 7 forming a MapR cluster. All the commands and operations are done on the remaining node with the hostname "washtenaw".

Pulling Data From MapR FS: The Naive Approach

The simplest approach is simply to run all the processing locally on washtenaw, pulling all the data needed from the MapR FS. Following is the script "run-local.sh":

#!/bin/bash
# run-local.sh
cat list | while read name
do
    printf '%s\t' $name
    hadoop fs -cat $name | md5sum 
done

And the performance is

wdong@washtenaw $ time ./run-local.sh > md5.hadoop

real    1m14.910s
user    1m10.304s
sys 0m19.573s

Roughly we spend 7.5s on each file.

Pulling Data From MapR FS without Java

We already know that MapR FS is able to achieve a throughput of 80MB/s from the prevous post. So it should take only 30/80 = 0.375s to retrieve each file from the cluster (without the name node lookup and all other latencies). This is much smaller than the 7.5s we spent in our first setting. An obvious overhead is the cost to start the hadoop command line. So in this setting, I'll test directly fetching the data using the C API. No java code is involved in this setting, but the data is still loaded remotely from the cluster.

Here's the C++ source code:

// run-c++.cpp
#include "hdfs.h" 
#include <iostream>

using namespace std;

int main(int argc, char **argv) {
    static size_t BUFFER_SIZE = 64 * 1024 * 1024;
    string buffer(BUFFER_SIZE, '\0');

    hdfsFS fs = hdfsConnect("default", 0); 
    string path;
    while (cin >> path) {
        cout << path << ' ';
        hdfsFile h = hdfsOpenFile(fs, path.c_str(), O_RDONLY, BUFFER_SIZE, 0, 0); 
        cout.flush();
        FILE *cmd = popen("md5sum", "w");
        for (;;) {
            size_t sz = hdfsRead(fs, h, &buffer[0], buffer.size());
            fwrite(&buffer[0], sz, 1, cmd);
            if (sz != buffer.size()) break;
        }   
        pclose(cmd);
        hdfsCloseFile(fs, h); 
    }   
    hdfsDisconnect(fs);
    return 0;
}

And here's the performance:

wdong@washtenaw$ make run-c++
g++ -std=c++11 -Wall -O3 -I/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs   -L/opt/mapr/lib -Wl,-allow-shlib-undefined  run-c++.cpp  -lMapRClient -o run-c++
wdong@washtenaw$ time ./run-c++  < list > md5.c++

real    0m42.258s
user    0m6.880s
sys 0m16.077s

We've reduced per-file processing time from 7.5s to about 4.2s. The overhead of running the hadoop command line for each file is about 3s -- that is pretty big.

Pushing Code to Data

Here comes the real stuff. I'll detect on which node a file is stored, and push our code to where the data is. I use the following C++ program to query the location of a file, assuming that each file is contained in one filesystem block with no replication, which is the case in my system setup.

// hdfs-lookup.cpp
#include "hdfs.h" 
#include <string>
#include <iostream>
#include <boost/assert.hpp>

using namespace std;

int main(int argc, char **argv) {
    hdfsFS fs = hdfsConnect("default", 0); 
    string path;
    while (cin >> path) {
        // get the block of the 1st byte
        char ***hosts = hdfsGetHosts(fs, path.c_str(), 0, 1); 
        BOOST_VERIFY(hosts[0] && hosts[0][0]);
        cout << path << ' ' << hosts[0][0] << endl;
        hdfsFreeHosts(hosts);
    }   
    hdfsDisconnect(fs);
    return 0;
}

Running the program produces something like the following: each line containing the path followed by the hostname where the data is:

wdong@washtenaw$ make hdfs-lookup
g++ -std=c++11 -Wall -O3 -I/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs   -L/opt/mapr/lib -Wl,-allow-shlib-undefined  hdfs-lookup.cpp  -lMapRClient -o hdfs-lookup
wdong@washtenaw$ ./hdfs-lookup < list
test/data1 fuller
test/data2 ford
test/data3 huron
test/data4 huron
test/data5 plymouth
...

I then use the following script to drive the computation:

#!/bin/bash
# run-remote.sh

mkdir -p input; rm -f input/*
mkdir -p output; rm -f output/*

cat list | ./hdfs-lookup | while read path host
do
    echo $path >> input/$host
done

for host in `ls input`
do
    ssh $host "cd $PWD; cat input/$host | ./run-c++ > output/$host"
done

cat output/*

And here's the performance:

wdong@washtenaw$ time ./run-remote.sh > md5.remote

real    0m13.529s
user    0m0.116s
sys 0m0.040s

Substantial speedup! The time we spent on each file is 0.135s.

Pushing Code to Data with Parallelization

Finally, I go ahead to parallelize the above driving script:

#!/bin/bash
# run-parallel.sh

mkdir -p input; rm -f input/*
mkdir -p output; rm -f output/*

cat list | ./hdfs-lookup | while read path host
do
    echo $path >> input/$host
done

for host in `ls input`
do
    ssh $host "cd $PWD; cat input/$host | ./run-c++ > output/$host" &
done

wait

cat output/*

And here's the performance:

wdong@washtenaw$ wc -l input/*    # just to show the data distribution among the nodes.
  12 input/ford
  10 input/fuller
  21 input/geddes
  19 input/huron
   9 input/maple
  10 input/plymouth
  19 input/wagner
 100 total
wdong@washtenaw$ time ./run-parallel.sh > md5.parallel

real    0m2.987s
user    0m0.120s
sys 0m0.068s

wdong@washtenaw$ for i in md5.* ; do sort $i | md5sum; done  # just to check that all outputs are the same
a3889305ff721ec32ced48a3066ea059  -
a3889305ff721ec32ced48a3066ea059  -
a3889305ff721ec32ced48a3066ea059  -
a3889305ff721ec32ced48a3066ea059  -

That is a speedup of 25x over my initial naive approach, but at his point, there's really no surprises. More speedup can be achieved by parallelizing the program run-c++.cpp, but it would not be much meaningful in this setting as for much larger dataset the bottleneck will be the disks, and I only have one disk attached to each node.

A note for hadoop users: when building the C++ programs, link against libhdfs.so and libjvm.so instead of libMapRClient.so; you'll also have to set up the java environment properly.

fuse_dfs on MapR

2013-03-17T00:00:00-04:00

Here's how to get fuse_dfs work with MapR.

Download the patched source code.

The fuse_dfs code is at hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src.

Patch the source code a little bit, including the following:

in fuse_connect.c, add "#define __USE_GNU 1" before the line "#include <search.h>".
in fuse_trash.c, search for "hdfsDelete" and remove the third parameter "1" to the function.
in fuse_dfs.c, remove the check for "options.port == 0". That is because we are going to use exactly the port number 0.

Use the following Makefile and make fuse_dfs.

HADOOP_PREFIX=/opt/mapr/hadoop/hadoop-0.20.2
PACKAGE_VERSION=0.20.2
FUSE_HOME=/usr
PERMS=
PROTECTED_PATHS=
bin_PROGRAMS = fuse_dfs
fuse_dfs_SOURCES = fuse_dfs.o fuse_options.o fuse_trash.o fuse_stat_struct.o fuse_users.o fuse_init.o fuse_connect.o fuse_impls_access.o fuse_impls_chmod.o  fuse_impls_chown.o  fuse_impls_create.o  fuse_impls_flush.o fuse_impls_getattr.o  fuse_impls_mkdir.o  fuse_impls_mknod.o  fuse_impls_open.o fuse_impls_read.o fuse_impls_release.o fuse_impls_readdir.o fuse_impls_rename.o fuse_impls_rmdir.o fuse_impls_statfs.o fuse_impls_symlink.o fuse_impls_truncate.o fuse_impls_utimens.o  fuse_impls_unlink.o fuse_impls_write.o
fuse_dfs:   $(fuse_dfs_SOURCES)
CFLAGS= -Wall -g -DPERMS=$(PERMS) -D_FILE_OFFSET_BITS=64 -I$(HADOOP_PREFIX)/src/c++/libhdfs -D_FUSE_DFS_VERSION=\"$(PACKAGE_VERSION)\" -DPROTECTED_PATHS=\"$(PROTECTED_PATHS)\" -I$(FUSE_HOME)/include
LDFLAGS= -L/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib -L/opt/mapr/lib -L$(FUSE_HOME)/lib -Wl,-allow-shlib-undefined
LDLIBS = -lMapRClient -lfuse 

all:    fuse_dfs
clean:
    rm *.o fuse_dfs

Mount with the following command

LD_LIBRARY_PATH=/opt/mapr/lib ./fuse_dfs -oserver=default -oport=0 <mount point>

We use the MapR client library instead of libhdfs, and there's no Java code involved between fuse and the disks. (Java is probably more of a psychological issue here than a performance issue here.)

MapR File Copy Throughput

2013-03-17T00:00:00-04:00

$ du -sh /data/local/wdong/data         # the directory contains a bunch of 30MB files.
15G /data/local/wdong/data

$ time cp -R /data/local/wdong/data .   # copy data via fuse

real    3m9.192s
user    0m0.148s
sys 0m19.581s

$ time hadoop fs -put /data/local/wdong/data test/data1

real    2m56.955s
user    0m16.225s
sys 0m30.286s

So whether via fuse or hadoop commandline, the write throughput of mapr is about 80MB/s, with the hadoop commandline being slightly faster. The overhead of java is actually negative compared to that of fuse. I expect the performance of MapR's native NFS server should beat both.

I'm using a $20 TRENDnet 8-port Gigabit Switch and there is a cluster of 7 MapR servers behind it.

Programming AT89S52 with Arduino

2012-07-04T00:00:00-04:00

I cannot make my usbasp programmer work on AT89S52. It makes me so frustrated that I decide to build my own with the ATMEGA328P at hand so I can learn 8051 programming (what an overkill!). The process of trying to make it work is painful, but nevertheless it finally worked (code is still buggy and I don't know why).

Here's the result:

(OK, it's a poor man's Arduino -- a minimal breadboard system connected to PC via a USB-RS232 adaptor. The AVR chip is on the right -- the 51 chip on the left is so much bigger!)

Wiring:

ATMEGA328P AT89S52
 VCC ---- VCC ---- USB VCC
 GND ---- GND ---- USB GND
 Pin 15 (Arduino Pin 9) --- RST
 Pin 17 MOSI (Arduino Pin 11) -- Pin 6 MOSI
 Pin 18 MISO (Arduino Pin 12) -- Pin 7 MISO
 Pin 19 SCK (Arduino Pin 13) -- Pin 8 SCK

ATMEGA328P
 Pin 16 (Arduino Pin 10) -- LED, error signal

AT80S52
 Pin 31 /EA -- VCC
 Pin 3 (P1.2) -- LED, testing

Both chips use 16MHz crystal oscillator -- this is hardcoded in the program.

The Arduino sketch

The PC host program, needs Boost to compile, but it should be cross platform.

Following are a few commands to show the usage of the host program. Change -s parameter to the serial port used for communication.

./x51 [-s /dev/ttyUSB0] [--page] foo.ihx # Uploading. sdcc .ihx output of sdcc.
./x51 --dump foo.ihx # Dump the content of the hex file.
./x51 [-s /dev/ttyUSB0] [--page] --verify foo.ihx # read the file content from the chip and dump it. It needs the original .ihx file to determine the number of bytes to read.

Both upload and verify accept an optional "--page" parameter to do page-mode I/O. The serial communication code is still buggy in that the Arduino has to be reset before another upload/download can be carried out.

What's learned: 1. Arduino's SPI does't work for this purpose for unknown reason, haven't checked the source code. 2. Arduino's shiftOut works for programming purpose, but shiftIn doesn't, maybe due to timing reason. 3. Arduino is slow in processing serial input. I have to lower the baud rate to 9600 and add delay to the host code to make serial communication work, and it is not yet stable.

JPEG vs JPEG 2000 VS WebP

2012-04-12T00:00:00-04:00

Evaluation protocol:

A small number of web images are collected from the VPN trace. These images are first converted to lossless formats (png and pnm). The programs jpeg, jasper and webp are then used to compress the images with various quality numbers and then decompress to a lossless format. The program imgcmp is used to compare the original image and the decompressed images. Compression ratio is the size of the compressed file divided by thatof the lossless png file.

A few error bars are also shown for jpeg with quality=30 as reference. This configuration is with noticable yet tolerable noise. (These bars fall into two groups: photos and cartoons. The baseline PNG is good at compressing cartoons, so jpeg has a relatively low compression ratio and high error for this group.)

The two performance measures lead to the same conclusion: webp wins when compression ratio < 0.12, and jpeg 2000 wins when compression ratio > 0.12. The performance of jpeg is always comparable to the worse of the other two.

Sample images.

Image Format Distribution: - 1146 image/jpeg - 194 image/gif - 188 image/png

Making Sense Out of the Matlab FFT Results

2011-10-11T00:00:00-04:00

Here's a note for people who know nothing about digital signal processing (like me) to make sense out of the matlab FFT results.

For a input A of N real numbers, B = fft(A) contains N complex numbers. B has the following property:

B[0] (the first number, DC) is pure real.
If N is even, then B[N/2] (the “Nyquist” frequency) is also pure real.
Other than that, B[i] = B[N-i]*.

If we take the abs of the output, then the first ceil(N/2) elements contain all the information. Now, what's important about this abs of fft output is that it represents the power of components of various frequency. That is:

abs(B[n]) is the magnitude of the component, a sine wave, that is sampled n cycles by A.

FFT needs to sample at least two points from each cycle, so the components with the highest frequency that appears in the FFT result is sampled floor(N/2) cycles, and that corresponds to B[floor(N/2)].

So far we have not talked anything about sample rate. Now, assume that the sample rate of A is r, then the time length of A is N/r seconds. For B[n], A covers n cycles, so each cycle is N/(nr) seconds. That is, the corresponding frequency of B[n] is nr/N.

So here's the functionality of sample rate r and number of samples N:

Sample rate r controls the highest frequency, that is r/2, that is covered by FFT.
N controls the granularity of the quantization of the frequency range [0, r/2].

About

2011-01-01T00:00:00-05:00

I'm an independent computer scientist. I provide high-quality and scalable implementations of bleeding edge algorithms and consulting services on machine learning, data mining, recommendation systems, computer vision, search engines, storage systems, high-performance computing and other fields in computer science. Ann Arbor Algorithms is my sole proprietorship.