Wei Dong's Bloghttp://www.wdong.org/2017-05-10T00:00:00-04:00Equivalence of Subpixel Convolution and Transposed Convolution2017-05-10T00:00:00-04:002017-05-10T00:00:00-04:00Wei Dongtag:www.wdong.org,2017-05-10:/equivalence-of-subpixel-convolution-and-transposed-convolution.html<p>The idea of subpixel convolution was originally proposed by <a href="https://arxiv.org/abs/1609.05158">W. Shi et al</a>
from <a href="https://techcrunch.com/2016/06/20/twitter-is-buying-magic-pony-technology-which-uses-neural-networks-to-improve-images/">Magic
Pony</a>,
and well explained in <a href="http://www.inference.vc/holiday-special-deriving-the-subpixel-cnn-from-first-principles/">this
post</a>.
The figure below illustrates in 1-dimensional case that subpixel
convolution is probably just transposed convolution (or deconvolution).</p>
<p><img alt="subpixel" src="http://www.wdong.org/subpixel.png"></p><p>The idea of subpixel convolution was originally proposed by <a href="https://arxiv.org/abs/1609.05158">W. Shi et al</a>
from <a href="https://techcrunch.com/2016/06/20/twitter-is-buying-magic-pony-technology-which-uses-neural-networks-to-improve-images/">Magic
Pony</a>,
and well explained in <a href="http://www.inference.vc/holiday-special-deriving-the-subpixel-cnn-from-first-principles/">this
post</a>.
The figure below illustrates in 1-dimensional case that subpixel
convolution is probably just transposed convolution (or deconvolution).</p>
<p><img alt="subpixel" src="http://www.wdong.org/subpixel.png"></p>2017 Spring2017-04-17T00:00:00-04:002017-04-17T00:00:00-04:00Wei Dongtag:www.wdong.org,2017-04-17:/2017-spring.html<p><img alt="plum" src="http://www.wdong.org/plum.jpg">
<img alt="apple" src="http://www.wdong.org/apple.jpg">
<img alt="pear" src="http://www.wdong.org/pear.jpg">
<img alt="flower" src="http://www.wdong.org/flower.jpg"></p><p><img alt="plum" src="http://www.wdong.org/plum.jpg">
<img alt="apple" src="http://www.wdong.org/apple.jpg">
<img alt="pear" src="http://www.wdong.org/pear.jpg">
<img alt="flower" src="http://www.wdong.org/flower.jpg"></p>Receptive Field Size of Convolutional Network2017-01-19T00:00:00-05:002017-01-19T00:00:00-05:00Wei Dongtag:www.wdong.org,2017-01-19:/receptive-field-size-of-convolutional-network.html<p>The receptive field of a cell in a convolutional neural network (CNN)
is the underlying image region that affects the computation
of this cell. For a CNN to effectively
recognize an object, the receptive field size of the
narrowest layer of the network must be large enough to
capture the …</p><p>The receptive field of a cell in a convolutional neural network (CNN)
is the underlying image region that affects the computation
of this cell. For a CNN to effectively
recognize an object, the receptive field size of the
narrowest layer of the network must be large enough to
capture the characterisic structures of that object.
Receptive field being too big can be problematic, too.
In addition to costing excessive computation, the
neural network might be forced to learn structures
that surround, but are not intrinsically attached to or associate with
the object.</p>
<p><a href="https://github.com/aaalgo/tfgraph">This program</a> automatically
computes the receptive field size of the narrowest layer of
a TensorFlow model. This can be used as a guide to modifying a
standard network architecture to work with a simple dataset.</p>Deep Learning with Owl, PicPac and XNN2016-09-01T00:00:00-04:002016-09-01T00:00:00-04:00Wei Dongtag:www.wdong.org,2016-09-01:/deep-learning-with-owl-picpac-and-xnn.html<p>Owl, PicPac and XNN are three
tools I wrote to make image-related
model training easy.</p>
<ul>
<li><a href="https://github.com/aaalgo/owl">Owl</a>: a web UI for efficient image annotation.</li>
<li><a href="https://github.com/aaalgo/picpac">PicPac</a>: PicPac is an image database and streaming library
that preprocess the images and feed them into a
deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana …</li></ul><p>Owl, PicPac and XNN are three
tools I wrote to make image-related
model training easy.</p>
<ul>
<li><a href="https://github.com/aaalgo/owl">Owl</a>: a web UI for efficient image annotation.</li>
<li><a href="https://github.com/aaalgo/picpac">PicPac</a>: PicPac is an image database and streaming library
that preprocess the images and feed them into a
deep learning framework. PicPac supports Caffe (fork), MxNet, Nervana, Theano and Tensorflow.</li>
<li><a href="https://github.com/aaalgo/xnn">XNN</a>: a C++ wrapper that provides a unified
prediction interface
to all common deep learning frameworks, including
Caffe, MxNet, Tensorflow, Theano and other Python-based frameworks.</li>
<li>(<a href="https://github.com/aaalgo/caffe">Caffe fork with PicPac support</a>)</li>
</ul>
<p>The goal is to
create a model that will detect and localize
a target object category within images.
We will use a toy dataset for car plate recognition for illustration.</p>
<p>================</p>
<div class="highlight"><pre><span></span>$ git clone https://github.com/aaalgo/owl
$ <span class="nb">cd</span> owl
$ <span class="c1"># Download the dataset</span>
$ wget http://www.robots.ox.ac.uk/~vgg/data/cars_markus/cars_markus.tar
$ mkdir images
$ <span class="nb">cd</span> images
$ tar xf ../cars_markus.tar
$ <span class="nb">cd</span> ..
$ <span class="c1"># create database</span>
$ ./manage.py migrate
$ <span class="c1"># import images into the database</span>
$ find images/ -name <span class="s1">'*.jpg'</span> <span class="p">|</span> ./manage.py import --run
$ <span class="c1"># start the annotation server</span>
</pre></div>
<p>Before starting the annotation server, we need to adjust a couple of
parameters in the file <code>owl/annotate/params.py</code></p>
<div class="highlight"><pre><span></span><span class="n">ROWS</span> <span class="o">=</span> <span class="mi">2</span> <span class="c1"># <-- images rows / page</span>
<span class="n">COLS</span> <span class="o">=</span> <span class="mi">3</span> <span class="c1"># <-- images / row</span>
<span class="n">BATCH</span> <span class="o">=</span> <span class="n">ROWS</span> <span class="o">*</span> <span class="n">COLS</span>
<span class="n">POLYGON</span> <span class="o">=</span> <span class="bp">False</span> <span class="c1"># set to True for polygons</span>
<span class="n">VIEWED_AS_DONE</span> <span class="o">=</span> <span class="bp">False</span> <span class="c1"># see below</span>
</pre></div>
<div class="highlight"><pre><span></span>$ ./run.sh
</pre></div>
<p>The URL of the annotation UI is <code>http://HOSTNAME:18000/annotate/</code>.</p>
<p><img alt="ui" src="http://www.wdong.org/anno.jpg"></p>
<p>The UI is designed to minimize hand movements and therefore maximize
efficiency. The following design decisions were made:</p>
<ul>
<li>A bounding box is automatically saved by AJAX when created.</li>
<li>Refreshing page loads the next batch of examples.</li>
</ul>
<p>The annotation process finishes when all images are annotated/viewed.
The <code>VIEWED_AS_DONE</code> parameter controls the behavior whether
an image viewed should be considered annotated even when no annotation
is added. Set the value to <code>True</code> if it is know that images without
positive regions exist. If the value is set to <code>False</code> and no annotation
is made to an image, it will be shown again when all other images are done.</p>
<p>After annotation is done, or sufficient number of annotations are collected,
the images and annotations can be exported to a PicPac database by</p>
<div class="highlight"><pre><span></span>$./manage.py <span class="nb">export</span> db
</pre></div>
<p>The file <code>db</code> then contains all the information needed for training.</p>
<h1>PicPac Database</h1>
<p>A PicPac database contains images and labels/annotations.
The annotation produced by Owl is the same format used
by <a href="http://annotorious.github.io/">Annotorious</a>. Actually
Owl uses an extended version of Annotorious. Below is
a sample annotation:</p>
<div class="highlight"><pre><span></span><span class="p">{</span><span class="s1">'shapes'</span><span class="o">:</span> <span class="p">[{</span><span class="nx">u</span><span class="s1">'geometry'</span><span class="o">:</span> <span class="p">{</span><span class="nx">u</span><span class="s1">'y'</span><span class="o">:</span> <span class="mf">0.5912162162162162</span><span class="p">,</span> <span class="nx">u</span><span class="s1">'x'</span><span class="o">:</span> <span class="mf">0.6049107142857143</span><span class="p">,</span> <span class="nx">u</span><span class="s1">'width'</span><span class="o">:</span> <span class="mf">0.10491071428571429</span><span class="p">,</span> <span class="nx">u</span><span class="s1">'height'</span><span class="o">:</span> <span class="mf">0.08277027027027027</span><span class="p">},</span> <span class="nx">u</span><span class="s1">'style'</span><span class="o">:</span> <span class="p">{},</span> <span class="nx">u</span><span class="s1">'type'</span><span class="o">:</span> <span class="nx">u</span><span class="s1">'rect'</span><span class="p">}]}</span>
</pre></div>
<p>PicPac provides a web server for viewing the content of a database.</p>
<div class="highlight"><pre><span></span>$ picpac-server db
$ picpac-server db
WARNING: Logging before InitGoogleLogging<span class="o">()</span> is written to STDERR
I0901 22:52:20.280788 <span class="m">29210</span> picpac-server.cpp:146<span class="o">]</span> listening at 0.0.0.0:18888
I0901 22:52:20.281389 <span class="m">29210</span> picpac-server.cpp:148<span class="o">]</span> running server with <span class="m">1</span> threads.
</pre></div>
<p>And samples with annotations can be viewed with <code>http://HOSTNAME:18888/l?annotate=json</code>. The red bounding box is rendered on-the-fly by the server; images and annotations are stored separately in the database.</p>
<p><img alt="ui" src="http://www.wdong.org/picpac.jpg"></p>
<p>The server accepts almost all of the perturbation/augmentation parameters,
so the effects on the training set can be visualized. For examples,
the following can be appended to the URL <code>&perturb=1&pert_angle=20</code>.</p>
<p>Sometimes when the positive regions are too small compared to the background, it is desirable to use only local areas surrounding the postive regions as training example, so that positive pixels and negative pixels are roughly balanced. The command below can be used to do the cropping.</p>
<div class="highlight"><pre><span></span>$ picpac-split-region --width <span class="m">100</span> --height <span class="m">50</span> --bg <span class="m">200</span> --no-scale <span class="m">1</span> db db.crop
min: 0.668153
mean: 0.743567
max: 0.819342
</pre></div>
<p>Using <code>picpac-server</code> to serve <code>db.crop</code> shows this.</p>
<p><img alt="ui" src="http://www.wdong.org/picpac-crop.jpg"></p>
<p>The program <code>picpac-split-region</code> accepts the following parameters:</p>
<ul>
<li>(<code>--size</code>, always 50) Scale, or sqrt(width*height), of positive region.</li>
<li><code>--width</code> output image wdith.</li>
<li><code>--height</code> output image height.</li>
<li><code>--no-scale 1</code>. If not set, the cropped region is scaled so</li>
</ul>
<p>positive region and negative region are of the specified size. If
set, the cropped region is not scaled. Rather the size parameters
are used to determine the ratio between positive and negative regions,
and the output image size is determined accordingly.</p>
<h1>Training</h1>
<p>XNN provides a couple of templates based on public models. For example, we can train
with the above database using the following command.</p>
<div class="highlight"><pre><span></span>xnn/train-caffe-fcn.py fcn db ws
</pre></div>
<p>where</p>
<ul>
<li><code>fcn</code> is the template name.</li>
<li><code>db</code> is the input database.</li>
<li><code>ws</code> is the working directory.</li>
</ul>
<p>Training will start automatically after the command, and can be
canceled with CTRL+C. The <code>ws</code> directory will contain the
following:</p>
<div class="highlight"><pre><span></span>$ ls wc
log params.pickle solver.prototxt train.log train.prototxt.tmpl
model snapshots solver.prototxt.tmpl train.prototxt train.sh
</pre></div>
<p>Training can be restarted with <code>train.sh</code>, or continued at a snapshot by supplying a snapshot name under the <code>snapshots</code> directory as the argument of <code>train.sh</code>. </p>
<p>While some parameter can be adjusted via arguments to <code>train-caffe-fcn.py</code>,
it is easier to cancel the training process, edit the file <code>train.prototxt</code> and then restarted. The most import parameters of <code>train.prototxt</code> are
annotated below.</p>
<div class="highlight"><pre><span></span>layer {
name: "data1"
type: "PicPac"
top: "data"
top: "label"
picpac_param {
path: "path/to/db"
batch: 1 # batch size, has to be 1 if image sizes are different
channels: 3 # color channels, use 1 for grayscale images
split: 5 # randomly split db into 5 parts
split_fold: 0 # use part 0 for validation and the rest for training
annotate: "json"
anno_color1: 1
threads: 4
perturb: true # enable image augmentation
pert_color1: 10 # random perturbation range of
pert_color2: 10 # the three color channels
pert_color3: 10
pert_angle: 20 # maximal angle of random rotation, in degrees
pert_min_scale: 0.8 # min &
pert_max_scale: 1.2 # max ramdom scaling factor
}
}
PicPac supports a full range of flexible configurations. See
(documentation)[http://picpac.readthedocs.io/en/latest/] for details.
</pre></div>
<h1>PicPac with TensorFlow</h1>
<p>PicPac has a simple python interface with the same parameters.</p>
<div class="highlight"><pre><span></span> <span class="n">config</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">loop</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">reshuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
<span class="n">batch</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">split</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">split_fold</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="n">annotate</span><span class="o">=</span><span class="s1">'json'</span><span class="p">,</span>
<span class="n">channels</span><span class="o">=</span><span class="n">FLAGS</span><span class="o">.</span><span class="n">channels</span><span class="p">,</span>
<span class="n">stratify</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="n">mixin</span><span class="o">=</span><span class="s2">"db0"</span><span class="p">,</span>
<span class="n">mixin_group_delta</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
<span class="c1">#pert_color1=10,</span>
<span class="c1">#pert_angle=5,</span>
<span class="c1">#pert_min_scale=0.8,</span>
<span class="c1">#pert_max_scale=1.2,</span>
<span class="c1">#pad=False,</span>
<span class="c1">#pert_hflip=True,</span>
<span class="n">channel_first</span><span class="o">=</span><span class="bp">False</span>
<span class="p">)</span>
<span class="n">stream</span> <span class="o">=</span> <span class="n">picpac</span><span class="o">.</span><span class="n">ImageStream</span><span class="p">(</span><span class="s1">'db'</span><span class="p">,</span> <span class="n">negate</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">perturb</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="o">**</span><span class="n">config</span><span class="p">)</span>
<span class="o">...</span>
<span class="k">with</span> <span class="n">tf</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">sess</span><span class="p">:</span>
<span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">(</span><span class="n">init</span><span class="p">)</span>
<span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">FLAGS</span><span class="o">.</span><span class="n">max_steps</span><span class="p">):</span>
<span class="n">images</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">pad</span> <span class="o">=</span> <span class="n">stream</span><span class="o">.</span><span class="n">next</span><span class="p">()</span>
<span class="n">feed_dict</span> <span class="o">=</span> <span class="p">{</span><span class="n">X</span><span class="p">:</span> <span class="n">images</span><span class="p">,</span>
<span class="n">Y_</span><span class="p">:</span> <span class="n">labels</span><span class="p">}</span>
<span class="n">_</span><span class="p">,</span> <span class="n">loss_value</span> <span class="o">=</span> <span class="n">sess</span><span class="o">.</span><span class="n">run</span><span class="p">([</span><span class="n">train_op</span><span class="p">,</span> <span class="n">loss</span><span class="p">],</span> <span class="n">feed_dict</span><span class="o">=</span><span class="n">feed_dict</span><span class="p">)</span>
</pre></div>Image Storage For Deep Learning, Raw or JPG?2015-11-07T00:00:00-05:002015-11-07T00:00:00-05:00Wei Dongtag:www.wdong.org,2015-11-07:/image-storage-for-deep-learning-raw-or-jpg.html<p>Caffe requires the user to preload training images into a database, and the
images are stored as raw pixels. The following calculation shows that this is
not a very good idea.</p>
<p>Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3
= 192KB storage. On the other …</p><p>Caffe requires the user to preload training images into a database, and the
images are stored as raw pixels. The following calculation shows that this is
not a very good idea.</p>
<p>Assume that images are pre-scaled to 256×256, so each raw image costs 256x256x3
= 192KB storage. On the other hand, 255x255x3 JPEG images compressed with
default parameters cost about 48KB, about 1/4 the storage of raw pixels.
Benchmark shows that a 4-core 2600K can decode jpeg images of this size at a
rate of 6500/s using all cores, or 1350/s with one core. If we assume only one
core can be allocated for image decoding, the processing power translates to
63MB/s input throughput of JPEG data, and 253MB/s output throughput of raw
pixels. The sequential read throughput of a traditional HDD is about 100MB,
which is above the input throughput of one core. So an economical design would
be to store the images compressed with jpeg on a tradional HDD, and decode the
image with a dedicated CPU core. The throughput of Caffe, according to the
website, is about 4ms/image for learning and 1ms/image for predicting on a K40
GPU. So the throughput of the above configuration can well saturate the GPU
power even for predicting. The whole system is nice and balanced, and the main
stream HDD provides about 3TB of storage. This also leaves some room for future
growth of GPU power and training image size (HDD/SSD grows in capacity rather
than throughput).</p>
<p>Of course, this all relies on being able to achieve 63MB/s throughput from the
disk, and achieving this on a HDD requires sequential I/O. With images stored
in a database, it requires a very fast SSD to achieve such throughput. That’s
why I developed the PicPoc image storage for deep learning.(Benchmarking show
that sequential read with LMDB DOES achieve raw hardware throughput, whether
HDD or SSD. The storage overhead of LMDB is also reasonably low, around 3% as I
measured with the ILSVRC 2012 dataset.)</p>
<p>Here are some performance numbers I’ve been achieving with preliminary
experiments.</p>
<p>Importing fall 2011 version of ImageNet (14million images stored on 21935 tar
files, totalling about 1.2TB) into PicPoc took about 10 hours. The output is
400GB. The input on one HDD and output on another. CPU usage is 213.6%.
Considering reading 1.2TB from HDD takes about 3.5 hours and CPU usage is about
50%, there’s a possibility to double the loading throughput. But that’s a one
shot business so I’ll say it’s good enough for now. The ILSVRC 2012 training
data, when imported, costs 28GB storage, as apposed to 173GB imported to LMDB
as raw pixels as described in Caffe’s documentation. (One doesn’t have to use
raw pixels with LMDB. The Caffe Datum can be used to store encoded image, and
OpenCV pretty much support all popular image codecs).</p>
<p>On reading with decoding, the system is able to sustain 120MB/s throughput on a
traditional 1TB HDD. I’ve also created a <a href="https://github.com/aaalgo/caffe-picpoc">Caffe fork with PicPoc backend</a>.</p>Go是怎么来的?2015-10-15T00:00:00-04:002015-10-15T00:00:00-04:00Wei Dongtag:www.wdong.org,2015-10-15:/goshi-zen-yao-lai-de.html<p>pthread是包在OS线程外的。light-weighted thread需要用到用户态线程。
微软的fiber是用户态的,但是因为是windows世界的,不怎么招人待见。
在多核上做用户态线程是一件非常非常恶心的事情,GO最主要的贡献其实
就是把这件事做成了。主要的恶心之处就是怎么处理job stealing:一个
操统线程上面的任务都跑完了,就需要去别的操统线程那儿把活弄过来。
这就涉及到各种同步,各种locking。Locking多了,性能就下来了。
这种事情以前应该不少公司内部都有人做过,能做这个的人一般也都
屌得不得了。其实稍微对比一下就能知道multi-core有多难做:
node.js不支持multi-core,python折腾这么多年也还是个残疾。
如果你想更多地了解一下,可以从man makecontext
看起。每个用户态线程其实是一个context。然后底下每个操统线程负责
管一堆context。context切换主要靠cooperative scheduling,而不是操统
用的preemptive scheduling。也就是说一个context运行到某一步自己主动
把执行权让出来。Unix世界的一般没见过cooperative scheduling.
Windows 3.x是cooperative scheduling,所以线程跑一会就得调用yield
让出执行权。因为不可能要求程序员写几行程序就插入一个yield,所以
其实Windows很多UI和I …</p><p>pthread是包在OS线程外的。light-weighted thread需要用到用户态线程。
微软的fiber是用户态的,但是因为是windows世界的,不怎么招人待见。
在多核上做用户态线程是一件非常非常恶心的事情,GO最主要的贡献其实
就是把这件事做成了。主要的恶心之处就是怎么处理job stealing:一个
操统线程上面的任务都跑完了,就需要去别的操统线程那儿把活弄过来。
这就涉及到各种同步,各种locking。Locking多了,性能就下来了。
这种事情以前应该不少公司内部都有人做过,能做这个的人一般也都
屌得不得了。其实稍微对比一下就能知道multi-core有多难做:
node.js不支持multi-core,python折腾这么多年也还是个残疾。
如果你想更多地了解一下,可以从man makecontext
看起。每个用户态线程其实是一个context。然后底下每个操统线程负责
管一堆context。context切换主要靠cooperative scheduling,而不是操统
用的preemptive scheduling。也就是说一个context运行到某一步自己主动
把执行权让出来。Unix世界的一般没见过cooperative scheduling.
Windows 3.x是cooperative scheduling,所以线程跑一会就得调用yield
让出执行权。因为不可能要求程序员写几行程序就插入一个yield,所以
其实Windows很多UI和I/O的API都内嵌了yield。那些差的程序员不知道
这回事,有时候进入一个纯计算的循环没有在中间插入yield,就会导致
系统挂起。Unix世界从一开始就是pre-emptive的,操统API没有内嵌yield
这回事。手写程序隔几行插入yield也不可行。这就是unix世界的C/C++做
用户态线程几乎不肯能的原因。这也是为啥rob pike非要搞一个新的语言的
原因:在操统API外包一层,并且嵌入yield(还有就是GC)。GO在语言层面
上其实没有任何创新,甚至比好多现有的语言都要低级。如果从出发点来
看,GO的目的其实已经达到了。相比而言,Windows有在API
里面做yield的传统,这也是为什么这么容易搞出来fiber的原因。</p>
<p>C++11的thread API里有个莫名其妙的this_thread::yield,其实就是为了
给non-preemptive的runtime留下余地。理论上说,如果禁止调用操统API,
全都用C++的I/O库,C++是有可能做出来用户态线程的runtime的。有的嵌入
式系统本就没有preemptive scheduling, yield就成了必须得了</p>How to Profile Yarn App/Container Memory Usage2015-01-13T00:00:00-05:002015-01-13T00:00:00-05:00Wei Dongtag:www.wdong.org,2015-01-13:/how-to-profile-yarn-appcontainer-memory-usage.html<p>Yarn does not provide a tool to profile the memory usage of an app yet, but it does save some instrumentation information to the log. Like this.</p>
<div class="highlight"><pre><span></span><span class="nt">yarn-wdong-nodemanager-washtenaw</span><span class="nc">.log</span><span class="nd">:2015-01-06</span> <span class="nt">14</span><span class="nd">:56:43</span><span class="o">,</span><span class="nt">267</span> <span class="nt">INFO</span> <span class="nt">org</span><span class="nc">.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl</span><span class="o">:</span> <span class="nt">Memory</span> <span class="nt">usage</span> <span class="nt">of</span> <span class="nt">ProcessTree</span> <span class="nt">16669</span> <span class="nt">for …</span></pre></div><p>Yarn does not provide a tool to profile the memory usage of an app yet, but it does save some instrumentation information to the log. Like this.</p>
<div class="highlight"><pre><span></span><span class="nt">yarn-wdong-nodemanager-washtenaw</span><span class="nc">.log</span><span class="nd">:2015-01-06</span> <span class="nt">14</span><span class="nd">:56:43</span><span class="o">,</span><span class="nt">267</span> <span class="nt">INFO</span> <span class="nt">org</span><span class="nc">.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl</span><span class="o">:</span> <span class="nt">Memory</span> <span class="nt">usage</span> <span class="nt">of</span> <span class="nt">ProcessTree</span> <span class="nt">16669</span> <span class="nt">for</span> <span class="nt">container-id</span> <span class="nt">container_1420574192658_0001_01_000001</span><span class="o">:</span> <span class="nt">277</span><span class="nc">.3</span> <span class="nt">MB</span> <span class="nt">of</span> <span class="nt">9</span> <span class="nt">GB</span> <span class="nt">physical</span> <span class="nt">memory</span> <span class="nt">used</span><span class="o">;</span> <span class="nt">8</span><span class="nc">.9</span> <span class="nt">GB</span> <span class="nt">of</span> <span class="nt">18</span><span class="nc">.9</span> <span class="nt">GB</span> <span class="nt">virtual</span> <span class="nt">memory</span> <span class="nt">used</span>
</pre></div>
<p>The numbers reported are actually those based on which Yarn kills processes.</p>
<p><a href="https://github.com/aaalgo/yarn-memory-tracker">This script</a> analyzes the log and reports maximal memory usage of each container for a particular app.</p>
<p>Sample output</p>
<div class="highlight"><pre><span></span>$ yarn-memory-tracker.sh application_1421176927536_0002 <span class="c1"># an spark app</span>
<span class="m">383</span> containers found <span class="k">for</span> app application_1421176927536_0002
container_1421176927536_0001_01_000001: 0.254785 of 16.4 GB
container_1421176927536_0001_01_000002: 16.2 of 51.4 GB
container_1421176927536_0001_01_000003: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000004: 0.00107422 of 51.4 GB
container_1421176927536_0001_01_000005: 12.5 of 51.4 GB
container_1421176927536_0002_01_000001: 0.251563 of 16.4 GB
container_1421176927536_0002_01_000002: 16.1 of 51.4 GB
......
</pre></div>HDFS Demistified: How to Manually Assemble an HDFS File When Hadoop is Down2015-01-08T00:00:00-05:002015-01-08T00:00:00-05:00Wei Dongtag:www.wdong.org,2015-01-08:/hdfs-demistified-how-to-manually-assemble-an-hdfs-file-when-hadoop-is-down.html<p>In this blog I’ll explain how to manually assemble a file stored in HDFS using 1) meta data on namenode and 2) blocks on datanode. The process does not require the hadoop system to be up and running. It’s an interesting exercise to gain some knowledge about the …</p><p>In this blog I’ll explain how to manually assemble a file stored in HDFS using 1) meta data on namenode and 2) blocks on datanode. The process does not require the hadoop system to be up and running. It’s an interesting exercise to gain some knowledge about the internal mechanisms of Hadoop, and such knowledge can be handy when it comes to data recovery.</p>
<p>(1) Fetch the fsimage.</p>
<p>Below is a tree view of what are stored in the namenode directory.</p>
<div class="highlight"><pre><span></span>$ <span class="nb">cd</span> HADOOP_NAMENODE_DIR
$ tree .
.
<span class="p">|</span>-- current
<span class="p">|</span> <span class="p">|</span>-- VERSION
<span class="p">|</span> <span class="p">|</span>-- edits_0000000000005342851-0000000000005440884
<span class="p">|</span> <span class="p">|</span>-- edits_0000000000006347975-0000000000006347976
... ...
<span class="p">|</span> <span class="p">|</span>-- edits_0000000000006347977-0000000000006347986
<span class="p">|</span> <span class="p">|</span>-- edits_inprogress_0000000000006347987
<span class="p">|</span> <span class="p">|</span>-- fsimage_0000000000006347976
<span class="p">|</span> <span class="p">|</span>-- fsimage_0000000000006347976.md5
<span class="p">|</span> <span class="p">|</span>-- fsimage_0000000000006347986
<span class="p">|</span> <span class="p">|</span>-- fsimage_0000000000006347986.md5
<span class="p">|</span> <span class="sb">`</span>-- seen_txid
<span class="sb">`</span>-- in_use.lock
<span class="m">1</span> directory, <span class="m">50</span> files
</pre></div>
<p>We are interested in the fsimage file with the largest postfix number. Copy it out as “fsimage”. If our file in the HDFS is recently uploaded or modified, then its full metadata might not be present in the fsimage. Some of the data could still be in one of the edits_ files. We won’t be able to fully assemble the most recent version of the file. One way to force Hadoop to produce a new checkpoint is to restart Hadoop. Edit logs are merged into a new fsimage upon restart.</p>
<p>Before proceeding to the next step, it is useful to examine the content of the VERSION file.</p>
<div class="highlight"><pre><span></span>$ cat VERSION
<span class="c1">#Wed Oct 01 01:25:37 CST 2014</span>
<span class="nv">namespaceID</span><span class="o">=</span>1453566641
<span class="nv">clusterID</span><span class="o">=</span>CID-a5f06877-24b3-4892-9dcf-05fccf827889
<span class="nv">cTime</span><span class="o">=</span>0
<span class="nv">storageType</span><span class="o">=</span>NAME_NODE
<span class="nv">blockpoolID</span><span class="o">=</span>BP-908018994-10.10.2.27-1412043710870
<span class="nv">layoutVersion</span><span class="o">=</span>-56
</pre></div>
<p>We’ll need the blockpoolID information.</p>
<p>(2) Examine the content of fsimage.</p>
<p>Use the following command to dump the content of fsimage to an XML file.</p>
<div class="highlight"><pre><span></span>$ hdfs oiv -i fsimage -o fsimage.xml -p XML
</pre></div>
<p>Let’s say we are interested in recovering the file “/user/home/playtime_20140915.txt”. We can find the following relavant information in the fsimage dump.</p>
<div class="highlight"><pre><span></span><span class="nt"><inode><id></span>16392<span class="nt"></id><type></span>FILE<span class="nt"></type><name></span>playtime_20140915.txt<span class="nt"></name><replication></span>2<span class="nt"></replication><mtime></span>1412052903661<span class="nt"></mtime><atime></span>1418937665301<span class="nt"></atime><perferredBlockSize></span>134217728<span class="nt"></perferredBlockSize><permission></span>wdong:supergroup:rw-r--r--<span class="nt"></permission><blocks><block><id></span>1073741825<span class="nt"></id><genstamp></span>1001<span class="nt"></genstamp><numBytes></span>134217728<span class="nt"></numBytes></block></span>
<span class="nt"><block><id></span>1073741826<span class="nt"></id><genstamp></span>1002<span class="nt"></genstamp><numBytes></span>134217728<span class="nt"></numBytes></block></span>
<span class="nt"><block><id></span>1073741827<span class="nt"></id><genstamp></span>1003<span class="nt"></genstamp><numBytes></span>49999484<span class="nt"></numBytes></block></span>
<span class="nt"></blocks></span>
<span class="nt"></inode></span>
</pre></div>
<p>We can extract the list of block IDs by either eyeballing or programming.</p>
<div class="highlight"><pre><span></span>1073741825
1073741826
1073741827
</pre></div>
<p>We can also add up the number of bytes (318434940). If Hadoop is up, we can verify if the file size is correct.</p>
<p>(3). Gather the blocks.</p>
<p>Hadoop does not maintain a on-disk file or database mapping block IDs to nodes. This is actually a nice stateless design. We’ll need to manually enumerate each node to find the blocks we need. Here’s a sample layout of Hadoop data directory.</p>
<div class="highlight"><pre><span></span>$ tree .
.
<span class="p">|</span>-- current
<span class="p">|</span> <span class="p">|</span>-- BP-908018994-10.10.2.27-1412043710870
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- current
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- VERSION
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- dfsUsed
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- finalized
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1073743127
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1073834270_93446.meta
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1073752388_11564.meta
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1073801146
......
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="sb">`</span>-- blk_1073801146_60322.meta
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="sb">`</span>-- rbw
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1074397675
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1074397675_656923.meta
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- blk_1074397684
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span> <span class="sb">`</span>-- blk_1074397684_656932.meta
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- dncp_block_verification.log.curr
<span class="p">|</span> <span class="p">|</span> <span class="p">|</span>-- dncp_block_verification.log.prev
<span class="p">|</span> <span class="p">|</span> <span class="sb">`</span>-- tmp
<span class="p">|</span> <span class="sb">`</span>-- VERSION
<span class="sb">`</span>-- in_use.lock
<span class="m">390</span> directories, <span class="m">16252</span> files
</pre></div>
<p>Here we see the blockpoolId we noted before as a directory name. Blocks are simply named as blk_ID in one of the sub directories.</p>
<p>In our cluster, the data nodes are mounted as “/data/hadoop/data*/”. So it is quite easy to launch a cluster-wide search with pdsh.</p>
<div class="highlight"><pre><span></span>$ pdsh <span class="s2">"find /data/hadoop/data*/ -name blk_1073741825"</span>
klose4: /data/hadoop/data3/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/subdir56/blk_1073741825
klose2: /data/hadoop/data2/current/BP-908018994-10.10.2.27-1412043710870/current/finalized/blk_1073741825
</pre></div>
<p>We see that the block has two replicates. We can modify the above command a little bit to copy the file over:</p>
<div class="highlight"><pre><span></span>$ <span class="k">for</span> B in <span class="m">1073741825</span> <span class="m">1073741826</span> <span class="m">1073741827</span> <span class="p">;</span> <span class="k">do</span> pdsh <span class="s2">"find /data/hadoop/data*/ -name blk_</span><span class="nv">$B</span><span class="s2">"</span> <span class="p">|</span> <span class="k">while</span> <span class="nb">read</span> a b<span class="p">;</span> <span class="k">do</span> scp <span class="nv">$a$b</span> . <span class="p">;</span> break<span class="p">;</span> <span class="k">done</span> <span class="p">;</span> <span class="k">done</span>
</pre></div>
<p>The break command is to stop us from copying more than one replica.</p>
<p>(4) Assemble the file</p>
<div class="highlight"><pre><span></span>$ cat blk_1073741825 blk_1073741826 blk_1073741827 <span class="p">|</span> md5sum
ad07d7ced9c9210b4a4b14d08c0d146f -
$ hdfs dfs -cat playtime_20140915.txt <span class="p">|</span> md5sum <span class="c1"># only when hadoop is up.</span>
ad07d7ced9c9210b4a4b14d08c0d146f -
</pre></div>
<p>Bingo!</p>Spark on Yarn: Where Have All the Memory Gone?2015-01-08T00:00:00-05:002015-01-08T00:00:00-05:00Wei Dongtag:www.wdong.org,2015-01-08:/spark-on-yarn-where-have-all-the-memory-gone.html<p>Efficient processing of big data, especially with Spark, is really all about how much memory one can afford, or how efficient use one can make of the limited amount of available memory. Efficient memory utilization, however, is not what one can take for granted with default configuration shipped with Spark …</p><p>Efficient processing of big data, especially with Spark, is really all about how much memory one can afford, or how efficient use one can make of the limited amount of available memory. Efficient memory utilization, however, is not what one can take for granted with default configuration shipped with Spark and Yarn. Rather, it takes very careful provisioning and tuning to get as much as possible from the bare metal. In this post I’ll demonstrate a case when not-so-careful configuration of Spark on Yarn leads to poor memory utilization for caching, explain the math that leads to all the observed numbers, and give some tips on parameter tuning to address the problem.</p>
<p>A little bit background first. I’m not working with one of those big names, and do not have thousands of machines at my disposal. My group has less than ten for data crunching. Months of experimental usage of Spark in a standalone configuration showed very promising results, and we want to use those few machines for both production and development. That is, we want to be able to run two instances of Spark apps in parallel. Since we already have Hadoop, Spark on Yarn seems to be a natural choice. It is not difficult at all to setup Spark on Yarn, but I quickly found that I was not able to fire up the second instance of Spark because I was, as seen by Yarn, out of memory. I’ll use the following simplified one-machine setup to demonstrate the problem I’ve seen.</p>
<h1>1. Demonstration of the Problem</h1>
<p>This demo was run on a desktop with 64g memory. I used the following setting:</p>
<div class="highlight"><pre><span></span>yarn.nodemanager.resource.memory-mb = 49152 # 48G
yarn.scheduler.maximum-allocation-mb = 24576 # 24G
SPARK_EXECUTOR_INSTANCES=1
SPARK_EXECUTOR_MEMORY=18G
SPARK_DRIVER_MEMORY=4G
</pre></div>
<p>The Yarn parameters went into yarn-site.xml, and the Spark ones in spark-env.sh. I didn’t set any other memory-related parameters.</p>
<p>So the total memory allocated to Yarn was 48G, with 24G maximum for one app. Spark should use 18+4 = 22G memory, which is below the 24G cap. So I should be able to run two Spark apps in parallel.</p>
<p>Following are the numbers I got from all the logs and Web UIs when I actually fired up one Spark app.</p>
<ul>
<li>(Yarn) Memory Total: 48G</li>
<li>(Yarn) Memory Used: 25G</li>
<li>(Yarn) Container 1, TotalMemoryNeeded: 5120M</li>
<li>(Yarn) Container 2, TotalMemoryNeeded: 20480M</li>
<li>(Spark) Driver memory requirement: 4480 MB memory including 384 MB overhead (From output of Spark-Shell)</li>
<li>(Spark) Driver available memory to App: 2.1G</li>
<li>(Spark) Executor available memory to App: 9.3G</li>
</ul>
<p>Below are the relevant screen shots.</p>
<p><img alt="ui" src="http://www.wdong.org/memory-yarn.png">
<img alt="ui" src="http://www.wdong.org/memory-spark.png"></p>
<p>So here are the problems that I see with the driver:</p>
<p>I’ve configured Spark driver to use 4G, and Spark asked Yarn for 4G plus an overhead of 384MB.
What reflected in Yarn is that the driver has used 5G.
What’s really available in the driver’s block manager is only 2.1G.
One has to understand that Spark has to reserve a portion of memory for code execution and cannot give everything to the block manager (the cache), but still,</p>
<p>WHERE HAVE ALL THE MEMORY GONE???</p>
<h1>2. The Math Behind</h1>
<p>Rule 1. Yarn always rounds up memory requirement to multiples of yarn.scheduler.minimum-allocation-mb, which by default is 1024 or 1GB. That’s why the driver’s requirement of 4G+384M showed up as 5G in Yarn. The parameter yarn.scheduler.minimum-allocation-mb is really “minimum-allocation-unit-mb”. This can be easily verified by setting the parameter to a prime number, such as 97, and see Yarn allocate by multiples of the number.</p>
<p>Rule 2. Spark adds an overhead to SPARK_EXECUTOR_MEMORY/SPARK_DRIVER_MEMORY before asking Yarn for the amount. The rule of overhead is the same for both executor and driver, which is</p>
<div class="highlight"><pre><span></span>//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala
val MEMORY_OVERHEAD_FACTOR = 0.07
val MEMORY_OVERHEAD_MIN = 384
//yarn/common/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
protected val memoryOverhead: Int = sparkConf.getInt("spark.yarn.executor.memoryOverhead",
math.max((MEMORY_OVERHEAD_FACTOR * executorMemory).toInt, MEMORY_OVERHEAD_MIN))
......
val totalExecutorMemory = executorMemory + memoryOverhead
numPendingAllocate.addAndGet(missing)
logInfo(s"Will allocate $missing executor containers, each with $totalExecutorMemory MB " +
s"memory including $memoryOverhead MB overhead")
</pre></div>
<p>This overhead is necessary because when a JVM program is allowed certain amount of memory (by -Xmx), the overall memory usage of JVM could be more than that, and Yarn literally kills programs which uses more memory than allowed (with complicated rules). One can only adjust the two magic numbers by modifying the source.</p>
<p>The above two rules determine how the configured SPARK_XXX_MEMORY finally show up in Yarn.</p>
<p>Rule 3. How much memory does driver/executor see.</p>
<p>One limits the maximal heap memory of JVM by the option “-Xmx”. Part of the specified memory get used by the Scala runtime and other system component, and what a Scala program sees is less then the specified amount. This can be illustrated with the following example.</p>
<div class="highlight"><pre><span></span>$ scala -J-Xmx4g
Welcome to Scala version 2.10.3 <span class="o">(</span>OpenJDK 64-Bit Server VM, Java 1.7.0_51<span class="o">)</span>.
Type in expressions to have them evaluated.
Type :help <span class="k">for</span> more information.
scala> Runtime.getRuntime.maxMemory
res0: <span class="nv">Long</span> <span class="o">=</span> 3817865216
scala>
</pre></div>
<p>The runtime eats about 455M. (The above process has an RSS of 140.3M in Linux, so a big portion of the 455M is more like being reserved than actually being used.)</p>
<p>The Spark driver is allocated the configured 4G by JVM options. This can be verified by running the following from inside the Spark shell.</p>
<div class="highlight"><pre><span></span><span class="n">scala</span><span class="o">></span>
<span class="n">scala</span><span class="o">></span> <span class="kn">import</span> <span class="nn">java.lang.management.ManagementFactory</span>
<span class="kn">import</span> <span class="nn">java.lang.management.ManagementFactory</span>
<span class="n">scala</span><span class="o">></span> <span class="n">ManagementFactory</span><span class="o">.</span><span class="n">getRuntimeMXBean</span><span class="o">.</span><span class="n">getInputArguments</span>
<span class="n">res0</span><span class="p">:</span> <span class="n">java</span><span class="o">.</span><span class="n">util</span><span class="o">.</span><span class="n">List</span><span class="p">[</span><span class="n">String</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="n">XX</span><span class="p">:</span><span class="n">MaxPermSize</span><span class="o">=</span><span class="mi">128</span><span class="n">m</span><span class="p">,</span> <span class="o">-</span><span class="n">Djava</span><span class="o">.</span><span class="n">library</span><span class="o">.</span><span class="n">path</span><span class="o">=/</span><span class="n">home</span><span class="o">/</span><span class="n">hadoop</span><span class="o">/</span><span class="n">hadoop</span><span class="o">-</span><span class="mf">2.4</span><span class="o">.</span><span class="mi">1</span><span class="o">/</span><span class="n">lib</span><span class="o">/</span><span class="n">native</span><span class="p">,</span> <span class="o">-</span><span class="n">Xms4G</span><span class="p">,</span> <span class="o">-</span><span class="n">Xmx4G</span><span class="p">]</span>
<span class="n">scala</span><span class="o">></span> <span class="n">Runtime</span><span class="o">.</span><span class="n">getRuntime</span><span class="o">.</span><span class="n">maxMemory</span>
<span class="n">res1</span><span class="p">:</span> <span class="n">Long</span> <span class="o">=</span> <span class="mi">4116709376</span>
<span class="n">scala</span><span class="o">></span>
</pre></div>
<p>Rule 4. How Spark determines maximal usable memory</p>
<div class="highlight"><pre><span></span>//core/src/main/scala/org/apache/spark/storage/BlockManager.scala
/** Return the total amount of storage memory available. */
private def getMaxMemory(conf: SparkConf): Long = {
val memoryFraction = conf.getDouble("spark.storage.memoryFraction", 0.6)
val safetyFraction = conf.getDouble("spark.storage.safetyFraction", 0.9)
(Runtime.getRuntime.maxMemory * memoryFraction * safetyFraction).toLong
}
</pre></div>
<p>We have 4116709376 * 0.6 * 0.9 = 2.07G. That is where the 2.1G value comes from. The maximal available memory of executor is derived in the same way.</p>
<p>Overall, the following two formulas guide memory allocation:</p>
<p>What’s seen by Yarn: (SPECIFIED_MEMORY + OVERHEAD) round up to multiples of minimum-allocation-mb , with OVERHEAD = max(SPECIFIED_MEMORY * 0.07, 384M)
What’s usable for cache: (SPECIFIED_MEMORY – MEMORY_USED_BY_RUNTIME) * spark.storage.memoryFraction * spark.storage.safetyFraction</p>
<h1>3. Tuning Suggestions</h1>
<p>We see that the root of memory under utilization is the over-provisioning done in almost each step. Even if a process really reaches the configured memory cap, it’s unlikely it will keep using that amount of memory all the time. Because Yarn actually kills a process when it exceeds the memory cap, we have to keep SPARK_XXX_MEMORY big enough. It is also very difficult to determine the actual amount of memory gets used by Spark code execution, so dealing with spark.storage.memoryFraction is tricky. But if one is sure that it’s unlikely for the overall memory consumption of parallel Spark apps to exceed the physical memory, the easiest way to improve memory utilization is to counter the over-provisioning with overcommitment. That is, to set the Yarn parameter yarn.nodemanager.resource.memory-mb to MORE THAN THE AVAILABLE PHYSICAL MEMORY (luckily Yarn does not check that). It also helps a little bit to set yarn.scheduler.minimum-allocation-mb to a small value like 100M, so an app does not get much more that what it asks for.</p>C++11多线程攻略2014-12-03T00:00:00-05:002014-12-03T00:00:00-05:00Wei Dongtag:www.wdong.org,2014-12-03:/c11duo-xian-cheng-gong-lue.html<p><a href="http://www.wdong.org/thread-tutorial.cpp">Tutorial in one program (contains Chinese).</a></p>
<div class="highlight"><pre><span></span><span class="cp">#include</span> <span class="cpf"><vector></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><iostream></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><thread></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><chrono></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><mutex></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><future></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><stdexcept></span><span class="cp"></span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">cerr</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">;</span>
<span class="k">namespace</span> <span class="n">chrono</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="p">;</span>
<span class="k">namespace</span> <span class="n">this_thread</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="p">;</span>
<span class="c1">// 用std::thread显式实现线程</span>
<span class="c1">// 用回调函数构造线程</span>
<span class="c1">// 注意std …</span></pre></div><p><a href="http://www.wdong.org/thread-tutorial.cpp">Tutorial in one program (contains Chinese).</a></p>
<div class="highlight"><pre><span></span><span class="cp">#include</span> <span class="cpf"><vector></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><iostream></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><thread></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><chrono></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><mutex></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><future></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><stdexcept></span><span class="cp"></span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">cout</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">cerr</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">endl</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">vector</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">lock_guard</span><span class="p">;</span>
<span class="k">using</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">;</span>
<span class="k">namespace</span> <span class="n">chrono</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">chrono</span><span class="p">;</span>
<span class="k">namespace</span> <span class="n">this_thread</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">this_thread</span><span class="p">;</span>
<span class="c1">// 用std::thread显式实现线程</span>
<span class="c1">// 用回调函数构造线程</span>
<span class="c1">// 注意std::thread的构造函数如何把不确定参数传入回调函数</span>
<span class="c1">// 这个是C++11的新特性。 下面是std::thread构造函数的原型</span>
<span class="c1">//</span>
<span class="c1">// template< class Function, class... Args ></span>
<span class="c1">// explicit thread::thread( Function&& f, Args&&... args );</span>
<span class="c1">//</span>
<span class="c1">// 注意第一个参数也是模板类型,所以可以传入函数指针,</span>
<span class="c1">// std::function对象,lambda和各种重载了()算符的类</span>
<span class="c1">// 下面的函数实现一个线程的计算,将会被作为回调函数用来创建线程</span>
<span class="kt">void</span> <span class="nf">fun</span> <span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"th "</span> <span class="o"><<</span> <span class="n">this_thread</span><span class="o">::</span><span class="n">get_id</span><span class="p">()</span> <span class="o"><<</span> <span class="s">": "</span> <span class="o"><<</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// 延时</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"wait 1 second."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">chrono</span><span class="o">::</span><span class="n">seconds</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"wait 10 milliseconds."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">chrono</span><span class="o">::</span><span class="n">milliseconds</span><span class="p">(</span><span class="mi">10</span><span class="p">));</span>
<span class="c1">// 另一种延时</span>
<span class="k">auto</span> <span class="n">now</span> <span class="o">=</span> <span class="n">chrono</span><span class="o">::</span><span class="n">system_clock</span><span class="o">::</span><span class="n">now</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"wait another 10 milliseconds."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_until</span><span class="p">(</span><span class="n">now</span> <span class="o">+</span> <span class="n">chrono</span><span class="o">::</span><span class="n">milliseconds</span><span class="p">(</span><span class="mi">10</span><span class="p">));</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="nf">fun_sync</span> <span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">);</span>
<span class="kt">void</span> <span class="nf">demo_thread</span> <span class="p">()</span> <span class="p">{</span>
<span class="c1">// 注意A:线程对象在析构之前一定要调用join或者detach</span>
<span class="c1">// 否则会出错</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">th</span><span class="p">(</span><span class="n">fun</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="c1">// 等待线程执行完成</span>
<span class="n">th</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">th</span><span class="p">(</span><span class="n">fun</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="c1">// 脱离与线程的关系,放任它自己完成</span>
<span class="n">th</span><span class="p">.</span><span class="n">detach</span><span class="p">();</span>
<span class="c1">// detach以后就无法通过th对象对线程进行操作了</span>
<span class="p">}</span>
<span class="p">{</span>
<span class="c1">// 下面的程序会出错</span>
<span class="cm">/*</span>
<span class="cm"> std::thread th(fun, 1, 2);</span>
<span class="cm"> */</span>
<span class="p">}</span>
<span class="c1">// 用lambda启动线程</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="kr">thread</span> <span class="n">th</span><span class="p">(</span> <span class="p">[](</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"thread with lambda "</span> <span class="o"><<</span> <span class="n">this_thread</span><span class="o">::</span><span class="n">get_id</span><span class="p">()</span> <span class="o"><<</span> <span class="s">": "</span> <span class="o"><<</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">},</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">th</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// std::thread和=操作</span>
<span class="c1">// 一个thread对象有两种状态:</span>
<span class="c1">// 1. 和一个线程关联</span>
<span class="c1">// 2. 不和任何线程关联</span>
<span class="c1">// 上面说的注意A只对处于状态1的线程对象有关</span>
<span class="c1">//</span>
<span class="c1">// 如果够造thread对象时不传入参数,那么对象就处于状态2</span>
<span class="c1">// 可以直接被析枸</span>
<span class="c1">//</span>
<span class="c1">// std::thread 用 = 进行赋值的时候实现的是move而不是copy</span>
<span class="c1">// 也就是说:</span>
<span class="c1">// A = B;</span>
<span class="c1">// 如果B和一个线程关联,那么操作完成后</span>
<span class="c1">// - 关联线程转移到A</span>
<span class="c1">// - B不再和原有的线程关联</span>
<span class="c1">//</span>
<span class="c1">// 思考:如果=之前A已经和某线程关联了怎么办?</span>
<span class="c1">// 我不也不知道答案: )</span>
<span class="c1">// 最好的办法就是保证=以前线程对象是空的,或者干脆</span>
<span class="c1">// 不要用=操作,除非...</span>
<span class="p">{</span>
<span class="c1">// 用vector管理线程</span>
<span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">></span> <span class="n">threads</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="c1">// 起10个线程</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">threads</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="p">(</span><span class="n">fun</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// 注意move操作</span>
<span class="p">}</span>
<span class="c1">// 等待他们全部完成</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">th</span><span class="p">:</span> <span class="n">threads</span><span class="p">)</span> <span class="p">{</span>
<span class="n">th</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 上面打印出来的东西因为没有同步,有时会有混乱</span>
<span class="c1">// 所以线程内如果要往cout/cerr输出东西,需要考虑进行同步</span>
<span class="c1">// 用fun_sync再做一遍</span>
<span class="p">{</span>
<span class="c1">// 用vector管理线程</span>
<span class="n">vector</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">></span> <span class="n">threads</span><span class="p">(</span><span class="mi">10</span><span class="p">);</span>
<span class="c1">// 起10个线程</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="n">threads</span><span class="p">.</span><span class="n">size</span><span class="p">();</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">threads</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="p">(</span><span class="n">fun_sync</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">i</span><span class="p">);</span> <span class="c1">// 注意move操作</span>
<span class="p">}</span>
<span class="c1">// 等待他们全部完成</span>
<span class="k">for</span> <span class="p">(</span><span class="k">auto</span> <span class="o">&</span><span class="nl">th</span><span class="p">:</span> <span class="n">threads</span><span class="p">)</span> <span class="p">{</span>
<span class="n">th</span><span class="p">.</span><span class="n">join</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="c1">// 线程同步</span>
<span class="kt">void</span> <span class="nf">fun_sync</span> <span class="p">(</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span>
<span class="k">static</span> <span class="n">std</span><span class="o">::</span><span class="n">mutex</span> <span class="n">mutex</span><span class="p">;</span> <span class="c1">// 全局变量</span>
<span class="c1">// 土法lock/unlock,C++里面不要像下面这样写程序!!!!!!!!!!</span>
<span class="n">mutex</span><span class="p">.</span><span class="n">lock</span><span class="p">();</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"random sleep"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">mutex</span><span class="p">.</span><span class="n">unlock</span><span class="p">();</span>
<span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">chrono</span><span class="o">::</span><span class="n">milliseconds</span><span class="p">(</span><span class="n">rand</span><span class="p">()</span> <span class="o">%</span> <span class="mi">100</span><span class="p">));</span>
<span class="p">{</span> <span class="c1">// 正确的做法:用lock_guard实现RAII式的保护</span>
<span class="n">lock_guard</span><span class="o"><</span><span class="n">std</span><span class="o">::</span><span class="n">mutex</span><span class="o">></span> <span class="n">guard</span><span class="p">(</span><span class="n">mutex</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"th "</span> <span class="o"><<</span> <span class="n">this_thread</span><span class="o">::</span><span class="n">get_id</span><span class="p">()</span> <span class="o"><<</span> <span class="s">": "</span> <span class="o"><<</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 问题1. C++里面所有的open/close, lock/unlock, malloc/free, new/delete</span>
<span class="c1">// 这种配对操作最好都要通过RAII方式用一个对象实现</span>
<span class="c1">// 为什么??? (答案在最后)</span>
<span class="c1">// 问题2. 非要用动态创建对象的话怎么办?</span>
<span class="p">}</span>
<span class="c1">// 普通的mutex可以应付绝大多数需求, 更加fancy的mutex在这里找</span>
<span class="c1">// http://en.cppreference.com/w/cpp/header/mutex</span>
<span class="kt">void</span> <span class="nf">demo_async</span> <span class="p">()</span> <span class="p">{</span>
<span class="c1">// 3. 用async进行异步计算</span>
<span class="c1">// 传统的函数调用是在一行之内启动函数计算和获取返回值,比如</span>
<span class="k">auto</span> <span class="n">plus</span> <span class="o">=</span> <span class="p">[](</span><span class="kt">int</span> <span class="n">a</span><span class="p">,</span> <span class="kt">int</span> <span class="n">b</span><span class="p">)</span> <span class="p">{</span> <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="p">;};</span>
<span class="p">{</span>
<span class="kt">int</span> <span class="n">c</span> <span class="o">=</span> <span class="n">plus</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"1 + 2 = "</span> <span class="o"><<</span> <span class="n">c</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 异步计算的目的是把启动计算和获取返回值分开,并且保证</span>
<span class="c1">// 获取返回值的时候计算已经完成。这样分开以后计算在何时何地</span>
<span class="c1">// 进行就有了很大的自由,可以用各种方式进行优化.</span>
<span class="c1">// </span>
<span class="c1">// C++里的异步计算需要用到下面几种特性</span>
<span class="c1">// - std::async, 用于启动异步计算</span>
<span class="c1">// - std::future, 用于获取计算结果 </span>
<span class="p">{</span> <span class="c1">// 用异步计算重新做上面的是事情</span>
<span class="n">std</span><span class="o">::</span><span class="n">future</span><span class="o"><</span><span class="kt">int</span><span class="o">></span> <span class="n">c</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">plus</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="c1">// 在这里干别的各种事情</span>
<span class="c1">// 然后获取结果</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"async: 1 + 2 = "</span> <span class="o"><<</span> <span class="n">c</span><span class="p">.</span><span class="n">get</span><span class="p">()</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// C++保证future::get()返回的时候异步计算已经完成</span>
<span class="c1">//</span>
<span class="p">}</span>
<span class="c1">// 编译器可以自动推断future的类型,所以可以这么搞 </span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">future</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">plus</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">future</span><span class="p">.</span><span class="n">get</span><span class="p">();</span>
<span class="p">}</span>
<span class="c1">// 如果线程启动后就不需要对它进行管理(创建后直接detach),那么可以用std::async启动线程</span>
<span class="c1">// 而无需显式创建std::thread对象</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"DEMO ASYNC."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">fun</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"1st async done."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// !!!!!!!!TODO 注意察看输出</span>
<span class="c1">// (很可能)你会发现fun并没有被运行!</span>
<span class="c1">// 为什么????????</span>
<span class="c1">//</span>
<span class="c1">// async并不一定真是并行进行的。C++里面的async实现了两种方式:并行处理和lazy evaluation。</span>
<span class="c1">// (注意"异步"和"并行"的区别。)</span>
<span class="c1">// 在调用async时可以传入std::launch::async或std::launch::deferred指定。如果啥都不传,默认行为</span>
<span class="c1">// 由编译器决定。在g++里就是deferred,也就是laze evaluation。所谓lazy,就是指返回值在真正需要用</span>
<span class="c1">// 的时候才进行计算。像下面这样:</span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"DEMO ASYNC, lazy evaluation."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"global thread id: "</span> <span class="o"><<</span> <span class="n">this_thread</span><span class="o">::</span><span class="n">get_id</span><span class="p">()</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">future</span><span class="o"><</span><span class="kt">void</span><span class="o">></span> <span class="n">future</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">launch</span><span class="o">::</span><span class="n">deferred</span><span class="p">,</span> <span class="n">fun</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">future</span><span class="p">.</span><span class="n">get</span><span class="p">();</span> <span class="c1">// 即使函数返回void,也需要get一下才能保证函数运行</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"You should see output from fun now."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="c1">// 注意, fun里面打印的线程ID和全局ID是相同的,也就是说async并没有真正创建线程,</span>
<span class="c1">// 而只是在future.get()的时候运行了fun函数</span>
<span class="p">}</span>
<span class="c1">// 下面我们传入std::launch::async迫使async真正创建一个进程</span>
<span class="c1">// 通过launch::async方式创建的异步任务即使没运行future::get也会被完成</span>
<span class="c1">// (C++里的命名这里有点混乱, 我觉得"std::launch::parallel"更能代表并行运行的意思) </span>
<span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"global thread id: "</span> <span class="o"><<</span> <span class="n">this_thread</span><span class="o">::</span><span class="n">get_id</span><span class="p">()</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"run with launch::async"</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">launch</span><span class="o">::</span><span class="n">async</span><span class="p">,</span> <span class="n">fun</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"You should see output from fun now."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 异步异常处理</span>
<span class="p">{</span>
<span class="k">auto</span> <span class="n">thrower</span> <span class="o">=</span> <span class="p">[](){</span><span class="n">cout</span> <span class="o"><<</span> <span class="s">"we are going to throw."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span> <span class="k">throw</span> <span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span><span class="p">(</span><span class="s">"hello, world!"</span><span class="p">);};</span>
<span class="k">auto</span> <span class="n">future</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">async</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">launch</span><span class="o">::</span><span class="n">async</span><span class="p">,</span> <span class="n">thrower</span><span class="p">);</span>
<span class="n">this_thread</span><span class="o">::</span><span class="n">sleep_for</span><span class="p">(</span><span class="n">chrono</span><span class="o">::</span><span class="n">seconds</span><span class="p">(</span><span class="mi">1</span><span class="p">));</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"exception was already thrown in another thread."</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="k">try</span> <span class="p">{</span> <span class="c1">// 异步catch异常</span>
<span class="n">future</span><span class="p">.</span><span class="n">get</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">catch</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">runtime_error</span> <span class="k">const</span> <span class="o">&</span><span class="n">e</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cerr</span> <span class="o"><<</span> <span class="n">e</span><span class="p">.</span><span class="n">what</span><span class="p">()</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">main</span> <span class="p">()</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="s">"Hardware concurrency: "</span> <span class="o"><<</span> <span class="n">std</span><span class="o">::</span><span class="kr">thread</span><span class="o">::</span><span class="n">hardware_concurrency</span><span class="p">()</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">demo_thread</span><span class="p">();</span>
<span class="n">demo_async</span><span class="p">();</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="c1">// 答案</span>
<span class="c1">// 问题1:</span>
<span class="c1">// 如果open和close之间抛出了异常,那么close就会被跳过,引起错误</span>
<span class="c1">//</span>
<span class="c1">// 问题2:</span>
<span class="c1">// C++里面尽量不要用new/delete. 有时候非要用, 最好的办法是用new之后</span>
<span class="c1">// 立刻用unique_ptr或者shared_ptr包装起来。事实上C++的智能指针可以直接构造对象,</span>
<span class="c1">// 不需要用到new关键字,比如</span>
<span class="c1">// unique_ptr<std::thread> p_th(fun, 1, 2, 3);</span>
<span class="c1">// </span>
<span class="c1">// 或者干脆用长度为1的std::vector</span>
<span class="c1">// </span>
<span class="c1">// 除非保证new/delete之间有catch (...)抓住所有的异常,否则delete都有可能会被跳过。</span>
</pre></div>The Dark Truth Behind the Power of Monads (and why it’s OK you cannot master it)2014-11-27T00:00:00-05:002014-11-27T00:00:00-05:00Wei Dongtag:www.wdong.org,2014-11-27:/the-dark-truth-behind-the-power-of-monads-and-why-its-ok-you-cannot-master-it.html<p>Years ago when I was in graduate school, there was a period when I became very obsessed with functional programming and the Haskell programming language. (At the dawn of the era of multi-core computation, it was believed that functional programming was one of the most promising technologies that would save …</p><p>Years ago when I was in graduate school, there was a period when I became very obsessed with functional programming and the Haskell programming language. (At the dawn of the era of multi-core computation, it was believed that functional programming was one of the most promising technologies that would save the world.) Unlike many other earlier functional languages which allow ad hoc imperative constructions when it comes to IO-related tasks, Haskell is elegant and purely functional. When IO must be done, one programs in Haskell with a mindset that he/she is composing a combination of IO (and other) instructions. The IO instructions, as instructions rather than the realization of them, are purely mathematical and do not interact with the real world. The computation of a Haskell program produces something called an IO monad, and all side effects only take place when the runtime executes the IO monad, which happens outside the functional programming realm. One can argue that this is only a perspective, because a piece of, say, C++ code, as code per se, do not have any side effects either. Side effects only take place when a binary is executed, and the binary is not C++, it’s outside the realm of C++ programming. Well, yes, functional programming IS only a perspective, but imperative programmers just do not think in that way.</p>
<p>Now lets come back to the mysterious monad. It turns out that monads are much, much more powerful than merely an abstraction of IO operations. A whole lot of apparently unrelated programming constructions can be realized with monads, and the resulting code is usually very simple and beautiful. There are monads for old school pessimistic error handling (the Maybe monad), exceptions, lists, state machines, parsers, even software transactional memory (STM) and structured query language (SQL). You name it! I felt my eyesight broaden, and was fascinated by the prospect that I’ll become a very powerful programmer if I master the unqualified, generic MONAD.</p>
<p>So I dug out all articles and tutorials about monads I can find on the Internet (well, except for text books on category theory; I was trying to find a shortcut). And after a long time of struggling, I found myself defeated by the fact that I cannot comprehend this programming construction as simple as three laws, among which at least two if not the third, in my opinion, are trivial! I didn’t have any problem with every one of the specialized monads, and I understood every letter of the monad laws. But the generic monad, I just couldn’t get the hang of it.</p>
<p>Life goes on without one becoming a master monadic programmer. I came back to my bread-earning systems and machine learning programming, and was glad to find that I could still do whatever I needed to do with the good old imperative programming. And with the introduction of lambda (be cautious: the lambda in C++ and other imperative languages are all but functional!) and other nice features to C++11, life after all is becoming better. But deep down inside, this monad thing kept bothering me, and has been, as now I see, fermenting.</p>
<p>Then there came a day, when I was fighting a religious programming language war in a web forum (always a good pass-time), the enlightenment suddenly came to me. And here I’m sharing with you the dark truth behind the power of monads I finally came to understand.</p>
<p>Behind all powers of monads, there is only one true power: the parsing power. Monads can be used in a parser to create internal data structures representing the parsing results of a context-free language (strictly speaking a little more than that, but it doesn’t matter here), to which almost all programming languages belong. The only property that is common to all those specialized monads is that they can all be written in a context free language which can be parsed. The process of monadic programming is nothing more than mentally parsing a piece of imaginary imperative or whatever code and writing down the “internal” representation in monads. It naturally follows that one can mimic with monads any (context-free) programming constructions that can ever be invented. And I’m pretty sure the process can be automated if one doesn’t have to preserve all the syntactic sugar. My hope to become a master programmer by learning to use monads, put in the layman’s terms, is not any more realistic than the hope to become a serial inventor by learning the art of creating text files!</p>
<p>Monad and its equivalents are still nice and powerful as a language for invention. They made it possible to (re-)invent programming constructions without having to create new languages. But very likely they don’t make invention itself any easier. One evidence is that all the monads we so far have seen all have their preexisting counterparts in other languages and libraries. At least now I don’t feel so bad that I cannot become a master monadic programmer.</p>
<p>And speaking of syntactic sugar, people have been feeding on it for a while, and I don’t mind having more of it.</p>Buliding Portable C/C++ Programs and Libraries for the Linux World2014-10-03T00:00:00-04:002014-10-03T00:00:00-04:00Wei Dongtag:www.wdong.org,2014-10-03:/buliding-portable-cc-programs-and-libraries-for-the-linux-world.html<p>I have to admit that pulling source code from github has become the mainstream development mode, and virtualization has become the main stream deployment mode, but many times it is still desirable to have a piece of software delivered in the form of a binary executable or library. There are …</p><p>I have to admit that pulling source code from github has become the mainstream development mode, and virtualization has become the main stream deployment mode, but many times it is still desirable to have a piece of software delivered in the form of a binary executable or library. There are many reasons one wants to do that: one might not want to give away the source code; they client wants the source code, but doesn’t have the capability of building it, or simply does not want to spent the human labor to build it. With tools like maven, software versioning is kind of under control in the Java world. But the C/C++ world is not as lucky, and with all the github code that usually do not even have a version number, build a C/C++ code repository with dependencies is not for everybody.</p>
<p>Virtualization helps to contain all the dependencies, and it is a good solution for bigger software components like web service. One just deliver a VM image and everything is taken care of. But the use cases for C/C++ are usually small performance-critical components that must be tightly integrated to code written in other languages, and the overhead of virtualization is usually too high.</p>
<p>But the Linux world is notoriously heterogeneous. We are living in a world with Linux kernel 3.x, Ubuntu 14.x and RHEL 7.x, but almost all the companies and university labs whose machines I got a chance to log into are still using CentOS 5.x for production and research (RHEL 5 was first released in 2007) — once you got a cluster setup, it’s virtually impossible to upgrade the operating system version. On the other hand, you also want your software to run on the newest systems which are available to today’s startup companies and everything in between and hopefully in future.</p>
<p>Now generic portability between Linux versions and distributions cannot be achieved by C/C++. That’s why Java was invented. But if one just want to deliver a single program/library file that contains all the functionality — thanks to the backward compatibility of the Linux kernel — this is usually achievable by linking almost all the libraries statically into the program.</p>
<p>The library case is more interesting. We want everything to be contained in the library, including all the libraries we depend on. But we cannot provide a static library because in that way we’ll have to expose all the dependencies and it will cause version conflicts for sure when the client tries to link against the library. So the solution is to develop a (almost) statically linked shared library, and maybe a very small piece of interfacing code. The KGraph library for similarity search is provided in this form.</p>
<p>Static linking is not the common practice in the linux world. All software packages are distributed with shared libraries, and if one chooses to build something from source code, shared libraries are produced by default. But fortunately, most software packages use the automake system, and static linking has always been an option which can be easily enabled by adding “–enable-static –disable-shared” in the configure script. The “–disable-shared” part is important because without it, shared library will also be produced, and the default behavior of gcc is to link against shared library. One can force gcc to link statically by adding “-static”, but some system API wont work as expected (<a href="http://stackoverflow.com/questions/2725255/create-statically-linked-binary-that-uses-getaddrinfo">getaddrinfo will lose the ability to resolve hostnames</a>, update: solved with <a href="https://c-ares.haxx.se/">c-ares</a>). Now with “–enable-static” alone, the build system will assume the library is to be used statically, and will produce non-relocatable machine code, and we won’t be able to use that to produce a shared library. The solution is to export “CFLAGS=-fPIC” and “CXXFLAGS=-fPIC” before running the configure script. These two easy fixes work in most packages, and the remaining have to be worked on cases by case.</p>
<p>It will be misleading to end this blog leaving a novice reader believing static linking is the way to go for everybody. Actually there are <a href="http://www.akkadia.org/drepper/no_static_linking.html">strong arguments against static linking</a> (basically one can gain more with dynamic linking). But there are languages like golang which favor static linking. And for people out there by himself, like me, who does not have a lot of human labor at disposal, and would rather spending time on algorithms rather than software packaging, static linking does come in handy.</p>
<p>About the companion box:</p>
<p>This box contains a development environment that is geared towards computation intensive data processing applications without GUI, like machine learning, image/audio processing and such. It is based on CentOS 5.6 with devtools 2.1 (gcc-4.8). I’ve also installed many libraries using the above method, including Boost, Poco, OpenCV, libav and many others. OpenCV and libav have been taylored to removed GUI and device related stuff (including media playback), because such functionality relies on components that are hard to make portable.</p>Equivalence between entropy regularization and the softmax function2014-08-08T00:00:00-04:002014-08-08T00:00:00-04:00Wei Dongtag:www.wdong.org,2014-08-08:/equivalence-between-entropy-regularization-and-the-softmax-function.html<p>Never realized the relationship between the two until we came up with a fancy convex programming formulization with entropy regulation to solve a recommendation problem and obtained the well-known simple formula with softmax. Google'd the keywords and didn't see a clear explanation within the first few pages of search results …</p><p>Never realized the relationship between the two until we came up with a fancy convex programming formulization with entropy regulation to solve a recommendation problem and obtained the well-known simple formula with softmax. Google'd the keywords and didn't see a clear explanation within the first few pages of search results, so I guess it might be helpful to write down a summary of the connection.</p>
<p><img alt="ui" src="http://www.wdong.org/entropy.png"></p>Bcache for Ubuntu 14.04 Root Filesystem2014-05-28T00:00:00-04:002014-05-28T00:00:00-04:00Wei Dongtag:www.wdong.org,2014-05-28:/bcache-for-ubuntu-1404-root-filesystem.html<p>It's been a while since bcache made its way into the Linux kernel, but the
installers of most distributions have not yet caught up to allow users to
install to a bcache-backed volume, and user-land tools necessary to make use of
bcache are not yet installed by default. There has …</p><p>It's been a while since bcache made its way into the Linux kernel, but the
installers of most distributions have not yet caught up to allow users to
install to a bcache-backed volume, and user-land tools necessary to make use of
bcache are not yet installed by default. There has been a tutorial on <a
href="https://github.com/g2p/blocks">how to convert the root of an existing
installation into bcache with a tool named blocks</a>, but the code base and
the depending code bases are either too old or too new to be directly usable
with the default python 3.4 setup of Ubuntu 14.04. After a frustrating process
of fixing all compatibility issues, I was able to make the code run, but I
don't trust my own patch (and the stability of github forked code) enough to
apply that on my real data. I ended up this not-so-drastic, but cleaner and
safer way to get a bcache-backed root filesystem with minimal external
dependencies.</p>
<p>The idea is to (1) install Ubuntu into a normal partition (>= 5GB) which would later be converted to the swap space, (2) setup the bcache, and (3) migrate / to the bcache device.</p>
<p>In my case (lenovo U430P), /dev/sda is a 16G SSD, and /dev/sdb is a 1T HDD.</p>
<h1>1. Initial Installation</h1>
<p>Install Ubuntu 14.04 using the following disk partitioning scheme
- 64MB EFI partition /dev/sda1, fat32, mounted at /boot/efi. This is not necessary if the machine is booted in the traditional BIOS way.</li>
- 200MB ext4 partition /dev/sda2 to be mounted on /boot. It is necessary to make /boot on a separate partition. I made this on SSD to make it faster for the kernel to be loaded (haven't compared, but I guess the speedup over HDD -- if not slowdown -- won't be that obvious as the kernel is a multi-megabyte file.)</li>
- 16GB ext4 partition /dev/sdb1 to be mounted as / in installation. We'll later convert that to be a swap partition.</li>
- An empty big partition /dev/sdb2 later to be used as root (/dev/sda2). Create this partition, but do not use it for now.</li>
- An empty partition /dev/sda3 on SSD later to be used as the cache. Create this partition, but do not use it for now.</li></p>
<p>The installer will complain about not having a swap space. Ignore that.</p>
<h1>2. Setting Up Bcache</h1>
<p>After installation, boot into the newly installed system, install bcache-tools (in PPA) and setup the system:</p>
<div class="highlight"><pre><span></span>$ sudo bash
<span class="c1"># add-apt-repository ppa:g2p/storage</span>
<span class="c1"># apt-get update</span>
<span class="c1"># apt-get install bcache-tools</span>
<span class="c1"># make-bcache -C /dev/sda3 -B /dev/sdb2</span>
<span class="c1"># mkfs.ext4 /dev/bcache0</span>
</pre></div>
<h1>3. Migrating Root Filesystem</h1>
<p>Keep working in the newly installed system.</p>
<div class="highlight"><pre><span></span>$ sudo bash
$ mkdir OLD NEW
<span class="c1"># mount /dev/sdb1 OLD # the old root</span>
<span class="c1"># mount /dev/bcache0 NEW # this would be our new root</span>
<span class="c1"># rsync -a OLD/ NEW/ # now NEW contains the root</span>
<span class="c1"># ### mount a serious directory in preparation for grub-install</span>
<span class="c1"># mount /dev/sda2 NEW/boot</span>
<span class="c1"># mount /dev/sda1 NEW/boot/efi</span>
<span class="c1"># mount -o bind /dev NEW/dev</span>
<span class="c1"># mount -t proc none NEW/proc</span>
<span class="c1"># mount -t sysfs none NEW/sys</span>
<span class="c1"># chroot NEW</span>
<span class="c1"># #### find out the UUID of /dev/bcache0 and /dev/sdb1</span>
<span class="c1"># ls -l /dev/disk/by-uuid/ | grep bcache0</span>
lrwxrwxrwx <span class="m">1</span> root root <span class="m">13</span> May <span class="m">29</span> 21:49 4c492013-e8a3-40b5-b5cd-9220ed2e0195 -> ../../bcache0
<span class="c1"># ls -l /dev/disk/by-uuid/ | grep sdb1</span>
lrwxrwxrwx <span class="m">1</span> root root <span class="m">10</span> May <span class="m">29</span> 21:49 765d6fc0-9ff4-4cf4-95f9-17a6e76ae80c -> ../../sdb1
<span class="c1"># vi NEW/etc/fstab NEW/boot/grub.cfg #### edit the files NEW/etc/fstab and NEW/boot/grub.cfg, replace all UUID of sdb1 to that of bcache0.</span>
<span class="c1"># grub-install /dev/sda</span>
</pre></div>
<h1>4. Final Configurations in New System</h1>
<p>Reboot into the newly installed system. Now the root is on /dev/bcache0. The old data on /dev/sdb1 is not used, and /dev/sdb1 can be converted to the swap space.</p>
<div class="highlight"><pre><span></span>$ sudo bash
<span class="c1"># mkswap /dev/sdb1</span>
Setting up swapspace version 1, <span class="nv">size</span> <span class="o">=</span> <span class="m">15624188</span> KiB
no label, <span class="nv">UUID</span><span class="o">=</span>e35bc636-9944-4dd5-ab3d-6c371b0cb7a8
<span class="c1"># swapon /dev/sdb1</span>
<span class="c1">##### make sure to change the UUID of the command below</span>
<span class="nb">echo</span> <span class="s2">"UUID=e35bc636-9944-4dd5-ab3d-6c371b0cb7a8 none swap defaults 0 0"</span> >> /etc/fstab
</pre></div>
<p>Now I'm having my Ubuntu running happily on bcache, and I hope it's not going to cause and data loss.</p>机器学习内功总纲2014-05-07T00:00:00-04:002014-05-07T00:00:00-04:00Wei Dongtag:www.wdong.org,2014-05-07:/ji-qi-xue-xi-nei-gong-zong-gang.html<div class="highlight"><pre><span></span>我觉得机器学习的万法之宗就是奥康姆剃刀: 拟合效果类似,模型越简单预测能力越强
。从不同的对“简单”的定义出发,就产生了不同的流派。比如:
1. 特征维度越低就越简单(参考维度诅咒)。从这一点出发产生了各种降维算法,像PCA
, LDA(有两个完全不同的LDA,但本质上都是降维)等。多层神经网络也可以看成一种
降维算法。Hinton在Science上那片autoencoder的文章标题就是"降维",降维的重要性
可见一斑。将降维推广其实就是数据压缩。甚至有一种观点认为数据压缩做到了极致就
是人工智能。
2. 特征的非零维度越少就越简单。所谓的Sparse Coding是机器视觉和相关方向非常重
要的研究课题。在2012年neural network异军突起以前,几乎所有的图像识别算法都要
用到某种Sparse Coding。大家知道如果D维空间内的点用包含D个向量的基标出就是一
个D维向量,这D个维几乎不可能为0。实现sparse coding的方法就是两路:1. 允许有误
差。2.增加基的个数。最土的sparse coding就是k-means clustering(也叫vector …</pre></div><div class="highlight"><pre><span></span>我觉得机器学习的万法之宗就是奥康姆剃刀: 拟合效果类似,模型越简单预测能力越强
。从不同的对“简单”的定义出发,就产生了不同的流派。比如:
1. 特征维度越低就越简单(参考维度诅咒)。从这一点出发产生了各种降维算法,像PCA
, LDA(有两个完全不同的LDA,但本质上都是降维)等。多层神经网络也可以看成一种
降维算法。Hinton在Science上那片autoencoder的文章标题就是"降维",降维的重要性
可见一斑。将降维推广其实就是数据压缩。甚至有一种观点认为数据压缩做到了极致就
是人工智能。
2. 特征的非零维度越少就越简单。所谓的Sparse Coding是机器视觉和相关方向非常重
要的研究课题。在2012年neural network异军突起以前,几乎所有的图像识别算法都要
用到某种Sparse Coding。大家知道如果D维空间内的点用包含D个向量的基标出就是一
个D维向量,这D个维几乎不可能为0。实现sparse coding的方法就是两路:1. 允许有误
差。2.增加基的个数。最土的sparse coding就是k-means clustering(也叫vector
quantization)。更加一般性的做法是在训练模型的时候加一个L1-regularization来
实现稀疏性。这就引出了下一类方法。
3. Regularization。一个向量维数虽然大,但是缩小每个维度的取值范围,也是一种
形式的简单。从整体来看,缩小取值范围的一个一般做法就是优化向量(或者模型)的模
。从统计的观点看,如果假设数据符合正态分布,最大似然估计基本上就等价于L2-
regularization。着一个流派的学习算法往往都是解下面形式的一个优化问题
min_M |f(x;M) - y| + a|M|
训练误差项 regularization
目前解这类问题的主流是下山法(SGD)。因为加regulariztion项形式上看很容易,所以
往往会被滥用。不管前面出现了什么指数函数对数函数,后面统统来一个
regularization。拿搞物理的人的说法是,连量纲都对不上。更别说统计意义了。神奇
的是竟然还都有一定的效果。
4. 减少训练数据摄入。如果特征是抽象空间的点,没有维度的概念,也没有模的概念
,这时候怎么办?有一种办法是选取所有训练数据中最具代表性/最关键的一小撮样本
来产生模型。从这个角度出发就产生了SVM和boosting这类算法。在SVM中,这类关键样
本被称为 support vector。SV的个数其实就是模型的维度。用SGD训练SVM的时候,如
果碰到一个预测正确的样本,就直接跳过, 碰到错误的才更新模型。 从这个角度推广
一下,按拟合程度如何对数据/模型加权重,基本上就得到了boosting。(boosting不是
这么产生的,但不妨这么理解。) 减少训练数据摄入不是指减小原始训练集的大小,而是
对原始训练集进行精简。比如最终都精简到1兆样本,那么增大原始训练集,比如从10兆
到100兆还是会提高预测精度。
这个是我七八年来学习ML的一些心得,希望对新人有所帮助。哪天看到一种效果很好的
新方法,但是想不通为什么这个方法效果好,或者对方法的来龙去脉摸不着头脑的时候
,不妨万法归宗,往奥康目剃刀上扯一扯,或许就融会贯通了。
</pre></div>Cost of Padding for Convolution2013-09-03T00:00:00-04:002013-09-03T00:00:00-04:00Wei Dongtag:www.wdong.org,2013-09-03:/cost-of-padding-for-convolution.html<p>Padding an image for convolution sounds like a pain. <a href="https://code.google.com/p/cuda-convnet/">Cuda-convnet</a> has a clever way of implement it, but I found it might be an over-engineering for me to do similar optimization for my CPU-based implementation. The following operf profiling result shows that even if I simply do padding by copying …</p><p>Padding an image for convolution sounds like a pain. <a href="https://code.google.com/p/cuda-convnet/">Cuda-convnet</a> has a clever way of implement it, but I found it might be an over-engineering for me to do similar optimization for my CPU-based implementation. The following operf profiling result shows that even if I simply do padding by copying to a larger matrix the overhead is negligible. The last two lines are the cost of the padding node. % of time spent on update and predict are 0.0686% and 8.2e-04%.</p>
<p>operf output:</p>
<div class="highlight"><pre><span></span>CPU: Intel Sandy Bridge microarchitecture, speed 3.201e+06 MHz <span class="o">(</span>estimated<span class="o">)</span>
Counted CPU_CLK_UNHALTED events <span class="o">(</span>Clock cycles when not halted<span class="o">)</span> with a unit mask of 0x00 <span class="o">(</span>No unit mask<span class="o">)</span> count 100000
samples % image name symbol name
<span class="m">1108937</span> 18.1060 libgomp.so.1.0.0 gomp_barrier_wait_end
<span class="m">1082524</span> 17.6747 libgomp.so.1.0.0 gomp_team_barrier_wait_end
<span class="m">1078260</span> 17.6051 cifar.ptblas MNLOOP
<span class="m">475213</span> 7.7589 cifar.ptblas MNLOOP
<span class="m">374029</span> 6.1069 no-vmlinux /no-vmlinux
<span class="m">316764</span> 5.1719 libc-2.15.so __memmove_ssse3_back
<span class="m">275784</span> 4.5028 cifar.ptblas _ZN6hiperfit6neural10WindowNode6updateEi._omp_fn.17
<span class="m">272491</span> 4.4490 cifar.ptblas ATL_gemoveT_aX
<span class="m">143035</span> 2.3354 cifar.ptblas ATL_scol2blk_a1
<span class="m">110492</span> 1.8040 cifar.ptblas _ZN6hiperfit6neural8PoolNodeINS0_4pool3maxEE7predictEi._omp_fn.3
<span class="m">102437</span> 1.6725 cifar.ptblas MNLOOP
<span class="m">83093</span> 1.3567 cifar.ptblas _ZN6hiperfit5ArrayIfE5applyIZNS_6neural12FunctionNodeINS3_8function4reluEE6updateEiEUlRffffE_EEvRKS1_SB_SB_RKT_._omp_fn.10
<span class="m">79453</span> 1.2973 cifar.ptblas hiperfit::neural::ArrayNode::preupdate<span class="o">(</span>int<span class="o">)</span>
<span class="m">74185</span> 1.2112 cifar.ptblas _ZN5cifar7DataSetC2ERKSsbj.constprop.352
<span class="m">68051</span> 1.1111 cifar.ptblas ATL_sJIK0x0x72TN72x72x0_a1_bX
<span class="m">67752</span> 1.1062 cifar.ptblas _ZN6hiperfit6neural8PoolNodeINS0_4pool3maxEE6updateEi._omp_fn.2
<span class="m">65919</span> 1.0763 cifar.ptblas ATL_sJIK0x0x0TN0x0x0_a1_bX
<span class="m">60774</span> 0.9923 cifar.ptblas _ZN6hiperfit5ArrayIfE5applyIZNS_6neural12FunctionNodeINS3_8function4reluEE7predictEiEUlRffE_EEvRKS1_RKT_._omp_fn.11
......
<span class="m">4201</span> 0.0686 cifar.ptblas hiperfit::neural::PadNode::update<span class="o">(</span>int<span class="o">)</span>
......
<span class="m">50</span> 8.2e-04 cifar.ptblas hiperfit::neural::PadNode::predict<span class="o">(</span>int<span class="o">)</span>
......
</pre></div>Android WebKit Rendering Pipeline and Instrumentation2013-07-08T00:00:00-04:002013-07-08T00:00:00-04:00Wei Dongtag:www.wdong.org,2013-07-08:/android-webkit-rendering-pipeline-and-instrumentation.html<p>The Android WebKit rendering pipeline is summarized in <a href="http://www.wdong.org/webkit-draw1.pdf">this document</a>. The final section provides a patch to the android source code that will log all the rendering activity. A sample browsing session is traced with this code, with all image-rendering activities logged and offline processed and rendered with a standalone …</p><p>The Android WebKit rendering pipeline is summarized in <a href="http://www.wdong.org/webkit-draw1.pdf">this document</a>. The final section provides a patch to the android source code that will log all the rendering activity. A sample browsing session is traced with this code, with all image-rendering activities logged and offline processed and rendered with a standalone installation of Skia. Each time WebKit updates the screen, an frame is produced from the log, and concatenating these frames with ffmpeg makes <a href="http://www.wdong.org/trace1.mpg">this video</a>.
In this way, I'm able to log all images that are actually drawn on the screen during a browsing session.</p>
<p>I wrote this code to study opportunities for bandwidth saving -- an image in a webpage won't have to be actually downloaded if it is not shown at all, and only a low-resolution version of the image, given multiple resolutions are available, is needed if only a zoomed thumbnail of the image is shown.</p>Geological Distribution of the Alex Top 500 Websites2013-06-08T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2013-06-08:/geological-distribution-of-the-alex-top-500-websites.html<p><a href="http://www.wdong.org/top500.tar.bz2">Top 500</a></p>
<p>A one-hour exercise of displaying a bunch of points on the map using <a href="http://openlayers.org/">OpenLayers</a> and jquery. The geological location of the IP addresses are obtained with geoiplookup.</p><p><a href="http://www.wdong.org/top500.tar.bz2">Top 500</a></p>
<p>A one-hour exercise of displaying a bunch of points on the map using <a href="http://openlayers.org/">OpenLayers</a> and jquery. The geological location of the IP addresses are obtained with geoiplookup.</p>Notes on LDA and RBM2013-04-01T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2013-04-01:/notes-on-lda-and-rbm.html<ul>
<li><a href="http://www.wdong.org/lda.pdf">LDA</a></li>
<li><a href="http://www.wdong.org/rbm.pdf">RBM</a></li>
</ul><ul>
<li><a href="http://www.wdong.org/lda.pdf">LDA</a></li>
<li><a href="http://www.wdong.org/rbm.pdf">RBM</a></li>
</ul>Tips for Cross Compiling Libraries for Android2013-03-19T00:00:00-04:002013-03-19T00:00:00-04:00Wei Dongtag:www.wdong.org,2013-03-19:/tips-for-cross-compiling-libraries-for-android.html<p>Assume we want to install the libraries to "/opt/arm-tools".</p>
<ol>
<li>Boost</li>
</ol>
<p>Edit the file "tools/build/v2/user-config.jam" under the boost source directory and add the following line:</p>
<div class="highlight"><pre><span></span>using gcc : arm : arm-none-linux-gnueabi-g++ <span class="p">;</span>
</pre></div>
<p>Then build and install boost with the following command</p>
<div class="highlight"><pre><span></span>./b2 <span class="nv">toolset</span><span class="o">=</span>gcc-arm target-os<span class="o">=</span>linux <span class="nv">threading</span><span class="o">=</span>multi <span class="nv">link …</span></pre></div><p>Assume we want to install the libraries to "/opt/arm-tools".</p>
<ol>
<li>Boost</li>
</ol>
<p>Edit the file "tools/build/v2/user-config.jam" under the boost source directory and add the following line:</p>
<div class="highlight"><pre><span></span>using gcc : arm : arm-none-linux-gnueabi-g++ <span class="p">;</span>
</pre></div>
<p>Then build and install boost with the following command</p>
<div class="highlight"><pre><span></span>./b2 <span class="nv">toolset</span><span class="o">=</span>gcc-arm target-os<span class="o">=</span>linux <span class="nv">threading</span><span class="o">=</span>multi <span class="nv">link</span><span class="o">=</span>static runtime-link<span class="o">=</span>static <span class="nv">variant</span><span class="o">=</span>release <span class="nv">optimization</span><span class="o">=</span>space --prefix<span class="o">=</span>/opt/arm-tools
</pre></div>
<ol>
<li>Packages with Autotools</li>
</ol>
<p>Run ./configure with the following script</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="nb">export</span> <span class="nv">CC</span><span class="o">=</span>arm-none-linux-gnueabi-gcc
<span class="nb">export</span> <span class="nv">CXX</span><span class="o">=</span>arm-none-linux-gnueabi-g++
<span class="nb">export</span> <span class="nv">AR</span><span class="o">=</span>arm-none-linux-gnueabi-ar
<span class="nb">export</span> <span class="nv">CFLAGS</span><span class="o">=</span>-I/opt/arm-tools/include
<span class="nb">export</span> <span class="nv">CXXFLAGS</span><span class="o">=</span>-I/opt/arm-tools/include
<span class="nb">export</span> <span class="nv">LDFLAGS</span><span class="o">=</span>-L/opt/arm-tools/lib
./configure --host<span class="o">=</span>x86_64-unknown-linux-gnu --target<span class="o">=</span>arm-none-linux-gnueabi --enable-static --disable-shared --prefix<span class="o">=</span>/opt/arm-tools
</pre></div>Pushing Code to Data: A MapR Exerciese2013-03-18T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2013-03-18:/pushing-code-to-data-a-mapr-exerciese.html<p><span style="color:red;font-size:24px;">In this blog I'll demonstrate how to push code to data stored on a MapR (or Hadoop) cluster and achieve an order of magnitude speedup with simple bash and C++ coding and without any of the MapReduce and Java stuff.</span></p>
<p>I just had a MapR cluster set up; and in …</p><p><span style="color:red;font-size:24px;">In this blog I'll demonstrate how to push code to data stored on a MapR (or Hadoop) cluster and achieve an order of magnitude speedup with simple bash and C++ coding and without any of the MapReduce and Java stuff.</span></p>
<p>I just had a MapR cluster set up; and in this exercise, I'm going to test the data-processing speedup I can gain by pushing the code to run where data is. The goal here is to gain some knowledge on the overhead of various system component (e.g. the hadoop commandline) so I can design a in-house platform for distributed data processing.</p>
<h2>Setup</h2>
<p>The data I have are mostly files of roughly fixed size, about 30MB each. I need to store a large number of such files in the MapR filesystem, and need to frequently run some processing command on each file. The output generated from each file is much smaller than the input and is negligible. In this experiment, I'll use 100 such files as input, and use "md5sum" as the operation. The files are already stored on the MapR filesystem, with the text file "list" containing a list of the paths to the 100 files.</p>
<p>The hardware setup is as follows. The cluster has 8 nodes, with 7 forming a MapR cluster. All the commands and operations are done on the remaining node with the hostname "washtenaw". </p>
<h2>Pulling Data From MapR FS: The Naive Approach</h2>
<p>The simplest approach is simply to run all the processing locally on washtenaw, pulling all the data needed from the MapR FS. Following is the script "run-local.sh":</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="c1"># run-local.sh</span>
cat list <span class="p">|</span> <span class="k">while</span> <span class="nb">read</span> name
<span class="k">do</span>
<span class="nb">printf</span> <span class="s1">'%s\t'</span> <span class="nv">$name</span>
hadoop fs -cat <span class="nv">$name</span> <span class="p">|</span> md5sum
<span class="k">done</span>
</pre></div>
<p>And the performance is</p>
<div class="highlight"><pre><span></span>wdong@washtenaw $ time ./run-local.sh > md5.hadoop
real 1m14.910s
user 1m10.304s
sys 0m19.573s
</pre></div>
<p>Roughly we spend 7.5s on each file. </p>
<h2>Pulling Data From MapR FS without Java </h3></h2>
<p>We already know that MapR FS is able to achieve a throughput of 80MB/s from the <a href="/mapr-file-copy-throughput.html">prevous post</a>. So it should take only 30/80 = 0.375s to retrieve each file from the cluster (without the name node lookup and all other latencies). This is much smaller than the 7.5s we spent in our first setting. An obvious overhead is the cost to start the hadoop command line. So in this setting, I'll test directly fetching the data using the C API. No java code is involved in this setting, but the data is still loaded remotely from the cluster.</p>
<p>Here's the C++ source code:</p>
<div class="highlight"><pre><span></span><span class="c1">// run-c++.cpp</span>
<span class="cp">#include</span> <span class="cpf">"hdfs.h" </span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><iostream></span><span class="cp"></span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
<span class="k">static</span> <span class="kt">size_t</span> <span class="n">BUFFER_SIZE</span> <span class="o">=</span> <span class="mi">64</span> <span class="o">*</span> <span class="mi">1024</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">;</span>
<span class="n">string</span> <span class="n">buffer</span><span class="p">(</span><span class="n">BUFFER_SIZE</span><span class="p">,</span> <span class="sc">'\0'</span><span class="p">);</span>
<span class="n">hdfsFS</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">hdfsConnect</span><span class="p">(</span><span class="s">"default"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">string</span> <span class="n">path</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cin</span> <span class="o">>></span> <span class="n">path</span><span class="p">)</span> <span class="p">{</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">path</span> <span class="o"><<</span> <span class="sc">' '</span><span class="p">;</span>
<span class="n">hdfsFile</span> <span class="n">h</span> <span class="o">=</span> <span class="n">hdfsOpenFile</span><span class="p">(</span><span class="n">fs</span><span class="p">,</span> <span class="n">path</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="n">O_RDONLY</span><span class="p">,</span> <span class="n">BUFFER_SIZE</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">cout</span><span class="p">.</span><span class="n">flush</span><span class="p">();</span>
<span class="kt">FILE</span> <span class="o">*</span><span class="n">cmd</span> <span class="o">=</span> <span class="n">popen</span><span class="p">(</span><span class="s">"md5sum"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">);</span>
<span class="k">for</span> <span class="p">(;;)</span> <span class="p">{</span>
<span class="kt">size_t</span> <span class="n">sz</span> <span class="o">=</span> <span class="n">hdfsRead</span><span class="p">(</span><span class="n">fs</span><span class="p">,</span> <span class="n">h</span><span class="p">,</span> <span class="o">&</span><span class="n">buffer</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">buffer</span><span class="p">.</span><span class="n">size</span><span class="p">());</span>
<span class="n">fwrite</span><span class="p">(</span><span class="o">&</span><span class="n">buffer</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sz</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">cmd</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sz</span> <span class="o">!=</span> <span class="n">buffer</span><span class="p">.</span><span class="n">size</span><span class="p">())</span> <span class="k">break</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">pclose</span><span class="p">(</span><span class="n">cmd</span><span class="p">);</span>
<span class="n">hdfsCloseFile</span><span class="p">(</span><span class="n">fs</span><span class="p">,</span> <span class="n">h</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">hdfsDisconnect</span><span class="p">(</span><span class="n">fs</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>And here's the performance:</p>
<div class="highlight"><pre><span></span>wdong@washtenaw$ make run-c++
g++ -std<span class="o">=</span>c++11 -Wall -O3 -I/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs -L/opt/mapr/lib -Wl,-allow-shlib-undefined run-c++.cpp -lMapRClient -o run-c++
wdong@washtenaw$ <span class="nb">time</span> ./run-c++ < list > md5.c++
real 0m42.258s
user 0m6.880s
sys 0m16.077s
</pre></div>
<p>We've reduced per-file processing time from 7.5s to about 4.2s. The overhead of running the hadoop command line for each file is about 3s -- that is pretty big.</p>
<h2>Pushing Code to Data</h2>
<p>Here comes the real stuff. I'll detect on which node a file is stored, and push our code to where the data is. I use the following C++ program to query the location of a file, assuming that each file is contained in one filesystem block with no replication, which is the case in my system setup.</p>
<div class="highlight"><pre><span></span><span class="c1">// hdfs-lookup.cpp</span>
<span class="cp">#include</span> <span class="cpf">"hdfs.h" </span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><string></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><iostream></span><span class="cp"></span>
<span class="cp">#include</span> <span class="cpf"><boost/assert.hpp></span><span class="cp"></span>
<span class="k">using</span> <span class="k">namespace</span> <span class="n">std</span><span class="p">;</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">**</span><span class="n">argv</span><span class="p">)</span> <span class="p">{</span>
<span class="n">hdfsFS</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">hdfsConnect</span><span class="p">(</span><span class="s">"default"</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">string</span> <span class="n">path</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="n">cin</span> <span class="o">>></span> <span class="n">path</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// get the block of the 1st byte</span>
<span class="kt">char</span> <span class="o">***</span><span class="n">hosts</span> <span class="o">=</span> <span class="n">hdfsGetHosts</span><span class="p">(</span><span class="n">fs</span><span class="p">,</span> <span class="n">path</span><span class="p">.</span><span class="n">c_str</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">);</span>
<span class="n">BOOST_VERIFY</span><span class="p">(</span><span class="n">hosts</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">&&</span> <span class="n">hosts</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]);</span>
<span class="n">cout</span> <span class="o"><<</span> <span class="n">path</span> <span class="o"><<</span> <span class="sc">' '</span> <span class="o"><<</span> <span class="n">hosts</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o"><<</span> <span class="n">endl</span><span class="p">;</span>
<span class="n">hdfsFreeHosts</span><span class="p">(</span><span class="n">hosts</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">hdfsDisconnect</span><span class="p">(</span><span class="n">fs</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
<p>Running the program produces something like the following: each line containing the path followed by the hostname where the data is:</p>
<div class="highlight"><pre><span></span>wdong@washtenaw$ make hdfs-lookup
g++ -std<span class="o">=</span>c++11 -Wall -O3 -I/opt/mapr/hadoop/hadoop-0.20.2/src/c++/libhdfs -L/opt/mapr/lib -Wl,-allow-shlib-undefined hdfs-lookup.cpp -lMapRClient -o hdfs-lookup
wdong@washtenaw$ ./hdfs-lookup < list
test/data1 fuller
test/data2 ford
test/data3 huron
test/data4 huron
test/data5 plymouth
...
</pre></div>
<p>I then use the following script to drive the computation:</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="c1"># run-remote.sh</span>
mkdir -p input<span class="p">;</span> rm -f input/*
mkdir -p output<span class="p">;</span> rm -f output/*
cat list <span class="p">|</span> ./hdfs-lookup <span class="p">|</span> <span class="k">while</span> <span class="nb">read</span> path host
<span class="k">do</span>
<span class="nb">echo</span> <span class="nv">$path</span> >> input/<span class="nv">$host</span>
<span class="k">done</span>
<span class="k">for</span> host in <span class="sb">`</span>ls input<span class="sb">`</span>
<span class="k">do</span>
ssh <span class="nv">$host</span> <span class="s2">"cd </span><span class="nv">$PWD</span><span class="s2">; cat input/</span><span class="nv">$host</span><span class="s2"> | ./run-c++ > output/</span><span class="nv">$host</span><span class="s2">"</span>
<span class="k">done</span>
cat output/*
</pre></div>
<p>And here's the performance:</p>
<div class="highlight"><pre><span></span>wdong@washtenaw$ <span class="nb">time</span> ./run-remote.sh > md5.remote
real 0m13.529s
user 0m0.116s
sys 0m0.040s
</pre></div>
<p>Substantial speedup! The time we spent on each file is 0.135s.</p>
<h2>Pushing Code to Data with Parallelization </h3></h2>
<p>Finally, I go ahead to parallelize the above driving script:</p>
<div class="highlight"><pre><span></span><span class="ch">#!/bin/bash</span>
<span class="c1"># run-parallel.sh</span>
mkdir -p input<span class="p">;</span> rm -f input/*
mkdir -p output<span class="p">;</span> rm -f output/*
cat list <span class="p">|</span> ./hdfs-lookup <span class="p">|</span> <span class="k">while</span> <span class="nb">read</span> path host
<span class="k">do</span>
<span class="nb">echo</span> <span class="nv">$path</span> >> input/<span class="nv">$host</span>
<span class="k">done</span>
<span class="k">for</span> host in <span class="sb">`</span>ls input<span class="sb">`</span>
<span class="k">do</span>
ssh <span class="nv">$host</span> <span class="s2">"cd </span><span class="nv">$PWD</span><span class="s2">; cat input/</span><span class="nv">$host</span><span class="s2"> | ./run-c++ > output/</span><span class="nv">$host</span><span class="s2">"</span> <span class="p">&</span>
<span class="k">done</span>
<span class="nb">wait</span>
cat output/*
</pre></div>
<p>And here's the performance:</p>
<div class="highlight"><pre><span></span>wdong@washtenaw$ wc -l input/* <span class="c1"># just to show the data distribution among the nodes.</span>
<span class="m">12</span> input/ford
<span class="m">10</span> input/fuller
<span class="m">21</span> input/geddes
<span class="m">19</span> input/huron
<span class="m">9</span> input/maple
<span class="m">10</span> input/plymouth
<span class="m">19</span> input/wagner
<span class="m">100</span> total
wdong@washtenaw$ <span class="nb">time</span> ./run-parallel.sh > md5.parallel
real 0m2.987s
user 0m0.120s
sys 0m0.068s
wdong@washtenaw$ <span class="k">for</span> i in md5.* <span class="p">;</span> <span class="k">do</span> sort <span class="nv">$i</span> <span class="p">|</span> md5sum<span class="p">;</span> <span class="k">done</span> <span class="c1"># just to check that all outputs are the same</span>
a3889305ff721ec32ced48a3066ea059 -
a3889305ff721ec32ced48a3066ea059 -
a3889305ff721ec32ced48a3066ea059 -
a3889305ff721ec32ced48a3066ea059 -
</pre></div>
<p>That is a speedup of 25x over my initial naive approach, but at his point, there's really no surprises. More speedup can be achieved by parallelizing the program run-c++.cpp, but it would not be much meaningful in this setting as for much larger dataset the bottleneck will be the disks, and I only have one disk attached to each node.</p>
<p><span style="color:red;font-size:24px;">
A note for hadoop users: when building the C++ programs, link against libhdfs.so and libjvm.so instead of libMapRClient.so; you'll also have to set up the java environment properly.</p>fuse_dfs on MapR2013-03-17T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2013-03-17:/fuse_dfs-on-mapr.html<p>Here's how to get fuse_dfs work with MapR.</p>
<p><strong>Download the <a href="http://www.wdong.org/fuse-dfs.tar.bz2">patched source code</a>.</strong></p>
<h2>The fuse_dfs code is at hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src.</h2>
<h2>Patch the source code a little bit, including the following:</h2>
<ul>
<li>in fuse_connect.c, add "#define __USE_GNU 1" before the line "#include <search.h>".</li>
<li>in fuse_trash.c …</li></ul><p>Here's how to get fuse_dfs work with MapR.</p>
<p><strong>Download the <a href="http://www.wdong.org/fuse-dfs.tar.bz2">patched source code</a>.</strong></p>
<h2>The fuse_dfs code is at hadoop-hdfs-project/hadoop-hdfs/src/contrib/fuse-dfs/src.</h2>
<h2>Patch the source code a little bit, including the following:</h2>
<ul>
<li>in fuse_connect.c, add "#define __USE_GNU 1" before the line "#include <search.h>".</li>
<li>in fuse_trash.c, search for "hdfsDelete" and remove the third parameter "1" to the function.</li>
<li>in fuse_dfs.c, remove the check for "options.port == 0". That is because we are going to use exactly the port number 0.</li>
</ul>
<h2>Use the following Makefile and make fuse_dfs.</h2>
<div class="highlight"><pre><span></span><span class="nv">HADOOP_PREFIX</span><span class="o">=</span>/opt/mapr/hadoop/hadoop-0.20.2
<span class="nv">PACKAGE_VERSION</span><span class="o">=</span>0.20.2
<span class="nv">FUSE_HOME</span><span class="o">=</span>/usr
<span class="nv">PERMS</span><span class="o">=</span>
<span class="nv">PROTECTED_PATHS</span><span class="o">=</span>
<span class="nv">bin_PROGRAMS</span> <span class="o">=</span> fuse_dfs
<span class="nv">fuse_dfs_SOURCES</span> <span class="o">=</span> fuse_dfs.o fuse_options.o fuse_trash.o fuse_stat_struct.o fuse_users.o fuse_init.o fuse_connect.o fuse_impls_access.o fuse_impls_chmod.o fuse_impls_chown.o fuse_impls_create.o fuse_impls_flush.o fuse_impls_getattr.o fuse_impls_mkdir.o fuse_impls_mknod.o fuse_impls_open.o fuse_impls_read.o fuse_impls_release.o fuse_impls_readdir.o fuse_impls_rename.o fuse_impls_rmdir.o fuse_impls_statfs.o fuse_impls_symlink.o fuse_impls_truncate.o fuse_impls_utimens.o fuse_impls_unlink.o fuse_impls_write.o
<span class="nf">fuse_dfs</span><span class="o">:</span> <span class="k">$(</span><span class="nv">fuse_dfs_SOURCES</span><span class="k">)</span>
<span class="nv">CFLAGS</span><span class="o">=</span> -Wall -g -DPERMS<span class="o">=</span><span class="k">$(</span>PERMS<span class="k">)</span> -D_FILE_OFFSET_BITS<span class="o">=</span><span class="m">64</span> -I<span class="k">$(</span>HADOOP_PREFIX<span class="k">)</span>/src/c++/libhdfs -D_FUSE_DFS_VERSION<span class="o">=</span><span class="se">\"</span><span class="k">$(</span>PACKAGE_VERSION<span class="k">)</span><span class="se">\"</span> -DPROTECTED_PATHS<span class="o">=</span><span class="se">\"</span><span class="k">$(</span>PROTECTED_PATHS<span class="k">)</span><span class="se">\"</span> -I<span class="k">$(</span>FUSE_HOME<span class="k">)</span>/include
<span class="nv">LDFLAGS</span><span class="o">=</span> -L/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib -L/opt/mapr/lib -L<span class="k">$(</span>FUSE_HOME<span class="k">)</span>/lib -Wl,-allow-shlib-undefined
<span class="nv">LDLIBS</span> <span class="o">=</span> -lMapRClient -lfuse
<span class="nf">all</span><span class="o">:</span> <span class="n">fuse_dfs</span>
<span class="nf">clean</span><span class="o">:</span>
rm *.o fuse_dfs
</pre></div>
<h2>Mount with the following command</h2>
<div class="highlight"><pre><span></span><span class="nv">LD_LIBRARY_PATH</span><span class="o">=</span>/opt/mapr/lib ./fuse_dfs -oserver<span class="o">=</span>default -oport<span class="o">=</span><span class="m">0</span> <mount point>
</pre></div>
<p>We use the MapR client library instead of libhdfs, and there's no Java code involved between fuse and the disks. (Java is probably more of a psychological issue here than a performance issue here.)</p>MapR File Copy Throughput2013-03-17T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2013-03-17:/mapr-file-copy-throughput.html<div class="highlight"><pre><span></span>$ du -sh /data/local/wdong/data <span class="c1"># the directory contains a bunch of 30MB files.</span>
15G /data/local/wdong/data
$ <span class="nb">time</span> cp -R /data/local/wdong/data . <span class="c1"># copy data via fuse</span>
real 3m9.192s
user 0m0.148s
sys 0m19.581s
$ <span class="nb">time</span> hadoop fs -put /data/local/wdong/data test/data1
real …</pre></div><div class="highlight"><pre><span></span>$ du -sh /data/local/wdong/data <span class="c1"># the directory contains a bunch of 30MB files.</span>
15G /data/local/wdong/data
$ <span class="nb">time</span> cp -R /data/local/wdong/data . <span class="c1"># copy data via fuse</span>
real 3m9.192s
user 0m0.148s
sys 0m19.581s
$ <span class="nb">time</span> hadoop fs -put /data/local/wdong/data test/data1
real 2m56.955s
user 0m16.225s
sys 0m30.286s
</pre></div>
<p>So whether via fuse or hadoop commandline, the write throughput of mapr is about 80MB/s, with the hadoop commandline being slightly faster. The overhead of java is actually negative compared to that of fuse. I expect the performance of MapR's native NFS server should beat both.</p>
<p>I'm using a $20 <a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16833156309">TRENDnet 8-port Gigabit Switch</a> and there is a cluster of 7 MapR servers behind it.</p>Programming AT89S52 with Arduino2012-07-04T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2012-07-04:/programming-at89s52-with-arduino.html<p>I cannot make my usbasp programmer work on AT89S52. It makes me so frustrated that I decide to build my own with the ATMEGA328P at hand so I can learn 8051 programming (what an overkill!). The process of trying to make it work is painful, but nevertheless it finally worked …</p><p>I cannot make my usbasp programmer work on AT89S52. It makes me so frustrated that I decide to build my own with the ATMEGA328P at hand so I can learn 8051 programming (what an overkill!). The process of trying to make it work is painful, but nevertheless it finally worked (code is still buggy and I don't know why).</p>
<p>Here's the result:</p>
<p><img alt="ui" src="http://www.wdong.org/8051.jpg"></p>
<p>(OK, it's a poor man's Arduino -- a minimal breadboard system connected to PC via a USB-RS232 adaptor. The AVR chip is on the right -- the 51 chip on the left is so much bigger!)</p>
<p>Wiring:</p>
<div class="highlight"><pre><span></span>ATMEGA328P AT89S52
VCC ---- VCC ---- USB VCC
GND ---- GND ---- USB GND
Pin 15 (Arduino Pin 9) --- RST
Pin 17 MOSI (Arduino Pin 11) -- Pin 6 MOSI
Pin 18 MISO (Arduino Pin 12) -- Pin 7 MISO
Pin 19 SCK (Arduino Pin 13) -- Pin 8 SCK
ATMEGA328P
Pin 16 (Arduino Pin 10) -- LED, error signal
AT80S52
Pin 31 /EA -- VCC
Pin 3 (P1.2) -- LED, testing
</pre></div>
<p>Both chips use 16MHz crystal oscillator -- this is hardcoded in the program.</p>
<p><a href="http://www.wdong.org/arduino-x51.cpp">The Arduino sketch</a></p>
<p><a href="http://www.wdong.org/x51.cpp">The PC host program</a>, needs Boost to compile, but it should be cross platform.</p>
<p>Following are a few commands to show the usage of the host program. Change -s parameter to the serial port used for communication.</p>
<div class="highlight"><pre><span></span>./x51 <span class="o">[</span>-s /dev/ttyUSB0<span class="o">]</span> <span class="o">[</span>--page<span class="o">]</span> foo.ihx <span class="c1"># Uploading. sdcc .ihx output of sdcc.</span>
./x51 --dump foo.ihx <span class="c1"># Dump the content of the hex file.</span>
./x51 <span class="o">[</span>-s /dev/ttyUSB0<span class="o">]</span> <span class="o">[</span>--page<span class="o">]</span> --verify foo.ihx <span class="c1"># read the file content from the chip and dump it. It needs the original .ihx file to determine the number of bytes to read.</span>
</pre></div>
<p>Both upload and verify accept an optional "--page" parameter to do page-mode I/O. The serial communication code is still buggy in that the Arduino has to be reset before another upload/download can be carried out.</p>
<p>What's learned:
1. Arduino's SPI does't work for this purpose for unknown reason, haven't checked the source code.
2. Arduino's shiftOut works for programming purpose, but shiftIn doesn't, maybe due to timing reason.
3. Arduino is slow in processing serial input. I have to lower the baud rate to 9600 and add delay to the host code to make serial communication work, and it is not yet stable.</p>JPEG vs JPEG 2000 VS WebP2012-04-12T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2012-04-12:/jpeg-vs-jpeg-2000-vs-webp.html<p>Evaluation protocol:</p>
<p>A small number of web images are collected from the VPN trace. These images are first converted to lossless formats (png and pnm). The programs jpeg, jasper and webp are then used to compress the images with various quality numbers and then decompress to a lossless format. The …</p><p>Evaluation protocol:</p>
<p>A small number of web images are collected from the VPN trace. These images are first converted to lossless formats (png and pnm). The programs jpeg, jasper and webp are then used to compress the images with various quality numbers and then decompress to a lossless format. The program imgcmp is used to compare the original image and the decompressed images. Compression ratio is the size of the compressed file divided by thatof the lossless png file.</p>
<p>A few error bars are also shown for jpeg with quality=30 as reference. This configuration is with noticable yet tolerable noise. (These bars fall into two groups: photos and cartoons. The baseline PNG is good at compressing cartoons, so jpeg has a relatively low compression ratio and high error for this group.)</p>
<p>The two performance measures lead to the same conclusion:<span style="color: #ff0000;"> webp wins when compression ratio < 0.12, and jpeg 2000 wins when compression ratio > 0.12. The performance of jpeg is always comparable to the worse of the other two.</span></h5>
</p>
<p><img alt="ui" src="http://www.wdong.org/codec-rmse.png">
<img alt="ui" src="http://www.wdong.org/codec-psnr.png"></p>
<p>Sample images.
<img alt="ui" src="http://www.wdong.org/montage.jpg"></p>
<p>Image Format Distribution:
- 1146 image/jpeg
- 194 image/gif
- 188 image/png</pre></p>Making Sense Out of the Matlab FFT Results2011-10-11T00:00:00-04:002017-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2011-10-11:/making-sense-out-of-the-matlab-fft-results.html<p>Here's a note for people who know nothing about digital signal processing (like me) to make sense out of the matlab FFT results.</p>
<p>For a input A of N real numbers, B = fft(A) contains N complex numbers. B has the following property:</p>
<ul>
<li>B[0] (the first number, DC) is …</li></ul><p>Here's a note for people who know nothing about digital signal processing (like me) to make sense out of the matlab FFT results.</p>
<p>For a input A of N real numbers, B = fft(A) contains N complex numbers. B has the following property:</p>
<ul>
<li>B[0] (the first number, DC) is pure real.</li>
<li>If N is even, then B[N/2] (the “Nyquist” frequency) is also pure real.</li>
<li>Other than that, B[i] = B[N-i]*.</li>
</ul>
<p>If we take the abs of the output, then the first ceil(N/2) elements contain all the information. Now, what's important about this abs of fft output is that it represents the power of components of various frequency. That is:</p>
<p>abs(B[n]) is the magnitude of the component, a sine wave, that is sampled n cycles by A.</p>
<p>FFT needs to sample at least two points from each cycle, so the components with the highest frequency that appears in the FFT result is sampled floor(N/2) cycles, and that corresponds to B[floor(N/2)].</p>
<p>So far we have not talked anything about sample rate. Now, assume that the sample rate of A is r, then the time length of A is N/r seconds. For B[n], A covers n cycles, so each cycle is N/(nr) seconds. That is, the corresponding frequency of B[n] is nr/N.</p>
<p>So here's the functionality of sample rate r and number of samples N:</p>
<ol>
<li>Sample rate r controls the highest frequency, that is r/2, that is covered by FFT.</li>
<li>N controls the granularity of the quantization of the frequency range [0, r/2].</li>
</ol>About2011-01-01T00:00:00-05:002011-01-01T00:00:00-05:00Wei Dongtag:www.wdong.org,2011-01-01:/about.html<p>I'm an independent computer scientist. I provide high-quality and scalable implementations of bleeding edge algorithms and consulting services on machine learning, data mining, recommendation systems, computer vision, search engines, storage systems, high-performance computing and other fields in computer science. Ann Arbor Algorithms is my sole proprietorship.</p><p>I'm an independent computer scientist. I provide high-quality and scalable implementations of bleeding edge algorithms and consulting services on machine learning, data mining, recommendation systems, computer vision, search engines, storage systems, high-performance computing and other fields in computer science. Ann Arbor Algorithms is my sole proprietorship.</p>