Kevin's Blog
http://kevinzakka.github.io/
Thu, 20 Feb 2020 18:05:35 +0000Thu, 20 Feb 2020 18:05:35 +0000Jekyll v3.8.5kNN classification using Neighbourhood Components Analysis<p><small><strong>Update (12/02/2020)</strong>: The implementation is now available as a <a href="https://pypi.org/project/torchnca/">pip package</a>. Simply run <em>pip install torchnca</em>.<small></small></small></p>
<p>While reading related work<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> for my current research project, I stumbled upon a reference to a classic paper from 2004 called <em>Neighbourhood Components Analysis</em> (NCA). After giving it a read, I was instantly charmed by its simplicity and elegance. Long story short, NCA allows you to learn a linear transformation of your data that maximizes k-nearest neighbours performance. By forcing the transformation to be low-rank, NCA will perform dimensionality reduction, leading to vastly reduced storage sizes and search times for kNN! NCA is a very useful algorithm to have in your toolkit – just like PCA – but it’s very rarely mentioned in the wild. In fact, I couldn’t find any tutorial or reference outside of academic papers. This post is an attempt to rectify this.</p>
<div class="imgcap">
<button id="animButton" onclick="toggleAnim()" class="playbutton">Play</button>
<img alt="" src="/assets/nca/banner-start.png" width="70%" id="animImage" style="border:none;" />
<div class="thecap" style="text-align:center;"><b>Figure 1:</b> Visualizing the embedding space of a synthetic dataset as NCA trains.</div>
</div>
<script language="javascript">
function toggleAnim() {
path = document.getElementById("animImage").src;
if (path.split('/').pop() == "banner-start.png")
{
document.getElementById("animImage").src = "/assets/nca/banner-smaller.gif";
document.getElementById("animButton").textContent = "Reset";
}
else
{
document.getElementById("animImage").src = "/assets/nca/banner-start.png";
document.getElementById("animButton").textContent = "Play";
}
}
</script>
<p>I’ve implemented NCA in PyTorch with some added bells and whistles. It took almost 1 week to get it to work right, but I gained a lot of insight along the way. I think implementing algorithms from scratch is a great way of building intuition<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> for why things work – and by extension when and why they don’t – so I encourage the reader to do the same. There’s also a video presentation of NCA by one of the co-authors on <a href="https://youtu.be/07erva41ZoI">YouTube</a> which should serve as a good supplement to this post.</p>
<div style="text-align: center;">
<a href="https://papers.nips.cc/paper/2566-neighbourhood-components-analysis.pdf" id="linkbutton" target="_blank" style="margin-right: 10px;">Paper</a>
<a href="https://github.com/kevinzakka/nca" id="linkbutton" target="_blank" style="margin-left: 10px;">PyTorch Code</a>
</div>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#knn-issues">kNN: The Good, The Bad, The Ugly</a></li>
<li><a href="#nca-rescue">NCA to the rescue</a>
<ul>
<li><a href="#loss-func">Formulating the loss function</a></li>
<li><a href="#contrastive">NCA as a special case of the contrastive loss</a></li>
</ul>
</li>
<li><a href="#pytorch">NCA in PyTorch</a>
<ul>
<li><a href="#init">Initialization</a></li>
<li><a href="#comp-loss">Loss function</a></li>
<li><a href="#sgd">Replacing Conjugate Gradients with SGD</a></li>
<li><a href="#tricks">Stability tricks</a></li>
</ul>
</li>
<li><a href="#results">Boring… Show me what it can do!</a>
<ul>
<li><a href="#dim-reduct">Dimensionality reduction</a></li>
<li><a href="#sentiment">kNN on MNIST</a></li>
</ul>
</li>
<li><a href="#thankyou">Acknowledgements</a></li>
</ul>
<p><a name="knn-issues"></a></p>
<h2 id="knn-the-good-the-bad-the-ugly">kNN: The Good, The Bad, The Ugly</h2>
<div class="imgcap">
<img src="/assets/knn/teaser.png" width="60%" style="border:none;" />
<div class="thecap" style="text-align:center"><b>Figure 2:</b> kNN's nonlinear decision boundary <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">(source)</a>.</div>
</div>
<p>You’ve probably <a href="https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/">heard</a> of k-nearest neighbours (kNN) <em>at least</em> once in your life. It’s one of the first algorithms taught in many machine learning classes, and not without good reason. There’s lots to love about kNN! To name a few:</p>
<ul>
<li>It has an extremely simple implementation. In fact, kNN has absolutely no computational training cost.</li>
<li>It’s decision boundary, controlled by <script type="math/tex">k</script>, is highly nonlinear (the lines in Figure 2 are locally linear but their overall shape can’t be defined by a hyperplane). For low values of <script type="math/tex">k</script>, kNN has very little <a href="https://en.wikipedia.org/wiki/Inductive_bias">inductive bias</a>.</li>
<li>There’s just a single hyperparameter to tune: the number of neighbours <script type="math/tex">k</script>. You can easily find its optimal value with cross-validation.</li>
<li>It is asymptotically optimal. One can show that as the amount of data approaches infinity, k-NN is guaranteed to yield an error rate no worse than twice the Bayes error rate – the lowest possible error rate for any classifier – on a binary classification task. Or in other words, you can expect the performance of kNN to automatically improve as the number of training examples increases.</li>
</ul>
<p>But kNN does have some annoying drawbacks that limit its efficiency in big-data regimes. Specifically,</p>
<ul>
<li>It has to store and search through the entire training data to classify just one test point. That’s extremely unappealing from a deployment perspective since we usualy aim for a high test-time efficiency and low memory footprint.</li>
<li>In high dimensions, it suffers from the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>.</li>
<li>The choice of the distance metric can have a significant effect on its performance. What then is the optimal distance metric? How should one go about choosing it?</li>
</ul>
<p><a name="nca-rescue"></a></p>
<h2 id="nca-to-the-rescue">NCA to the Rescue</h2>
<p>Rather than having the user specify some arbitrary distance metric, NCA <em>learns</em> it by choosing a parameterized family of quadratic distance metrics, constructing a loss function of the parameters, and optimizing it with gradient descent. Furthermore, the learned distance metric can explicitly be made low-dimensional, solving test-time storage and search issues. How does NCA do this?</p>
<p>It turns out that learning a quadratic distance metric <script type="math/tex">\mathcal{d}</script> of the input space where the performance of kNN is maximized is equivalent to learning a linear transformation <script type="math/tex">\mathcal{A}</script> of the input space, such that in the transformed space, kNN with a Euclidean distance metric is maximized. In fact, quadratic distance metrics<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> can be represented by a positive semi-definite matrix <script type="math/tex">Q = \mathcal{A}^T \mathcal{A}</script> such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation} \label{eq1}
\begin{split}
d(x_1, x_2) &= (x_1 - x_2)^T Q (x_1- x_2) \\
& = (\mathcal{A}x_1 - \mathcal{A}x_2)^T (\mathcal{A}x_1 - \mathcal{A}x_2) \\
&= \langle y_1 - y_2, y_1 - y_2 \rangle
\end{split}
\end{equation} %]]></script>
<p>The goal of the learning algorithm then, is to optimize the performance of kNN on future test data. Since we don’t a priori know the test data, we can choose instead to optimize the closest thing in our toolbox: the <a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Leave-one-out_cross-validation"><em>leave-one-out</em></a> (LOO) performance of the training data.</p>
<p>At this point, I’d like the reader to appreciate the elegance of NCA. We’ve transformed the problem of maximizing the classification accuracy of kNN into an optimization problem involving a two-dimensional matrix <script type="math/tex">\mathcal{A}</script>. What remains is specifying a loss function that’s parameterized by <script type="math/tex">\mathcal{A}</script> and that can serve as as a proxy for the LOO classification accuracy.</p>
<div class="imgcap">
<img src="/assets/nca/loo-disc.png" width="50%" style="border:none;" />
<div class="thecap" style="text-align:center"><b>Figure 3:</b> The discontinuous graph of the LOO cross validation error. The red rectangle in particular illustrates how an infinitesimal change in the x-axis may change the value of the y-axis by a finite amount.</div>
</div>
<p><a name="loss-func"></a>
<strong>Formulating The Loss Function.</strong> There’s a slight bump in our road: LOO error is a highly discontinuous loss function. The reason is that it depends solely on the neighbourhood graph of each point. If the distance metric changes slightly at first, there might be no change in the neighbourhood graph and thus no change of the LOO error. But then suddenly, an infinitesimal change in the metric can alter the neighbourhood graph of many points, causing a significant jump in the LOO error. This is illustrated in the figure above.</p>
<p>Clearly, a discontinuous loss function is terrible for optimzation so we need to construct an alternative that is smooth and differentiable. The key to doing this is to replace <em>fixed</em> neighbourhood selection (i.e. what is done in LOO cross-validation) with <em>stochastic</em> neighbourhood selection. That is, each point <script type="math/tex">i</script> in the training set selects another point <script type="math/tex">j</script> as its neighbor with some probability <script type="math/tex">p_{ij}</script> that is inversely proportional to the Euclidean distance <script type="math/tex">d_{ij}</script> in the transformed space. By summing over all values of <script type="math/tex">j</script>, we can compute the probability <script type="math/tex">p_i</script> that a point <script type="math/tex">i</script> will be correctly classified and then sum over all values of <script type="math/tex">p_i</script> to obtain the total number of points we can expect to correctly classifiy.</p>
<p>Denoting the set of points in the same class as <script type="math/tex">i</script> by <script type="math/tex">C_i</script>, our loss function<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> thus becomes:</p>
<script type="math/tex; mode=display">\mathcal{L}(X; \mathcal{A}) = -\sum_i p_i = - \sum_i \sum_{j \in C_i} p_{ij}</script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation} \label{eq2}
\begin{split}
p_{ij} &= \frac{e^{-d_{ij}}}{\sum_{k \neq i} e^{-d_{ik}}}
&= \frac{\exp{\big(-\lVert Ax_i - Ax_j \lVert ^2\big)}}{\sum_{k \neq i} \exp{\big(- \lVert A x_i - Ax_k \lVert}\big)}
\end{split}
\end{equation} %]]></script>
<p>The really neat thing about this stochastic assignment is that we’ve completely avoided having to specify a value of <script type="math/tex">k</script>. It gets learned implicitly through the scale of the matrix <script type="math/tex">\mathcal{A}</script>:</p>
<ul>
<li>With larger values of <script type="math/tex">\mathcal{A}</script>, the distance between points increases and as a result their probabilities decrease (think exponential of smaller and smaller values). This means kNN will consult fewer neighbours for each point.</li>
<li>With smaller values of <script type="math/tex">\mathcal{A}</script>, the distance between points decreases and as a result their probabilities increase (think exponential of larger and larger values). This means kNN will consult more neighbours for each point.</li>
</ul>
<p><a name="contrastive"></a>
<strong>NCA as a special case of the contrastive loss.</strong> If we slightly alter our loss function to sum over log probabilities <script type="math/tex">-\sum_i \log{p_i}</script>, you’ll notice it looks just like a categorical cross entropy loss. In fact, you can think of NCA as a single hidden layer feed-forward neural network that performs metric learning with a contrastive loss function. Recall that a contrastive loss takes on the form:</p>
<script type="math/tex; mode=display">\mathcal{L}_ {contr} = \alpha \mathcal{L}_ {pos} + \beta \mathcal{L}_ {neg}</script>
<p>In most papers, <script type="math/tex">\mathcal{L}_ {pos}</script> is an L2 loss, <script type="math/tex">\mathcal{L}_ {neg}</script> is a hinge loss and <script type="math/tex">\alpha = \beta = 1</script>. The NCA loss function uses a categorical cross-entropy loss for <script type="math/tex">\mathcal{L}_ {pos}</script> with <script type="math/tex">\alpha = 1</script> and <script type="math/tex">\beta = 0</script>. This insight is going to be very valuable in our implementation of NCA when we talk about tricks to stabilize the training.</p>
<p><a name="pytorch"></a></p>
<h2 id="nca-in-pytorch">NCA In PyTorch</h2>
<p>There’s currently no GPU-accelerated version of NCA. The two most common ones at the time of this post are sklearn’s python <a href="https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NeighborhoodComponentsAnalysis.html">implementation</a> and a C++ <a href="https://github.com/jhseu/nca">implementation</a>. This meant I had the perfect excuse to implement a version in PyTorch that could leverage (a) <a href="https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html">automatic differentiation</a> to compute the gradient of the loss function with respect to <script type="math/tex">\mathcal{A}</script> and (b) blazing fast GPU acceleration that would prove super useful for large datasets. While the implementation was pretty straightforward, getting it to converge consistently took quite a while. In this section, I’ll walk you through the high-level components needed to implement NCA plus all the additional bells and whistles I added to get it to converge. The entirety of the code is available on <a href="https://github.com/kevinzakka/nca">GitHub</a>.</p>
<p><a name="init"></a>
<strong>Initialization.</strong> Since NCA is a gradient-based iterative optimization process, it requires that we specify an initialization strategy for the matrix <script type="math/tex">\mathcal{A}</script>. The two obvious ones (no, not zero init!) are identity initialization and random initialization. Recall that if <script type="math/tex">d</script> is the chosen dimension of the embedding space, and if <script type="math/tex">X \in \mathcal{R}^{N \ \times \ D}</script> is our input dataset, then <script type="math/tex">\mathcal{A} \in \mathcal{R}^{d \ \times \ D}</script>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">D</span> <span class="o">=</span> <span class="mi">3</span> <span class="c1"># feature space dimension
</span><span class="n">d</span> <span class="o">=</span> <span class="mi">2</span> <span class="c1"># embedding space dimension
</span>
<span class="k">if</span> <span class="n">init</span> <span class="o">==</span> <span class="s">"random"</span><span class="p">:</span>
<span class="c1"># random init from a normal distribution
</span> <span class="c1"># with mean 0 and variance 0.01
</span> <span class="n">A</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">D</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.01</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">init</span> <span class="o">==</span> <span class="s">"identity"</span><span class="p">:</span>
<span class="c1"># identity init
</span> <span class="n">A</span> <span class="o">=</span> <span class="n">nn</span><span class="o">.</span><span class="n">Parameter</span><span class="p">(</span><span class="n">torch</span><span class="o">.</span><span class="n">eye</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">D</span><span class="p">))</span>
</code></pre></div></div>
<p><a name="comp-loss"></a>
<strong>Loss Function.</strong> Computing the loss function requires forming a matrix of pairwise Euclidean distances in the transformed space, applying a softmax over the negative distances to compute pairwise probabilities, then summing over probabilities belonging to the same class. The trick here is to vectorize the softmax computation whilst ignoring diagonal values of the distance matrix (i.e. values where <script type="math/tex">i = j</script>) and probabilities that don’t have the same class labels.</p>
<p>To compute a pairwise Euclidean distance matrix, we make use of the following code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">pairwise_l2_sq</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="s">"""Compute pairwise squared Euclidean distances.
"""</span>
<span class="n">dot</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">double</span><span class="p">(),</span> <span class="n">torch</span><span class="o">.</span><span class="n">t</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">double</span><span class="p">()))</span>
<span class="n">norm_sq</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">diag</span><span class="p">(</span><span class="n">dot</span><span class="p">)</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">norm_sq</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:]</span> <span class="o">-</span> <span class="mi">2</span><span class="o">*</span><span class="n">dot</span> <span class="o">+</span> <span class="n">norm_sq</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span>
<span class="n">dist</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">dist</span><span class="p">,</span> <span class="nb">min</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># replace negative values with 0
</span> <span class="k">return</span> <span class="n">dist</span><span class="o">.</span><span class="nb">float</span><span class="p">()</span>
</code></pre></div></div>
<p>Note the cast to <code class="language-plaintext highlighter-rouge">double</code> to increase numerical precision in the dot product computation and the <code class="language-plaintext highlighter-rouge">clamp</code> method to replace any negative values that could have arisen from numerical imprecisions with zeros.</p>
<p>Next, we want to compute a softmax over the negative distances to obtain the pairwise probability matrix <script type="math/tex">p_{ij}</script>. Unlike a typical softmax implementation, the denominator in our equation sums over all <script type="math/tex">k \neq i</script>, i.e. it skips the diagonal entries of the pairwise distance matrix. A neat trick to achieve this without modifying the softmax function is to fill the diagonal entries with <code class="language-plaintext highlighter-rouge">np.inf</code>. That way, taking the exponential of their negative evaluates to 0 and doesn’t contribute to the normalization.</p>
<p>Now for each row <script type="math/tex">i</script> in <script type="math/tex">p_{ij}</script>, we need to sum over all columns <script type="math/tex">j \in C_i</script>. We can achieve this simply by creating a pairwise boolean mask of class labels, element-wise multiplying it with <script type="math/tex">p_{ij}</script> then calling the <code class="language-plaintext highlighter-rouge">sum</code> method. The code below executes all the aforementioned computations:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># compute pairwise boolean class label mask
</span><span class="n">y_mask</span> <span class="o">=</span> <span class="p">(</span><span class="n">y</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">]</span> <span class="o">==</span> <span class="n">y</span><span class="p">[</span><span class="bp">None</span><span class="p">,</span> <span class="p">:])</span><span class="o">.</span><span class="nb">float</span><span class="p">()</span>
<span class="c1"># compute pairwise squared Euclidean distances
# in transformed space
</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">mm</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">torch</span><span class="o">.</span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="p">))</span>
<span class="n">distances</span> <span class="o">=</span> <span class="n">pairwise_l2_sq</span><span class="p">(</span><span class="n">embedding</span><span class="p">)</span>
<span class="c1"># compute pairwise probability matrix p_ij defined by a
# softmax over negative squared distances in the transformed space.
# since we are dealing with negative values with the largest value
# being 0, we need not worry about numerical instabilities
# in the softmax function
</span><span class="n">p_ij</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="o">-</span><span class="n">distances</span><span class="p">)</span>
<span class="c1"># for each p_i, zero out any p_ij that is not of the same
# class label as i
</span><span class="n">p_ij_mask</span> <span class="o">=</span> <span class="n">p_ij</span> <span class="o">*</span> <span class="n">y_mask</span>
<span class="c1"># sum over js to compute p_i
</span><span class="n">p_i</span> <span class="o">=</span> <span class="n">p_ij_mask</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># compute expected number of points correctly classified by summing
# over all p_i's.
</span><span class="n">loss</span> <span class="o">=</span> <span class="o">-</span><span class="n">p_i</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
</code></pre></div></div>
<p><a name="sgd"></a>
<strong>Replacing Conjugate Gradients with SGD.</strong> The authors originally optimized NCA with conjuate gradients. I decided to stick with mini-batch Stochastic Gradient Descent. My reasoning was two-fold. First, with very large datasets, the size of the pairwise matrix grows quadratically with the number of points so it was essential that I use a mini-batch optimizer that could run very fast on a memory-limited GPU. Second, SGD has been shown to be a tried and true optimizer in deep learning that <a href="https://ruder.io/optimizing-gradient-descent/">tends to generalize better</a> than its counterparts.</p>
<p><a name="tricks"></a>
<strong>Stability Tricks.</strong> It took an intense session of debugging to get the implementation to consistently work for the various initializations and input data sizes. Here they are, in no particular order:</p>
<ul>
<li>Summing over log probabilities was more stable than the non-log variant. In other words, I ended up using a categorical cross-entropy loss.</li>
<li>Initially, the random initialization was sampled from a unit variance Gaussian. Lowering the variance to 0.01 seemed to make the optimization more stable.</li>
<li>Selecting the batch size was crucial for convergence. A small batch size leads to a very jittery loss function. This makes sense intuitively: a small batch means the pairwise matrix is only a very crude approximation of the neighbourhood graph since it only considers a random subset of all possible neigbhours. I noticed a good rule of thumb was to try to maximize the batch size within the GPU limits.</li>
<li>Normalizing the input data (i.e. subtracting the mean and dividing by the standard deviation) helped with convergence. Note that doing this requires that we store the computed statistics and scale any test data appropriately.</li>
<li>Without L2 regularization, the final matrix <script type="math/tex">\mathcal{A}</script> tended to blow up in scale. Adding L2 regularization to the loss function helped tame the matrix and speed-up convergence.</li>
<li>Random init always converged to a collapsed projection where the points lay on a hyperplane. This is possible because there is no term in the loss function that explicity pulls different classes apart. To combat this, I added a hinge loss component to the loss function, essentially turning the NCA loss into a contrastive loss function.</li>
</ul>
<p><a name="results"></a></p>
<h2 id="boring-show-me-what-it-can-do">Boring… Show Me What It Can Do!</h2>
<p>At this point, you’re probably curious to know if NCA lives up to its claims. Let’s go ahead and test the PyTorch implementation on 2 tasks: dimensionality reduction and kNN classification.</p>
<p>Using the NCA API is super simple. Very briefly, you first instantiate an NCA object with an embedding dimension and an initialization strategy. Then you call the <code class="language-plaintext highlighter-rouge">train</code> method on the input and ground-truth tensors, specifying a batch size and learning rate. There are other parameters you can change, all documented in the class docstring.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nca</span> <span class="o">=</span> <span class="n">NCA</span><span class="p">(</span><span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">init</span><span class="o">=</span><span class="s">"random"</span><span class="p">)</span> <span class="c1"># instantiate nca object
</span><span class="n">nca</span><span class="o">.</span><span class="n">train</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span> <span class="n">lr</span><span class="o">=</span><span class="mf">1e-4</span><span class="p">)</span> <span class="c1"># fit nca model
</span><span class="n">X_nca</span> <span class="o">=</span> <span class="n">nca</span><span class="p">(</span><span class="n">X</span><span class="p">)</span> <span class="c1"># apply the learned transformation
</span></code></pre></div></div>
<p><a name="dim-reduct"></a></p>
<h4 id="dimensionality-reduction">Dimensionality Reduction</h4>
<p>For this task, I replicated a portion of the results from section 4 of the paper. Specifically, I generated a synthetic three-dimensional dataset which consists of 5 classes, shown in different colors in Figure 4. The first two dimensions of the dataset correspond to concentric circles, while the third dimension is just Gaussian noise with high variance.</p>
<div class="imgcap">
<img src="/assets/nca/res.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center"><b>Figure 4:</b> NCA vs. PCA vs. LDA on the synthetic dataset.</div>
</div>
<p>I then embed the dataset to a 2D space using PCA, LDA and NCA. The results are shown Figure 4. While NCA seems to have recovered the original concentric pattern, PCA fails to project out the noise, a direct consequence of the high variance nature of the noise. If we lower it to <code class="language-plaintext highlighter-rouge">0.1</code> for example, PCA successfully recovers the pattern. LDA also struggles to recover the concentric pattern since the classes themselves are not linearly separable.</p>
<p><a name="sentiment"></a></p>
<h4 id="knn-on-mnist">kNN On MNIST</h4>
<p>The whole motivation for NCA was that it would vastly reduce the storage and search costs of kNN for high-dimensional datasets. To put this to the test, we compared the storage, run time and error rates of two variants of kNN on the MNIST dataset:</p>
<ul>
<li>5-NN on the raw MNIST dataset (784 dimensional)</li>
<li>5-NN on the 32 dimensional NCA projection of MNIST</li>
</ul>
<p>The results are shown in the table below.</p>
<style>
table {
font-family: arial, sans-serif;
border-collapse: collapse;
width: 100%;
}
td, th {
border: 1px solid #dddddd;
text-align: left;
padding: 8px;
}
tr:nth-child(even) {
background-color: #f0f0f0;
}
</style>
<table>
<tr>
<th style="text-align: center">Algorithm</th>
<th style="text-align: center">Raw kNN</th>
<th style="text-align: center">NCA + kNN</th>
</tr>
<tr>
<th style="text-align: center">Error (%)</th>
<td style="text-align: center">2.8</td>
<td style="text-align: center">3.3</td>
</tr>
<tr>
<th style="text-align: center">Time (s)</th>
<td style="text-align: center">155.25</td>
<td style="text-align: center">2.37</td>
</tr>
<tr>
<th style="text-align: center">Storage (Mb)</th>
<td style="text-align: center">156.8</td>
<td style="text-align: center">6.40</td>
</tr>
</table>
<p>That’s a 66x speedup in time and a 25x saveup in storage<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>!</p>
<p><a name="thankyou"></a></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I’d like to thank Nick Hynes, Alex Nichol and Brent Yi for their valuable feedback throughout my debugging session and blog writing. I also want to thank Chris Choy for the insight he provided on mode collapse. The javascript code for the animation was adapted from Sam Greydanus’ <a href="https://greydanus.github.io/">blog</a> – check him out, he’s got some great content.</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The paper in question is <a href="https://arxiv.org/abs/1904.07846">Temporal Cycle Consistency Learning</a> from Dwibedi et. al. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>John Schulman discusses this in more depth in his latest <a href="http://joschu.net/blog/opinionated-guide-ml-research.html">blog post</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>You can convince yourself that this is a valid distance metric by checking that the non-negativity, symmetry and triangle inequality conditions are satisfied. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>We negate the expression because our goal is to maximize the expectation and we’re going to feed it to an optimizer that performs minimization. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Performance on MNIST isn’t very representative of real world performance on tougher datasets but this is still a very cool result. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Mon, 10 Feb 2020 00:00:00 +0000
http://kevinzakka.github.io/2020/02/10/nca/
http://kevinzakka.github.io/2020/02/10/nca/machine learningmetric learningknnncaLearning to Assemble and to Generalize from Self-Supervised Disassembly<p>This is a crosspost from the official <a href="https://ai.googleblog.com/2019/10/learning-to-assemble-and-to-generalize.html">Google AI Blog</a>.</p>
<p>Our physical world is full of different shapes, and learning how they are all interconnected is a natural part of interacting with our surroundings — for example, we understand that coat hangers hook onto clothing racks, power plugs insert into wall outlets, and USB cables fit into USB sockets. This general concept of “how things fit together’’ based on their shapes is something that we acquire over time and experience, and it helps to increase the efficiency with which we perform tasks, like assembling DIY furniture kits or packing gifts into a box. If robots could also learn “how things fit together,” then perhaps they could become more adaptable to new manipulation tasks involving objects they have never seen before, like reconnecting severed pipes, or building makeshift shelters by piecing together debris during disaster response scenarios.</p>
<p>To explore this idea, we worked with researchers from Stanford and Columbia Universities to develop <a href="https://form2fit.github.io/">Form2Fit</a>, a robotic manipulation algorithm that uses deep neural networks to learn to visually recognize how objects correspond (or “fit”) to each other. To test this algorithm, we tasked a real robot to perform kit assembly, where it needed to accurately assemble objects into a blister pack or corrugated display to form a single unit. Previous systems built for this task required extensive manual tuning to assemble a single kit unit at a time. However, we demonstrate that by learning the general concept of “how things fit together,” Form2Fit enables our robot to assemble various types of kits with a 94% success rate. Furthermore, Form2Fit is one of the first systems capable of generalizing to new objects and kitting tasks not seen during training.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/teaser-white.gif" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Form2Fit learns to assemble a wide variety of kits by finding geometric correspondences between object surfaces and their target placement locations. By leveraging geometric information learned from multiple kits during training, the system generalizes to new objects and kits.</div>
</p>
</div>
<p>While often overlooked, shape analysis plays an important role in manipulation, especially for tasks like kit assembly. In fact, the shape of an object often matches the shape of its corresponding space in the packaging, and understanding this relationship is what allows people to do this task with minimal guesswork. At its core, Form2Fit aims to learn this relationship by training over numerous pairs of objects and their corresponding placing locations across multiple different kitting tasks – with the goal to acquire a broader understanding of how shapes and surfaces fit together. Form2Fit improves itself over time with minimal human supervision, gathering its own training data by repeatedly disassembling completed kits through trial and error, then time-reversing the disassembly sequences to get assembly trajectories. After training overnight for 12 hours, our robot learns effective pick and place policies for a variety of kits, achieving 94% assembly success rates with objects and kits in varying configurations, and over 86% assembly success rates when handling completely new objects and kits.</p>
<h3 id="data-driven-shape-descriptors-for-generalizable-assembly">Data-Driven Shape Descriptors For Generalizable Assembly</h3>
<p>The core of Form2Fit is a two-stream matching network that learns to infer orientation-sensitive geometric pixel-wise descriptors for objects and their target placement locations from visual data. These descriptors can be understood as compressed 3D point representations that encode object geometry, textures, and contextual task-level knowledge. Form2Fit uses these descriptors to establish correspondences between objects and their target locations (i.e., where they should be placed). Since these descriptors are orientation-sensitive, they allow Form2Fit to infer how the picked object should be rotated before it is placed in its target location.</p>
<p>Form2Fit uses two additional networks to generate valid pick and place candidates. A suction network gets fed a 3D image of the objects and generates pixel-wise predictions of suction success. The suction probability map is visualized as a heatmap, where hotter pixels indicate better locations to grasp the object at the 3D location of the corresponding pixel. In parallel, a place network gets fed a 3D image of the target kit and outputs pixel-wise predictions of placement success. These, too, are visualized as a heatmap, where higher confidence values serve as better locations for the robot arm to approach from a top-down angle to place the object. Finally, the planner integrates the output of all three modules to produce the final pick location, place location and rotation angle.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/overview.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview of Form2Fit: the suction and place networks infer candidate picking and placing locations in the scene respectively. The matching network generates pixel-wise orientation-sensitive descriptors to match picking locations to their corresponding placing locations. The planner then integrates it all to control the robot to execute the next best pick and place action.</div>
</p>
</div>
<h3 id="learning-assembly-from-disassembly">Learning Assembly from Disassembly</h3>
<p>Neural networks require large amounts of training data, which can be difficult to collect for tasks like assembly. Precisely inserting objects into tight spaces with the correct orientation (e.g., in kits) is challenging to learn through trial and error, because the chances of success from random exploration can be slim. In contrast, disassembling completed units is often easier to learn through trial and error, since there are fewer incorrect ways to remove an object than there are to correctly insert it. We leveraged this difference in order to amass training data for Form2Fit.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/trimmed.gif" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">An example of self-supervision through time-reversal: rewinding a disassembly sequence of a deodorant kit over time generates a valid assembly sequence.</div>
</p>
</div>
<p>Our key observation is that in many cases of kit assembly, a disassembly sequence – when reversed over time – becomes a valid assembly sequence. This concept, called <a href="https://arxiv.org/abs/1810.01128">time-reversed disassembly</a>, enables Form2Fit to train entirely through self-supervision by randomly picking with trial and error to disassemble a fully-assembled kit, then reversing that disassembly sequence to learn how the kit should be put together.</p>
<h3 id="generalization-results">Generalization Results</h3>
<p>The results of our experiments show great potential for learning generalizable policies for assembly. For instance, when a policy is trained to assemble a kit in only one specific position and orientation, it can still robustly assemble random rotations and translations of the kit 90% of the time.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/init.gif" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Form2Fit policies are robust to a wide range of rotations and translations of the kits.</div>
</p>
</div>
<p>We also find that Form2Fit is capable of tackling novel configurations it has not been exposed to during training. For example, when training a policy on two single-object kits (floss and tape), we find that it can successfully assemble new combinations and mixtures of those kits, even though it has never seen such configurations before.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/res1.gif" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Form2Fit policies can generalize to novel kit configurations such as multiple versions of the same kit and mixtures of different kits.</div>
</p>
</div>
<p>Furthermore, when given completely novel kits on which it has not been trained, Form2Fit can generalize using its learned shape priors to assemble those kits with over 86% assembly accuracy.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/res2.gif" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Form2Fit policies can generalize to never-before-seen single and multi-object kits.</div>
</p>
</div>
<h3 id="what-have-the-descriptors-learned">What Have the Descriptors Learned?</h3>
<p>To explore what the descriptors of the matching network from Form2Fit have learned to encode, we visualize the pixel-wise descriptors of various objects in RGB colorspace through use of an embedding technique called <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a>.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/form2fit/tsne.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">The t-SNE embedding of the learned object descriptors. Similarly oriented objects of the same category display identical colors (e.g. A, B or F, G), while different objects (e.g. C, H) or same objects with different orientations (e.g. A, C, D or H, F) exhibit different colors.</div>
</p>
</div>
<p>We observe that the descriptors have learned to encode (a) rotation — objects oriented differently have different descriptors (A, C, D, E) and (H, F); (b) spatial correspondence — same points on the same oriented objects share similar descriptors (A, B) and (F, G); and (c) object identity — zoo animals and fruits exhibit unique descriptors (columns 3 and 4).</p>
<h3 id="limitations--future-work">Limitations & Future Work</h3>
<p>While Form2Fit’s results are promising, its limitations suggest directions for future work. In our experiments, we assume a 2D planar workspace to constrain the kit assembly task so that it can be solved by sequencing top-down picking and placing actions. This may not work for all cases of assembly – for example, when a peg needs to be precisely inserted at a 45 degree angle. It would be interesting to expand Form2Fit to more complex action representations for 3D assembly.</p>
<p>You can learn more about this work and download the code from our <a href="https://github.com/kevinzakka/form2fit">GitHub repository</a>.</p>
<p align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/exnMwDmS1QI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>This research was done by Kevin Zakka, Andy Zeng, Johnny Lee, and Shuran Song (faculty at Columbia University), with special thanks to Nick Hynes, Alex Nichol, and Ivan Krasin for fruitful technical discussions; Adrian Wong, Brandon Hurd, Julian Salazar, and Sean Snyder for hardware support; Ryan Hickman for valuable managerial support; and Chad Richards for helpful feedback on writing.</p>
Thu, 31 Oct 2019 00:00:00 +0000
http://kevinzakka.github.io/2019/10/31/form2fit/
http://kevinzakka.github.io/2019/10/31/form2fit/roboticsresearchManifesto<p>I find writing to be a very fascinating and therapeutic activity. There’s nothing quite like twiddling a bunch of words into a sequence, reading the result out loud, grimacing, and adjusting until it sounds just right. It’s the reason I started this blog yet I find that I haven’t been able to write as much as I would like to. It sucks, but articles on here have usually been academic and because I prioritize quality over quantity, finding the time to write them has been very challenging.</p>
<p>To combat this dry spell, I’ve decided to create a new section of the blog entitled <strong>Miscellany</strong>, where I’ll post on a variety of topics such as interesting research papers, books I read, and philosophical ponderings of life. I still intend to publish on the main section, but posts there will be reserved for tutorials and research expositions primarily in machine learning. I’m aiming to write once a month and while it’s not much, it’s still better than nothing. As <a href="https://www.youtube.com/watch?v=46GwJbrMghQ&feature=youtu.be&t=172">Andy Dufresne puts it beautifully</a> in <em>The Shawshank Redemption</em>:</p>
<blockquote>
<p>Get busy living, or get busy dying<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p>
</blockquote>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>A tad bit dramatic for my case, but I couldn’t resist. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 26 May 2019 00:00:00 +0000
http://kevinzakka.github.io/2019/05/26/manifesto/
http://kevinzakka.github.io/2019/05/26/manifesto/personalDex-Net 2.0: Deep Learning to Plan Robust Grasps<p>In this blog post, we’re going to take a close look at <a href="https://arxiv.org/abs/1703.09312">Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics</a> by <em>Jeffrey Mahler</em>, <em>Jacky Liang</em>, <em>Sherdil Niyaz</em>, <em>Michael Laskey</em>, <em>Richard Doan</em>, <em>Xinyu Liu</em>, <em>Juan Aparicio Ojea</em>, and <em>Ken Goldberg</em>.</p>
<div class="imgcap">
<img src="/assets/dexnet/teaser.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview</div>
</div>
<p><strong>TL, DR.</strong> This paper tackles grasp planning which is the task of finding a gripper configuration (pose and width) that maximizes a success metric subject to kinematic and collision constraints. The suggested approach is to train a Grasp Quality Convolutional Neural Network (GQ-CNN) on a large synthetic dataset of depth images with associated positive and negative grasps. Then during test time, one can sample various grasps from a depth image, feed each through the GQ-CNN, pick the one with the highest probability of success, and execute the grasp open-loop.</p>
<h3 id="variables">Variables</h3>
<p>Let’s start by introducing the variables that appear in the paper.</p>
<ul>
<li><script type="math/tex">x = (O, T_o, T_c, \gamma)</script>: the state describing the variable properties of the camera and objects in the environment, where:
<ul>
<li><script type="math/tex">O</script>: the geometry and mass properties of the object.</li>
<li><script type="math/tex">T_o, T_c</script>: 3D poses of the object and camera respectively.</li>
<li><script type="math/tex">\gamma</script>: the coefficient of friction between the object and the gripper.</li>
</ul>
</li>
<li><script type="math/tex">u = (p, \phi)</script>: a parallel-jaw grasp in 3D space, specified by a center <script type="math/tex">p = (x, y, z)</script> relative to the camera and an angle in the table plane <script type="math/tex">\phi</script>.</li>
<li><script type="math/tex">y = R^{H \times W}</script>: a pointcloud represented as a depth image with height H and width W taken by the camera with known intrinsics <script type="math/tex">K</script> and pose <script type="math/tex">T_c</script>.</li>
<li><script type="math/tex">S(u, x) \in \{0, 1\}</script>: a binary-valued grasp success metric, such as force closure.</li>
</ul>
<p>Using these random variables, we can define a joint distribution <script type="math/tex">p(S, x, u, y)</script> that models the inherent uncertainty associated with our assumptions, such as erroneous sensors readings (calibration error, noise, limiting pinhole model, etc.), and imprecise control (kinematic inaccuracies, etc.).</p>
<p><strong>Goal.</strong> Ingest a depth image <script type="math/tex">u</script> of an object in a scene with an associated grasp candidate <script type="math/tex">u</script>, and spit out the probability that <script type="math/tex">u</script> will succeed under the above uncertainties. This is equivalent to predicting the <strong>robustness</strong> <script type="math/tex">Q</script> of a grasp, defined as the expected value of <script type="math/tex">S</script> conditioned on <script type="math/tex">u</script> and <script type="math/tex">y</script>, i.e. <script type="math/tex">Q(u, y) = \mathbb{E}[S \vert u, y]</script>.</p>
<p><strong>Solution.</strong> Use a neural network with weights <script type="math/tex">\theta</script> to approximate the complex, high-dimensional function <script type="math/tex">Q</script>. Concretely,</p>
<script type="math/tex; mode=display">\hat{\theta} = \arg \min_{\theta} \ \mathbb{E}_{p(S, u, x, y)} \big[L(S, Q_{\theta}(u, y)) \big]</script>
<p>And finally, using Monte-Carlo sampling of input-output pairs from our joint distribution, we obtain:</p>
<script type="math/tex; mode=display">\hat{\theta} = \arg\min_{\theta} \sum_{i=1}^{N} L(S_i, Q_{\theta}(u_i, y_i))</script>
<p>where <script type="math/tex">(S_i, u_i, x_i, y_i) \sim p(S, x, u, y)</script>.</p>
<h3 id="generative-graphical-model">Generative Graphical Model</h3>
<p>We can think of our joint <script type="math/tex">p(S, x, u, y)</script> as a generative model of images, grasps and success metrics. The relationship between the different variables is illustrated in the graphical model below.</p>
<div class="imgcap">
<img src="/assets/dexnet/gm.png" width="45%" style="border:none;" />
<div class="thecap" style="text-align:center">Graphical Model</div>
</div>
<p>Using the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">chain rule</a>, we can express the joint <script type="math/tex">p(S, x, u, y)</script> as the product of 4 terms: <script type="math/tex">p(S \vert u, y, x)</script>, <script type="math/tex">p(u \vert x, y)</script>, <script type="math/tex">p(y \vert x)</script> and <script type="math/tex">p(x)</script>. And since <script type="math/tex">S</script> and <script type="math/tex">u</script> are independent of <script type="math/tex">y</script> (no arrow going from <script type="math/tex">y</script> to <script type="math/tex">S</script> or <script type="math/tex">u</script>), we can reduce the expression to</p>
<script type="math/tex; mode=display">p(S, u, y, x) = {\color{red}{p(S \vert u, x)}} \cdot {\color{orange}{p(u \vert x)}} \cdot {\color{blue}{p(y \vert x)}} \cdot {\color{green}{p(x)}}</script>
<p>where:</p>
<ul>
<li><script type="math/tex">{\color{green}{p(x)}}</script> is the state distribution.</li>
<li><script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is the observation model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is the grasp candidate model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is the analytic model of grasp success conditioned on the grasp candidate and current state.</li>
</ul>
<p>The state <script type="math/tex">x = (O, T_o, T_c, \gamma)</script> is represented by the blue nodes in the graphical model. Using the chain rule and independence properties, we can express its underlying distribution as the product of:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
{\color{green}{p(x)}}
&= p(\gamma \vert T_c, T_o, O) \cdot p(T_c \vert T_o, O) \cdot p(T_o \vert O) \cdot p(O) \\
&= p(\gamma) \cdot p(T_c) \cdot p(T_o \vert O) \cdot p(O)
\end{align*} %]]></script>
<p>with:</p>
<ul>
<li><script type="math/tex">p(\gamma)</script>: truncated Gaussian over friction coefficients.</li>
<li><script type="math/tex">p(O)</script>: discrete uniform distribution over 3D object models.</li>
<li><script type="math/tex">p(T_o \vert O)</script>: continuous uniform distribution over discrete set of stable object poses.</li>
<li><script type="math/tex">p(T_c)</script>: continuous uniform distribution over spherical coordinates and polar angle.</li>
</ul>
<p>The grasp candidate model <script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is a uniform distribution over pairs of antipodal contact points on the object surface whose grasp axis is parallel to the table plane (we want top-down grasps), the observation model <script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is a rendered depth image of the scene corrupted with multiplicative and Gaussian Process noise, and the success model <script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is a binary-valued reward function subject to 2 constraints: epsilon quality and collision freedom.</p>
<p>Now that we’ve examined the inner workings of our generative model <script type="math/tex">p</script>, let’s see how we can use it to generate the massive Dex-Net dataset.</p>
<h3 id="generating-dex-net">Generating Dex-Net</h3>
<p>To train our GQ-CNN, we need to generate i.i.d samples, consisting of depth images, grasps, and grasp robustness labels, by sampling from the generative joint <script type="math/tex">p(S, x, u, y)</script>.</p>
<div class="imgcap">
<img src="/assets/dexnet/data-gen.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Data Generation Pipeline</div>
</div>
<ol>
<li>Randomly select, from a database of 1,500 meshes, a 3D object mesh using a discrete uniform distribution.</li>
<li>Randomly select, from a set of stable poses, a pose for this object using a continuous uniform distribution.</li>
<li>Use rejection sampling to generate top-down parallel-jaw grasps covering the surface of the object.</li>
<li>Randomly sample the camera pose (also from a continuous uniform distribution) and use it to render the object and its pose w.r.t to the camera into a depth image using ray tracing.</li>
<li>Classify the robustness of each sampled grasps to obtain a set of positive and negative grasps. Robustness is estimated using force closure probability which is a function of object pose, gripper pose, and friction coefficient uncertainty.</li>
</ol>
<h3 id="training-the-gq-cnn">Training the GQ-CNN</h3>
<p>Once the synthetic dataset has been generated, it becomes trivial to train the network.</p>
<div class="imgcap">
<img src="/assets/dexnet/model.png" width="65%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview of the Model</div>
</div>
<p>Remember how we mentioned that GQ-CNN takes as input a depth image and a grasp candidate? Well it actually turns out that the authors have a very clever way of encoding the grasp information into the depth image: they take a depth image and grasp candidate and transform the depth image such that the grasp pixel location <script type="math/tex">(i, j)</script> – projected from the grasp position <script type="math/tex">(x, y)</script> – is aligned with the image center and the grasp axis <script type="math/tex">\varphi</script> corresponds to the middle row of the image. Then, at every iteration of SGD, we sample the transformed depth image and the remaining grasp variable <script type="math/tex">z</script> (i.e the gripper depth from the camera), normalize the depth image to zero mean and unit standard deviation, and feed the tuple to the 18M parameter GQ-CNN model.</p>
<p><strong>Note 1.</strong> The model is a typical deep learning architecture composed of convolutional, max-pool and fully-connected primitives.</p>
<p><strong>Note 2.</strong> The depth alignment makes it easier for the model to train since it doesn’t have to worry about any rotational invariances. As for feeding the gripper depth to the model, I would think this is useful for pruning grasps that have the correct 2D position and orientation, but are too far away from the object (i.e. either not touching or barely touching).</p>
<h3 id="grasp-planning-inference-time">Grasp Planning (Inference Time)</h3>
<p>Once the model is trained, we can pair the QG-CNN with a policy of choice. The one used in the paper is <script type="math/tex">\pi_{\theta}(y) = \arg \max_{u \in C} Q_{\theta}(u, y)</script> which amounts to sampling a set of predefined grasps from a depth image subject to a set of constraints <script type="math/tex">C</script> (e.g. kinematic and collision constraints), scoring each grasp using the GQ-CNN, and finally executing the most robust grasp. There are two sampling strategies used to generate grasp candidates: antipodal grasp sampling and cross-entropy sampling.</p>
<p><strong>Antipodal Grasp Sampling.</strong></p>
<p>First, we perform edge detection by locating pixel areas with high gradient magnitude. This is especially useful since graspable regions usually correspond to contact points on opposite edges of an object.</p>
<div class="imgcap">
<img src="/assets/dexnet/edge-detection.png" width="100%" style="border:none;" />
</div>
<p>Then we sample pairs of pixels belonging to these areas to generate antipodal contact points on the object. We enforce the constraints that point pairs are parallel to the table plane.</p>
<div class="imgcap">
<img src="/assets/dexnet/cands.gif" width="50%" style="border:none;" />
</div>
<p>We repeat this step until we reach the desired number of grasps, potentially increasing the friction coefficient if the amount is insufficient. In the final step, 2D grasps are deprojected to 3D grasps using the camera intrinsics and extrinsics and multiple grasps are obtained from the same contact points by discretizing the height starting from the object surface to the table surface (<script type="math/tex">h = 0</script>).</p>
<p><strong>Cross Entropy Method.</strong></p>
<div class="imgcap">
<img src="/assets/dexnet/cem.png" width="75%" style="border:none;" />
<div class="thecap" style="text-align:center">Evolution of grasp robustness as the gripper center sweeps the depth image from top to bottom.</div>
</div>
<p>Randomly choosing a grasp from a set of candidates doesn’t work very well in cases where the grasping regions are small and require very precise gripper configurations. Taking a look at the image above, we can see that as we sweep candidate grasps from top to bottom, grasp robustness stays near zero and spikes momentarily when we reach the good, yet narrow grasping area. Thus, uniform sampling of grasp candidates is inefficient especially since we’re trying to perform real-time grasp planning.</p>
<p>This is where importance sampling – one of <a href="https://kevinzakka.github.io/2018/09/28/prioritized-learning/">my favorite</a> techniques – can help! We can modify our sampling strategy such that at every iteration, we refit the candidate distribution to the grasps with the highest predicted robustness. The algorithm to perform this fitting is the cross-entropy method (CEM) which tries to minimize the cross-entropy between a mixture of gaussians and the top-k percentile of grasps ranked by GQ-CNN. The result is that at every iteration, we are more likely to sample grasps with high-robustness values (grasps in the spike area) and converge to an optimal grasp candidate. This fitting process is illustrated below.</p>
<div class="imgcap">
<img src="/assets/dexnet/sampled.gif" width="50%" style="border:none;" />
</div>
<h3 id="discussion">Discussion</h3>
<ul>
<li>The sampling of grasps is inefficient. It would be interesting to extend the GQ-CNN to a fully-convolutional architecture where robustness labels can be computed for every pixel in the depth image in a single forward pass.</li>
<li>Dex-Net is open-loop which means that once a grasp candidate has been picked, it is executed blindly with no visual feedback. This sets it up for failure when camera calibration is imprecise or the environment it is placed in is dynamic and susceptible to change.</li>
<li>If we can speed-up Dex-Net by creating a smaller, fully-convolutional GQ-CNN, we may be able to run it at a high enough frequency to incorporate visual feedback and close the loop.</li>
</ul>
Mon, 05 Nov 2018 00:00:00 +0000
http://kevinzakka.github.io/2018/11/05/dexnet/
http://kevinzakka.github.io/2018/11/05/dexnet/graspingroboticscnnLearning What to Learn and When to Learn It<p><small><b>Disclaimer</b>: This blog post describes unfinished research and should be treated as a work in progress.</small></p>
<p>Hello world! I’m coming out of hibernation after 14 months of radio silence on this blog. I have a lot of things to blog about, from my research internship at Stanford University this past summer, to wrapping up my B.Eng. in EE in July – and I’ll hopefully get to those in future blog posts – but today, I’d like to talk about some of the cool research I did in my senior year of undergrad. Unfortunately, it’s not GAN/RL related (read as fortunately) but it’s definitely an interesting aspect of the field that could use some more attention.</p>
<p>The problem we’ll be investigating today is whether we can get Deep Neural Networks (DNNs) to converge faster and learn more efficiently. In particular, we’ll try to answer the following questions:</p>
<ul>
<li>Do we <em>really</em> need all the training samples in a dataset to reach a desired accuracy?</li>
<li>Can we do better than (lazy) uniform sampling of the data in a given training epoch?</li>
</ul>
<p>It actually turns out that on MNIST, we can reliably speedup training by a factor of 2 using just 30% of the available data<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>!</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Motivation</a></li>
<li><a href="#toc2">Refresher</a>
<ul>
<li><a href="#toc3">Stochastic Gradient Descent</a></li>
<li><a href="#toc4">Importance Sampling</a></li>
</ul>
</li>
<li><a href="#toc5">Quantifying Sample Importance</a></li>
<li><a href="#toc6">Loss Patterns</a></li>
<li><a href="#toc7">SGD on Steroids</a>
<ul>
<li><a href="#toc8">Mini-Batch Resampling</a></li>
<li><a href="#toc9">Auxiliary Model</a></li>
</ul>
</li>
<li><a href="#toc10">Things I Wish I Tried</a></li>
<li><a href="#toc11">Closing Thoughts</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="motivation">Motivation</h2>
<p>Human beings acquire knowledge in a unique way, accelerating their learning by choosing where and when to focus their efforts on the available training material. For example, when practicing a new musical composition, a pianist will spend more time on the difficult measures – breaking them down into manageable pieces that can be progressively mastered – rather than wasting her efforts on the simpler, more familiar parts.</p>
<div class="imgcap">
<img src="/assets/pr-lr/music-sheet-bach.jpg" width="75%" />
<div class="thecap" style="text-align:center"><a href="https://www.thestrad.com/yehudi-menuhins-marked-up-copy-of-bachs-solo-violin-sonata-no2/6651.article">Annotated Copy of Bach’s Solo Violin Sonata No. 2</a></div>
</div>
<p>Much of the same can be said about our formal primary and secondary education: our teachers help us learn from a smart selection of examples, leveraging previously acquired concepts to help guide our learning of new tools and abstractions. Human learning thus exhibits <strong>resource</strong> and <strong>time</strong> efficiency: we become proficient at mastering new concepts by selecting first, a <em>subset</em> of what is available to us in terms of learning material, and second, the <em>sequence</em> in which to learn the selected items such that we minimize acquisition time.</p>
<p>Unfortunately, the training algorithms we use in AI, unlike human learning, are data hungry and time consuming. With vanilla stochastic gradient descent (SGD) for example, the standard go-to optimizer, we repetitively iterate over the training data in sequential mini-batches for a large number of epochs, where a mini-batch is constructed by uniformly sampling <script type="math/tex">b</script> training points from the dataset. On large datasets – a necessity for good generalization – the naiveté of this sampling strategy hinders convergence and bottlenecks computation.</p>
<p><a name="toc2"></a></p>
<h2 id="refresher">Refresher</h2>
<p>So how can we improve SGD? Can we replace uniform sampling with a more efficient sampling distribution? More specifically, can we somehow predict a sample’s importance such that we adaptively construct training batches that catalyze more learning-per-iteration? These are all excellent questions we’ll be tackling further in the post, so let’s begin by refreshing a few concepts.</p>
<p><a name="toc3"></a>
<strong>Stochastic Gradient Descent.</strong> Given a neural network <script type="math/tex">M</script> parameterized by a set of weights <script type="math/tex">W</script>, a dataset <script type="math/tex">\mathcal{D}</script>, and a loss function <script type="math/tex">L</script>, we can express the goal of training as finding the optimal set of weights <script type="math/tex">\hat{W}</script> such that,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
\hat{W} & = \arg \min_{W} \ L_{\mathcal{D}} \\
& = \arg \min_{W} \ \frac{1}{B} \sum_{i=1}^{B} L_i \\
& = \arg \min_{W} \frac{1}{B} \sum_{i=1}^{B} \sum_{j=1}^{b} L_{ij} \big( M(x_j; W), y_j \big) \\
\end{split}
\end{equation} %]]></script>
<p>where <script type="math/tex">B</script> corresponds to the number of batches in an epoch, <script type="math/tex">b</script> the number of training observations in a batch, and <script type="math/tex">(x_i, y_i)</script> an input-output training pair.</p>
<div class="imgcap">
<img src="/assets/pr-lr/sgd.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center"><a href="https://distill.pub/2017/momentum/">Converging to an Optimum with SGD</a></div>
</div>
<p>Without loss of generality, we can simplify the notation by considering just one training observation, a special case where the batch size is equal to 1. In that case, training our neural network <script type="math/tex">M</script> amounts to updating the weight vector <script type="math/tex">W</script> by taking a small step in the direction of the gradient of the loss with respect to <script type="math/tex">W</script> between two consecutive iterations:</p>
<script type="math/tex; mode=display">W_{t+1} = W_t - \alpha \ \mu_i \ \nabla_{W_t} L_i</script>
<p>In the above equation, <script type="math/tex">i</script> is a discrete random variable sampled from <script type="math/tex">\mathcal{D}</script> according to a probability distribution <script type="math/tex">\mathcal{P}</script> with probabilities <script type="math/tex">p_i</script> and sampling weights <script type="math/tex">\mu_i</script>. With vanilla SGD and uniform sampling, we have that <script type="math/tex">\forall i \in \mathcal{D}</script>,</p>
<script type="math/tex; mode=display">\begin{equation*}
\mu_i = 1 \\
p_i = \frac{1}{|\mathcal{D_t}|}
\end{equation*}</script>
<p><a name="toc4"></a>
<strong>Importance Sampling.</strong> Importance sampling is a neat little trick for reducing the variance of an integral estimation by selecting a better distribution from which to sample a random variable. The trick is to multiply the integrand by a cleverly disguised <script type="math/tex">1</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{equation}
\begin{split}
E_{x \sim p(x)} \big[\ f(x) \big] & = \int f(x)\ p(x)\ dx \\
& = \int f(x)\ p(x)\ \frac{q(x)}{q(x)}\ dx \\
& = \int \frac{p(x)}{q(x)}\cdot f(x)\ q(x)\ dx \\
& = E_{x \sim q(x)} \big[\ f(x)\cdot \frac{p(x)}{q(x)} \big] \\
\end{split}
\end{equation} %]]></script>
<p>Since many quantities of interest (probabilities, sums, integrals)<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> can be obtained by computing the mean of a function of a random variable <script type="math/tex">E[f(X)]</script>, we can greatly accelerate – and even improve – Monte-Carlo estimates by switching out the original probability distribution with a density that minimizes the sampling of points that contribute very little to the estimate, i.e. points with a function value of 0.</p>
<div class="imgcap">
<img src="/assets/pr-lr/mc-imp.jpg" width="80%" style="border:none;" />
<div class="thecap" style="text-align:center">Smaller Point Spread with Importance Sampling</div>
</div>
<p>For a tutorial on Monte-Carlo estimation and Importance Sampling, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Monte%20Carlo%20and%20Importance%20Sampling.ipynb">here</a>.</p>
<p><a name="toc5"></a></p>
<h2 id="quantifying-sample-importance">Quantifying Sample Importance</h2>
<p>In the previous section, we mentioned that uniform sampling assigns equal importance to all the training points in <script type="math/tex">\mathcal{D}</script>. This is obviously wasteful: while some samples are “easy” for the model and can be discarded in the initial stages with minimal impact on performance, the more “difficult” samples should be addressed more frequently throughout the training since they contribute to faster learning. So can we find a way to quantify this “importance”?</p>
<p>Fortunately, the answer is yes: we can theoretically<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> show that this quantity is none other than the norm of the gradient of a sample. Intuitively this makes sense: in the classification setting for example, we would expect misclassified examples to exhibit larger gradients than their correctly classified counterparts. Unfortunately, the norm of the gradient is pretty expensive to compute, especially in settings where we would like to avoid computing a full forward and backwards pass.</p>
<p>What about the loss of a sample? We essentially get it for free in the forward pass of backprop, so if we can show some degree of correlation with the gradient norm, it would be a less accurate but way cheaper metric for importance. Let’s try and verify this with a small PyTorch experiment. We’re going to train a small convnet on MNIST and record both the loss and gradient of every image in an epoch. We’ll then sort the list containing the gradient norms and use it to index the list of losses. A scatter plot of the reindexed losses should reveal a few things:</p>
<ul>
<li>If there <em>is</em> indeed a correlation, there should be a (potentially noisy) straight line through the scatter plot.</li>
<li>If the correlation is positive – implying that a higher gradient norm corresponds to a higher loss value and vice versa – this line should be increasing.</li>
</ul>
<p><strong>EDIT (08/06/2019)</strong>: @AruniRC kindly mentioned that I can compute the <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Pearson correlation coefficient</a> to numerically quantify the degree of correlation between the gradient norm and the loss value. I’ve now added a cell in the notebook to compute it.</p>
<p>Here’s a code snippet for computing the L2 norm of the gradient of a batch of losses with respect to the parameters of the network. Since there’s a pair of weights and biases associated with every convolutional and fully-connected layer and we want to return a scalar, we can calculate and return the square root of the sum of the squared gradient norms.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gradient_norm</span><span class="p">(</span><span class="n">losses</span><span class="p">,</span> <span class="n">model</span><span class="p">):</span>
<span class="n">norms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">losses</span><span class="p">:</span>
<span class="n">grad_params</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">autograd</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">create_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">grad_norm</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">grad</span> <span class="ow">in</span> <span class="n">grad_params</span><span class="p">:</span>
<span class="n">grad_norm</span> <span class="o">+=</span> <span class="n">grad</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">norms</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">grad_norm</span><span class="o">.</span><span class="n">sqrt</span><span class="p">())</span>
<span class="k">return</span> <span class="n">norms</span>
</code></pre></div></div>
<p>Incorporating the above function in the training loop is pretty trivial. All we need to do is record a <code class="language-plaintext highlighter-rouge">(grad_norm, loss)</code> tuple for every image in the dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># train for 1 epoch
</span><span class="n">epoch_stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">batch_idx</span><span class="p">,</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">train_loader</span><span class="p">):</span>
<span class="n">data</span><span class="p">,</span> <span class="n">target</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">),</span> <span class="n">target</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">losses</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">nll_loss</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">reduction</span><span class="o">=</span><span class="s">'none'</span><span class="p">)</span>
<span class="n">grad_norms</span> <span class="o">=</span> <span class="n">gradient_norm</span><span class="p">(</span><span class="n">losses</span><span class="p">,</span> <span class="n">model</span><span class="p">)</span>
<span class="n">indices</span> <span class="o">=</span> <span class="p">[</span><span class="n">batch_idx</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">))]</span>
<span class="n">batch_stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">indices</span><span class="p">,</span> <span class="n">grad_norms</span><span class="p">,</span> <span class="n">losses</span><span class="p">):</span>
<span class="n">batch_stats</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">i</span><span class="p">,</span> <span class="p">[</span><span class="n">g</span><span class="p">,</span> <span class="n">l</span><span class="p">]])</span>
<span class="n">epoch_stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_stats</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">losses</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>
<p>We can compute the correlation between <code class="language-plaintext highlighter-rouge">grad_norms</code> and <code class="language-plaintext highlighter-rouge">losses</code> using the following one-liner:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">corr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cov</span><span class="p">(</span><span class="n">grad_norms</span><span class="p">,</span> <span class="n">losses</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">grad_norms</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">losses</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Pearson Correlation Coeff: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">corr</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]))</span> <span class="c1"># prints ~0.83
</span></code></pre></div></div>
<p>This returns a value of 0.83 which shows a strong relationship between both variables. Next, we verify this intuition graphically by indexing our losses using the sorted gradient norms and generating the aforementioned scatter plot.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># reindex the losses using the sorted gradient norms
</span><span class="n">flat</span> <span class="o">=</span> <span class="p">[</span><span class="n">val</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">epoch_stats</span> <span class="k">for</span> <span class="n">val</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span>
<span class="n">sorted_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">flat</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">flat</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="n">sorted_losses</span> <span class="o">=</span> <span class="p">[</span><span class="n">flat</span><span class="p">[</span><span class="n">idx</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">item</span><span class="p">()</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">sorted_idx</span><span class="p">]</span>
</code></pre></div></div>
<div class="imgcap">
<img src="/assets/pr-lr/loss_vs_grad.jpg" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Sorted Losses According to Gradient Norm</div>
</div>
<p>The above plot suggests that we <em>can</em> indeed use the loss value of a sample as a proxy for its importance. This is exciting news and opens up some interesting avenues for improving SGD.</p>
<p>If you want to reproduce the above logic, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Loss%20vs%20Gradient%20Norm.ipynb">here</a>.</p>
<p><a name="toc6"></a></p>
<h2 id="loss-patterns">Loss Patterns</h2>
<p>In this section, we’ll try to answer the following question:</p>
<blockquote>
<p>Is a sample’s importance consistent across epochs? In other words, if a sample exhibits low loss in the early stages of training, is this still the case in later epochs?</p>
</blockquote>
<p>There is substantial benefit in providing empirical evidence to this hypothesis. The reasons are two-fold: <strong>first</strong>, by eliminating consistently low-loss images from the dataset, we reduce train time proportionally to the discarded images; <strong>second</strong>, by oversampling the high-loss images, we reduce the variance of the gradients and speedup the convergence to <script type="math/tex">\hat{W}</script>.</p>
<p>To explore this idea, we’re going to track every sample’s loss over a set number of epochs. We’ll bin the loss values into 10 quantiles and compare the histograms over the different epochs. Finally, we’ll repeat these steps with shuffling turned off, then turned on.</p>
<p><strong>NB:</strong> We need to be a bit careful with keeping track of a sample’s index when shuffling is turned on. The solution is to create a permutation of <code class="language-plaintext highlighter-rouge">[0, 1, 2, ..., 59,999]</code> at the beginning of every epoch and feed it to a sequential sampler <strong>with shuffling turned off</strong>. By remapping the indices to their true ordering relative to the permutations at the end of training, we would have effectively simulated random shuffling.</p>
<p>If this sounds complicated, let me show you how simple it is to achieve in PyTorch:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># PermSampler takes a list of `indices` and iterates over it sequentially
</span><span class="k">class</span> <span class="nc">PermSampler</span><span class="p">(</span><span class="n">Sampler</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">indices</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">indices</span> <span class="o">=</span> <span class="n">indices</span>
<span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">iter</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">indices</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">indices</span><span class="p">)</span>
<span class="c1"># if `permutation` is None, we return a data loader with no shuffling
# if `permutation` is a list of indices, we return a data loader that iterates
# over the MNIST dataset with indices specified by `permutation`.
</span><span class="k">def</span> <span class="nf">get_data_loader</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">permutation</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">normalize</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Normalize</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="p">(</span><span class="mf">0.1307</span><span class="p">,),</span> <span class="n">std</span><span class="o">=</span><span class="p">(</span><span class="mf">0.3081</span><span class="p">,))</span>
<span class="n">transform</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Compose</span><span class="p">([</span><span class="n">transforms</span><span class="o">.</span><span class="n">ToTensor</span><span class="p">(),</span> <span class="n">normalize</span><span class="p">])</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">MNIST</span><span class="p">(</span><span class="n">root</span><span class="o">=</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">train</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">download</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">transform</span><span class="o">=</span><span class="n">transform</span><span class="p">)</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">permutation</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="n">PermSampler</span><span class="p">(</span><span class="n">permutation</span><span class="p">)</span>
<span class="n">loader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loader</span>
</code></pre></div></div>
<p>After training for 5 epochs, we collect a list containing a tuple <code class="language-plaintext highlighter-rouge">(idx, loss_idx)</code> for every image in the dataset. We can remap the indices with the following code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># remap the indices based on the permutations list
</span><span class="k">for</span> <span class="n">stat</span><span class="p">,</span> <span class="n">perm</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">stats_with_shuffling_flat</span><span class="p">,</span> <span class="n">permutations</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">stat</span><span class="p">)):</span>
<span class="n">stat</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">perm</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</code></pre></div></div>
<p>Finally, we bin the sorted losses of every epoch into 10 bins and compute the percent match of bins across all epochs, the last 4 epochs, and the last 2 epochs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">percentage_split</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">percentages</span><span class="p">):</span>
<span class="n">cdf</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">percentages</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">allclose</span><span class="p">(</span><span class="n">cdf</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="n">stops</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">cdf</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">seq</span><span class="p">)))</span>
<span class="k">return</span> <span class="p">[</span><span class="n">seq</span><span class="p">[</span><span class="n">a</span><span class="p">:</span><span class="n">b</span><span class="p">]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">([</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">stops</span><span class="p">,</span> <span class="n">stops</span><span class="p">)]</span>
<span class="k">def</span> <span class="nf">bin_losses</span><span class="p">(</span><span class="n">all_epochs</span><span class="p">,</span> <span class="n">num_quantiles</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">percentile_splits</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ep</span> <span class="ow">in</span> <span class="n">all_epochs</span><span class="p">:</span>
<span class="n">sorted_loss_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ep</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">ep</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">splits</span> <span class="o">=</span> <span class="n">percentage_split</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">,</span> <span class="p">[</span><span class="n">num_quantiles</span><span class="o">/</span><span class="mi">100</span><span class="p">]</span><span class="o">*</span><span class="n">num_quantiles</span><span class="p">)</span>
<span class="n">percentile_splits</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">splits</span><span class="p">)</span>
<span class="k">return</span> <span class="n">percentile_splits</span>
<span class="n">fr</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">all_matches</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">fr</span><span class="p">:</span>
<span class="n">percent_matches</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_quantiles</span><span class="p">):</span>
<span class="n">percentile_all</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">percentile_splits</span><span class="p">)):</span>
<span class="n">percentile_all</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percentile_splits</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
<span class="n">matching</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">intersect1d</span><span class="p">,</span> <span class="n">percentile_all</span><span class="p">)</span>
<span class="n">percent</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">matching</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">percentile_all</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">percent_matches</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percent</span><span class="p">)</span>
<span class="n">all_matches</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percent_matches</span><span class="p">)</span>
</code></pre></div></div>
<p>It’s interesting to compute percent matches across a varying range of epochs. The reason is that the training dynamics are less stable in the early epochs when the model weights are still random (analogous to transient response and steady state in circuit theory). For example, we would expect to have higher percent matches if we eliminate the first epoch from the analysis – and this is verified in the below plot!</p>
<div class="img">
<img src="/assets/pr-lr/no_shuffling.jpg" width="100%" style="border:none;" />
<img src="/assets/pr-lr/with_shuffling.jpg" width="100%" style="border:none;" />
</div>
<p>The histograms confirm our hypothesis:</p>
<ul>
<li>~ 30% of the samples with a loss value in the top 10% consistently rank in those ranges across all epochs. This number increases to ~ 60% across epochs 1 through 4 and ~ 85% across the last two epochs.</li>
<li>~ 30% of the samples with a loss value in the bottom 10% consistently rank in those ranges across all epochs. This number increases to ~ 50% across epochs 1 through 4 and ~ 70% across the last two epochs.</li>
<li>Shuffling has a minimial impact on the loss evolution of the samples across epochs.</li>
</ul>
<p>If you want to reproduce the histograms, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Loss%20Patterns.ipynb">here</a>.</p>
<p><a name="toc7"></a></p>
<h2 id="sgd-on-steroids">SGD on Steroids</h2>
<p><a name="toc8"></a>
<strong>Mini-Batch Resampling.</strong> In the first version of SGD-S, we’re going to split our training epochs into 2 stages:</p>
<ul>
<li><strong>Transient Epochs</strong>: in the transient epochs, we train our model exactly as we would in regular SGD. However, in the last epoch, we record and return the losses of every image in the dataset.</li>
<li><strong>Steady-State Epochs</strong>:
<ul>
<li>For every epoch in the steady-state, we sample batches using the loss as the sampling distribution.</li>
<li>At the end of every epoch in the steady-state, we eliminate 10% of the images with the lowest losses. Furthermore, we can choose to randomly introduce a fraction of the discarded images to combat potential catastrophic forgetting.</li>
</ul>
</li>
</ul>
<p>Let’s illustrate how we can use the loss function to construct an importance sampling distribution for mini-batch resampling. This is achievable using PyTorch’s <code class="language-plaintext highlighter-rouge">WeightedRandomSampler</code> in conjunction with the <code class="language-plaintext highlighter-rouge">DataLoader</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># sort the loss in decreasing order
</span><span class="n">sorted_loss_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">losses</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">losses</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># house cleaning
</span><span class="n">to_remove</span> <span class="o">=</span> <span class="n">sorted_loss_idx</span><span class="p">[</span><span class="o">-</span><span class="nb">int</span><span class="p">((</span><span class="n">perc_to_remove</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">)):]</span>
<span class="n">to_keep</span> <span class="o">=</span> <span class="n">sorted_loss_idx</span><span class="p">[:</span><span class="o">-</span><span class="nb">int</span><span class="p">((</span><span class="n">perc_to_remove</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">))]</span>
<span class="n">to_add</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">removed</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="mf">.01</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">)),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">new_idx</span> <span class="o">=</span> <span class="n">to_keep</span> <span class="o">+</span> <span class="n">to_add</span>
<span class="n">new_idx</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">losses</span><span class="p">[</span><span class="n">idx</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">new_idx</span><span class="p">]</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="n">WeightedRandomSampler</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p><a name="toc9"></a>
<strong>Auxiliary Model.</strong></p>
<p><a name="toc10"></a></p>
<h2 id="things-i-wish-i-tried">Things I Wish I Tried</h2>
<p><a name="toc11"></a></p>
<h2 id="closing-thoughts">Closing Thoughts</h2>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>CIFAR results pending. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Explain how. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Add proof or point to it. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 28 Sep 2018 00:00:00 +0000
http://kevinzakka.github.io/2018/09/28/prioritized-learning/
http://kevinzakka.github.io/2018/09/28/prioritized-learning/deep learningsgdimportance samplingpytorch2018Getting Up and Running with PyTorch on Amazon Cloud<p align="center">
<img src="/assets/aws/splash.png" alt="Drawing" width="60%" />
</p>
<p>This is a succint tutorial aimed at helping you set up an AWS GPU instance so that you can train and test your PyTorch models in the cloud. If you don’t own a GPU like me, this can be a great way of drastically reducing the training time of your models, so while your instance is furiously crunching numbers in some faraway Amazon server, you can peacefully experiment with and prototype new architectures from the comfort of a Starbucks couch.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/cpu-meter.png" width="30%" style="border:none;" />
<div class="thecap" style="text-align:center">I mean we all love a silent macbook, right?</div>
</p>
</div>
<p>The cool part is that if you’re a high school or college student, you can sign up for a Github Developer pack which will get you $150 worth of free AWS credits. That’s around 167 hours or 7 days of compute time<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, an amply sufficient amount for those fun weekend side projects and experiments. As usual, any code or script that appears on this page can be downloaded from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">Blog Repository</a>. And on that note, let’s get started!</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Configuring Your EC2 Instance</a></li>
<li><a href="#toc2">Launching & Managing Your EC2 Instance</a></li>
<li><a href="#toc3">SSH Persistence With TMUX</a></li>
<li><a href="#toc4">Conclusion</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="configuring-your-ec2-instance">Configuring Your EC2 Instance</h2>
<p>I’m assuming you’ve already created an AWS account but if you haven’t, the whole process shouldn’t take you more than 2 minutes. Note that it will require you to enter your credit card information which is necessary to charge you <em>if and when</em> you exceed your free credits. Now’s also a great time to claim your <a href="https://education.github.com/pack">GitHub Student Developer Pack</a> credits so go ahead and do that.</p>
<p><strong>Pick your Region.</strong> Ok, so the instance type we are going to use is located in <strong>US West (Oregon)</strong> so make sure the region information on the top right of the screen correctly reflects that.</p>
<p align="center">
<img src="/assets/aws/step1.png" alt="Drawing" width="80%" />
</p>
<p><strong>Limit Increase.</strong> The next thing we need to do is request a limit increase for EC2 instances. For some weird reason, Amazon automatically sets the limit to 0 upon account creation so it has to be increased by sending in a support ticket.</p>
<p>Go ahead and click <code class="language-plaintext highlighter-rouge">Support > Support Center</code> at the top right of your screen. This will direct you to a page with a blue <code class="language-plaintext highlighter-rouge">Create Case</code> button that you should click. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step2.png" alt="Drawing" width="80%" />
</p>
<p>We want a Limit Increase for EC2 instances meaning you need to select <code class="language-plaintext highlighter-rouge">Service Limit Increase</code> in <strong>Regarding</strong> and <code class="language-plaintext highlighter-rouge">EC2 Instances</code> in <strong>Limit Type</strong>. Now fill in the <strong>Request 1</strong> box and <strong>Use Case Description</strong> as I’ve done here.</p>
<p align="center">
<img src="/assets/aws/step3.png" alt="Drawing" width="80%" />
</p>
<p>Finally, make sure to select <code class="language-plaintext highlighter-rouge">Web</code> as your <strong>Contact method</strong> and submit the request. Note that the time of response varies: I’ve had limit increases resolved in a matter of minutes and sometimes up to a full day, so be patient. Also, feel free to change the <strong>New limit value</strong> to suit your needs. I’ve opted for 2 because the <code class="language-plaintext highlighter-rouge">p2.xlarge</code> instance type we’ll be working with has a single GPU with memory constraints that may limit the number of jobs I may run concurrently.</p>
<p><strong>Configure Instance.</strong> Ok, we’re now ready to create and configure our EC2 instance. Back on the home page console (click on the orange cube in the top left), navigate to <code class="language-plaintext highlighter-rouge">EC2</code> in the Compute services section, and then click on the blue <code class="language-plaintext highlighter-rouge">Launch Instance</code> button.</p>
<p align="center">
<img src="/assets/aws/step4.png" alt="Drawing" width="80%" />
</p>
<p>You’ll be greeted with a 7-step process like so.</p>
<p align="center">
<img src="/assets/aws/step5.png" alt="Drawing" width="80%" />
</p>
<p><strong>AMI.</strong> First select the <code class="language-plaintext highlighter-rouge">Ubuntu Server 16.04 LTS (HVM), SSD Volume Type</code> as the AMI of choice.</p>
<p><strong>Instance.</strong> Select <code class="language-plaintext highlighter-rouge">p2.xlarge</code> as your instance type. This is an instance with a single GPU which is what we asked for in our limit increase request.</p>
<p><strong>Spot Instances.</strong> At this point, you should be on the <strong>Configure Instance Details</strong> step. This is where things get interesting. In fact, Amazon gives us the ability to bid on spare Amazon EC2 computing capacity for a much cheaper price than the on-demand one.</p>
<p>Basically, what that means is that if our bid price is higher than the current market price, our instance will be launched and charged at that price. The only downside is that if that ever flips around, instances get <span style="color:red">terminated</span> instantly and with no warning<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p><span style="color:blue">TL;DR:</span> Spot instances can be ideal for non-critical experimentation like hyperparameter tuning but stay away from them if you need to train a model for a large number of epochs.</p>
<p>I’ll assume the user uses On-Demand pricing for the remainder of this post but if you do want to find out more about Spot Instances, feel free to watch this Youtube <a href="https://www.youtube.com/watch?v=_XT6McviY7w">video</a>.</p>
<p><strong>Add Storage.</strong> Next, we’ll be increasing the size of our Root Volume to accomodate large datasets such as ImageNet which is around 48 Gb. Feel free to enter any number above that.</p>
<p align="center">
<img src="/assets/aws/step6.png" alt="Drawing" width="80%" />
</p>
<p>Note that the Root Volume is EBS-backed meaning it persists on instance termination. The default behavior however is to delete it on termination. Weird right? Well, not really. With ephemeral storage, the other type of storage AWS offers, there is no persist option, whether it be on instance stop or terminate. Thus EBS with delete-on-terminate gives us the ability to keep our data on disk when the instance is stopped!</p>
<p><strong>Configure Security Group.</strong> You can skip the <strong>Add Tags</strong> section and jump to this last step. This part is important because it will allow us to monitor our training with Tensorboard and use Jupyter Notebook. We’ll be adding 4 protocols as shown in the picture below.</p>
<p align="center">
<img src="/assets/aws/step7.png" alt="Drawing" width="80%" />
</p>
<p>Once you click the launch button, a window will pop up and prompt you to create a key-pair. This little file is needed when ssh-ing into your instance, so download it and store it in a secure location you’ll remember. For this tutorial’s sake, I’ll be calling mine <code class="language-plaintext highlighter-rouge">aws-dl.pem</code> and storing it in my Downloads folder.</p>
<p align="center">
<img src="/assets/aws/step8.png" alt="Drawing" width="80%" />
</p>
<p><a name="toc2"></a></p>
<h2 id="launching--managing-your-ec2-instance">Launching & Managing Your EC2 Instance</h2>
<p>We’ve finally arrived at the point where we can ssh into our EC2 instance. To do so, you’ll need to navigate to the <code class="language-plaintext highlighter-rouge">Instances</code> page located in the navigation panel on the left of your screen. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step9.png" alt="Drawing" width="80%" />
</p>
<p>You need to take note of 2 things:</p>
<ul>
<li><strong>Public DNS (IPv4)</strong>: <code class="language-plaintext highlighter-rouge">ec2-52-42-90-161.us-west-2.compute.amazonaws.com</code></li>
<li><strong>IPv4 Public IP</strong>: <code class="language-plaintext highlighter-rouge">52.42.90.161</code></li>
</ul>
<p>Other than that, there are just 2 ways to interact with your instance you need to be aware of: <strong>login</strong> with ssh and <strong>copy</strong> a file to it with scp.</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">ssh -v -i X ubuntu@Y</code> where X represents the path to the key-pair file and Y represents the Public IP of your instance.</li>
<li><code class="language-plaintext highlighter-rouge">scp -i W -r X ubuntu@Y:Z</code> where W is the path to the key-pair file, X is the path to the local file, Y is the Public IP, and Z is the destination path on the instance.</li>
</ul>
<p>It’s important to note that if you’re using the key-pair file for the very first time, you’ll need to change its permission to read and write by running <code class="language-plaintext highlighter-rouge">chmod 600 ~/Downloads/aws-dl.pem</code>.</p>
<p>With all that being said, we can finally fire up a terminal and execute the following command:</p>
<p><code class="language-plaintext highlighter-rouge">
ssh -v -i ~/Downloads/aws-dl.pem ubuntu@52.42.90.161
</code></p>
<p align="center">
<img src="/assets/aws/term1.png" alt="Drawing" width="80%" />
</p>
<p>Enter yes, and voila! You should be successfully logged in. The instance is still not ready for use as there are a few more things that need to be done, but fear not. I’ve created a small bash script that you can execute which automates the following:</p>
<ul>
<li>It downloads and installs the required nvidia gpu drivers.</li>
<li>It updates and upgrades the distribution packages.</li>
<li>It installs python3 along with virtualenv.</li>
<li>It creates a virtualenv called <code class="language-plaintext highlighter-rouge">deepL</code> that will house all the required pip packages and PyTorch.</li>
<li>And it finally installs PyTorch v0.2.</li>
</ul>
<p>Go ahead and download <code class="language-plaintext highlighter-rouge">install.sh</code> from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">repo</a> and save it to your Desktop. We need to copy it to our instance, so apply the command I mentioned above:</p>
<p><code class="language-plaintext highlighter-rouge">
scp -i ~/Downloads/aws-dL.pem -r ~/Desktop/install.sh ubuntu@52.42.90.161:~/.
</code></p>
<p>Next, go back to the terminal window logged into the instance and execute the following 2 commands:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x install.sh
./install.sh
</code></pre></div></div>
<p>Once that’s done, you’ll need to reboot your instance. Enter <code class="language-plaintext highlighter-rouge">exit</code> at the command line and navigate to your browser as in the image below. Be patient and wait for a few minutes before you ssh back into the instance!</p>
<p align="center">
<img src="/assets/aws/step10.png" alt="Drawing" width="80%" />
</p>
<p>At this point, we should sanity check our installation by seeing if PyTorch loads correctly.</p>
<ul>
<li>First, activate the virtualenv by executing <code class="language-plaintext highlighter-rouge">source ~/envs/deepL/bin/activate</code>.</li>
<li>Enter <code class="language-plaintext highlighter-rouge">python</code> and inside the interpreter, <code class="language-plaintext highlighter-rouge">import torch</code> then <code class="language-plaintext highlighter-rouge">torch.__version__</code>. Fingers crossed, this should print out <code class="language-plaintext highlighter-rouge">0.2.0_1</code>.</li>
<li>Lastly, check that the GPU is visible by typing <code class="language-plaintext highlighter-rouge">torch.cuda.is_available()</code> which should print out True.</li>
</ul>
<p><span style="color:red">Once you’ve finished working on your instance, you should stop it immediately to avoid incurring additional charges.</span></p>
<p><a name="toc3"></a></p>
<h2 id="ssh-persistence-with-tmux">SSH Persistence With TMUX</h2>
<p>I would be doing you a great disservice if I didn’t mention this nifty little package called <code class="language-plaintext highlighter-rouge">tmux</code> that you can use when running your instances for long periods of time. <em>What exactly is tmux, and why should you use it</em>?</p>
<p>Well, if you’re shhed into an instance, peacefully running a job, and your connection suddenly drops, your ssh connection will automatically get killed. This means anything running on that instance stops as well (i.e. your model will stop training). Closing your laptop to commute from university to your house for example becomes a big no no.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/term3.png" width="80%" style="border:none;" />
<div class="thecap" style="text-align:center">A TMUX session</div>
</p>
</div>
<p>This is where tmux comes in! Tmux makes it so that anything running within a session persists even if the connection drops or the terminal gets killed. To see it in action, I’d suggest you watch the following <a href="https://www.youtube.com/watch?v=BHhA_ZKjyxo">video</a>.</p>
<p>Thus, your workflow should always be as follows:</p>
<ul>
<li>SSH into your aws instance.</li>
<li>Create a new tmux session called work using the command <code class="language-plaintext highlighter-rouge">tmux new -s work</code>.</li>
<li>Do everything as you would previously.</li>
<li>Detach from the session by pressing <code class="language-plaintext highlighter-rouge">ctrl-b</code> followed by <code class="language-plaintext highlighter-rouge">d</code>.</li>
</ul>
<p>Once you’ve detached yourself from the session, you can work on anything else, even go to sleep… Subsequently, if you need to reattach to that particular tmux session to check your progress, run <code class="language-plaintext highlighter-rouge">tmux a -t work</code>.</p>
<p>That’s pretty much it. For a more complete list of tmux commands, you should refer to this lovely <a href="https://gist.github.com/MohamedAlaa/2961058">cheatsheet</a>.</p>
<p><a name="toc4"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this tutorial, we went over the basic steps needed to create a free, GPU-powered Amazon AWS instance. We explored how to interact with our instance using the <code class="language-plaintext highlighter-rouge">ssh</code> and <code class="language-plaintext highlighter-rouge">scp</code> commands and how a bash script could be leveraged to download and install all the required packages needed to run PyTorch. Finally, we saw how we could make our ssh session persistent using a very important program called tmux.</p>
<p>Until next time!</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is for a GPU-powered p2.xlarge instance with an on-demand price of around $0.9/hr. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>A terminated instance gets deleted, meaning you lose whatever’s on there permanently. On the other hand, a stopped instance just goes offline so you don’t get charged for it and you can fire it back up again at a later time. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 13 Aug 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/08/13/aws-pytorch/
http://kevinzakka.github.io/2017/08/13/aws-pytorch/deep learningawsamazonpytorch2017Understanding Recurrent Neural Networks - Part I<p>Recurrent Neural Networks have been my Achilles’ heel for the past few months. Admittedly, I haven’t had the grit to sit down and work out their details, but I’ve figured it’s time I stop treating them like black boxes and try instead to discover what makes them tick. My intentions with this series are hence twofold: first, to combat my weakness by understanding their inner workings and coding one from scratch; and second, to write down what I learn in order to reinforce the insights I may gain along the way.</p>
<p>In this first installment, we’ll be introducing the intuition behind RNNs, motivating their use by highlighting a glaring limitation of traditional neural networks. We’ll then transition into a more technical description of their architecture which will be useful for the next installment where we’ll code one from scratch in numpy.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Human Learning</a></li>
<li><a href="#toc2">The Woes of Traditional Neural Nets</a></li>
<li><a href="#toc3">Enhancing Neural Networks with Memory</a></li>
<li><a href="#toc4">The Nitty Gritty Details</a></li>
<li><a href="#toc5">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="human-learning">Human Learning</h3>
<blockquote>
<p>We are the sum total of our experiences. None of us are the same as we were yesterday, nor will be tomorrow.</p>
<cite>B.J. Neblett</cite>
</blockquote>
<p>There is an inherent truth to the quote above. Our brain pools from past experiences and combines them in intricate ways to solve new and unseen tasks. It is hardwired to work with sequences of information that we perpetually store and call upon over the course of our lives. At its core, <em>human learning</em> can be distilled into two fundamental processes:</p>
<ul>
<li><strong>memorization</strong>: every time we gain new information, we store it for future reference.</li>
<li><strong>combination</strong>: not all tasks are the same, so we couple our analytical skills with a combination of our memorized, previous experiences to reason about the world.</li>
</ul>
<p>Consider the following pictures.</p>
<p align="center">
<img src="/assets/rnn/weird_cat.jpg" alt="Drawing" width="200px" /><img src="/assets/rnn/weird_cat2.jpg" alt="Drawing" width="200px" />
</p>
<p>Even though it’s in a very weird position, a child can instantly tell that the fur ball in front of it is a cat. It’ll recognize the ears, the whiskers and the snout (memory) but the shape of it all may throw it off. Subconciously however, the child may recall how human stretching deforms shape and pose (combination), and infer that the same is happening to the cat.</p>
<p>Not all tasks require the distant past however. At times, solving a problem makes use of information that was processed only moments ago. For example, take a look at this incomplete sentence:</p>
<blockquote>
<p>I bought my usual caramel-covered popcorn with iced tea and headed to the ___.</p>
</blockquote>
<p>If I asked you to fill-in the missing word, you’d probably guess “movies”. How did you know that <code class="language-plaintext highlighter-rouge">library</code> or <code class="language-plaintext highlighter-rouge">starbucks</code> were invalid words? Well, it’s probably because you used context, or information from earlier in the sentence to infer the correct answer. Now think about the following. If I asked you to recite the lyrics of your favorite song backwards, would you be able to do it? Probably not… What about counting backwards? Yeah, piece of cake!</p>
<p align="center">
<img src="/assets/rnn/yarn.jpg" alt="Drawing" width="200px" />
</p>
<p>So what makes reciting the song backwards so excruciatingly difficult? The answer is that counting backwards is done <strong>on the fly</strong>. There is a logical relationship between each number, and knowing the order of the 9 digits and how subtraction works means you can count backwards from say 1845098 even if you’ve never done it before. On the other hand, you memorized the lyrics of the song in a specific order. Your brain works by <strong>indexing</strong> from one word to the next, starting from the first word. It’s hard to index backwards for the simple reason that your brain has never done it before, so that specific sequence was never stored. Think of the memorized lyric sequence as a giant ball of yarn whose unraveled end can only be accessed with the correct first word in the forward sequence.</p>
<p>The main takeaway is that our brains are naturally talented at working with sequences and they do so by relying on a deceptively simple, yet powerful concept called <strong>information persistence</strong>.</p>
<p><a name="toc2"></a></p>
<h3 id="the-woes-of-traditional-neural-nets">The Woes of Traditional Neural Nets</h3>
<p>We live in a world that is inherently sequential. Audio, video, and language (even your DNA!) are but a few examples of data in which information at a given time step is intricately dependent on information from previous timesteps. So how is all this related to deep learning? Well, think about feeding a sequence of frames from a video into a neural network and asking it to predict what comes next… Or, back to our previous example, feeding a set of words and asking it to complete the sentence.</p>
<p>It should be obvious to you that information from the past is crucial for outputting a sane and plausible prediction. But traditional neural networks can’t do this because they operate on the fundamental assumption that inputs are independent! This is a problem because it means our output at any given time is completely and <strong>solely</strong> determined by the input at that same time. There is no previous history and our network cannot capitalize on the complex temporal dependencies that exist between the different frames or words to refine its predictions.</p>
<p>This is where <em>Recurrent Neural Networks</em> come in! RNNs allow us to deal with sequences by incorporating a mechanism that stores and leverages information from previous history, sort of like a memory. Put differently, whereas a traditional net maps <strong>one</strong> input to an output, a recurrent net maps an <strong>entire history</strong> of previous inputs to each output. If that’s still obscure to you, just think of RNNs as a traditional neural net enhanced with a loop<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, one that allows for information to persist across timesteps.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/rnn/draw2.gif" width="400" style="border:none;" />
<div class="thecap" style="text-align:center">(<a href="https://www.youtube.com/watch?v=Zt-7MI9eKEo">Video Courtesy</a>) DRAW model improving its output by iterating over the canvas rather than producing the image in one shot.</div>
</p>
</div>
<p>It is important to note that recurrent neural nets aren’t just bound to sequential data in the sense that many problems can be tackled by decomposing them into a series of smaller subproblems. The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step. As an example, a recurrent net<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> was used to generate handwritten digits in a sequential fashion, mimicking the way artists refine and reassess their work with brushstrokes.</p>
<blockquote>
<p>The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step.</p>
</blockquote>
<p><a name="toc3"></a></p>
<h3 id="enhancing-neural-nets-with-memory">Enhancing Neural Nets with Memory</h3>
<p>So how exactly can we endow our networks with the ability to memorize? To answer this question, let’s recall our basic hidden layer neural network, which takes as input a vector <code class="language-plaintext highlighter-rouge">X</code>, dot products it with a weight matrix <code class="language-plaintext highlighter-rouge">W</code> and applies a nonlinearity. We’ll consider the output <code class="language-plaintext highlighter-rouge">y</code> when three successive inputs are fed through the network. Note that the bias term has been eliminated so as to simplify the notation, and I’ve taken the liberty of coloring the equations to make certain patterns stand out.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2})</script>
<p>Given the simple API above, it’s pretty clear that each output is solely determined by its input, i.e. there is no trace of past inputs in the calculation of its value. So let’s alter the API by allowing our hidden layer to use a combination of both the current input and the previous input, and visualize what happens.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1} + W_h\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2} + W_h\color{green}{X_1})</script>
<p>Nice! By introducing recurrence into the formula, we’ve managed to obtain a mix of 2 colors in each hidden layer. Intuitively, our network now has a memory depth of 1, equivalent to “seeing” one step backwards in time. Remember though that our goal is to be able to capture information across <strong>all</strong> previous timesteps, so this does not cut it.</p>
<p>Hmm… What if we feed in a combination of the current input and the previous hidden layer?</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)</script>
<script type="math/tex; mode=display">y_2 = f\bigg(W_x \color{red}{X_2} + W_h \ f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)\bigg)</script>
<p>Much better! Our layer at each timestep is now a blend of all the colors that have come before it, allowing our network to take into account all its past history when computing its output. This is the power of recurrence in all its glory: creating a loop where information can persist across timesteps.</p>
<p><a name="toc4"></a></p>
<h3 id="the-nitty-gritty-details">The Nitty Gritty Details</h3>
<div class="imgcap">
<img src="/assets/rnn/rnn-1_layer-unrolled.svg" width="300px" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://kbullaughey.github.io/lstm-play/rnn/">Image Courtesy</a></div>
</div>
<p>At its core, an RNN can be represented by an internal, hidden state <code class="language-plaintext highlighter-rouge">h</code> that gets updated with every timestep and from which an output <code class="language-plaintext highlighter-rouge">y</code> can be optionally derived<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. This update behavior is governed by the following equations:</p>
<script type="math/tex; mode=display">\begin{cases}
h_t = f \big(W_{xh}x_t + W_{hh}h_{t-1}+b_1\big) \\
y_t = g \big(W_{hy}h_t + b_2\big)
\end{cases}</script>
<p>Don’t let the above notation scare you. It’s actually very simple once you dissect it.</p>
<ul>
<li><script type="math/tex">W_{xh}x_t</script> - we’re multiplying the input <script type="math/tex">x_t</script> by a weight matrix <script type="math/tex">W_{xh}</script>. You can think of this dot product as a way for the hidden layer to extract information out of the input.</li>
<li><script type="math/tex">W_{hh}h_{t-1}</script> - this dot product is allowing the network to extract information from an entire history of past inputs which it will use in conjunction with information gathered from the current input, to compute its output. This is the crucial, self-defining property of RNNs.</li>
<li><script type="math/tex">f</script> and <script type="math/tex">g</script> are activation functions that squash the dot products to a specific range. The function <script type="math/tex">f</script> is usually <code class="language-plaintext highlighter-rouge">tanh</code> or <code class="language-plaintext highlighter-rouge">ReLU</code>. <script type="math/tex">g</script> can be a <code class="language-plaintext highlighter-rouge">softmax</code> when we want to output class probabilities.</li>
<li><script type="math/tex">b_1</script> and <script type="math/tex">b_2</script> are biases that help offset the outputs away from the origin (similar to the b in your typical <script type="math/tex">ax+b</script> line).</li>
</ul>
<p>As you can see, the Vanilla RNN model is quite simple. Once its architecture has been defined, training it is exactly the same as with normal neural nets, i.e. initializing the weight matrices and biases, defining a loss function and minimizing that loss function using some form of gradient descent.</p>
<p>This conclues our first installment in the series. In next week’s blog post, we’ll be coding our very own RNN from the ground up in numpy and apply it to a language modeling task. Stay tuned until then…</p>
<p><a name="toc5"></a></p>
<h3 id="references">References</h3>
<p>There are a ton of resources that helped me better grasp the fundamentals of RNNs. I’d like to thank <a href="https://twitter.com/iamtrask">iamtrask</a> especially, for letting me use his idea of colors to explain neural memory. You can read his amazing blog post <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">here</a>.</p>
<ul>
<li>Denny Britz’s RNN series - click <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">here</a></li>
<li>Andrej Karpathy’s Blog Post - click <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">here</a></li>
<li>Chris Olah’s Blog Post - click <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you’re familiar with Control Theory, this should be slightly reminiscent of a feedback loop, although not quite. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I’m referring to the <a href="https://arxiv.org/abs/1502.04623">DRAW</a> model introduced by Gregor et. al at Deepmind. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>In the simplest of cases, the hidden state <script type="math/tex">h_t</script> is used as both the output <script type="math/tex">y_t</script> and input to the next hidden state <script type="math/tex">h_{t+1}</script>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 20 Jul 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/07/20/rnn/
http://kevinzakka.github.io/2017/07/20/rnn/deep learningrnnsequences2017Deep Learning Paper Implementations: Spatial Transformer Networks - Part II<div class="imgcap">
<img src="/assets/stn2/ai.jpg" width="45%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>In last week’s <a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">blog post</a>, we introduced two very important concepts: <strong>affine transformations</strong> and <strong>bilinear interpolation</strong> and mentioned that they would prove crucial in understanding Spatial Transformer Networks.</p>
<p>Today, we’ll provide a detailed, section-by-section summary of the <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> paper, a concept originally introduced by researchers <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind.</p>
<p>Hopefully, it’ll will give you a clear understanding of the module and prove useful for next week’s blog post where we’ll cover its implementation in Tensorflow.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Motivation</a></li>
<li><a href="#toc2">Pooling Operator</a></li>
<li><a href="#toc3">Spatial Transformer Network</a>
<ul>
<li><a href="#toc4">Localisation Network</a></li>
<li><a href="#toc5">Parametrised Sampling Grid</a></li>
<li><a href="#toc6">Differentiable Image Sampling</a></li>
</ul>
</li>
<li><a href="#toc7">Fun with STNs</a>
<ul>
<li><a href="#toc8">Distorted MNIST</a></li>
<li><a href="#toc9">GTSRB dataset</a></li>
</ul>
</li>
<li><a href="#toc10">Summary</a></li>
<li><a href="#toc11">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="motivation">Motivation</h2>
<p>When working on a classification task, it is usually desirable that our system be <strong>robust</strong> to input variations. By this, we mean to say that should an input undergo a certain “transformation” so to speak, our classification model should in theory spit out the same class label as before that transformation. A few examples of the “challenges” our image classification model may face include:</p>
<ul>
<li><strong>scale variation</strong>: variations in size both in the real world and in the image.</li>
<li><strong>viewpoint variation</strong>: different object orientation with respect to the viewer.</li>
<li><strong>deformation</strong>: non rigid bodies can be deformed and twisted in unusual shapes.</li>
</ul>
<div class="imgcap">
<div>
<img src="/assets/stn2/var1.png" style="max-width:49%; height:350px;" />
<img src="/assets/stn2/var2.png" style="max-width:49%; height:200px;" />
</div>
<div class="thecap" style="text-align:center"><a href="http://cs231n.github.io/classification/">Image Courtesy</a></div>
</div>
<p>For illustration purposes, take a look at the images above. While the task of classifying them may seem trivial to a human being, recall that our computer algorithms only work with raw 3D arrays of brightness values so a tiny change in an input image can alter every single pixel value in the corresponding array. Hence, our ideal image classification model should in theory be able to disentangle object pose and deformation from texture and shape.</p>
<p>For a different type of intuition, let’s again take a look at the following cat images.</p>
<div class="imgcap">
<div>
<img src="/assets/stn2/cat2.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat2_.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat1.jpg" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/cat1_.jpg" style="max-width:49%; height:250px;" />
</div>
<div class="thecap" style="text-align:center"> <b>Left:</b> Cat images which may present classification challenges. <b>Right:</b> Transformed images which yield a simplified classification pipeline.</div>
</div>
<p>Would it not be extremely desirable if our model could go from left to right using some sort of crop and scale-normalize combination so as to simplify the subsequent classification task?</p>
<p><a name="toc2"></a></p>
<h2 id="pooling-layers">Pooling Layers</h2>
<p>It turns out that the pooling layers we use in our neural network architectures actually endow our models with a certain degree of spatial invariance. Recall that the pooling operator acts as a sort of downsampling mechanism. It progressively reduces the spatial size of the feature map along the depth dimension, cutting down the amount of parameters and computational cost.</p>
<hr />
<div class="fig figcenter fighighlight">
<img src="/assets/stn2/pool.jpeg" width="36%" />
<img src="/assets/stn2/maxpool.jpeg" width="59%" style="border-left: 1px solid black;" />
<div class="figcaption">
Pooling layer downsamples the volume spatially. <b>Left:</b> In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. <b>Right:</b> 2x2 max pooling. (<a href="http://cs231n.github.io/convolutional-networks/#pool">Image Courtesy</a>)
</div>
</div>
<hr />
<p><strong>How exactly does it provide invariance?</strong> Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and “pool” the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7, each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we’d be capturing approximately the same information by aggregating pixel values.</p>
<p>Now there are a few downsides to pooling which make it an undesirable operator. For one, pooling is <strong>destructive</strong>. It discards 75% of feature activations when it is used, meaning we are guaranteed to lose exact positional information. Now you may be wondering why this is bad since we mentioned earlier that it endowed our network with some spatial robustness. Well the thing is that positional information is invaluable in visual recognition tasks. Think of our cat classifier above. It may be important to know where the position of the whiskers are relative to, say the snout. This can’t be achieved when it is this sort of information we throw away when we use max pooling.</p>
<p>Another limitation of pooling is that it is <strong>local and predefined</strong>. With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can’t just increase the receptive field arbitrarily because then that would downsample our feature map too agressively.</p>
<p>The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!</p>
<blockquote>
<p>The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)</p>
</blockquote>
<p><a name="toc3"></a></p>
<h2 id="spatial-transformer-networks-stns">Spatial Transformer Networks (STNs)</h2>
<p>The Spatial Transformer mechanism addresses the issues above by providing Convolutional Neural Networks with explicit spatial transformation capabilities. It possesses 3 defining properties that make it very appealing.</p>
<ul>
<li><strong>modular</strong>: STNs can be inserted anywhere into existing architectures with relatively small tweaking.</li>
<li><strong>differentiable</strong>: STNs can be trained with backprop allowing for end-to-end training of the models they are injected in.</li>
<li><strong>dynamic:</strong> STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.</li>
</ul>
<p>As you can see, the Spatial Transformer is superior to the Pooling operator in all regards. So this begs the following question: <strong>what exactly is a Spatial Transformer?</strong></p>
<div class="imgcap">
<img src="/assets/stn2/stn_arch.png" width="65%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a></div>
</div>
<p>The Spatial Transformer module consists in three components shown in the figure above: a <strong>localisation network</strong>, a <strong>grid generator</strong> and a <strong>sampler</strong>. Before we dive into each of their details, I’d like to briefly remind you of a 3 step pipeline we talked about last week.</p>
<div class="imgcap">
<img src="/assets/stn2/pipeline.png" width="75%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">Affine Transformation Pipeline</a></div>
</div>
<p>Recall that we can’t just blindly rush to the input image and apply our affine transformation. It’s important to first create a sampling grid, transform it, and then sample the input image using the grid. With that being said, let’s jump into the core components of the Spatial Transformer.</p>
<p><a name="toc4"></a></p>
<h3 id="localisation-network">Localisation Network</h3>
<p>The goal of the localisation network is to spit out the parameters <script type="math/tex">\theta</script> of the affine transformation that’ll be applied to the input feature map. More formally, our localisation net is defined as follows:</p>
<ul>
<li><strong>input</strong>: feature map U of shape (H, W, C)</li>
<li><strong>output</strong>: transformation matrix <script type="math/tex">\theta</script> of shape (6,)</li>
<li><strong>architecture</strong>: fully-connected network or ConvNet as well.</li>
</ul>
<p>As we train our network, we would like our localisation net to output more and more accurate thetas. <strong>What do we mean by accurate?</strong> Well, think of our digit 7 rotated by 90 degrees counterclockwise. After say 2 epochs, our localisation net may output a transformation matrix which performs a 45 degree clockwise rotation and after 5 epochs for example, it may actually learn to do a complete 90 degree clockwise rotation. The effect is that our output image looks like a standard digit 7, something our neural network has seen in the training data and can easily classify.</p>
<p>Another way to look at it is that the localisation network learns to store the knowledge of how to transform each training sample in the weights of its layers.</p>
<p><a name="toc5"></a></p>
<h3 id="parametrised-sampling-grid">Parametrised Sampling Grid</h3>
<p>The grid generator’s job is to output a parametrised sampling grid, which is a set of points where the input map <strong>should</strong> be sampled to produce the desired transformed output.</p>
<p>Concretely, the grid generator first creates a normalized meshgrid of the same size as the input image U of shape (H, W), that is, a set of indices <script type="math/tex">(x^t, y^t)</script> that cover the whole input feature map (the subscript t here stands for target coordinates in the output feature map). Then, since we’re applying an affine transformation to this grid and would like to use translations, we proceed by adding a row of ones to our coordinate vector to obtain its homogeneous equivalent. This is the little trick we also talked about last week. Finally, we reshape our 6 parameter <script type="math/tex">\theta</script> to a 2x3 matrix and perform the following multiplication which results in our desired parametrised sampling grid.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
x^{s} \\
y^{s} \\
\end{bmatrix} = \begin{bmatrix}
\theta_{11} & \theta_{12} & \theta_{13} \\
\theta_{21} & \theta_{22} & \theta_{23}
\end{bmatrix}
%
\begin{bmatrix}
x^t \\
y^t \\
1
\end{bmatrix} %]]></script>
<p>The column vector <script type="math/tex">\begin{bmatrix}
x^s \\
y^s
\end{bmatrix}</script> consists in a set of indices that tell us where we should sample our input to obtain the desired transformed output.</p>
<p><strong>But wait a minute, what if those indices are fractional?</strong> Bingo! That’s why we learned about bilinear interpolation and this is exactly what we do next.</p>
<p><a name="toc6"></a></p>
<h3 id="differentiable-image-sampling">Differentiable Image Sampling</h3>
<p>Since bilinear interpolation is differentiable, it is perfectly suitable for the task at hand. Armed with the input feature map and our parametrised sampling grid, we proceed with bilinear sampling and obtain our output feature map V of shape (H’, W’, C’). Note that this implies that we can perform downsampling and upsampling by specifying the shape of our sampling grid. (take that pooling!) We definitely aren’t restricted to bilinear sampling, and there are other sampling kernels we can use, but the important takeaway is that it must be differentiable to allow the loss gradients to flow all the way back to our localisation network.</p>
<div class="imgcap">
<img src="/assets/stn2/transformation.png" width="60%" style="border:none;" />
<div class="thecap" style="text-align:justify">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>) Two examples of applying the parameterised sampling grid to an image U producing the output V. <b>(a)</b> Identity transform (i.e. U = V) <b>(2)</b> Affine Transformation (i.e. rotation)</div>
</div>
<p>The above illustrates the inner workings of the Spatial Transformer. Basically it boils down to 2 crucial concepts we’ve been talking about all week: an affine transformation followed by bilinear interpolation. Take a moment and admire the elegance of such a mechanism! We’re letting our network learn the optimal affine transformation parameters that will help it ultimately succeed in the classification task <strong>all on its own</strong>.</p>
<p><a name="toc7"></a></p>
<h2 id="fun-with-spatial-transformers">Fun with Spatial Transformers</h2>
<p>As a final note, I’ll provide 2 examples that illustrate the power of Spatial Transformers. I’ve attached the references for each example at the bottom of the post, so make sure to look those up if they pique your interest.</p>
<p><a name="toc8"></a></p>
<h3 id="distorted-mnist">Distorted MNIST</h3>
<p>Here is the result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification.</p>
<div class="imgcap">
<img src="/assets/stn2/mnist.png" width="45%" style="border:none;" /><div class="thecap" style="text-align:center">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>)</div>
</div>
<p>Notice how it has learned to do exactly what we wanted our theoretical “robust” image classification model to do: by zooming in and eliminating background clutter, it has “standardized” the input to facilitate classification. If you want to view a live animation of the transformer in action, click <a href="https://drive.google.com/file/d/0B1nQa_sA3W2iN3RQLXVFRkNXN0k/view">here</a>.</p>
<p><a name="toc9"></a></p>
<h3 id="german-traffic-sign-recognition-benchmark-gtsrb-dataset">German Traffic Sign Recognition Benchmark (GTSRB) dataset</h3>
<div class="imgcap">
<div>
<img src="/assets/stn2/epoch_evolution.gif" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/moving_evolution.gif" style="max-width:49%; height:250px;" />
</div>
<div class="thecap">(<a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">Image Courtesy</a>) <b>Left</b>: Behavior of the Spatial Transformer during training. Notice how it learns to focus on the traffic sign, gradually removing background. <b>Right</b>: Output for different input images. Note how it stays approximately contant regardless of the input variability and distortion. Pretty neat!</div>
</div>
<p><a name="toc10"></a></p>
<h2 id="summary">Summary</h2>
<p>In today’s blog post, we went over Google Deepmind’s Spatial Transformer Network paper. We started by introducing the different challenges classification models face, mainly how distortions in the input images can cause our classifiers to fail. One remedy is to use pooling layers; however they possess a few glaring limitations that have made them fall into disuse. The other remedy, and the subject of this blog post, is to use Spatial Transformer Networks.</p>
<p>This consists in a differentiable module that can be inserted anywhere in ConvNet architecture to increase its geometric invariance. It effectively endows our networks with the ability to spatially transform feature maps at no extra data or supervision cost. Finally, we saw how the whole mechanism boils down to 2 familiar operations: an affine transformation and bilinear interpolation.</p>
<p>In next week’s blog post we’ll be using what we’ve learned so far to aid us in coding this paper from scratch in Tensorflow. In the meantime, if you have any questions, feel free to post them in the comment section below.</p>
<p>Cheers and see you next week!</p>
<p><a name="toc11"></a></p>
<h2 id="references">References</h2>
<ul>
<li>The original Deepmind paper - click <a href="https://arxiv.org/abs/1506.02025">here</a></li>
<li>Kudos to the Torch blog post on STNs which really helped me during the learning process - click <a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">here</a></li>
<li>Torch Implementation also helped me grasp the inner workings of STNs - check out this <a href="https://github.com/qassemoquab/stnbhwd">repo</a></li>
<li>Stanford’s CS231n as always - click <a href="cs231n.github.io">here</a></li>
</ul>
Wed, 18 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/18/stn-part2/
http://kevinzakka.github.io/2017/01/18/stn-part2/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationDeep Learning Paper Implementations: Spatial Transformer Networks - Part I<div class="imgcap">
<img src="/assets/stn/ai.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>The first three blog posts in my “Deep Learning Paper Implementations” series will cover <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> introduced by <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind in 2016. The Spatial Transformer Network is a learnable module aimed at increasing the spatial invariance of Convolutional Neural Networks in a computationally and parameter efficient manner.</p>
<p>In this first installment, we’ll be introducing two very important concepts that will prove crucial in understanding the inner workings of the Spatial Transformer layer. We’ll first start by examining a subset of image transformation techniques that fall under the umbrella of <strong>affine transformations</strong>, and then dive into a procedure that commonly follows these transformations: <strong>bilinear interpolation</strong>.</p>
<p>In the second installment, we’ll be going over the Spatial Transformer Layer in detail and summarizing the paper, and then in the third and final part, we’ll be coding it from scratch in Tensorflow and applying it to the <a href="http://benchmark.ini.rub.de/?section=gtsrb&subsection=news">GTSRB dataset</a> (German Traffic Sign Recognition Benchmark).</p>
<p>For the full code that appears on this page, visit my <a href="https://github.com/kevinzakka/blog-code/tree/master/spatial_transformer">Github Repository</a>.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Image Transformations</a>
<ul>
<li><a href="#toc2">Scale</a></li>
<li><a href="#toc3">Rotate</a></li>
<li><a href="#toc4">Shear</a></li>
<li><a href="#toc5">Translate</a></li>
</ul>
</li>
<li><a href="#toc6">Bilinear Interpolation</a>
<ul>
<li><a href="#toc7">Motivation</a></li>
<li><a href="#toc8">Algorithm</a></li>
<li><a href="#toc9">Python Code</a></li>
</ul>
</li>
<li><a href="#toc10">Results</a></li>
<li><a href="#toc11">Conclusion</a></li>
<li><a href="#toc12">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="image-transformations">Image Transformations</h3>
<p>To lay the groundwork for affine transformations, we first need to talk about linear transformations. To that end, we’ll be restricting ourselves to 2 dimensions and work with matrices.</p>
<p>We define the following:</p>
<ul>
<li>a point K with coordinates
<script type="math/tex">\begin{bmatrix}
x \\
y
\end{bmatrix}</script> represented as a <script type="math/tex">(2\times1)</script> column vector.</li>
<li>a matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix} %]]></script> represented as a square matrix of shape <script type="math/tex">(2\times2)</script>.</li>
</ul>
<p>and would like to examine the linear transformation <script type="math/tex">T</script> defined by the matrix product <script type="math/tex">K' = T(K) = MK</script> as we vary the parameters a, b, c and d of M.</p>
<p><strong>Warm-Up Question.</strong></p>
<p>Say we set <script type="math/tex">a = d = 1</script> and <script type="math/tex">b = c = 0</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix} %]]></script>
<p>In that case, what transform do you think we would obtain? Go ahead and give it a few moment’s thought…</p>
<p><strong>Solution.</strong></p>
<p>Let’s write it out:</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x \\
y
\end{bmatrix} = K %]]></script>
<p>We’ve actually represented the identity transform, meaning that the point K does not move in the plane. Let us now jump to more interesting transforms.</p>
<p><a name="toc2"></a></p>
<p><strong>Scaling.</strong></p>
<div class="imgcap">
<img src="/assets/stn/scale.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>We let <script type="math/tex">b = c = 0</script>, and <script type="math/tex">a</script> and <script type="math/tex">d</script> take on any positive value.</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix} %]]></script>
<p>Note that there is a special case of scaling called <em>isotropic</em> scaling in which the scaling factor for both the x and y direction is the same, say <script type="math/tex">s</script>. In that case, enlarging an image would correspond to <script type="math/tex">s > 1</script> while shrinking would correspond to <script type="math/tex">% <![CDATA[
s < 1 %]]></script>. It’s a bit non-intuitive then that to zoom-in on an image, you need <script type="math/tex">% <![CDATA[
s < 1 %]]></script> (think about it).</p>
<p>Anyway, performing the matrix product, we obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
px \\
qy
\end{bmatrix} %]]></script>
<p><a name="toc3"></a></p>
<p><strong>Rotation.</strong></p>
<div class="imgcap">
<img src="/assets/stn/rot.png" width="19%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>Suppose we want to rotate by an angle <script type="math/tex">\theta</script> about the origin. To do so, we set <script type="math/tex">a = d = \cos{\theta}</script> and <script type="math/tex">b = c = \sin{\theta}</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix} %]]></script>
<p>We thus obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x\cos{\theta}- y\sin{\theta} \\
x\sin{\theta} + y\cos{\theta}
\end{bmatrix} %]]></script>
<p><a name="toc4"></a></p>
<p><strong>Shear.</strong></p>
<div class="imgcap">
<img src="/assets/stn/shear.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>When we shear an image, we offset the y direction by a distance proportional to x, and the x direction by a distance proportional to y. For example, when we go from normal text to italics, we are effectively applying a shear transform (think about shearing a deck of cards if that helps).</p>
<p>To achieve shearing, we set <script type="math/tex">a = d = 1</script>, <script type="math/tex">b = m</script> and <script type="math/tex">c = n</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix} %]]></script>
<p>This yields</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x + my \\
y + nx
\end{bmatrix} %]]></script>
<hr />
<p>In summary, we have defined 3 basic linear transformations:</p>
<ul>
<li><strong>scaling:</strong> scales the x and y direction by a scalar.</li>
<li><strong>shearing:</strong> offsets the x by a number proportional to y and x by a number proportional to x.</li>
<li><strong>rotating:</strong> rotates the points around the origin by an angle <script type="math/tex">\theta</script>.</li>
</ul>
<p>Now the nice thing about matrices is that we can collapse sequential linear transformations into a single transformation matrix. For example, say we would like to apply a shear, a scale and then a rotation to our column vector K. Given that these transformations can be represented by the matrices <script type="math/tex">H</script>, <script type="math/tex">S</script> and <script type="math/tex">R</script>, and respecting the order of transformations, we can write down this operation as</p>
<script type="math/tex; mode=display">K' = R \big[ S \big( HK \big) \big]</script>
<p>But recall that matrix multiplication is associative! So this reduces to</p>
<script type="math/tex; mode=display">\boxed{K' = MK}</script>
<p>where <script type="math/tex">M = RSH</script>. Be mindful of the order since matrix multiplication <script type="math/tex">\color{red}{\text{is not}}</script> commutative.</p>
<p>A beautiful consequence of this formula is that if we are given multiple transformations to do for a very high-dimensional vector, then we can basically carry out a single matrix multiplication rather than repeatedly manipulating the high-dimensional vector for every sequential transformation.</p>
<hr />
<p><a name="toc5"></a></p>
<p><strong>Translation.</strong></p>
<p>The only downside to this <script type="math/tex">2 \times 2</script> matrix representation is that we cannot represent translation since it isn’t a linear transformation. Translation however, is a very important and needed transformation, so we would like to be able to encapsulate it in our matrix representation.</p>
<p>To solve this dilemna, we represent our 2D vectors in 3D using <strong>homogeneous coordinates</strong> as follows:</p>
<ul>
<li>our point K becomes a <script type="math/tex">(3\times1)</script> column vector
<script type="math/tex">\begin{bmatrix}
x \\
y \\
1
\end{bmatrix}</script></li>
<li>our matrix M becomes a <script type="math/tex">(3\times3)</script> square matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
0 & 0 & 1
\end{bmatrix} %]]></script></li>
</ul>
<p>To represent a translation, all we have to do is place 2 new parameters <script type="math/tex">e</script> and <script type="math/tex">f</script> in our third column like so</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & e \\
c & d & f \\
0 & 0 & 1
\end{bmatrix} %]]></script>
<p>and we can thus carry out translations as linear transformations in homogeneous coordinates. Note that if we require a 2D output, then all we need to do is represent M as a <script type="math/tex">2 \times 3</script> matrix and leave K untouched.</p>
<p><strong>Example.</strong></p>
<p>Translate both the x and y direction by <script type="math/tex">\Delta</script>. Result should be 2D.</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 & \Delta \\
0 & 1 & \Delta
\end{bmatrix}
%
\begin{bmatrix}
x \\
y \\
1
\end{bmatrix} =
\begin{bmatrix}
x + \Delta \\
y + \Delta
\end{bmatrix} %]]></script>
<p><strong>Summary.</strong></p>
<div class="imgcap">
<img src="/assets/stn/affine.png" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>By using a little trick, we were able to add a new transformation to our repertoire of linear transformations. This transformation, called translation, is an affine transformation. Hence, we can generalize our results and represent our 4 affine transformations (all linear transformations are affine) by the 6 parameter matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & c \\
d & e & f
\end{bmatrix} %]]></script>
<p><a name="toc6"></a></p>
<h3 id="bilinear-interpolation">Bilinear Interpolation</h3>
<p><a name="toc7"></a></p>
<p><strong>Motivation.</strong> When an image undergoes an affine transformation such as a rotation or scaling, the pixels in the image get moved around. This can be especially problematic when a pixel location in the output does not map directly to one in the input image.</p>
<p>In the illustration below, you can clearly see that the rotation places some points at locations that are not centered in the squares. This means that they would not have a corresponding pixel value in the original image.</p>
<div class="imgcap">
<img src="/assets/stn/stickman.png" width="70%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://northstar-www.dartmouth.edu/doc/idl/html_6.2/Interpolation_Methods.html">Image Courtesy</a></div>
</div>
<p>So for example, suppose that after rotating an image, we need to find the pixel value at the location (6.7, 3.2). The problem with this is that there is no such thing as fractional pixel locations.</p>
<p>To solve this problem, bilinear interpolation uses the 4 nearest pixel values which are located in diagonal directions from a given location in order to find the appropriate color intensity values of that pixel. The result is smoother and more realistic images!</p>
<p><a name="toc8"></a></p>
<p><strong>Algorithm.</strong></p>
<div class="imgcap">
<img src="/assets/stn/interpol.png" width="35%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Image Courtesy</a></div>
</div>
<p>Our goal is to find the pixel value of the point P. To do so, we calculate the pixel value of <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script> using a weighted average of <script type="math/tex">(Q_{11}, Q_{21})</script> and <script type="math/tex">(Q_{12}, Q_{22})</script> respectively. Then, we use a weighted average of <script type="math/tex">R_2</script> and <script type="math/tex">R_1</script> to find the value of P.</p>
<p>Effectively, we are interpolating in the x direction and then the y direction, hence the name bilinear interpolation. You could just as well flip the order of interpolation and get the exact same value.</p>
<p>So given a point <script type="math/tex">P = (x, y)</script> and 4 corner coordinates <script type="math/tex">Q_{11} = (x_1, y_1)</script>, <script type="math/tex">Q_{21} = (x_2, y_1)</script>, <script type="math/tex">Q_{12} = (x_1, y_2)</script> and <script type="math/tex">Q_{22} = (x_2, y_2)</script>, we first interpolate in the x-direction:</p>
<script type="math/tex; mode=display">R_1 = \frac{x_2 - x}{x_2 - x_1}Q_{11} + \frac{x - x_1}{x_2 - x_1}Q_{21}</script>
<script type="math/tex; mode=display">R_2 = \frac{x_2 - x}{x_2 - x_1}Q_{12} + \frac{x - x_1}{x_2 - x_1}Q_{22}</script>
<p>and finally in the y-direction:</p>
<script type="math/tex; mode=display">\boxed{P = \frac{y_2 - y}{y_2 - y_1}R_1 + \frac{y - y_1}{y_2 - y_1}R_2}</script>
<p><a name="toc9"></a></p>
<p><strong>Python Code.</strong></p>
<p>One very very important note before we jump into the code!</p>
<hr />
<p>An image processing affine transformation usually follows the 3-step pipeline below:</p>
<ul>
<li>First, we create a sampling grid composed of <script type="math/tex">(x, y)</script> coordinates. For example, given a 400x400 grayscale image, we create a meshgrid of same dimension, that is, evenly spaced <script type="math/tex">x \in [0, W]</script> and <script type="math/tex">y \in [0, H]</script>.</li>
<li>We then apply the transformation matrix to the sampling grid generated in the step above.</li>
<li>Finally, we sample the resulting grid from the original image using the desired interpolation technique.</li>
</ul>
<p>As you can see, this is different than directly applying a transform to the original image.</p>
<hr />
<p>I’ve attached 2 cat images in the Github Repository mentioned at the top of this page which you should go ahead and download. Save them to your Desktop in a folder called <code class="language-plaintext highlighter-rouge">data/</code> or make sure to update the path location if you choose differently.</p>
<p>I’ve also written a function <code class="language-plaintext highlighter-rouge">load_img()</code> that converts images to numpy arrays. I won’t go into its details but it’s pretty basic and you shouldn’t take long to understand what it does. Note that you’ll need both PIL and Numpy to reproduce the results below.</p>
<p>Armed with this function, let’s load both cat images and concatenate them into a single input array. We’re working with 2 images because we want to make our code as general as possible.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="c1"># params
</span><span class="n">DIMS</span> <span class="o">=</span> <span class="p">(</span><span class="mi">400</span><span class="p">,</span> <span class="mi">400</span><span class="p">)</span>
<span class="n">CAT1</span> <span class="o">=</span> <span class="s">'cat1.jpg'</span>
<span class="n">CAT2</span> <span class="o">=</span> <span class="s">'cat2.jpg'</span>
<span class="c1"># load both cat images
</span><span class="n">img1</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT1</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">)</span>
<span class="n">img2</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT2</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">,</span> <span class="n">view</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c1"># concat into tensor of shape (2, 400, 400, 3)
</span><span class="n">input_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">img1</span><span class="p">,</span> <span class="n">img2</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c1"># dimension sanity check
</span><span class="k">print</span><span class="p">(</span><span class="s">"Input Img Shape: {}"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">input_img</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
</code></pre></div></div>
<p>Given that we have 2 images, our batch size is equal to 2. This means that we need an equal amount of transformation matrices M for each image in the batch.</p>
<p>Let’s go ahead and initialize 2 identity transform matrices. This is the simplest case, and if we implement our bilinear sampler correctly, we should expect our output image to be almost exact to the input image.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># grab shape
</span><span class="n">num_batch</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">C</span> <span class="o">=</span> <span class="n">input_img</span><span class="o">.</span><span class="n">shape</span>
<span class="c1"># initialize M to identity transform
</span><span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">]])</span>
<span class="c1"># repeat num_batch times
</span><span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
</code></pre></div></div>
<p>(Recall that our general affine transformation matrix is <script type="math/tex">2 \times 3</script> if we want to include translation.)</p>
<p>Now we need to write a function that will generate a meshgrid for us and output a sampling grid resulting from the product of this meshgrid and our transformation matrix M.</p>
<p>Let’s go ahead and generate our meshgrid. We’ll create a normalized one, that is the values of x and y range from -1 to 1 and there are <code class="language-plaintext highlighter-rouge">width</code> and <code class="language-plaintext highlighter-rouge">height</code> of them respectively. In fact, note that for images, x corresponds to the width of the image (i.e. number of columns of the matrix) while y corresponds to the height of the image (i.e. number of rows of the matrix).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># create normalized 2D grid
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">H</span><span class="p">)</span>
<span class="n">x_t</span><span class="p">,</span> <span class="n">y_t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>Then we need to augment the dimensions to create homogeneous coordinates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># reshape to (xt, yt, 1)
</span><span class="n">ones</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">x_t</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
<span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">x_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">y_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">ones</span><span class="p">])</span>
</code></pre></div></div>
<p>So we’ve created 1 grid here, but we need <code class="language-plaintext highlighter-rouge">num_batch</code> grids. Same as above, our one-liner below repeats our array <code class="language-plaintext highlighter-rouge">num_batch</code> times.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># repeat grid num_batch times
</span><span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">sampling_grid</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">H</span><span class="o">*</span><span class="n">W</span><span class="p">))</span>
</code></pre></div></div>
<p>Now we perform step 2 of our image transformation pipeline.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># transform the sampling grid i.e. batch multiply
</span><span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">sampling_grid</span><span class="p">)</span>
<span class="c1"># batch grid has shape (num_batch, 2, H*W)
</span>
<span class="c1"># reshape to (num_batch, height, width, 2)
</span><span class="n">batch_grids</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">moveaxis</span><span class="p">(</span><span class="n">batch_grids</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Finally, let’s write our bilinear sampler. Given our coordinates <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code> in the sampling grid, we want interpolate the pixel value in the original image.</p>
<p>Let’s start by seperating the x and y dimensions and rescaling them to belong in the height/width interval.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="n">y_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="c1"># rescale x and y to [0, W/H]
</span><span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">W</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">((</span><span class="n">y_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">H</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
</code></pre></div></div>
<p>Now for each coordinate <script type="math/tex">(x_i, y_i)</script> we want to grab 4 corner coordinates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># grab 4 nearest corner points for each (x_i, y_i)
</span><span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">x0</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">y0</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<p>(Note that we could just as well use the ceiling function rather than the increment by 1).</p>
<p>Now we must make sure that no value goes beyond the image boundaries. For example, suppose we have <script type="math/tex">x = 399</script>, then <script type="math/tex">x_0 = 399</script> and <script type="math/tex">x_1 = x0 + 1 = 400</script> which would result in a numpy error. Thus we clip our corner coordinates in the following way:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># make sure it's inside img range [0, H] or [0, W]
</span><span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Now we use advanced numpy indexing to grab the pixel value for each corner coordinate. These correspond to <code class="language-plaintext highlighter-rouge">(x0, y0)</code>, <code class="language-plaintext highlighter-rouge">(x0, y1)</code>, <code class="language-plaintext highlighter-rouge">(x1, y0)</code> and <code class="language-plaintext highlighter-rouge">(x_1, y_1)</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># look up pixel values at corner coords
</span><span class="n">Ia</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ib</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ic</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
<span class="n">Id</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
</code></pre></div></div>
<p>Almost there! Now, we calculate the weight coefficients,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># calculate deltas
</span><span class="n">wa</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
</code></pre></div></div>
<p>and finally, multiply and add according to the formula mentioned previously.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># add dimension for addition
</span><span class="n">wa</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wa</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wb</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wc</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wd</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c1"># compute output
</span><span class="n">out</span> <span class="o">=</span> <span class="n">wa</span><span class="o">*</span><span class="n">Ia</span> <span class="o">+</span> <span class="n">wb</span><span class="o">*</span><span class="n">Ib</span> <span class="o">+</span> <span class="n">wc</span><span class="o">*</span><span class="n">Ic</span> <span class="o">+</span> <span class="n">wd</span><span class="o">*</span><span class="n">Id</span>
</code></pre></div></div>
<hr />
<p><a name="toc10"></a></p>
<h3 id="results">Results</h3>
<p>So now that we’ve gone through the whole code incrementally, let’s have some fun and experiment with different values of the transformation matrix M.</p>
<p>The first thing you need to do is copy and paste the whole code which has been made more modular. Now let’s test if our function works correctly.</p>
<p><strong>Identity Transform.</strong></p>
<p>Add the following 2 lines as the end of the script and execute.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.imshow(out[1])
plt.show()
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft1.png" width="300" />
</p>
<p><strong>Translation.</strong></p>
<p>Say we want to translate the picture by <code class="language-plaintext highlighter-rouge">0.5</code> only in the x direction. This should shift the image to the left.</p>
<p>Edit the following line of your code as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = np.array([[1., 0., 0.5], [0., 1., 0.]])
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft2.png" width="300" />
</p>
<p><strong>Rotation.</strong></p>
<p>Finally, say we want to rotate the picture by <code class="language-plaintext highlighter-rouge">45</code> degrees. Given that <script type="math/tex">\cos{(45)} = \sin{(45)} = \frac{\sqrt{2}}{2} \approx 0.707</script>, edit just this line of your code as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = np.array([[0.707, -0.707, 0.], [0.707, 0.707, 0.]])
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft3.png" width="300" />
</p>
<p><a name="toc11"></a></p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we went over basic linear transformations such as rotation, shear and scale before generalizing to affine transformations which included translations. Then, we saw the importance of bilinear interpolation in the context of these transformations. Finally, we went over the algorithm, coded it from scratch in Python and wrote 2 methods that helped us visualize these transformations according to a 3 step image processing pipeline.</p>
<p>In the next installment of this series, we’ll go over the Spatial Transformer Network layer in detail as well as summarize the paper it is described in.</p>
<p>See you next week!</p>
<p><a name="toc12"></a></p>
<h3 id="references">References</h3>
<p>A big thank you to <a href="https://twitter.com/edersantana">Eder Santana</a> for introducing me to this paper!</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Bilinear Interpolation Wikipedia</a></li>
<li><a href="http://supercomputingblog.com/graphics/coding-bilinear-interpolation/">Bilinear Interpolation</a></li>
<li><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Matrix Transformations PDF</a></li>
<li><a href="http://stackoverflow.com/questions/12729228/simple-efficient-bilinear-interpolation-of-images-in-numpy-and-python">Bilinear Interpolation Code</a></li>
</ul>
Tue, 10 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/10/stn-part1/
http://kevinzakka.github.io/2017/01/10/stn-part1/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationNuts and Bolts of Applying Deep Learning<div class="imgcap">
<img src="/assets/app_dl/bolts.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://nutsandbolts.mit.edu/">Image Courtesy</a></div>
</div>
<p>This weekend was very hectic (catching up on courses and studying for a statistics quiz), but I managed to squeeze in some time to watch the <a href="http://www.bayareadlschool.org/">Bay Area Deep Learning School</a> livestream on YouTube. For those of you wondering what that is, BADLS is a 2-day conference hosted at Stanford University, and consisting of back-to-back presentations on a variety of topics ranging from NLP, Computer Vision, Unsupervised Learning and Reinforcement Learning. Additionally, top DL software libraries were presented such as Torch, Theano and Tensorflow.</p>
<p>There were some super interesting talks from leading experts in the field: <a href="http://www.dmi.usherb.ca/~larocheh/index_en.html">Hugo Larochelle</a> from Twitter, <a href="http://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a> from OpenAI, <a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Yoshua Bengio</a> from the Université de Montreal, and <a href="http://www.andrewng.org/">Andrew Ng</a> from Baidu to name a few. Of the plethora of presentations, there was one somewhat non-technical one given by Andrew that really piqued my interest.</p>
<p>In this blog post, I’m gonna try and give an overview of the main ideas outlined in his talk. The goal is to pause a bit and examine the ongoing trends in Deep Learning thus far, as well as gain some insight into applying DL in practice.</p>
<p>By the way, if you missed out on the livestreams, you can still view them at the following: <a href="https://www.youtube.com/watch?v=eyovmAtoUx0">Day 1</a> and <a href="https://www.youtube.com/watch?v=9dXiAecyJrY">Day 2</a>.</p>
<p><strong>Table of Contents</strong>:</p>
<ul>
<li><a href="#toc1">Major Deep Learning Trends</a></li>
<li><a href="#toc2">End-to-End Deep Learning</a></li>
<li><a href="#toc3">Bias-Variance Tradeoff</a></li>
<li><a href="#toc4">Human-level Performance</a></li>
<li><a href="#toc5">Personal Advice</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="major-deep-learning-trends">Major Deep Learning Trends</h3>
<p><strong>Why do DL algorithms work so well?</strong> According to Ng, with the rise of the Internet, Mobile and IOT era, the amount of data accessible to us has greatly increased. This correlates directly to a boost in the performance of neural network models, especially the larger ones which have the capacity to absorb all this data.</p>
<p align="center">
<img src="/assets/app_dl/perf_vs_data.png" width="450" />
</p>
<p>However, in the small data regime (left-hand side of the x-axis), the relative ordering of the algorithms is not that well defined and really depends on who is more motivated to engineer their features better, or refine and tune the hyperparameters of their model.</p>
<p>Thus this trend is more prevalent in the big data realm where hand engineering effectively gets replaced by end-to-end approaches and bigger neural nets combined with a lot of data tend to outperform all other models.</p>
<p><strong>Machine Learning and HPC team.</strong> The rise of big data and the need for larger models has started to put pressure on companies to hire a Computer Systems team. This is because some of the HPC (high-performance computing) applications require highly specialized knowledge and it is difficult to find researchers and engineers with sufficient knowledge in both fields. Thus, cooperation from both teams is the key to boosting performance in AI companies.</p>
<p><strong>Categorizing DL models.</strong> Work in DL can be categorized in the following 4 buckets:</p>
<p align="center">
<img src="/assets/app_dl/bucket.svg" width="350" />
</p>
<p>Most of the value in the industry today is driven by the models in the orange blob (innovation and monetization mostly) but Andrew believes that <strong>unsupervised deep learning</strong> is a super-exciting field that has loads of potential for the future.</p>
<p><a name="toc2"></a></p>
<h3 id="the-rise-of-end-to-end-dl">The rise of End-to-End DL</h3>
<p>A major improvement in the end-to-end approach has been the fact that outputs are becoming more and more complicated. For example, rather than just outputting a simple class score such as 0 or 1, algorithms are starting to generate richer outputs: images like in the case of GAN’s, full captions with RNN’s and most recently, audio like in DeepMind’s WaveNet.</p>
<p><strong>So what exactly does end-to-end training mean?</strong> Essentially, it means that AI practitioners are shying away from intermediate representations and going directly from one end (raw input) to the other end (output) Here’s an example from speech recognition.</p>
<p align="center">
<img src="/assets/app_dl/end-to-end.svg" width="340" />
</p>
<p><strong>Are there any disadvantages to this approach?</strong> End-to-end approaches are data hungry meaning they only perform well when provided with a huge dataset of labelled examples. In practice, not all applications have the luxury of large labelled datasets so other approaches which allow hand-engineered information and field expertise to be added into the model have gained the upper hand. As an example, in a self-driving car setting, going directly from the raw image to the steering direction is pretty difficult. Rather, many features such as trajectory and pedestrian location are calculated first as intermediate steps.</p>
<p>The main take-away from this section is that we should always be cautious of end-to-end approaches in applications where huge data is hard to come by.</p>
<p><a name="toc3"></a></p>
<h3 id="bias-variance-tradeoff">Bias-Variance Tradeoff</h3>
<p><strong>Splitting your data.</strong> In most deep learning problems, train and test come from different distributions. For example, suppose you are working on implementing an AI powered rearview mirror and have gathered 2 chunks of data: the first, larger chunk comes from many places (could be partly bought, and partly crowdsourced) and the second, much smaller chunk is actual car data.</p>
<p>In this case, splitting the data into train/dev/test can be tricky. One might be tempted to carve the dev set out of the training chunk like in the first example of the diagram below. (Note that the chunk on the left corresponds to data mined from the first distribution and the one on the right to the one from the second distribution.)</p>
<p align="center">
<img src="/assets/app_dl/split.svg" width="500" />
</p>
<p>This is bad because we usually want our dev and test to come from the same distribution. The reason for this is that because a part of the team will be spending a lot of time tuning the model to work well on the dev set, if the test set were to turn out very different from the dev set, then pretty much all the work would have been wasted effort.</p>
<p>Hence, a smarter way of splitting the above dataset would be just like the second line of the diagram. Now in practice, Andrew recommends creating dev sets from both data distributions: a train-dev and test-dev set. In this manner, any gap between the different errors can help you tackle the problem more clearly.</p>
<p align="center">
<img src="/assets/app_dl/errors.svg" width="450" />
</p>
<p><strong>Flowchart for working with a model.</strong> Given what we have described above, here’s a simplified flowchart of the actions you should take when confronted with training/tuning a DL model.</p>
<p align="center">
<img src="/assets/app_dl/flowachart.svg" width="500" />
</p>
<p><strong>The importance of data synthesis.</strong> Andrew also stressed the importance of data synthesis as part of any workflow in deep learning. While it may be painful to manually engineer training examples, the relative gain in performance you obtain once the parameters and the model fit well are huge and worth your while.</p>
<p><a name="toc4"></a></p>
<h3 id="human-level-performance">Human-level Performance</h3>
<p>One of the very important concepts underlined in this lecture was that of human-level performance. In the basic setting, DL models tend to plateau once they have reached or surpassed human-level accuracy. While it is important to note that human-level performance doesn’t necessarily coincide with the golden bayes error rate, it can serve as a very reliable proxy which can be leveraged to determine your next move when training your model.</p>
<p align="center">
<img src="/assets/app_dl/perf.png" width="550" />
</p>
<p><strong>Reasons for the plateau.</strong> There could be a theoretical limit on the dataset which makes further improvement futile (i.e. a noisy subset of the data). Humans are also very good at these tasks so trying to make progress beyond that suffers from diminishing returns.</p>
<p>Here’s an example that can help illustrate the usefulness of human-level accuracy. Suppose you are working on an image recognition task and measure the following:</p>
<ul>
<li><strong>Train error</strong>: 8%</li>
<li><strong>Dev Error</strong>: 10%</li>
</ul>
<p>If I were to tell you that human accuracy for such a task is on the order of 1%, then this would be a blatant bias problem and you could subsequently try increasing the size of your model, train longer etc. However, if I told you that human-level accuracy was on the order of 7.5%, then this would be more of a variance problem and you’d focus your efforts on methods such as data synthesis or gathering data more similar to the test.</p>
<p>By the way, there’s always room for improvement. Even if you are close to human-level accuracy overall, there could be subsets of the data where you perform poorly and working on those can boost production performance greatly.</p>
<p>Finally, one might ask what is a good way of defining human-level accuracy. For example, in the following image diagnosis setting, ignoring the cost of obtaining data, how should one pick the criteria for human-level accuracy?</p>
<ul>
<li><strong>typical human</strong>: 5%</li>
<li><strong>general doctor</strong>: 1%</li>
<li><strong>specialized doctor</strong>: 0.8%</li>
<li><strong>group of specialized doctor</strong>s: 0.5%</li>
</ul>
<p>The answer is always the best accuracy possible. This is because, as we mentioned earlier, human-level performance is a proxy for the bayes optimal error rate, so providing a more accurate upper bound to your performance can help you strategize your next move.</p>
<p><a name="toc5"></a></p>
<h3 id="personal-advice">Personal Advice</h3>
<p>Andrew ended the presentation with 2 ways one can improve his/her skills in the field of deep learning.</p>
<ul>
<li><strong>Practice, Practice, Practice</strong>: compete in Kaggle competitions and read associated blog posts and forum discussions.</li>
<li><strong>Do the Dirty Work</strong>: read a lot of papers and try to replicate the results. Soon enough, you’ll get your own ideas and build your own models.</li>
</ul>
Mon, 26 Sep 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/deep learningbiasvarianceadviceend-to-endmachine learning