Kevin's BlogGray Matter Scratchpad
http://kevinzakka.github.io/
Wed, 28 Nov 2018 05:04:38 +0000Wed, 28 Nov 2018 05:04:38 +0000Jekyll v3.7.4Dex-Net: Coupling Big Data and Physics-based Models for Robust Grasp Planning<div class="imgcap">
<img src="/assets/gradient-pub/teaser.jpg" width="75%" style="border:none;" />
</div>
<p>Intelligent, dexterous robots open the doors for a myriad of applications: from single-part bin-picking to universal warehouse picking at the industrial level and automated laundry-folding to personal assistants at the consumer level. A critical component of dexterous systems is their ability to grasp and manipulate objects. Although trivial for humans, manipulating unknown objects in the cluttered, unstructured environments of the real-world remains to this day a largely unsolved problem. This is because intelligent interaction not only requires the ability to generalize to unseen objects but to react to potentially unstable and dynamic enviroments.</p>
<p>(Add example of automated laundry folding: needs to generalize to new clothes, and react)</p>
<p>Recent research has made significant advancements in robotic grasping, most recently through “learning-based” approaches which couple simulated or human-labeled datasets with deep neural network function approximators. One such approach, the Dexterity Network (Dex-Net), couples massive synthetic data with physics based models to plan robust grasps. In this blog post, we’re going to take a close look at the Dex-Net papers, examining their inner workings and seeing how they’ve extended single-object grasping to bin picking with both suction and finger grasps.</p>
<h2 id="dex-net-20">Dex-Net 2.0</h2>
<p>In this first section, we’re going to be examining Dex-Net 2.0, which uses the 3D object models from Dex-Net 1.0 to tackle grasp planning. The suggested approach is to train a Grasp Quality Convolutional Neural Network (GQ-CNN) on a large synthetic dataset of depth images with associated positive and negative grasps. Then during test time, one can sample various grasps from a depth image, feed each through the GQ-CNN, pick the one with the highest probability of success, and execute the grasp open-loop.</p>
<div class="imgcap">
<img src="/assets/dexnet/teaser.png" width="75%" style="border:none;" />
</div>
<h3 id="problem-statement">Problem Statement</h3>
<p>Let’s start by introducing the variables that appear in the paper.</p>
<ul>
<li><script type="math/tex">x = (O, T_o, T_c, \gamma)</script>: the state describing the variable properties of the camera and objects in the environment, where:
<ul>
<li><script type="math/tex">O</script>: the geometry and mass properties of the object.</li>
<li><script type="math/tex">T_o, T_c</script>: 3D poses of the object and camera respectively.</li>
<li><script type="math/tex">\gamma</script>: the coefficient of friction between the object and the gripper.</li>
</ul>
</li>
<li><script type="math/tex">u = (p, \phi)</script>: a parallel-jaw grasp in 3D space, specified by a center <script type="math/tex">p = (x, y, z)</script> relative to the camera and an angle in the table plane <script type="math/tex">\phi</script>.</li>
<li><script type="math/tex">y = R^{H \times W}</script>: a pointcloud represented as a depth image with height H and width W taken by the camera with known intrinsics <script type="math/tex">K</script> and pose <script type="math/tex">T_c</script>.</li>
<li><script type="math/tex">S(u, x) \in \{0, 1\}</script>: a binary-valued grasp success metric, such as force closure.</li>
</ul>
<p>Using these random variables, we can define a joint distribution <script type="math/tex">p(S, x, u, y)</script> that models the inherent uncertainty associated with our assumptions, such as erroneous sensors readings (calibration error, noise, limiting pinhole model, etc.), and imprecise control (kinematic inaccuracies, etc.).</p>
<p><strong>Goal.</strong> Ingest a depth image <script type="math/tex">u</script> of an object in a scene with an associated grasp candidate <script type="math/tex">u</script>, and spit out the probability that <script type="math/tex">u</script> will succeed under the above uncertainties. This is equivalent to predicting the <strong>robustness</strong> <script type="math/tex">Q</script> of a grasp, defined as the expected value of <script type="math/tex">S</script> conditioned on <script type="math/tex">u</script> and <script type="math/tex">y</script>, i.e. <script type="math/tex">Q(u, y) = \mathbb{E}[S \vert u, y]</script>.</p>
<p><strong>Solution.</strong> Use a neural network with weights <script type="math/tex">\theta</script> to approximate the complex, high-dimensional function <script type="math/tex">Q</script>. Concretely,</p>
<script type="math/tex; mode=display">\hat{\theta} = \arg \min_{\theta} \ \mathbb{E}_{p(S, u, x, y)} \big[L(S, Q_{\theta}(u, y)) \big]</script>
<p>And finally, using Monte-Carlo sampling of input-output pairs from our joint distribution, we obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\hat{\theta} = \arg\min_{\theta} \sum_{i=1}^{N} L(S_i, Q_{\theta}(u_i, y_i))
\quad\text{with}\quad
(S_i, u_i, x_i, y_i) \sim p(S, x, u, y)
\end{equation}</script>
<h3 id="generative-graphical-model">Generative Graphical Model</h3>
<p>We can think of our joint <script type="math/tex">p(S, x, u, y)</script> as a generative model of images, grasps and success metrics. The relationship between the different variables is illustrated in the graphical model below.</p>
<div class="imgcap">
<img src="/assets/dexnet/gm.png" width="45%" style="border:none;" />
<div class="thecap" style="text-align:center">Graphical Model</div>
</div>
<p>Using the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">chain rule</a>, we can express the joint <script type="math/tex">p(S, x, u, y)</script> as the product of 4 terms: <script type="math/tex">p(S \vert u, y, x)</script>, <script type="math/tex">p(u \vert x, y)</script>, <script type="math/tex">p(y \vert x)</script> and <script type="math/tex">p(x)</script>. And since <script type="math/tex">S</script> and <script type="math/tex">u</script> are independent of <script type="math/tex">y</script> (no arrow going from <script type="math/tex">y</script> to <script type="math/tex">S</script> or <script type="math/tex">u</script>), we can reduce the expression to</p>
<script type="math/tex; mode=display">p(S, u, y, x) = {\color{red}{p(S \vert u, x)}} \cdot {\color{orange}{p(u \vert x)}} \cdot {\color{blue}{p(y \vert x)}} \cdot {\color{green}{p(x)}}</script>
<p>where:</p>
<ul>
<li><script type="math/tex">{\color{green}{p(x)}}</script> is the state distribution.</li>
<li><script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is the observation model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is the grasp candidate model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is the analytic model of grasp success conditioned on the grasp candidate and current state.</li>
</ul>
<p>The state <script type="math/tex">x = (O, T_o, T_c, \gamma)</script> is represented by the blue nodes in the graphical model. Using the chain rule and independence properties, we can express its underlying distribution as the product of:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
{\color{green}{p(x)}}
&= p(\gamma \vert T_c, T_o, O) \cdot p(T_c \vert T_o, O) \cdot p(T_o \vert O) \cdot p(O) \\
&= p(\gamma) \cdot p(T_c) \cdot p(T_o \vert O) \cdot p(O)
\end{align*} %]]></script>
<p>with:</p>
<ul>
<li><script type="math/tex">p(\gamma)</script>: truncated Gaussian over friction coefficients.</li>
<li><script type="math/tex">p(O)</script>: discrete uniform distribution over 3D object models.</li>
<li><script type="math/tex">p(T_o \vert O)</script>: continuous uniform distribution over discrete set of stable object poses.</li>
<li><script type="math/tex">p(T_c)</script>: continuous uniform distribution over spherical coordinates and polar angle.</li>
</ul>
<p>The grasp candidate model <script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is a uniform distribution over pairs of antipodal contact points on the object surface whose grasp axis is parallel to the table plane (we want top-down grasps), the observation model <script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is a rendered depth image of the scene corrupted with multiplicative and Gaussian Process noise, and the success model <script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is a binary-valued reward function subject to 2 constraints: epsilon quality and collision freedom.</p>
<p>Now that we’ve examined the inner workings of our generative model <script type="math/tex">p</script>, let’s see how we can use it to generate the massive Dex-Net dataset.</p>
<h3 id="generating-dex-net-20">Generating Dex-Net 2.0</h3>
<p>To train our GQ-CNN, we need to generate i.i.d samples, consisting of depth images, grasps, and grasp robustness labels, by sampling from the generative joint <script type="math/tex">p(S, x, u, y)</script>.</p>
<div class="imgcap">
<img src="/assets/dexnet/data-gen.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Data Generation Pipeline</div>
</div>
<ol>
<li>Randomly select, from a database of 1,500 meshes, a 3D object mesh using a discrete uniform distribution.</li>
<li>Randomly select, from a set of stable poses, a pose for this object using a continuous uniform distribution.</li>
<li>Randomly sample the camera pose (also from a continuous uniform distribution) and use it to render the object and its pose w.r.t to the camera into a depth image using ray tracing.</li>
<li>Use rejection sampling to generate top-down parallel-jaw grasps covering the surface of the object.</li>
<li>Classify the robustness of each sampled grasps to obtain a set of positive and negative grasps. Robustness is estimated using force closure probability which is a function of object pose, gripper pose, and friction coefficient uncertainty.</li>
</ol>
<h3 id="antipodal-grasp-sampling">Antipodal Grasp Sampling</h3>
<p>In this section, we detail the algorithm used to sample parallel-jaw grasp candidates from a depth image during the data generation process.</p>
<p>First, we perform edge detection by locating pixel areas with high gradient magnitude. This is especially useful since graspable regions usually correspond to contact points on opposite edges of an object.</p>
<div class="imgcap">
<img src="/assets/gradient-pub/edge-detection.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">High gradient area = edges</div>
</div>
<p>Then we sample pairs of pixels belonging to these areas to generate antipodal contact points on the object. These pairs need to be parallel to the table plane. For example, in the image below, the red points are invalid because the two pixels are located at different heights and would produce an angled grasp.</p>
<div class="imgcap">
<img src="/assets/gradient-pub/cands.gif" width="60%" style="border:none;" />
<div class="thecap" style="text-align:center">Random Pairs of Antipodal Points</div>
</div>
<p>We repeat this step until we reach the desired number of grasps, potentially increasing the friction coefficient if the amount is insufficient. In the final step, 2D grasps are deprojected to 3D grasps using the camera intrinsics and extrinsics and multiple grasps are obtained from the same contact points by discretizing the height starting from the object surface to the table surface (<script type="math/tex">h = 0</script>).</p>
<h3 id="gq-cnn">GQ-CNN</h3>
<p>Once the synthetic dataset has been generated, it becomes trivial to train the network.</p>
<div class="imgcap">
<img src="/assets/dexnet/model.png" width="65%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview of the Model</div>
</div>
<p>Remember how we mentioned that GQ-CNN takes as input a depth image and a grasp candidate? Well it actually turns out that the authors have a very clever way of encoding the grasp information into the depth image: they take a depth image and grasp candidate and transform the depth image such that the grasp pixel location <script type="math/tex">(i, j)</script> – projected from the grasp position <script type="math/tex">(x, y)</script> – is aligned with the image center and the grasp axis <script type="math/tex">\varphi</script> corresponds to the middle row of the image. Then, at every iteration of SGD, we sample the transformed depth image and the remaining grasp variable <script type="math/tex">z</script> (i.e the gripper depth from the camera), normalize the depth image to zero mean and unit standard deviation, and feed the tuple to the 18M parameter GQ-CNN model.</p>
<p><strong>Note 1.</strong> The model is a typical deep learning architecture composed of convolutional, max-pool and fully-connected primitives.</p>
<p><strong>Note 2.</strong> The depth alignment makes it easier for the model to train since it doesn’t have to worry about any rotational invariances. As for feeding the gripper depth to the model, I would think this is useful for pruning grasps that have the correct 2D position and orientation, but are too far away from the object (i.e. either not touching or barely touching).</p>
<h3 id="grasp-planning-inference-time">Grasp Planning (Inference Time)</h3>
<p>Once the model is trained, we can pair the QG-CNN with a policy of choice. The one used in the paper is <script type="math/tex">\pi_{\theta}(y) = \arg \max_{u \in C} Q_{\theta}(u, y)</script> which amounts to sampling a set of predefined grasps from a depth image subject to a set of constraints <script type="math/tex">C</script> (e.g. kinematic and collision constraints), scoring each grasp using the GQ-CNN, and finally executing the most robust grasp.</p>
<p><strong>Cross Entropy Method.</strong></p>
<div class="imgcap">
<img src="/assets/dexnet/cem.png" width="75%" style="border:none;" />
<div class="thecap" style="text-align:center">Evolution of grasp robustness as the gripper center sweeps the depth image from top to bottom.</div>
</div>
<p>Randomly choosing a grasp from a set of candidates doesn’t work very well in cases where the grasping regions are small and require very precise gripper configurations. Taking a look at the image above, we can see that as we sweep candidate grasps from top to bottom, grasp robustness stays near zero and spikes momentarily when we reach the good, yet narrow grasping area. Thus, uniform sampling of grasp candidates is inefficient especially since we’re trying to perform real-time grasp planning.</p>
<p>This is where importance sampling – one of <a href="https://kevinzakka.github.io/2018/09/28/prioritized-learning/">my favorite</a> techniques – can help! We can modify our sampling strategy such that at every iteration, we refit the candidate distribution to the grasps with the highest predicted robustness. The algorithm to perform this fitting is the cross-entropy method (CEM) which tries to minimize the cross-entropy between a mixture of gaussians and the top-k percentile of grasps ranked by GQ-CNN. The result is that at every iteration, we are more likely to sample grasps with high-robustness values (grasps in the spike area) and converge to an optimal grasp candidate.</p>
<div class="imgcap">
<img src="/assets/gradient-pub/sampled.gif" width="75%" style="border:none;" />
</div>
<h3 id="discussion">Discussion</h3>
<ul>
<li>The sampling of grasps is inefficient. It would be interesting to extend the GQ-CNN to a fully-convolutional architecture where robustness labels can be computed for every pixel in the depth image in a single forward pass.</li>
<li>Dex-Net is open-loop which means that once a grasp candidate has been picked, it is executed blindly with no visual feedback. This sets it up for failure when camera calibration is imprecise or the environment it is placed in is dynamic and susceptible to change.</li>
<li>If we can speed-up Dex-Net by creating a smaller, fully-convolutional GQ-CNN, we may be able to run it at a high enough frequency to incorporate visual feedback and close the loop.</li>
</ul>
Tue, 20 Nov 2018 00:00:00 +0000
http://kevinzakka.github.io/2018/11/20/gradient-dexnet/
http://kevinzakka.github.io/2018/11/20/gradient-dexnet/graspingroboticscnnDex-Net 2.0: Deep Learning to Plan Robust Grasps<p>In this blog post, we’re going to take a close look at <a href="https://arxiv.org/abs/1703.09312">Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics</a> by <em>Jeffrey Mahler</em>, <em>Jacky Liang</em>, <em>Sherdil Niyaz</em>, <em>Michael Laskey</em>, <em>Richard Doan</em>, <em>Xinyu Liu</em>, <em>Juan Aparicio Ojea</em>, and <em>Ken Goldberg</em>.</p>
<div class="imgcap">
<img src="/assets/dexnet/teaser.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview</div>
</div>
<p><strong>TL, DR.</strong> This paper tackles grasp planning which is the task of finding a gripper configuration (pose and width) that maximizes a success metric subject to kinematic and collision constraints. The suggested approach is to train a Grasp Quality Convolutional Neural Network (GQ-CNN) on a large synthetic dataset of depth images with associated positive and negative grasps. Then during test time, one can sample various grasps from a depth image, feed each through the GQ-CNN, pick the one with the highest probability of success, and execute the grasp open-loop.</p>
<h3 id="variables">Variables</h3>
<p>Let’s start by introducing the variables that appear in the paper.</p>
<ul>
<li><script type="math/tex">x = (O, T_o, T_c, \gamma)</script>: the state describing the variable properties of the camera and objects in the environment, where:
<ul>
<li><script type="math/tex">O</script>: the geometry and mass properties of the object.</li>
<li><script type="math/tex">T_o, T_c</script>: 3D poses of the object and camera respectively.</li>
<li><script type="math/tex">\gamma</script>: the coefficient of friction between the object and the gripper.</li>
</ul>
</li>
<li><script type="math/tex">u = (p, \phi)</script>: a parallel-jaw grasp in 3D space, specified by a center <script type="math/tex">p = (x, y, z)</script> relative to the camera and an angle in the table plane <script type="math/tex">\phi</script>.</li>
<li><script type="math/tex">y = R^{H \times W}</script>: a pointcloud represented as a depth image with height H and width W taken by the camera with known intrinsics <script type="math/tex">K</script> and pose <script type="math/tex">T_c</script>.</li>
<li><script type="math/tex">S(u, x) \in \{0, 1\}</script>: a binary-valued grasp success metric, such as force closure.</li>
</ul>
<p>Using these random variables, we can define a joint distribution <script type="math/tex">p(S, x, u, y)</script> that models the inherent uncertainty associated with our assumptions, such as erroneous sensors readings (calibration error, noise, limiting pinhole model, etc.), and imprecise control (kinematic inaccuracies, etc.).</p>
<p><strong>Goal.</strong> Ingest a depth image <script type="math/tex">u</script> of an object in a scene with an associated grasp candidate <script type="math/tex">u</script>, and spit out the probability that <script type="math/tex">u</script> will succeed under the above uncertainties. This is equivalent to predicting the <strong>robustness</strong> <script type="math/tex">Q</script> of a grasp, defined as the expected value of <script type="math/tex">S</script> conditioned on <script type="math/tex">u</script> and <script type="math/tex">y</script>, i.e. <script type="math/tex">Q(u, y) = \mathbb{E}[S \vert u, y]</script>.</p>
<p><strong>Solution.</strong> Use a neural network with weights <script type="math/tex">\theta</script> to approximate the complex, high-dimensional function <script type="math/tex">Q</script>. Concretely,</p>
<script type="math/tex; mode=display">\hat{\theta} = \arg \min_{\theta} \ \mathbb{E}_{p(S, u, x, y)} \big[L(S, Q_{\theta}(u, y)) \big]</script>
<p>And finally, using Monte-Carlo sampling of input-output pairs from our joint distribution, we obtain:</p>
<script type="math/tex; mode=display">\begin{equation}
\hat{\theta} = \arg\min_{\theta} \sum_{i=1}^{N} L(S_i, Q_{\theta}(u_i, y_i))
\quad\text{with}\quad
(S_i, u_i, x_i, y_i) \sim p(S, x, u, y)
\end{equation}</script>
<h3 id="generative-graphical-model">Generative Graphical Model</h3>
<p>We can think of our joint <script type="math/tex">p(S, x, u, y)</script> as a generative model of images, grasps and success metrics. The relationship between the different variables is illustrated in the graphical model below.</p>
<div class="imgcap">
<img src="/assets/dexnet/gm.png" width="45%" style="border:none;" />
<div class="thecap" style="text-align:center">Graphical Model</div>
</div>
<p>Using the <a href="https://en.wikipedia.org/wiki/Chain_rule_(probability)">chain rule</a>, we can express the joint <script type="math/tex">p(S, x, u, y)</script> as the product of 4 terms: <script type="math/tex">p(S \vert u, y, x)</script>, <script type="math/tex">p(u \vert x, y)</script>, <script type="math/tex">p(y \vert x)</script> and <script type="math/tex">p(x)</script>. And since <script type="math/tex">S</script> and <script type="math/tex">u</script> are independent of <script type="math/tex">y</script> (no arrow going from <script type="math/tex">y</script> to <script type="math/tex">S</script> or <script type="math/tex">u</script>), we can reduce the expression to</p>
<script type="math/tex; mode=display">p(S, u, y, x) = {\color{red}{p(S \vert u, x)}} \cdot {\color{orange}{p(u \vert x)}} \cdot {\color{blue}{p(y \vert x)}} \cdot {\color{green}{p(x)}}</script>
<p>where:</p>
<ul>
<li><script type="math/tex">{\color{green}{p(x)}}</script> is the state distribution.</li>
<li><script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is the observation model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is the grasp candidate model, conditioned on the current state.</li>
<li><script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is the analytic model of grasp success conditioned on the grasp candidate and current state.</li>
</ul>
<p>The state <script type="math/tex">x = (O, T_o, T_c, \gamma)</script> is represented by the blue nodes in the graphical model. Using the chain rule and independence properties, we can express its underlying distribution as the product of:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
{\color{green}{p(x)}}
&= p(\gamma \vert T_c, T_o, O) \cdot p(T_c \vert T_o, O) \cdot p(T_o \vert O) \cdot p(O) \\
&= p(\gamma) \cdot p(T_c) \cdot p(T_o \vert O) \cdot p(O)
\end{align*} %]]></script>
<p>with:</p>
<ul>
<li><script type="math/tex">p(\gamma)</script>: truncated Gaussian over friction coefficients.</li>
<li><script type="math/tex">p(O)</script>: discrete uniform distribution over 3D object models.</li>
<li><script type="math/tex">p(T_o \vert O)</script>: continuous uniform distribution over discrete set of stable object poses.</li>
<li><script type="math/tex">p(T_c)</script>: continuous uniform distribution over spherical coordinates and polar angle.</li>
</ul>
<p>The grasp candidate model <script type="math/tex">{\color{orange}{p(u \vert x)}}</script> is a uniform distribution over pairs of antipodal contact points on the object surface whose grasp axis is parallel to the table plane (we want top-down grasps), the observation model <script type="math/tex">{\color{blue}{p(y \vert x)}}</script> is a rendered depth image of the scene corrupted with multiplicative and Gaussian Process noise, and the success model <script type="math/tex">{\color{red}{p(S \vert u, x)}}</script> is a binary-valued reward function subject to 2 constraints: epsilon quality and collision freedom.</p>
<p>Now that we’ve examined the inner workings of our generative model <script type="math/tex">p</script>, let’s see how we can use it to generate the massive Dex-Net dataset.</p>
<h3 id="generating-dex-net">Generating Dex-Net</h3>
<p>To train our GQ-CNN, we need to generate i.i.d samples, consisting of depth images, grasps, and grasp robustness labels, by sampling from the generative joint <script type="math/tex">p(S, x, u, y)</script>.</p>
<div class="imgcap">
<img src="/assets/dexnet/data-gen.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Data Generation Pipeline</div>
</div>
<ol>
<li>Randomly select, from a database of 1,500 meshes, a 3D object mesh using a discrete uniform distribution.</li>
<li>Randomly select, from a set of stable poses, a pose for this object using a continuous uniform distribution.</li>
<li>Use rejection sampling to generate top-down parallel-jaw grasps covering the surface of the object.</li>
<li>Randomly sample the camera pose (also from a continuous uniform distribution) and use it to render the object and its pose w.r.t to the camera into a depth image using ray tracing.</li>
<li>Classify the robustness of each sampled grasps to obtain a set of positive and negative grasps. Robustness is estimated using force closure probability which is a function of object pose, gripper pose, and friction coefficient uncertainty.</li>
</ol>
<h3 id="training-the-gq-cnn">Training the GQ-CNN</h3>
<p>Once the synthetic dataset has been generated, it becomes trivial to train the network.</p>
<div class="imgcap">
<img src="/assets/dexnet/model.png" width="65%" style="border:none;" />
<div class="thecap" style="text-align:center">Overview of the Model</div>
</div>
<p>Remember how we mentioned that GQ-CNN takes as input a depth image and a grasp candidate? Well it actually turns out that the authors have a very clever way of encoding the grasp information into the depth image: they take a depth image and grasp candidate and transform the depth image such that the grasp pixel location <script type="math/tex">(i, j)</script> – projected from the grasp position <script type="math/tex">(x, y)</script> – is aligned with the image center and the grasp axis <script type="math/tex">\varphi</script> corresponds to the middle row of the image. Then, at every iteration of SGD, we sample the transformed depth image and the remaining grasp variable <script type="math/tex">z</script> (i.e the gripper depth from the camera), normalize the depth image to zero mean and unit standard deviation, and feed the tuple to the 18M parameter GQ-CNN model.</p>
<p><strong>Note 1.</strong> The model is a typical deep learning architecture composed of convolutional, max-pool and fully-connected primitives.</p>
<p><strong>Note 2.</strong> The depth alignment makes it easier for the model to train since it doesn’t have to worry about any rotational invariances. As for feeding the gripper depth to the model, I would think this is useful for pruning grasps that have the correct 2D position and orientation, but are too far away from the object (i.e. either not touching or barely touching).</p>
<h3 id="grasp-planning-inference-time">Grasp Planning (Inference Time)</h3>
<p>Once the model is trained, we can pair the QG-CNN with a policy of choice. The one used in the paper is <script type="math/tex">\pi_{\theta}(y) = \arg \max_{u \in C} Q_{\theta}(u, y)</script> which amounts to sampling a set of predefined grasps from a depth image subject to a set of constraints <script type="math/tex">C</script> (e.g. kinematic and collision constraints), scoring each grasp using the GQ-CNN, and finally executing the most robust grasp. There are two sampling strategies used to generate grasp candidates: antipodal grasp sampling and cross-entropy sampling.</p>
<p><strong>Antipodal Grasp Sampling.</strong></p>
<p>First, we perform edge detection by locating pixel areas with high gradient magnitude. This is especially useful since graspable regions usually correspond to contact points on opposite edges of an object.</p>
<div class="imgcap">
<img src="/assets/dexnet/edge-detection.png" width="100%" style="border:none;" />
</div>
<p>Then we sample pairs of pixels belonging to these areas to generate antipodal contact points on the object. We enforce the constraints that point pairs are parallel to the table plane.</p>
<div class="imgcap">
<img src="/assets/dexnet/cands.gif" width="50%" style="border:none;" />
</div>
<p>We repeat this step until we reach the desired number of grasps, potentially increasing the friction coefficient if the amount is insufficient. In the final step, 2D grasps are deprojected to 3D grasps using the camera intrinsics and extrinsics and multiple grasps are obtained from the same contact points by discretizing the height starting from the object surface to the table surface (<script type="math/tex">h = 0</script>).</p>
<p><strong>Cross Entropy Method.</strong></p>
<div class="imgcap">
<img src="/assets/dexnet/cem.png" width="75%" style="border:none;" />
<div class="thecap" style="text-align:center">Evolution of grasp robustness as the gripper center sweeps the depth image from top to bottom.</div>
</div>
<p>Randomly choosing a grasp from a set of candidates doesn’t work very well in cases where the grasping regions are small and require very precise gripper configurations. Taking a look at the image above, we can see that as we sweep candidate grasps from top to bottom, grasp robustness stays near zero and spikes momentarily when we reach the good, yet narrow grasping area. Thus, uniform sampling of grasp candidates is inefficient especially since we’re trying to perform real-time grasp planning.</p>
<p>This is where importance sampling – one of <a href="https://kevinzakka.github.io/2018/09/28/prioritized-learning/">my favorite</a> techniques – can help! We can modify our sampling strategy such that at every iteration, we refit the candidate distribution to the grasps with the highest predicted robustness. The algorithm to perform this fitting is the cross-entropy method (CEM) which tries to minimize the cross-entropy between a mixture of gaussians and the top-k percentile of grasps ranked by GQ-CNN. The result is that at every iteration, we are more likely to sample grasps with high-robustness values (grasps in the spike area) and converge to an optimal grasp candidate. This fitting process is illustrated below.</p>
<div class="imgcap">
<img src="/assets/dexnet/sampled.gif" width="50%" style="border:none;" />
</div>
<h3 id="discussion">Discussion</h3>
<ul>
<li>The sampling of grasps is inefficient. It would be interesting to extend the GQ-CNN to a fully-convolutional architecture where robustness labels can be computed for every pixel in the depth image in a single forward pass.</li>
<li>Dex-Net is open-loop which means that once a grasp candidate has been picked, it is executed blindly with no visual feedback. This sets it up for failure when camera calibration is imprecise or the environment it is placed in is dynamic and susceptible to change.</li>
<li>If we can speed-up Dex-Net by creating a smaller, fully-convolutional GQ-CNN, we may be able to run it at a high enough frequency to incorporate visual feedback and close the loop.</li>
</ul>
Mon, 05 Nov 2018 00:00:00 +0000
http://kevinzakka.github.io/2018/11/05/dexnet/
http://kevinzakka.github.io/2018/11/05/dexnet/graspingroboticscnnLearning What to Learn and When to Learn It<p><strong>Note</strong>. This blog post is a work in progress. There are a few experiments left to run to fill out the last 2 sections.</p>
<!-- > The man who moves a mountain begins by carrying away small stones. (Confucius) -->
<div class="imgcap">
<img src="/assets/pr-lr/construction.jpg" width="75%" />
<div class="thecap" style="text-align:center"><a href="https://hiddenincatours.com/great-pyramids-of-egypt-astonishing-precision-in-plain-sight/">The Great Pyramids of Egypt</a></div>
</div>
<p>Hello world! I’m coming out of hibernation after 14 months of radio silence on this blog. I have a lot of things to blog about, from my research internship at Stanford University this past summer, to wrapping up my B.Eng. in EE in July – and I’ll hopefully get to those in future blog posts – but today, I’d like to talk about some of the cool research I did in my senior year of undergrad. Unfortunately, it’s not GAN/RL related (read as fortunately) but it’s definitely an interesting aspect of the field that could use some more attention.</p>
<p>The problem we’ll be investigating today is whether we can get Deep Neural Networks (DNNs) to converge faster and learn more efficiently. In particular, we’ll try to answer the following questions:</p>
<ul>
<li>Do we <em>really</em> need all the training samples in a dataset to reach a desired accuracy?</li>
<li>Can we do better than (lazy) uniform sampling of the data in a given training epoch?</li>
</ul>
<p>It actually turns out that on MNIST, we can reliably speedup training by a factor of 2 using just 30% of the available data<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>!</p>
<p><strong>NB:</strong> I’ll be linking to various jupyter notebooks throughout this blog post. If you want to check them out along with any code that appears on this page, visit my <a href="https://github.com/kevinzakka/blog-code/tree/master/pr-lr">Github Repository</a>.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Motivation</a></li>
<li><a href="#toc2">Refresher</a>
<ul>
<li><a href="#toc3">Stochastic Gradient Descent</a></li>
<li><a href="#toc4">Importance Sampling</a></li>
</ul>
</li>
<li><a href="#toc5">Quantifying Sample Importance</a></li>
<li><a href="#toc6">Loss Patterns</a></li>
<li><a href="#toc7">SGD on Steroids</a>
<ul>
<li><a href="#toc8">Mini-Batch Resampling</a></li>
<li><a href="#toc9">Auxiliary Model</a></li>
</ul>
</li>
<li><a href="#toc10">Things I Wish I Tried</a></li>
<li><a href="#toc11">Closing Thoughts</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="motivation">Motivation</h2>
<p>Human beings acquire knowledge in a unique way, accelerating their learning by choosing where and when to focus their efforts on the available training material. For example, when practicing a new musical composition, a pianist will spend more time on the difficult measures – breaking them down into manageable pieces that can be progressively mastered – rather than wasting her efforts on the simpler, more familiar parts.</p>
<div class="imgcap">
<img src="/assets/pr-lr/music-sheet-bach.jpg" width="75%" />
<div class="thecap" style="text-align:center"><a href="https://www.thestrad.com/yehudi-menuhins-marked-up-copy-of-bachs-solo-violin-sonata-no2/6651.article">Annotated Copy of Bach’s Solo Violin Sonata No. 2</a></div>
</div>
<p>Much of the same can be said about our formal primary and secondary education: our teachers help us learn from a smart selection of examples, leveraging previously acquired concepts to help guide our learning of new tools and abstractions. Human learning thus exhibits <strong>resource</strong> and <strong>time</strong> efficiency: we become proficient at mastering new concepts by selecting first, a <em>subset</em> of what is available to us in terms of learning material, and second, the <em>sequence</em> in which to learn the selected items such that we minimize acquisition time.</p>
<p>Unfortunately, the training algorithms we use in AI, unlike human learning, are data hungry and time consuming. With vanilla stochastic gradient descent (SGD) for example, the standard go-to optimizer, we repetitively iterate over the training data in sequential mini-batches for a large number of epochs, where a mini-batch is constructed by uniformly sampling <script type="math/tex">b</script> training points from the dataset. On large datasets – a necessity for good generalization – the naiveté of this sampling strategy hinders convergence and bottlenecks computation.</p>
<p><a name="toc2"></a></p>
<h2 id="refresher">Refresher</h2>
<p>So how can we improve SGD? Can we replace uniform sampling with a more efficient sampling distribution? More specifically, can we somehow predict a sample’s importance such that we adaptively construct training batches that catalyze more learning-per-iteration? These are all excellent questions we’ll be tackling further in the post, so let’s begin by refreshing a few concepts.</p>
<p><a name="toc3"></a>
<strong>Stochastic Gradient Descent.</strong> Given a neural network <script type="math/tex">M</script> parameterized by a set of weights <script type="math/tex">W</script>, a dataset <script type="math/tex">\mathcal{D}</script>, and a loss function <script type="math/tex">L</script>, we can express the goal of training as finding the optimal set of weights <script type="math/tex">\hat{W}</script> such that,</p>
<script type="math/tex; mode=display">\hat{W} = \arg \min_{W} \ L_{\mathcal{D}} = \arg \min_{W} \ \frac{1}{B} \sum_{i=1}^{B} L_i = \ \arg \min_{W} \frac{1}{B} \sum_{i=1}^{B} \sum_{j=1}^{b} L_{ij} \big( M(x_j; W), y_j \big)</script>
<p>where <script type="math/tex">B</script> corresponds to the number of batches in an epoch, <script type="math/tex">b</script> the number of training observations in a batch, and <script type="math/tex">(x_i, y_i)</script> an input-output training pair.</p>
<div class="imgcap">
<img src="/assets/pr-lr/sgd.png" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center"><a href="https://distill.pub/2017/momentum/">Converging to an Optimum with SGD</a></div>
</div>
<p>Without loss of generality, we can simplify the notation by considering just one training observation, a special case where the batch size is equal to 1. In that case, training our neural network <script type="math/tex">M</script> amounts to updating the weight vector <script type="math/tex">W</script> by taking a small step in the direction of the gradient of the loss with respect to <script type="math/tex">W</script> between two consecutive iterations:</p>
<script type="math/tex; mode=display">W_{t+1} = W_t - \alpha \ \mu_i \ \nabla_{W_t} L_i</script>
<p>In the above equation, <script type="math/tex">i</script> is a discrete random variable sampled from <script type="math/tex">\mathcal{D}</script> according to a probability distribution <script type="math/tex">\mathcal{P}</script> with probabilities <script type="math/tex">p_i</script> and sampling weights <script type="math/tex">\mu_i</script>. With vanilla SGD and uniform sampling, we have that <script type="math/tex">\forall i \in \mathcal{D}</script>,</p>
<script type="math/tex; mode=display">\begin{equation*}
\mu_i = 1 \\
p_i = \frac{1}{|\mathcal{D_t}|}
\end{equation*}</script>
<p><a name="toc4"></a>
<strong>Importance Sampling.</strong> Importance sampling is a neat little trick for reducing the variance of an integral estimation by selecting a better distribution from which to sample a random variable. The trick is to multiply the integrand by a cleverly disguised 1:</p>
<script type="math/tex; mode=display">E_{x \sim p(x)} \big[\ f(x) \big] = \int f(x)\ p(x)\ dx = \int f(x)\ p(x)\ \frac{q(x)}{q(x)}\ dx = \int \frac{p(x)}{q(x)}\cdot f(x)\ q(x)\ dx = E_{x \sim q(x)} \big[\ f(x)\cdot \frac{p(x)}{q(x)} \big]</script>
<p>Since many quantities of interest (probabilities, sums, integrals)<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> can be obtained by computing the mean of a function of a random variable <script type="math/tex">E[f(X)]</script>, we can greatly accelerate – and even improve – Monte-Carlo estimates by switching out the original probability distribution with a density that minimizes the sampling of points that contribute very little to the estimate, i.e. points with a function value of 0.</p>
<div class="imgcap">
<img src="/assets/pr-lr/mc-imp.jpg" width="80%" style="border:none;" />
<div class="thecap" style="text-align:center">Smaller Point Spread with Importance Sampling</div>
</div>
<p>For a tutorial on Monte-Carlo estimation and Importance Sampling, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Monte%20Carlo%20and%20Importance%20Sampling.ipynb">here</a>.</p>
<p><a name="toc5"></a></p>
<h2 id="quantifying-sample-importance">Quantifying Sample Importance</h2>
<p>In the previous section, we mentioned that uniform sampling assigns equal importance to all the training points in <script type="math/tex">\mathcal{D}</script>. This is obviously wasteful: while some samples are “easy” for the model and can be discarded in the initial stages with minimal impact on performance, the more “difficult” samples should be addressed more frequently throughout the training since they contribute to faster learning. So can we find a way to quantify this “importance”?</p>
<p>Fortunately, the answer is yes: we can theoretically<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> show that this quantity is none other than the norm of the gradient of a sample. Intuitively this makes sense: in the classification setting for example, we would expect misclassified examples to exhibit larger gradients than their correctly classified counterparts. Unfortunately, the norm of the gradient is pretty expensive to compute, especially in settings where we would like to avoid computing a full forward and backwards pass.</p>
<p>What about the loss of a sample? We essentially get it for free in the forward pass of backprop, so if we can show some degree of correlation with the gradient norm, it would be a less accurate but way cheaper metric for importance. Let’s try and verify this with a small PyTorch experiment. We’re going to train a small convnet on MNIST and record both the loss and gradient of every image in an epoch. We’ll then sort the list containing the gradient norms and use it to index the list of losses. A scatter plot of the reindexed losses should reveal a few things:</p>
<ul>
<li>If there <em>is</em> indeed a correlation, there should be a (potentially noisy) straight line through the scatter plot.</li>
<li>If the correlation is positive – implying that a higher gradient norm corresponds to a higher loss value and vice versa – this line should be increasing.</li>
</ul>
<p>Here’s a code snippet for computing the L2 norm of the gradient of a batch of losses with respect to the parameters of the network. Since there’s a pair of weights and biases associated with every convolutional and fully-connected layer and we want to return a scalar, we can calculate and return the square root of the sum of the squared gradient norms.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">gradient_norm</span><span class="p">(</span><span class="n">losses</span><span class="p">,</span> <span class="n">model</span><span class="p">):</span>
<span class="n">norms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="n">losses</span><span class="p">:</span>
<span class="n">grad_params</span> <span class="o">=</span> <span class="n">torch</span><span class="o">.</span><span class="n">autograd</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">model</span><span class="o">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="n">create_graph</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">grad_norm</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">for</span> <span class="n">grad</span> <span class="ow">in</span> <span class="n">grad_params</span><span class="p">:</span>
<span class="n">grad_norm</span> <span class="o">+=</span> <span class="n">grad</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="nb">pow</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="n">norms</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">grad_norm</span><span class="o">.</span><span class="n">sqrt</span><span class="p">())</span>
<span class="k">return</span> <span class="n">norms</span>
</code></pre></div></div>
<p>Incorporating the above function in the training loop is pretty trivial. All we need to do is record a <code class="highlighter-rouge">(grad_norm, loss)</code> tuple for every image in the dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># train for 1 epoch</span>
<span class="n">epoch_stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">batch_idx</span><span class="p">,</span> <span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">target</span><span class="p">)</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">train_loader</span><span class="p">):</span>
<span class="n">data</span><span class="p">,</span> <span class="n">target</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">),</span> <span class="n">target</span><span class="o">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">zero_grad</span><span class="p">()</span>
<span class="n">output</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">losses</span> <span class="o">=</span> <span class="n">F</span><span class="o">.</span><span class="n">nll_loss</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="n">target</span><span class="p">,</span> <span class="n">reduction</span><span class="o">=</span><span class="s">'none'</span><span class="p">)</span>
<span class="n">grad_norms</span> <span class="o">=</span> <span class="n">gradient_norm</span><span class="p">(</span><span class="n">losses</span><span class="p">,</span> <span class="n">model</span><span class="p">)</span>
<span class="n">indices</span> <span class="o">=</span> <span class="p">[</span><span class="n">batch_idx</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="o">+</span> <span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">data</span><span class="p">))]</span>
<span class="n">batch_stats</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">l</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">indices</span><span class="p">,</span> <span class="n">grad_norms</span><span class="p">,</span> <span class="n">losses</span><span class="p">):</span>
<span class="n">batch_stats</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">i</span><span class="p">,</span> <span class="p">[</span><span class="n">g</span><span class="p">,</span> <span class="n">l</span><span class="p">]])</span>
<span class="n">epoch_stats</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">batch_stats</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">losses</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="n">loss</span><span class="o">.</span><span class="n">backward</span><span class="p">()</span>
<span class="n">optimizer</span><span class="o">.</span><span class="n">step</span><span class="p">()</span>
</code></pre></div></div>
<p>Finally, we index our losses using the sorted gradient norms and generate the desired scatter plot.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># reindex the losses using the sorted gradient norms</span>
<span class="n">flat</span> <span class="o">=</span> <span class="p">[</span><span class="n">val</span> <span class="k">for</span> <span class="n">sublist</span> <span class="ow">in</span> <span class="n">epoch_stats</span> <span class="k">for</span> <span class="n">val</span> <span class="ow">in</span> <span class="n">sublist</span><span class="p">]</span>
<span class="n">sorted_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">flat</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">flat</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>
<span class="n">sorted_losses</span> <span class="o">=</span> <span class="p">[</span><span class="n">flat</span><span class="p">[</span><span class="n">idx</span><span class="p">][</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">item</span><span class="p">()</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">sorted_idx</span><span class="p">]</span>
</code></pre></div></div>
<div class="imgcap">
<img src="/assets/pr-lr/loss_vs_grad.jpg" width="100%" style="border:none;" />
<div class="thecap" style="text-align:center">Sorted Losses According to Gradient Norm</div>
</div>
<p>Other than the fact that the above plot is very pretty, it suggests that we <em>can</em> indeed use the loss value of a sample as a proxy for its importance. This is exciting news and opens up some interesting avenues for improving SGD.</p>
<p>If you want to reproduce the above plot, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Loss%20vs%20Gradient%20Norm.ipynb">here</a>.</p>
<p><a name="toc6"></a></p>
<h2 id="loss-patterns">Loss Patterns</h2>
<p>In this section, we’ll try to answer the following question:</p>
<blockquote>
<p>Is a sample’s importance consistent across epochs? In other words, if a sample exhibits low loss in the early stages of training, is this still the case in later epochs?</p>
</blockquote>
<p>There is substantial benefit in providing empirical evidence to this hypothesis. The reasons are two-fold: <strong>first</strong>, by eliminating consistently low-loss images from the dataset, we reduce train time proportionally to the discarded images; <strong>second</strong>, by oversampling the high-loss images, we reduce the variance of the gradients and speedup the convergence to <script type="math/tex">\hat{W}</script>.</p>
<p>To explore this idea, we’re going to track every sample’s loss over a set number of epochs. We’ll bin the loss values into 10 quantiles and compare the histograms over the different epochs. Finally, we’ll repeat these steps with shuffling turned off, then turned on.</p>
<p><strong>NB:</strong> We need to be a bit careful with keeping track of a sample’s index when shuffling is turned on. The solution is to create a permutation of <code class="highlighter-rouge">[0, 1, 2, ..., 59,999]</code> at the beginning of every epoch and feed it to a sequential sampler <strong>with shuffling turned off</strong>. By remapping the indices to their true ordering relative to the permutations at the end of training, we would have effectively simulated random shuffling.</p>
<p>If this sounds complicated, let me show you how simple it is to achieve in PyTorch:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># PermSampler takes a list of `indices` and iterates over it sequentially</span>
<span class="k">class</span> <span class="nc">PermSampler</span><span class="p">(</span><span class="n">Sampler</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">indices</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">indices</span> <span class="o">=</span> <span class="n">indices</span>
<span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">iter</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">indices</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">__len__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">indices</span><span class="p">)</span>
<span class="c"># if `permutation` is None, we return a data loader with no shuffling</span>
<span class="c"># if `permutation` is a list of indices, we return a data loader that iterates</span>
<span class="c"># over the MNIST dataset with indices specified by `permutation`.</span>
<span class="k">def</span> <span class="nf">get_data_loader</span><span class="p">(</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">,</span> <span class="n">permutation</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="n">normalize</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Normalize</span><span class="p">(</span><span class="n">mean</span><span class="o">=</span><span class="p">(</span><span class="mf">0.1307</span><span class="p">,),</span> <span class="n">std</span><span class="o">=</span><span class="p">(</span><span class="mf">0.3081</span><span class="p">,))</span>
<span class="n">transform</span> <span class="o">=</span> <span class="n">transforms</span><span class="o">.</span><span class="n">Compose</span><span class="p">([</span><span class="n">transforms</span><span class="o">.</span><span class="n">ToTensor</span><span class="p">(),</span> <span class="n">normalize</span><span class="p">])</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="n">MNIST</span><span class="p">(</span><span class="n">root</span><span class="o">=</span><span class="n">data_dir</span><span class="p">,</span> <span class="n">train</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">download</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">transform</span><span class="o">=</span><span class="n">transform</span><span class="p">)</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">if</span> <span class="n">permutation</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="n">PermSampler</span><span class="p">(</span><span class="n">permutation</span><span class="p">)</span>
<span class="n">loader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="n">batch_size</span><span class="p">,</span> <span class="n">shuffle</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">return</span> <span class="n">loader</span>
</code></pre></div></div>
<p>After training for 5 epochs, we collect a list containing a tuple <code class="highlighter-rouge">(idx, loss_idx)</code> for every image in the dataset. We can remap the indices with the following code:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># remap the indices based on the permutations list</span>
<span class="k">for</span> <span class="n">stat</span><span class="p">,</span> <span class="n">perm</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">stats_with_shuffling_flat</span><span class="p">,</span> <span class="n">permutations</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">stat</span><span class="p">)):</span>
<span class="n">stat</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">perm</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
</code></pre></div></div>
<p>Finally, we bin the sorted losses of every epoch into 10 bins and compute the percent match of bins across all epochs, the last 4 epochs, and the last 2 epochs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">percentage_split</span><span class="p">(</span><span class="n">seq</span><span class="p">,</span> <span class="n">percentages</span><span class="p">):</span>
<span class="n">cdf</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">cumsum</span><span class="p">(</span><span class="n">percentages</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="n">allclose</span><span class="p">(</span><span class="n">cdf</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="mf">1.0</span><span class="p">)</span>
<span class="n">stops</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">map</span><span class="p">(</span><span class="nb">int</span><span class="p">,</span> <span class="n">cdf</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">seq</span><span class="p">)))</span>
<span class="k">return</span> <span class="p">[</span><span class="n">seq</span><span class="p">[</span><span class="n">a</span><span class="p">:</span><span class="n">b</span><span class="p">]</span> <span class="k">for</span> <span class="n">a</span><span class="p">,</span> <span class="n">b</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">([</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="n">stops</span><span class="p">,</span> <span class="n">stops</span><span class="p">)]</span>
<span class="k">def</span> <span class="nf">bin_losses</span><span class="p">(</span><span class="n">all_epochs</span><span class="p">,</span> <span class="n">num_quantiles</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">percentile_splits</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ep</span> <span class="ow">in</span> <span class="n">all_epochs</span><span class="p">:</span>
<span class="n">sorted_loss_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ep</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">ep</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">splits</span> <span class="o">=</span> <span class="n">percentage_split</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">,</span> <span class="p">[</span><span class="n">num_quantiles</span><span class="o">/</span><span class="mi">100</span><span class="p">]</span><span class="o">*</span><span class="n">num_quantiles</span><span class="p">)</span>
<span class="n">percentile_splits</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">splits</span><span class="p">)</span>
<span class="k">return</span> <span class="n">percentile_splits</span>
<span class="n">fr</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="n">all_matches</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">fr</span><span class="p">:</span>
<span class="n">percent_matches</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_quantiles</span><span class="p">):</span>
<span class="n">percentile_all</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">percentile_splits</span><span class="p">)):</span>
<span class="n">percentile_all</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percentile_splits</span><span class="p">[</span><span class="n">j</span><span class="p">][</span><span class="n">i</span><span class="p">])</span>
<span class="n">matching</span> <span class="o">=</span> <span class="nb">reduce</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">intersect1d</span><span class="p">,</span> <span class="n">percentile_all</span><span class="p">)</span>
<span class="n">percent</span> <span class="o">=</span> <span class="mi">100</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">matching</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">percentile_all</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="n">percent_matches</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percent</span><span class="p">)</span>
<span class="n">all_matches</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">percent_matches</span><span class="p">)</span>
</code></pre></div></div>
<p>It’s interesting to compute percent matches across a varying range of epochs. The reason is that the training dynamics are less stable in the early epochs when the model weights are still random (analogous to transient response and steady state in circuit theory). For example, we would expect to have higher percent matches if we eliminate the first epoch from the analysis – and this is verified in the below plot!</p>
<div class="img">
<img src="/assets/pr-lr/no_shuffling.jpg" width="100%" style="border:none;" />
<img src="/assets/pr-lr/with_shuffling.jpg" width="100%" style="border:none;" />
</div>
<p>The histograms confirm our hypothesis:</p>
<ul>
<li>~ 30% of the samples with a loss value in the top 10% consistently rank in those ranges across all epochs. This number increases to ~ 60% across epochs 1 through 4 and ~ 85% across the last two epochs.</li>
<li>~ 30% of the samples with a loss value in the bottom 10% consistently rank in those ranges across all epochs. This number increases to ~ 50% across epochs 1 through 4 and ~ 70% across the last two epochs.</li>
<li>Shuffling has a minimial impact on the loss evolution of the samples across epochs.</li>
</ul>
<p>If you want to reproduce the histograms, click <a href="https://github.com/kevinzakka/blog-code/blob/master/pr-lr/Loss%20Patterns.ipynb">here</a>.</p>
<p><a name="toc7"></a></p>
<h2 id="sgd-on-steroids">SGD on Steroids</h2>
<p><a name="toc8"></a>
<strong>Mini-Batch Resampling.</strong> In the first version of SGD-S, we’re going to split our training epochs into 2 stages:</p>
<ul>
<li><strong>Transient Epochs</strong>: in the transient epochs, we train our model exactly as we would in regular SGD. However, in the last epoch, we record and return the losses of every image in the dataset.</li>
<li><strong>Steady-State Epochs</strong>:
<ul>
<li>For every epoch in the steady-state, we sample batches using the loss as the sampling distribution.</li>
<li>At the end of every epoch in the steady-state, we eliminate 10% of the images with the lowest losses. Furthermore, we can choose to randomly introduce a fraction of the discarded images to combat potential catastrophic forgetting.</li>
</ul>
</li>
</ul>
<p>Let’s illustrate how we can use the loss function to construct an importance sampling distribution for mini-batch resampling. This is achievable using PyTorch’s <code class="highlighter-rouge">WeightedRandomSampler</code> in conjunction with the <code class="highlighter-rouge">DataLoader</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># sort the loss in decreasing order</span>
<span class="n">sorted_loss_idx</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">losses</span><span class="p">)),</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">k</span><span class="p">:</span> <span class="n">losses</span><span class="p">[</span><span class="n">k</span><span class="p">][</span><span class="mi">1</span><span class="p">],</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># house cleaning</span>
<span class="n">to_remove</span> <span class="o">=</span> <span class="n">sorted_loss_idx</span><span class="p">[</span><span class="o">-</span><span class="nb">int</span><span class="p">((</span><span class="n">perc_to_remove</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">)):]</span>
<span class="n">to_keep</span> <span class="o">=</span> <span class="n">sorted_loss_idx</span><span class="p">[:</span><span class="o">-</span><span class="nb">int</span><span class="p">((</span><span class="n">perc_to_remove</span> <span class="o">/</span> <span class="mi">100</span><span class="p">)</span> <span class="o">*</span> <span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">))]</span>
<span class="n">to_add</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">(</span><span class="n">removed</span><span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="o">.</span><span class="mo">01</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_loss_idx</span><span class="p">)),</span> <span class="n">replace</span><span class="o">=</span><span class="bp">False</span><span class="p">))</span>
<span class="n">new_idx</span> <span class="o">=</span> <span class="n">to_keep</span> <span class="o">+</span> <span class="n">to_add</span>
<span class="n">new_idx</span><span class="o">.</span><span class="n">sort</span><span class="p">()</span>
<span class="n">weights</span> <span class="o">=</span> <span class="p">[</span><span class="n">losses</span><span class="p">[</span><span class="n">idx</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="n">new_idx</span><span class="p">]</span>
<span class="n">sampler</span> <span class="o">=</span> <span class="n">WeightedRandomSampler</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">),</span> <span class="bp">True</span><span class="p">)</span>
</code></pre></div></div>
<p><a name="toc9"></a>
<strong>Auxiliary Model.</strong></p>
<p><a name="toc10"></a></p>
<h2 id="things-i-wish-i-tried">Things I Wish I Tried</h2>
<p><a name="toc11"></a></p>
<h2 id="closing-thoughts">Closing Thoughts</h2>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>CIFAR results pending. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Explain how. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Add proof or point to it. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 28 Sep 2018 00:00:00 +0000
http://kevinzakka.github.io/2018/09/28/prioritized-learning/
http://kevinzakka.github.io/2018/09/28/prioritized-learning/deep learningsgdimportance samplingpytorch2018Getting Up and Running with PyTorch on Amazon Cloud<p align="center">
<img src="/assets/aws/splash.png" alt="Drawing" width="60%" />
</p>
<p>This is a succint tutorial aimed at helping you set up an AWS GPU instance so that you can train and test your PyTorch models in the cloud. If you don’t own a GPU like me, this can be a great way of drastically reducing the training time of your models, so while your instance is furiously crunching numbers in some faraway Amazon server, you can peacefully experiment with and prototype new architectures from the comfort of a Starbucks couch.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/cpu-meter.png" width="30%" style="border:none;" />
<div class="thecap" style="text-align:center">I mean we all love a silent macbook, right?</div>
</p>
</div>
<p>The cool part is that if you’re a high school or college student, you can sign up for a Github Developer pack which will get you $150 worth of free AWS credits. That’s around 167 hours or 7 days of compute time<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, an amply sufficient amount for those fun weekend side projects and experiments. As usual, any code or script that appears on this page can be downloaded from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">Blog Repository</a>. And on that note, let’s get started!</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Configuring Your EC2 Instance</a></li>
<li><a href="#toc2">Launching & Managing Your EC2 Instance</a></li>
<li><a href="#toc3">SSH Persistence With TMUX</a></li>
<li><a href="#toc4">Conclusion</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="configuring-your-ec2-instance">Configuring Your EC2 Instance</h2>
<p>I’m assuming you’ve already created an AWS account but if you haven’t, the whole process shouldn’t take you more than 2 minutes. Note that it will require you to enter your credit card information which is necessary to charge you <em>if and when</em> you exceed your free credits. Now’s also a great time to claim your <a href="https://education.github.com/pack">GitHub Student Developer Pack</a> credits so go ahead and do that.</p>
<p><strong>Pick your Region.</strong> Ok, so the instance type we are going to use is located in <strong>US West (Oregon)</strong> so make sure the region information on the top right of the screen correctly reflects that.</p>
<p align="center">
<img src="/assets/aws/step1.png" alt="Drawing" width="80%" />
</p>
<p><strong>Limit Increase.</strong> The next thing we need to do is request a limit increase for EC2 instances. For some weird reason, Amazon automatically sets the limit to 0 upon account creation so it has to be increased by sending in a support ticket.</p>
<p>Go ahead and click <code class="highlighter-rouge">Support > Support Center</code> at the top right of your screen. This will direct you to a page with a blue <code class="highlighter-rouge">Create Case</code> button that you should click. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step2.png" alt="Drawing" width="80%" />
</p>
<p>We want a Limit Increase for EC2 instances meaning you need to select <code class="highlighter-rouge">Service Limit Increase</code> in <strong>Regarding</strong> and <code class="highlighter-rouge">EC2 Instances</code> in <strong>Limit Type</strong>. Now fill in the <strong>Request 1</strong> box and <strong>Use Case Description</strong> as I’ve done here.</p>
<p align="center">
<img src="/assets/aws/step3.png" alt="Drawing" width="80%" />
</p>
<p>Finally, make sure to select <code class="highlighter-rouge">Web</code> as your <strong>Contact method</strong> and submit the request. Note that the time of response varies: I’ve had limit increases resolved in a matter of minutes and sometimes up to a full day, so be patient. Also, feel free to change the <strong>New limit value</strong> to suit your needs. I’ve opted for 2 because the <code class="highlighter-rouge">p2.xlarge</code> instance type we’ll be working with has a single GPU with memory constraints that may limit the number of jobs I may run concurrently.</p>
<p><strong>Configure Instance.</strong> Ok, we’re now ready to create and configure our EC2 instance. Back on the home page console (click on the orange cube in the top left), navigate to <code class="highlighter-rouge">EC2</code> in the Compute services section, and then click on the blue <code class="highlighter-rouge">Launch Instance</code> button.</p>
<p align="center">
<img src="/assets/aws/step4.png" alt="Drawing" width="80%" />
</p>
<p>You’ll be greeted with a 7-step process like so.</p>
<p align="center">
<img src="/assets/aws/step5.png" alt="Drawing" width="80%" />
</p>
<p><strong>AMI.</strong> First select the <code class="highlighter-rouge">Ubuntu Server 16.04 LTS (HVM), SSD Volume Type</code> as the AMI of choice.</p>
<p><strong>Instance.</strong> Select <code class="highlighter-rouge">p2.xlarge</code> as your instance type. This is an instance with a single GPU which is what we asked for in our limit increase request.</p>
<p><strong>Spot Instances.</strong> At this point, you should be on the <strong>Configure Instance Details</strong> step. This is where things get interesting. In fact, Amazon gives us the ability to bid on spare Amazon EC2 computing capacity for a much cheaper price than the on-demand one.</p>
<p>Basically, what that means is that if our bid price is higher than the current market price, our instance will be launched and charged at that price. The only downside is that if that ever flips around, instances get <span style="color:red">terminated</span> instantly and with no warning<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p><span style="color:blue">TL;DR:</span> Spot instances can be ideal for non-critical experimentation like hyperparameter tuning but stay away from them if you need to train a model for a large number of epochs.</p>
<p>I’ll assume the user uses On-Demand pricing for the remainder of this post but if you do want to find out more about Spot Instances, feel free to watch this Youtube <a href="https://www.youtube.com/watch?v=_XT6McviY7w">video</a>.</p>
<p><strong>Add Storage.</strong> Next, we’ll be increasing the size of our Root Volume to accomodate large datasets such as ImageNet which is around 48 Gb. Feel free to enter any number above that.</p>
<p align="center">
<img src="/assets/aws/step6.png" alt="Drawing" width="80%" />
</p>
<p>Note that the Root Volume is EBS-backed meaning it persists on instance termination. The default behavior however is to delete it on termination. Weird right? Well, not really. With ephemeral storage, the other type of storage AWS offers, there is no persist option, whether it be on instance stop or terminate. Thus EBS with delete-on-terminate gives us the ability to keep our data on disk when the instance is stopped!</p>
<p><strong>Configure Security Group.</strong> You can skip the <strong>Add Tags</strong> section and jump to this last step. This part is important because it will allow us to monitor our training with Tensorboard and use Jupyter Notebook. We’ll be adding 4 protocols as shown in the picture below.</p>
<p align="center">
<img src="/assets/aws/step7.png" alt="Drawing" width="80%" />
</p>
<p>Once you click the launch button, a window will pop up and prompt you to create a key-pair. This little file is needed when ssh-ing into your instance, so download it and store it in a secure location you’ll remember. For this tutorial’s sake, I’ll be calling mine <code class="highlighter-rouge">aws-dl.pem</code> and storing it in my Downloads folder.</p>
<p align="center">
<img src="/assets/aws/step8.png" alt="Drawing" width="80%" />
</p>
<p><a name="toc2"></a></p>
<h2 id="launching--managing-your-ec2-instance">Launching & Managing Your EC2 Instance</h2>
<p>We’ve finally arrived at the point where we can ssh into our EC2 instance. To do so, you’ll need to navigate to the <code class="highlighter-rouge">Instances</code> page located in the navigation panel on the left of your screen. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step9.png" alt="Drawing" width="80%" />
</p>
<p>You need to take note of 2 things:</p>
<ul>
<li><strong>Public DNS (IPv4)</strong>: <code class="highlighter-rouge">ec2-52-42-90-161.us-west-2.compute.amazonaws.com</code></li>
<li><strong>IPv4 Public IP</strong>: <code class="highlighter-rouge">52.42.90.161</code></li>
</ul>
<p>Other than that, there are just 2 ways to interact with your instance you need to be aware of: <strong>login</strong> with ssh and <strong>copy</strong> a file to it with scp.</p>
<ul>
<li><code class="highlighter-rouge">ssh -v -i X ubuntu@Y</code> where X represents the path to the key-pair file and Y represents the Public IP of your instance.</li>
<li><code class="highlighter-rouge">scp -i W -r X ubuntu@Y:Z</code> where W is the path to the key-pair file, X is the path to the local file, Y is the Public IP, and Z is the destination path on the instance.</li>
</ul>
<p>It’s important to note that if you’re using the key-pair file for the very first time, you’ll need to change its permission to read and write by running <code class="highlighter-rouge">chmod 600 ~/Downloads/aws-dl.pem</code>.</p>
<p>With all that being said, we can finally fire up a terminal and execute the following command:</p>
<p><code class="highlighter-rouge">
ssh -v -i ~/Downloads/aws-dl.pem ubuntu@52.42.90.161
</code></p>
<p align="center">
<img src="/assets/aws/term1.png" alt="Drawing" width="80%" />
</p>
<p>Enter yes, and voila! You should be successfully logged in. The instance is still not ready for use as there are a few more things that need to be done, but fear not. I’ve created a small bash script that you can execute which automates the following:</p>
<ul>
<li>It downloads and installs the required nvidia gpu drivers.</li>
<li>It updates and upgrades the distribution packages.</li>
<li>It installs python3 along with virtualenv.</li>
<li>It creates a virtualenv called <code class="highlighter-rouge">deepL</code> that will house all the required pip packages and PyTorch.</li>
<li>And it finally installs PyTorch v0.2.</li>
</ul>
<p>Go ahead and download <code class="highlighter-rouge">install.sh</code> from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">repo</a> and save it to your Desktop. We need to copy it to our instance, so apply the command I mentioned above:</p>
<p><code class="highlighter-rouge">
scp -i ~/Downloads/aws-dL.pem -r ~/Desktop/install.sh ubuntu@52.42.90.161:~/.
</code></p>
<p>Next, go back to the terminal window logged into the instance and execute the following 2 commands:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x install.sh
./install.sh
</code></pre></div></div>
<p>Once that’s done, you’ll need to reboot your instance. Enter <code class="highlighter-rouge">exit</code> at the command line and navigate to your browser as in the image below. Be patient and wait for a few minutes before you ssh back into the instance!</p>
<p align="center">
<img src="/assets/aws/step10.png" alt="Drawing" width="80%" />
</p>
<p>At this point, we should sanity check our installation by seeing if PyTorch loads correctly.</p>
<ul>
<li>First, activate the virtualenv by executing <code class="highlighter-rouge">source ~/envs/deepL/bin/activate</code>.</li>
<li>Enter <code class="highlighter-rouge">python</code> and inside the interpreter, <code class="highlighter-rouge">import torch</code> then <code class="highlighter-rouge">torch.__version__</code>. Fingers crossed, this should print out <code class="highlighter-rouge">0.2.0_1</code>.</li>
<li>Lastly, check that the GPU is visible by typing <code class="highlighter-rouge">torch.cuda.is_available()</code> which should print out True.</li>
</ul>
<p><span style="color:red">Once you’ve finished working on your instance, you should stop it immediately to avoid incurring additional charges.</span></p>
<p><a name="toc3"></a></p>
<h2 id="ssh-persistence-with-tmux">SSH Persistence With TMUX</h2>
<p>I would be doing you a great disservice if I didn’t mention this nifty little package called <code class="highlighter-rouge">tmux</code> that you can use when running your instances for long periods of time. <em>What exactly is tmux, and why should you use it</em>?</p>
<p>Well, if you’re shhed into an instance, peacefully running a job, and your connection suddenly drops, your ssh connection will automatically get killed. This means anything running on that instance stops as well (i.e. your model will stop training). Closing your laptop to commute from university to your house for example becomes a big no no.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/term3.png" width="80%" style="border:none;" />
<div class="thecap" style="text-align:center">A TMUX session</div>
</p>
</div>
<p>This is where tmux comes in! Tmux makes it so that anything running within a session persists even if the connection drops or the terminal gets killed. To see it in action, I’d suggest you watch the following <a href="https://www.youtube.com/watch?v=BHhA_ZKjyxo">video</a>.</p>
<p>Thus, your workflow should always be as follows:</p>
<ul>
<li>SSH into your aws instance.</li>
<li>Create a new tmux session called work using the command <code class="highlighter-rouge">tmux new -s work</code>.</li>
<li>Do everything as you would previously.</li>
<li>Detach from the session by pressing <code class="highlighter-rouge">ctrl-b</code> followed by <code class="highlighter-rouge">d</code>.</li>
</ul>
<p>Once you’ve detached yourself from the session, you can work on anything else, even go to sleep… Subsequently, if you need to reattach to that particular tmux session to check your progress, run <code class="highlighter-rouge">tmux a -t work</code>.</p>
<p>That’s pretty much it. For a more complete list of tmux commands, you should refer to this lovely <a href="https://gist.github.com/MohamedAlaa/2961058">cheatsheet</a>.</p>
<p><a name="toc4"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this tutorial, we went over the basic steps needed to create a free, GPU-powered Amazon AWS instance. We explored how to interact with our instance using the <code class="highlighter-rouge">ssh</code> and <code class="highlighter-rouge">scp</code> commands and how a bash script could be leveraged to download and install all the required packages needed to run PyTorch. Finally, we saw how we could make our ssh session persistent using a very important program called tmux.</p>
<p>Until next time!</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is for a GPU-powered p2.xlarge instance with an on-demand price of around $0.9/hr. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>A terminated instance gets deleted, meaning you lose whatever’s on there permanently. On the other hand, a stopped instance just goes offline so you don’t get charged for it and you can fire it back up again at a later time. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 13 Aug 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/08/13/aws-pytorch/
http://kevinzakka.github.io/2017/08/13/aws-pytorch/deep learningawsamazonpytorch2017Understanding Recurrent Neural Networks - Part I<p>Recurrent Neural Networks have been my Achilles’ heel for the past few months. Admittedly, I haven’t had the grit to sit down and work out their details, but I’ve figured it’s time I stop treating them like black boxes and try instead to discover what makes them tick. My intentions with this series are hence twofold: first, to combat my weakness by understanding their inner workings and coding one from scratch; and second, to write down what I learn in order to reinforce the insights I may gain along the way.</p>
<p>In this first installment, we’ll be introducing the intuition behind RNNs, motivating their use by highlighting a glaring limitation of traditional neural networks. We’ll then transition into a more technical description of their architecture which will be useful for the next installment where we’ll code one from scratch in numpy.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Human Learning</a></li>
<li><a href="#toc2">The Woes of Traditional Neural Nets</a></li>
<li><a href="#toc3">Enhancing Neural Networks with Memory</a></li>
<li><a href="#toc4">The Nitty Gritty Details</a></li>
<li><a href="#toc5">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="human-learning">Human Learning</h3>
<blockquote>
<p>We are the sum total of our experiences. None of us are the same as we were yesterday, nor will be tomorrow. (B.J. Neblett)</p>
</blockquote>
<p>There is an inherent truth to the quote above. Our brain pools from past experiences and combines them in intricate ways to solve new and unseen tasks. It is hardwired to work with sequences of information that we perpetually store and call upon over the course of our lives. At its core, <em>human learning</em> can be distilled into two fundamental processes:</p>
<ul>
<li><strong>memorization</strong>: every time we gain new information, we store it for future reference.</li>
<li><strong>combination</strong>: not all tasks are the same, so we couple our analytical skills with a combination of our memorized, previous experiences to reason about the world.</li>
</ul>
<p>Consider the following pictures.</p>
<p align="center">
<img src="/assets/rnn/weird_cat.jpg" alt="Drawing" width="200px" /><img src="/assets/rnn/weird_cat2.jpg" alt="Drawing" width="200px" />
</p>
<p>Even though it’s in a very weird position, a child can instantly tell that the fur ball in front of it is a cat. It’ll recognize the ears, the whiskers and the snout (memory) but the shape of it all may throw it off. Subconciously however, the child may recall how human stretching deforms shape and pose (combination), and infer that the same is happening to the cat.</p>
<p>Not all tasks require the distant past however. At times, solving a problem makes use of information that was processed only moments ago. For example, take a look at this incomplete sentence:</p>
<blockquote>
<p>I bought my usual caramel-covered popcorn with iced tea and headed to the ___.</p>
</blockquote>
<p>If I asked you to fill-in the missing word, you’d probably guess “movies”. How did you know that <code class="highlighter-rouge">library</code> or <code class="highlighter-rouge">starbucks</code> were invalid words? Well, it’s probably because you used context, or information from earlier in the sentence to infer the correct answer. Now think about the following. If I asked you to recite the lyrics of your favorite song backwards, would you be able to do it? Probably not… What about counting backwards? Yeah, piece of cake!</p>
<p align="center">
<img src="/assets/rnn/yarn.jpg" alt="Drawing" width="200px" />
</p>
<p>So what makes reciting the song backwards so excruciatingly difficult? The answer is that counting backwards is done <strong>on the fly</strong>. There is a logical relationship between each number, and knowing the order of the 9 digits and how subtraction works means you can count backwards from say 1845098 even if you’ve never done it before. On the other hand, you memorized the lyrics of the song in a specific order. Your brain works by <strong>indexing</strong> from one word to the next, starting from the first word. It’s hard to index backwards for the simple reason that your brain has never done it before, so that specific sequence was never stored. Think of the memorized lyric sequence as a giant ball of yarn whose unraveled end can only be accessed with the correct first word in the forward sequence.</p>
<p>The main takeaway is that our brains are naturally talented at working with sequences and they do so by relying on a deceptively simple, yet powerful concept called <strong>information persistence</strong>.</p>
<p><a name="toc2"></a></p>
<h3 id="the-woes-of-traditional-neural-nets">The Woes of Traditional Neural Nets</h3>
<p>We live in a world that is inherently sequential. Audio, video, and language (even your DNA!) are but a few examples of data in which information at a given time step is intricately dependent on information from previous timesteps. So how is all this related to deep learning? Well, think about feeding a sequence of frames from a video into a neural network and asking it to predict what comes next… Or, back to our previous example, feeding a set of words and asking it to complete the sentence.</p>
<p>It should be obvious to you that information from the past is crucial for outputting a sane and plausible prediction. But traditional neural networks can’t do this because they operate on the fundamental assumption that inputs are independent! This is a problem because it means our output at any given time is completely and <strong>solely</strong> determined by the input at that same time. There is no previous history and our network cannot capitalize on the complex temporal dependencies that exist between the different frames or words to refine its predictions.</p>
<p>This is where <em>Recurrent Neural Networks</em> come in! RNNs allow us to deal with sequences by incorporating a mechanism that stores and leverages information from previous history, sort of like a memory. Put differently, whereas a traditional net maps <strong>one</strong> input to an output, a recurrent net maps an <strong>entire history</strong> of previous inputs to each output. If that’s still obscure to you, just think of RNNs as a traditional neural net enhanced with a loop<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, one that allows for information to persist across timesteps.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/rnn/draw2.gif" width="400" style="border:none;" />
<div class="thecap" style="text-align:center">(<a href="https://www.youtube.com/watch?v=Zt-7MI9eKEo">Video Courtesy</a>) DRAW model improving its output by iterating over the canvas rather than producing the image in one shot.</div>
</p>
</div>
<p>It is important to note that recurrent neural nets aren’t just bound to sequential data in the sense that many problems can be tackled by decomposing them into a series of smaller subproblems. The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step. As an example, a recurrent net<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> was used to generate handwritten digits in a sequential fashion, mimicking the way artists refine and reassess their work with brushstrokes.</p>
<blockquote>
<p>The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step.</p>
</blockquote>
<p><a name="toc3"></a></p>
<h3 id="enhancing-neural-nets-with-memory">Enhancing Neural Nets with Memory</h3>
<p>So how exactly can we endow our networks with the ability to memorize? To answer this question, let’s recall our basic hidden layer neural network, which takes as input a vector <code class="highlighter-rouge">X</code>, dot products it with a weight matrix <code class="highlighter-rouge">W</code> and applies a nonlinearity. We’ll consider the output <code class="highlighter-rouge">y</code> when three successive inputs are fed through the network. Note that the bias term has been eliminated so as to simplify the notation, and I’ve taken the liberty of coloring the equations to make certain patterns stand out.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2})</script>
<p>Given the simple API above, it’s pretty clear that each output is solely determined by its input, i.e. there is no trace of past inputs in the calculation of its value. So let’s alter the API by allowing our hidden layer to use a combination of both the current input and the previous input, and visualize what happens.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1} + W_h\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2} + W_h\color{green}{X_1})</script>
<p>Nice! By introducing recurrence into the formula, we’ve managed to obtain a mix of 2 colors in each hidden layer. Intuitively, our network now has a memory depth of 1, equivalent to “seeing” one step backwards in time. Remember though that our goal is to be able to capture information across <strong>all</strong> previous timesteps, so this does not cut it.</p>
<p>Hmm… What if we feed in a combination of the current input and the previous hidden layer?</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)</script>
<script type="math/tex; mode=display">y_2 = f\bigg(W_x \color{red}{X_2} + W_h \ f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)\bigg)</script>
<p>Much better! Our layer at each timestep is now a blend of all the colors that have come before it, allowing our network to take into account all its past history when computing its output. This is the power of recurrence in all its glory: creating a loop where information can persist across timesteps.</p>
<p><a name="toc4"></a></p>
<h3 id="the-nitty-gritty-details">The Nitty Gritty Details</h3>
<div class="imgcap">
<img src="/assets/rnn/rnn-1_layer-unrolled.svg" width="300px" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://kbullaughey.github.io/lstm-play/rnn/">Image Courtesy</a></div>
</div>
<p>At its core, an RNN can be represented by an internal, hidden state <code class="highlighter-rouge">h</code> that gets updated with every timestep and from which an output <code class="highlighter-rouge">y</code> can be optionally derived<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. This update behavior is governed by the following equations:</p>
<script type="math/tex; mode=display">\begin{cases}
h_t = f \big(W_{xh}x_t + W_{hh}h_{t-1}+b_1\big) \\
y_t = g \big(W_{hy}h_t + b_2\big)
\end{cases}</script>
<p>Don’t let the above notation scare you. It’s actually very simple once you dissect it.</p>
<ul>
<li><script type="math/tex">W_{xh}x_t</script> - we’re multiplying the input <script type="math/tex">x_t</script> by a weight matrix <script type="math/tex">W_{xh}</script>. You can think of this dot product as a way for the hidden layer to extract information out of the input.</li>
<li><script type="math/tex">W_{hh}h_{t-1}</script> - this dot product is allowing the network to extract information from an entire history of past inputs which it will use in conjunction with information gathered from the current input, to compute its output. This is the crucial, self-defining property of RNNs.</li>
<li><script type="math/tex">f</script> and <script type="math/tex">g</script> are activation functions that squash the dot products to a specific range. The function <script type="math/tex">f</script> is usually <code class="highlighter-rouge">tanh</code> or <code class="highlighter-rouge">ReLU</code>. <script type="math/tex">g</script> can be a <code class="highlighter-rouge">softmax</code> when we want to output class probabilities.</li>
<li><script type="math/tex">b_1</script> and <script type="math/tex">b_2</script> are biases that help offset the outputs away from the origin (similar to the b in your typical <script type="math/tex">ax+b</script> line).</li>
</ul>
<p>As you can see, the Vanilla RNN model is quite simple. Once its architecture has been defined, training it is exactly the same as with normal neural nets, i.e. initializing the weight matrices and biases, defining a loss function and minimizing that loss function using some form of gradient descent.</p>
<p>This conclues our first installment in the series. In next week’s blog post, we’ll be coding our very own RNN from the ground up in numpy and apply it to a language modeling task. Stay tuned until then…</p>
<p><a name="toc5"></a></p>
<h3 id="references">References</h3>
<p>There are a ton of resources that helped me better grasp the fundamentals of RNNs. I’d like to thank <a href="https://twitter.com/iamtrask">iamtrask</a> especially, for letting me use his idea of colors to explain neural memory. You can read his amazing blog post <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">here</a>.</p>
<ul>
<li>Denny Britz’s RNN series - click <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">here</a></li>
<li>Andrej Karpathy’s Blog Post - click <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">here</a></li>
<li>Chris Olah’s Blog Post - click <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you’re familiar with Control Theory, this should be slightly reminiscent of a feedback loop, although not quite. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I’m referring to the <a href="https://arxiv.org/abs/1502.04623">DRAW</a> model introduced by Gregor et. al at Deepmind. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>In the simplest of cases, the hidden state <script type="math/tex">h_t</script> is used as both the output <script type="math/tex">y_t</script> and input to the next hidden state <script type="math/tex">h_{t+1}</script>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 20 Jul 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/07/20/rnn/
http://kevinzakka.github.io/2017/07/20/rnn/deep learningrnnsequences2017My Short Term Goals For 2017<div class="imgcap">
<img src="/assets/goals/winter.jpg" width="80%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.pinterest.com/bishopspencer/winter/">Image Courtesy</a></div>
</div>
<p>This past month has been extremely productive, and I’m really satisfied with the way my winter break has panned out. In fact, I had the opportunity to read and learn tons from a multitude of arXiv papers and I actually went hands on and coded 2 projects from scratch in Tensorflow/Keras:</p>
<ul>
<li><a href="https://github.com/kevinzakka/style_transfer">Artistic Style Transfer</a></li>
<li><a href="https://github.com/kevinzakka/spatial_transformer_network">Spatial Transformer Networks</a></li>
</ul>
<p>I think sticking to theory all the time makes me very prone to forming misconceptions, so actually reproducing papers, googling questions and looking at people’s code has really helped me concretize the notions in my head and I’ve gained significant experience in the process.</p>
<p>Since tomorrow marks my last day of winter break, I thought I would use this opportunity to compile a list of projects I’d like to tackle in the next few weeks. It’ll be a perfect way of organizing and prioritizing my short term goals and I’ll be able to hold myself accountable if I get too lazy.</p>
<h2 id="vision">Vision</h2>
<p><strong>Image Super-Resolution.</strong> Super-resolution is the task of estimating a high-resolution (HR) image from its low-resolution (LR) counterpart. Think about those CSI-Miami episodes where they enhanced surveillance videos to glean valuable information for the crime case.</p>
<div class="imgcap">
<img src="/assets/goals/super-resolution.gif" width="80%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://github.com/alexjc/neural-enhance">Image Courtesy</a></div>
</div>
<p>My resource will be Alex Champandard’s <a href="https://github.com/alexjc/neural-enhance">Neural-Enhance</a> repository which uses a combination of 4 papers. Kudos to Alex, definitely go and give him a star for the amazing work.</p>
<p>I think super-resolution is a great application of Deep Learning and tackling it will prove to be very entertaining.</p>
<p><strong>Text-To-Image.</strong> My goal is to try and implement <a href="https://arxiv.org/abs/1612.03242">StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks</a> by Zhang et. al. Amazing work where they synthesize photo-realistic images from text descriptions with GANs! If that doesn’t sound like wizardry, take a look at the image below from the paper.</p>
<div class="imgcap">
<img src="/assets/goals/txt2img.png" width="65%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<p>I’ll definitely have to brush up on the theory of Generative Adversarial Networks so Goodfellow’s <a href="https://arxiv.org/abs/1701.00160">NIPS 2016 GAN tutorial</a> will prove invaluable.</p>
<p><strong>Lip Reading.</strong> My third project in Visual Recognition will be trying to build a model that can recognise phrases and sentences being spoken by a talking face. Specifically, I’ll try and reproduce the results of Son Chung et. al’s <a href="https://arxiv.org/abs/1611.05358">Lip Reading Sentences in the Wild</a>. Lip reading is just so damn useful and it can really help the hearing impaired so this is a big priority of mine. Helping society is exactly why I got into this field.</p>
<p>Here’s the youtube video uploaded by the author of the paper. Impressive results!</p>
<p align="center">
<iframe width="330" height="315" src="https://www.youtube.com/embed/5aogzAUPilE" frameborder="0" allowfullscreen=""></iframe>
</p>
<hr />
<h2 id="sound">Sound</h2>
<p>My goal is to tackle 2 seminal papers in this area.</p>
<p><strong>Wavenet.</strong> Google Deepmind’s <a href="https://arxiv.org/abs/1609.03499">paper</a>, which made waves (no pun intended) when it was released, leverages a deep neural network to generate raw audio waveforms. It won “Best paper from the industry” award on Reddit.</p>
<div class="imgcap">
<img src="/assets/goals/wavenet.gif" width="55%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Image Courtesy</a></div>
</div>
<p>I invite you to check out Deepmind’s <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">blog post</a> on the matter where they showcase samples created by the network. Not only have they taught it to generate synthetic utterances (english and mandarin), but there’s even a sample of some piano playing which blew me away.</p>
<p>The results are currently very hard to reproduce, but my goal is just to get something minimal working and to familiarize myself with dilated convolutions.</p>
<p><strong>SoundNet.</strong> The second paper is from MIT’s CSAIL and was presented at this year’s NIPS. The project <a href="http://projects.csail.mit.edu/soundnet/">landing page</a> is extremely well presented and detailed so again, check it out if you’re interested.</p>
<div class="imgcap">
<img src="/assets/goals/soundnet.png" width="75%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://projects.csail.mit.edu/soundnet/">Image Courtesy</a></div>
</div>
<p>I think this paper is extremely underrated, I didn’t see much talk about it on social media but it’s actually very elegant: given a video, their ConvNet recognizes objects and scenes <strong>from sound only</strong>!</p>
<hr />
<h2 id="nlp">NLP</h2>
<p>NLP is a very important application of Deep Learning and I’ve never had any experience with it, so I decided I’d like to try and implement two recent approaches that have shifted away from the traditional RNN architecture. The first paper is from Google Deepmind and the second one is from FAIR.</p>
<p><strong>ByteNet.</strong> Google Deepmind’s <a href="https://arxiv.org/abs/1610.10099">paper</a> which can perform language modeling and machine translation in linear time. They use dilated convolutions much like in Wavenet. The below snippet, courtesy of the paper, illustrates the model’s architecture.</p>
<div class="imgcap">
<img src="/assets/goals/bytenet.png" width="50%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<p><strong>Gated Convnets.</strong> This <a href="https://arxiv.org/abs/1612.08083">paper</a> is from FAIR. The authors evade the traditional RNN structure for language modeling and replace it with a convnet endowed with a gating mechanism (similar in concept to LSTMs). This model also enjoys an order of magnitude speedup compared to a recurrent baseline because they can parallelize it. The architecture is illustrated below.</p>
<div class="imgcap">
<img src="/assets/goals/gated.png" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<hr />
<h2 id="autonomous-driving">Autonomous Driving</h2>
<p>Finally, my current and main interest is autonomous driving. I’ve decided to tackle the following 3 projects and I feel they will form a solid background before I start messing around with Comma.ai’s <a href="https://github.com/commaai/openpilot">open source project</a>.</p>
<p><strong>Traffic Sign Classification.</strong> I want to implement and train a convolutional neural network to classify traffic signs. This, incidentally, is the subject of my next blog post which is part 3 of the Spatial Transformer <a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">series</a>.</p>
<div class="imgcap">
<img src="/assets/goals/traffic-signs.png" width="85%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">Image Courtesy</a></div>
</div>
<p><strong>Behavioral Cloning.</strong> I want to train a deep neural network to drive a car using OpenAI’s Universe and GTA V. Also would like to test it on MIT’s “Deep Learning for Self-Driving Cars” <a href="http://selfdrivingcars.mit.edu/deeptrafficjs/">Deep Traffic</a>. I don’t know if I need to be a pro in Reinforcement Learning, and I’ll definitely refine this list if need be. We’ll see when the time comes.</p>
<p><strong>Kalman Filters.</strong> The final goal is Kalman filters. This algorithm is super important for autonomous driving (gps noise smoothing for example), so I want to understand it more and write a small python implementation. I’ll also definitely write a blog post about it in the near future.</p>
<h2 id="summary">Summary</h2>
<p>That’s it for today’s blog post. I talked about a few projects I’d like to work on in the fields of Vision, Sound, NLP and Autonomous Driving. I thoroughly hope I can achieve the goals I have set in mind before the end of this Spring semester as I’d like to spend the second part of this year on Deep Reinforcement Learning.</p>
<p>For those of you that are interested, I’ve set up a <a href="https://github.com/kevinzakka/deeplearning-roadmap">roadmap repository</a> on my Github which mirrors the above list, so you can check it out and see my progress step-by-step.</p>
<p>Until next time, cheers!</p>
Sun, 22 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/22/goals/
http://kevinzakka.github.io/2017/01/22/goals/deep learninggoals2017computer visionNLPsoundself-drivingDeep Learning Paper Implementations: Spatial Transformer Networks - Part II<div class="imgcap">
<img src="/assets/stn2/ai.jpg" width="45%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>In last week’s <a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">blog post</a>, we introduced two very important concepts: <strong>affine transformations</strong> and <strong>bilinear interpolation</strong> and mentioned that they would prove crucial in understanding Spatial Transformer Networks.</p>
<p>Today, we’ll provide a detailed, section-by-section summary of the <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> paper, a concept originally introduced by researchers <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind.</p>
<p>Hopefully, it’ll will give you a clear understanding of the module and prove useful for next week’s blog post where we’ll cover its implementation in Tensorflow.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Motivation</a></li>
<li><a href="#toc2">Pooling Operator</a></li>
<li><a href="#toc3">Spatial Transformer Network</a>
<ul>
<li><a href="#toc4">Localisation Network</a></li>
<li><a href="#toc5">Parametrised Sampling Grid</a></li>
<li><a href="#toc6">Differentiable Image Sampling</a></li>
</ul>
</li>
<li><a href="#toc7">Fun with STNs</a>
<ul>
<li><a href="#toc8">Distorted MNIST</a></li>
<li><a href="#toc9">GTSRB dataset</a></li>
</ul>
</li>
<li><a href="#toc10">Summary</a></li>
<li><a href="#toc11">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="motivation">Motivation</h2>
<p>When working on a classification task, it is usually desirable that our system be <strong>robust</strong> to input variations. By this, we mean to say that should an input undergo a certain “transformation” so to speak, our classification model should in theory spit out the same class label as before that transformation. A few examples of the “challenges” our image classification model may face include:</p>
<ul>
<li><strong>scale variation</strong>: variations in size both in the real world and in the image.</li>
<li><strong>viewpoint variation</strong>: different object orientation with respect to the viewer.</li>
<li><strong>deformation</strong>: non rigid bodies can be deformed and twisted in unusual shapes.</li>
</ul>
<div class="imgcap">
<div>
<img src="/assets/stn2/var1.png" style="max-width:49%; height:350px;" />
<img src="/assets/stn2/var2.png" style="max-width:49%; height:200px;" />
</div>
<div class="thecap" style="text-align:center"><a href="http://cs231n.github.io/classification/">Image Courtesy</a></div>
</div>
<p>For illustration purposes, take a look at the images above. While the task of classifying them may seem trivial to a human being, recall that our computer algorithms only work with raw 3D arrays of brightness values so a tiny change in an input image can alter every single pixel value in the corresponding array. Hence, our ideal image classification model should in theory be able to disentangle object pose and deformation from texture and shape.</p>
<p>For a different type of intuition, let’s again take a look at the following cat images.</p>
<div class="imgcap">
<div>
<img src="/assets/stn2/cat2.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat2_.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat1.jpg" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/cat1_.jpg" style="max-width:49%; height:250px;" />
</div>
<div class="thecap" style="text-align:center"> <b>Left:</b> Cat images which may present classification challenges. <b>Right:</b> Transformed images which yield a simplified classification pipeline.</div>
</div>
<p>Would it not be extremely desirable if our model could go from left to right using some sort of crop and scale-normalize combination so as to simplify the subsequent classification task?</p>
<p><a name="toc2"></a></p>
<h2 id="pooling-layers">Pooling Layers</h2>
<p>It turns out that the pooling layers we use in our neural network architectures actually endow our models with a certain degree of spatial invariance. Recall that the pooling operator acts as a sort of downsampling mechanism. It progressively reduces the spatial size of the feature map along the depth dimension, cutting down the amount of parameters and computational cost.</p>
<hr />
<div class="fig figcenter fighighlight">
<img src="/assets/stn2/pool.jpeg" width="36%" />
<img src="/assets/stn2/maxpool.jpeg" width="59%" style="border-left: 1px solid black;" />
<div class="figcaption">
Pooling layer downsamples the volume spatially. <b>Left:</b> In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. <b>Right:</b> 2x2 max pooling. (<a href="http://cs231n.github.io/convolutional-networks/#pool">Image Courtesy</a>)
</div>
</div>
<hr />
<p><strong>How exactly does it provide invariance?</strong> Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and “pool” the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7, each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we’d be capturing approximately the same information by aggregating pixel values.</p>
<p>Now there are a few downsides to pooling which make it an undesirable operator. For one, pooling is <strong>destructive</strong>. It discards 75% of feature activations when it is used, meaning we are guaranteed to lose exact positional information. Now you may be wondering why this is bad since we mentioned earlier that it endowed our network with some spatial robustness. Well the thing is that positional information is invaluable in visual recognition tasks. Think of our cat classifier above. It may be important to know where the position of the whiskers are relative to, say the snout. This can’t be achieved when it is this sort of information we throw away when we use max pooling.</p>
<p>Another limitation of pooling is that it is <strong>local and predefined</strong>. With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can’t just increase the receptive field arbitrarily because then that would downsample our feature map too agressively.</p>
<p>The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!</p>
<blockquote>
<p>The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)</p>
</blockquote>
<p><a name="toc3"></a></p>
<h2 id="spatial-transformer-networks-stns">Spatial Transformer Networks (STNs)</h2>
<p>The Spatial Transformer mechanism addresses the issues above by providing Convolutional Neural Networks with explicit spatial transformation capabilities. It possesses 3 defining properties that make it very appealing.</p>
<ul>
<li><strong>modular</strong>: STNs can be inserted anywhere into existing architectures with relatively small tweaking.</li>
<li><strong>differentiable</strong>: STNs can be trained with backprop allowing for end-to-end training of the models they are injected in.</li>
<li><strong>dynamic:</strong> STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.</li>
</ul>
<p>As you can see, the Spatial Transformer is superior to the Pooling operator in all regards. So this begs the following question: <strong>what exactly is a Spatial Transformer?</strong></p>
<div class="imgcap">
<img src="/assets/stn2/stn_arch.png" width="65%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a></div>
</div>
<p>The Spatial Transformer module consists in three components shown in the figure above: a <strong>localisation network</strong>, a <strong>grid generator</strong> and a <strong>sampler</strong>. Before we dive into each of their details, I’d like to briefly remind you of a 3 step pipeline we talked about last week.</p>
<div class="imgcap">
<img src="/assets/stn2/pipeline.png" width="75%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">Affine Transformation Pipeline</a></div>
</div>
<p>Recall that we can’t just blindly rush to the input image and apply our affine transformation. It’s important to first create a sampling grid, transform it, and then sample the input image using the grid. With that being said, let’s jump into the core components of the Spatial Transformer.</p>
<p><a name="toc4"></a></p>
<h3 id="localisation-network">Localisation Network</h3>
<p>The goal of the localisation network is to spit out the parameters <script type="math/tex">\theta</script> of the affine transformation that’ll be applied to the input feature map. More formally, our localisation net is defined as follows:</p>
<ul>
<li><strong>input</strong>: feature map U of shape (H, W, C)</li>
<li><strong>output</strong>: transformation matrix <script type="math/tex">\theta</script> of shape (6,)</li>
<li><strong>architecture</strong>: fully-connected network or ConvNet as well.</li>
</ul>
<p>As we train our network, we would like our localisation net to output more and more accurate thetas. <strong>What do we mean by accurate?</strong> Well, think of our digit 7 rotated by 90 degrees counterclockwise. After say 2 epochs, our localisation net may output a transformation matrix which performs a 45 degree clockwise rotation and after 5 epochs for example, it may actually learn to do a complete 90 degree clockwise rotation. The effect is that our output image looks like a standard digit 7, something our neural network has seen in the training data and can easily classify.</p>
<p>Another way to look at it is that the localisation network learns to store the knowledge of how to transform each training sample in the weights of its layers.</p>
<p><a name="toc5"></a></p>
<h3 id="parametrised-sampling-grid">Parametrised Sampling Grid</h3>
<p>The grid generator’s job is to output a parametrised sampling grid, which is a set of points where the input map <strong>should</strong> be sampled to produce the desired transformed output.</p>
<p>Concretely, the grid generator first creates a normalized meshgrid of the same size as the input image U of shape (H, W), that is, a set of indices <script type="math/tex">(x^t, y^t)</script> that cover the whole input feature map (the subscript t here stands for target coordinates in the output feature map). Then, since we’re applying an affine transformation to this grid and would like to use translations, we proceed by adding a row of ones to our coordinate vector to obtain its homogeneous equivalent. This is the little trick we also talked about last week. Finally, we reshape our 6 parameter <script type="math/tex">\theta</script> to a 2x3 matrix and perform the following multiplication which results in our desired parametrised sampling grid.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
x^{s} \\
y^{s} \\
\end{bmatrix} = \begin{bmatrix}
\theta_{11} & \theta_{12} & \theta_{13} \\
\theta_{21} & \theta_{22} & \theta_{23}
\end{bmatrix}
%
\begin{bmatrix}
x^t \\
y^t \\
1
\end{bmatrix} %]]></script>
<p>The column vector <script type="math/tex">\begin{bmatrix}
x^s \\
y^s
\end{bmatrix}</script> consists in a set of indices that tell us where we should sample our input to obtain the desired transformed output.</p>
<p><strong>But wait a minute, what if those indices are fractional?</strong> Bingo! That’s why we learned about bilinear interpolation and this is exactly what we do next.</p>
<p><a name="toc6"></a></p>
<h3 id="differentiable-image-sampling">Differentiable Image Sampling</h3>
<p>Since bilinear interpolation is differentiable, it is perfectly suitable for the task at hand. Armed with the input feature map and our parametrised sampling grid, we proceed with bilinear sampling and obtain our output feature map V of shape (H’, W’, C’). Note that this implies that we can perform downsampling and upsampling by specifying the shape of our sampling grid. (take that pooling!) We definitely aren’t restricted to bilinear sampling, and there are other sampling kernels we can use, but the important takeaway is that it must be differentiable to allow the loss gradients to flow all the way back to our localisation network.</p>
<div class="imgcap">
<img src="/assets/stn2/transformation.png" width="60%" style="border:none;" />
<div class="thecap" style="text-align:justify">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>) Two examples of applying the parameterised sampling grid to an image U producing the output V. <b>(a)</b> Identity transform (i.e. U = V) <b>(2)</b> Affine Transformation (i.e. rotation)</div>
</div>
<p>The above illustrates the inner workings of the Spatial Transformer. Basically it boils down to 2 crucial concepts we’ve been talking about all week: an affine transformation followed by bilinear interpolation. Take a moment and admire the elegance of such a mechanism! We’re letting our network learn the optimal affine transformation parameters that will help it ultimately succeed in the classification task <strong>all on its own</strong>.</p>
<p><a name="toc7"></a></p>
<h2 id="fun-with-spatial-transformers">Fun with Spatial Transformers</h2>
<p>As a final note, I’ll provide 2 examples that illustrate the power of Spatial Transformers. I’ve attached the references for each example at the bottom of the post, so make sure to look those up if they pique your interest.</p>
<p><a name="toc8"></a></p>
<h3 id="distorted-mnist">Distorted MNIST</h3>
<p>Here is the result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification.</p>
<div class="imgcap">
<img src="/assets/stn2/mnist.png" width="45%" style="border:none;" /><div class="thecap" style="text-align:center">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>)</div>
</div>
<p>Notice how it has learned to do exactly what we wanted our theoretical “robust” image classification model to do: by zooming in and eliminating background clutter, it has “standardized” the input to facilitate classification. If you want to view a live animation of the transformer in action, click <a href="https://drive.google.com/file/d/0B1nQa_sA3W2iN3RQLXVFRkNXN0k/view">here</a>.</p>
<p><a name="toc9"></a></p>
<h3 id="german-traffic-sign-recognition-benchmark-gtsrb-dataset">German Traffic Sign Recognition Benchmark (GTSRB) dataset</h3>
<div class="imgcap">
<div>
<img src="/assets/stn2/epoch_evolution.gif" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/moving_evolution.gif" style="max-width:49%; height:250px;" />
</div>
<div class="thecap">(<a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">Image Courtesy</a>) <b>Left</b>: Behavior of the Spatial Transformer during training. Notice how it learns to focus on the traffic sign, gradually removing background. <b>Right</b>: Output for different input images. Note how it stays approximately contant regardless of the input variability and distortion. Pretty neat!</div>
</div>
<p><a name="toc10"></a></p>
<h2 id="summary">Summary</h2>
<p>In today’s blog post, we went over Google Deepmind’s Spatial Transformer Network paper. We started by introducing the different challenges classification models face, mainly how distortions in the input images can cause our classifiers to fail. One remedy is to use pooling layers; however they possess a few glaring limitations that have made them fall into disuse. The other remedy, and the subject of this blog post, is to use Spatial Transformer Networks.</p>
<p>This consists in a differentiable module that can be inserted anywhere in ConvNet architecture to increase its geometric invariance. It effectively endows our networks with the ability to spatially transform feature maps at no extra data or supervision cost. Finally, we saw how the whole mechanism boils down to 2 familiar operations: an affine transformation and bilinear interpolation.</p>
<p>In next week’s blog post we’ll be using what we’ve learned so far to aid us in coding this paper from scratch in Tensorflow. In the meantime, if you have any questions, feel free to post them in the comment section below.</p>
<p>Cheers and see you next week!</p>
<p><a name="toc11"></a></p>
<h2 id="references">References</h2>
<ul>
<li>The original Deepmind paper - click <a href="https://arxiv.org/abs/1506.02025">here</a></li>
<li>Kudos to the Torch blog post on STNs which really helped me during the learning process - click <a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">here</a></li>
<li>Torch Implementation also helped me grasp the inner workings of STNs - check out this <a href="https://github.com/qassemoquab/stnbhwd">repo</a></li>
<li>Stanford’s CS231n as always - click <a href="cs231n.github.io">here</a></li>
</ul>
Wed, 18 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/18/stn-part2/
http://kevinzakka.github.io/2017/01/18/stn-part2/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationDeep Learning Paper Implementations: Spatial Transformer Networks - Part I<div class="imgcap">
<img src="/assets/stn/ai.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>The first three blog posts in my “Deep Learning Paper Implementations” series will cover <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> introduced by <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind in 2016. The Spatial Transformer Network is a learnable module aimed at increasing the spatial invariance of Convolutional Neural Networks in a computationally and parameter efficient manner.</p>
<p>In this first installment, we’ll be introducing two very important concepts that will prove crucial in understanding the inner workings of the Spatial Transformer layer. We’ll first start by examining a subset of image transformation techniques that fall under the umbrella of <strong>affine transformations</strong>, and then dive into a procedure that commonly follows these transformations: <strong>bilinear interpolation</strong>.</p>
<p>In the second installment, we’ll be going over the Spatial Transformer Layer in detail and summarizing the paper, and then in the third and final part, we’ll be coding it from scratch in Tensorflow and applying it to the <a href="http://benchmark.ini.rub.de/?section=gtsrb&subsection=news">GTSRB dataset</a> (German Traffic Sign Recognition Benchmark).</p>
<p>For the full code that appears on this page, visit my <a href="https://github.com/kevinzakka/blog-code/tree/master/spatial_transformer">Github Repository</a>.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Image Transformations</a>
<ul>
<li><a href="#toc2">Scale</a></li>
<li><a href="#toc3">Rotate</a></li>
<li><a href="#toc4">Shear</a></li>
<li><a href="#toc5">Translate</a></li>
</ul>
</li>
<li><a href="#toc6">Bilinear Interpolation</a>
<ul>
<li><a href="#toc7">Motivation</a></li>
<li><a href="#toc8">Algorithm</a></li>
<li><a href="#toc9">Python Code</a></li>
</ul>
</li>
<li><a href="#toc10">Results</a></li>
<li><a href="#toc11">Conclusion</a></li>
<li><a href="#toc12">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="image-transformations">Image Transformations</h3>
<p>To lay the groundwork for affine transformations, we first need to talk about linear transformations. To that end, we’ll be restricting ourselves to 2 dimensions and work with matrices.</p>
<p>We define the following:</p>
<ul>
<li>a point K with coordinates
<script type="math/tex">\begin{bmatrix}
x \\
y
\end{bmatrix}</script> represented as a <script type="math/tex">(2\times1)</script> column vector.</li>
<li>a matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix} %]]></script> represented as a square matrix of shape <script type="math/tex">(2\times2)</script>.</li>
</ul>
<p>and would like to examine the linear transformation <script type="math/tex">T</script> defined by the matrix product <script type="math/tex">K' = T(K) = MK</script> as we vary the parameters a, b, c and d of M.</p>
<p><strong>Warm-Up Question.</strong></p>
<p>Say we set <script type="math/tex">a = d = 1</script> and <script type="math/tex">b = c = 0</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix} %]]></script>
<p>In that case, what transform do you think we would obtain? Go ahead and give it a few moment’s thought…</p>
<p><strong>Solution.</strong></p>
<p>Let’s write it out:</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x \\
y
\end{bmatrix} = K %]]></script>
<p>We’ve actually represented the identity transform, meaning that the point K does not move in the plane. Let us now jump to more interesting transforms.</p>
<p><a name="toc2"></a></p>
<p><strong>Scaling.</strong></p>
<div class="imgcap">
<img src="/assets/stn/scale.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>We let <script type="math/tex">b = c = 0</script>, and <script type="math/tex">a</script> and <script type="math/tex">d</script> take on any positive value.</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix} %]]></script>
<p>Note that there is a special case of scaling called <em>isotropic</em> scaling in which the scaling factor for both the x and y direction is the same, say <script type="math/tex">s</script>. In that case, enlarging an image would correspond to <script type="math/tex">s > 1</script> while shrinking would correspond to <script type="math/tex">% <![CDATA[
s < 1 %]]></script>. It’s a bit non-intuitive then that to zoom-in on an image, you need <script type="math/tex">% <![CDATA[
s < 1 %]]></script> (think about it).</p>
<p>Anyway, performing the matrix product, we obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
px \\
qy
\end{bmatrix} %]]></script>
<p><a name="toc3"></a></p>
<p><strong>Rotation.</strong></p>
<div class="imgcap">
<img src="/assets/stn/rot.png" width="19%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>Suppose we want to rotate by an angle <script type="math/tex">\theta</script> about the origin. To do so, we set <script type="math/tex">a = d = \cos{\theta}</script> and <script type="math/tex">b = c = \sin{\theta}</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix} %]]></script>
<p>We thus obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x\cos{\theta}- y\sin{\theta} \\
x\sin{\theta} + y\cos{\theta}
\end{bmatrix} %]]></script>
<p><a name="toc4"></a></p>
<p><strong>Shear.</strong></p>
<div class="imgcap">
<img src="/assets/stn/shear.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>When we shear an image, we offset the y direction by a distance proportional to x, and the x direction by a distance proportional to y. For example, when we go from normal text to italics, we are effectively applying a shear transform (think about shearing a deck of cards if that helps).</p>
<p>To achieve shearing, we set <script type="math/tex">a = d = 1</script>, <script type="math/tex">b = m</script> and <script type="math/tex">c = n</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix} %]]></script>
<p>This yields</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x + my \\
y + nx
\end{bmatrix} %]]></script>
<hr />
<p>In summary, we have defined 3 basic linear transformations:</p>
<ul>
<li><strong>scaling:</strong> scales the x and y direction by a scalar.</li>
<li><strong>shearing:</strong> offsets the x by a number proportional to y and x by a number proportional to x.</li>
<li><strong>rotating:</strong> rotates the points around the origin by an angle <script type="math/tex">\theta</script>.</li>
</ul>
<p>Now the nice thing about matrices is that we can collapse sequential linear transformations into a single transformation matrix. For example, say we would like to apply a shear, a scale and then a rotation to our column vector K. Given that these transformations can be represented by the matrices <script type="math/tex">H</script>, <script type="math/tex">S</script> and <script type="math/tex">R</script>, and respecting the order of transformations, we can write down this operation as</p>
<script type="math/tex; mode=display">K' = R \big[ S \big( HK \big) \big]</script>
<p>But recall that matrix multiplication is associative! So this reduces to</p>
<script type="math/tex; mode=display">\boxed{K' = MK}</script>
<p>where <script type="math/tex">M = RSH</script>. Be mindful of the order since matrix multiplication <script type="math/tex">\color{red}{\text{is not}}</script> commutative.</p>
<p>A beautiful consequence of this formula is that if we are given multiple transformations to do for a very high-dimensional vector, then we can basically carry out a single matrix multiplication rather than repeatedly manipulating the high-dimensional vector for every sequential transformation.</p>
<hr />
<p><a name="toc5"></a></p>
<p><strong>Translation.</strong></p>
<p>The only downside to this <script type="math/tex">2 \times 2</script> matrix representation is that we cannot represent translation since it isn’t a linear transformation. Translation however, is a very important and needed transformation, so we would like to be able to encapsulate it in our matrix representation.</p>
<p>To solve this dilemna, we represent our 2D vectors in 3D using <strong>homogeneous coordinates</strong> as follows:</p>
<ul>
<li>our point K becomes a <script type="math/tex">(3\times1)</script> column vector
<script type="math/tex">\begin{bmatrix}
x \\
y \\
1
\end{bmatrix}</script></li>
<li>our matrix M becomes a <script type="math/tex">(3\times3)</script> square matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
0 & 0 & 1
\end{bmatrix} %]]></script></li>
</ul>
<p>To represent a translation, all we have to do is place 2 new parameters <script type="math/tex">e</script> and <script type="math/tex">f</script> in our third column like so</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & e \\
c & d & f \\
0 & 0 & 1
\end{bmatrix} %]]></script>
<p>and we can thus carry out translations as linear transformations in homogeneous coordinates. Note that if we require a 2D output, then all we need to do is represent M as a <script type="math/tex">2 \times 3</script> matrix and leave K untouched.</p>
<p><strong>Example.</strong></p>
<p>Translate both the x and y direction by <script type="math/tex">\Delta</script>. Result should be 2D.</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 & \Delta \\
0 & 1 & \Delta
\end{bmatrix}
%
\begin{bmatrix}
x \\
y \\
1
\end{bmatrix} =
\begin{bmatrix}
x + \Delta \\
y + \Delta
\end{bmatrix} %]]></script>
<p><strong>Summary.</strong></p>
<div class="imgcap">
<img src="/assets/stn/affine.png" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>By using a little trick, we were able to add a new transformation to our repertoire of linear transformations. This transformation, called translation, is an affine transformation. Hence, we can generalize our results and represent our 4 affine transformations (all linear transformations are affine) by the 6 parameter matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & c \\
d & e & f
\end{bmatrix} %]]></script>
<p><a name="toc6"></a></p>
<h3 id="bilinear-interpolation">Bilinear Interpolation</h3>
<p><a name="toc7"></a></p>
<p><strong>Motivation.</strong> When an image undergoes an affine transformation such as a rotation or scaling, the pixels in the image get moved around. This can be especially problematic when a pixel location in the output does not map directly to one in the input image.</p>
<p>In the illustration below, you can clearly see that the rotation places some points at locations that are not centered in the squares. This means that they would not have a corresponding pixel value in the original image.</p>
<div class="imgcap">
<img src="/assets/stn/stickman.png" width="70%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://northstar-www.dartmouth.edu/doc/idl/html_6.2/Interpolation_Methods.html">Image Courtesy</a></div>
</div>
<p>So for example, suppose that after rotating an image, we need to find the pixel value at the location (6.7, 3.2). The problem with this is that there is no such thing as fractional pixel locations.</p>
<p>To solve this problem, bilinear interpolation uses the 4 nearest pixel values which are located in diagonal directions from a given location in order to find the appropriate color intensity values of that pixel. The result is smoother and more realistic images!</p>
<p><a name="toc8"></a></p>
<p><strong>Algorithm.</strong></p>
<div class="imgcap">
<img src="/assets/stn/interpol.png" width="35%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Image Courtesy</a></div>
</div>
<p>Our goal is to find the pixel value of the point P. To do so, we calculate the pixel value of <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script> using a weighted average of <script type="math/tex">(Q_{11}, Q_{21})</script> and <script type="math/tex">(Q_{12}, Q_{22})</script> respectively. Then, we use a weighted average of <script type="math/tex">R_2</script> and <script type="math/tex">R_1</script> to find the value of P.</p>
<p>Effectively, we are interpolating in the x direction and then the y direction, hence the name bilinear interpolation. You could just as well flip the order of interpolation and get the exact same value.</p>
<p>So given a point <script type="math/tex">P = (x, y)</script> and 4 corner coordinates <script type="math/tex">Q_{11} = (x_1, y_1)</script>, <script type="math/tex">Q_{21} = (x_2, y_1)</script>, <script type="math/tex">Q_{12} = (x_1, y_2)</script> and <script type="math/tex">Q_{22} = (x_2, y_2)</script>, we first interpolate in the x-direction:</p>
<script type="math/tex; mode=display">R_1 = \frac{x_2 - x}{x_2 - x_1}Q_{11} + \frac{x - x_1}{x_2 - x_1}Q_{21}</script>
<script type="math/tex; mode=display">R_2 = \frac{x_2 - x}{x_2 - x_1}Q_{12} + \frac{x - x_1}{x_2 - x_1}Q_{22}</script>
<p>and finally in the y-direction:</p>
<script type="math/tex; mode=display">\boxed{P = \frac{y_2 - y}{y_2 - y_1}R_1 + \frac{y - y_1}{y_2 - y_1}R_2}</script>
<p><a name="toc9"></a></p>
<p><strong>Python Code.</strong></p>
<p>One very very important note before we jump into the code!</p>
<hr />
<p>An image processing affine transformation usually follows the 3-step pipeline below:</p>
<ul>
<li>First, we create a sampling grid composed of <script type="math/tex">(x, y)</script> coordinates. For example, given a 400x400 grayscale image, we create a meshgrid of same dimension, that is, evenly spaced <script type="math/tex">x \in [0, W]</script> and <script type="math/tex">y \in [0, H]</script>.</li>
<li>We then apply the transformation matrix to the sampling grid generated in the step above.</li>
<li>Finally, we sample the resulting grid from the original image using the desired interpolation technique.</li>
</ul>
<p>As you can see, this is different than directly applying a transform to the original image.</p>
<hr />
<p>I’ve attached 2 cat images in the Github Repository mentioned at the top of this page which you should go ahead and download. Save them to your Desktop in a folder called <code class="highlighter-rouge">data/</code> or make sure to update the path location if you choose differently.</p>
<p>I’ve also written a function <code class="highlighter-rouge">load_img()</code> that converts images to numpy arrays. I won’t go into its details but it’s pretty basic and you shouldn’t take long to understand what it does. Note that you’ll need both PIL and Numpy to reproduce the results below.</p>
<p>Armed with this function, let’s load both cat images and concatenate them into a single input array. We’re working with 2 images because we want to make our code as general as possible.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="c"># params</span>
<span class="n">DIMS</span> <span class="o">=</span> <span class="p">(</span><span class="mi">400</span><span class="p">,</span> <span class="mi">400</span><span class="p">)</span>
<span class="n">CAT1</span> <span class="o">=</span> <span class="s">'cat1.jpg'</span>
<span class="n">CAT2</span> <span class="o">=</span> <span class="s">'cat2.jpg'</span>
<span class="c"># load both cat images</span>
<span class="n">img1</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT1</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">)</span>
<span class="n">img2</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT2</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">,</span> <span class="n">view</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># concat into tensor of shape (2, 400, 400, 3)</span>
<span class="n">input_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">img1</span><span class="p">,</span> <span class="n">img2</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c"># dimension sanity check</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Input Img Shape: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">input_img</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
</code></pre></div></div>
<p>Given that we have 2 images, our batch size is equal to 2. This means that we need an equal amount of transformation matrices M for each image in the batch.</p>
<p>Let’s go ahead and initialize 2 identity transform matrices. This is the simplest case, and if we implement our bilinear sampler correctly, we should expect our output image to be almost exact to the input image.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># grab shape</span>
<span class="n">num_batch</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">C</span> <span class="o">=</span> <span class="n">input_img</span><span class="o">.</span><span class="n">shape</span>
<span class="c"># initialize M to identity transform</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">]])</span>
<span class="c"># repeat num_batch times</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
</code></pre></div></div>
<p>(Recall that our general affine transformation matrix is <script type="math/tex">2 \times 3</script> if we want to include translation.)</p>
<p>Now we need to write a function that will generate a meshgrid for us and output a sampling grid resulting from the product of this meshgrid and our transformation matrix M.</p>
<p>Let’s go ahead and generate our meshgrid. We’ll create a normalized one, that is the values of x and y range from -1 to 1 and there are <code class="highlighter-rouge">width</code> and <code class="highlighter-rouge">height</code> of them respectively. In fact, note that for images, x corresponds to the width of the image (i.e. number of columns of the matrix) while y corresponds to the height of the image (i.e. number of rows of the matrix).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># create normalized 2D grid</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">H</span><span class="p">)</span>
<span class="n">x_t</span><span class="p">,</span> <span class="n">y_t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>Then we need to augment the dimensions to create homogeneous coordinates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># reshape to (xt, yt, 1) </span>
<span class="n">ones</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">x_t</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
<span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">x_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">y_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">ones</span><span class="p">])</span>
</code></pre></div></div>
<p>So we’ve created 1 grid here, but we need <code class="highlighter-rouge">num_batch</code> grids. Same as above, our one-liner below repeats our array <code class="highlighter-rouge">num_batch</code> times.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># repeat grid num_batch times</span>
<span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">sampling_grid</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">H</span><span class="o">*</span><span class="n">W</span><span class="p">))</span>
</code></pre></div></div>
<p>Now we perform step 2 of our image transformation pipeline.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># transform the sampling grid i.e. batch multiply</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">sampling_grid</span><span class="p">)</span>
<span class="c"># batch grid has shape (num_batch, 2, H*W)</span>
<span class="c"># reshape to (num_batch, height, width, 2)</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">moveaxis</span><span class="p">(</span><span class="n">batch_grids</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Finally, let’s write our bilinear sampler. Given our coordinates <code class="highlighter-rouge">x</code> and <code class="highlighter-rouge">y</code> in the sampling grid, we want interpolate the pixel value in the original image.</p>
<p>Let’s start by seperating the x and y dimensions and rescaling them to belong in the height/width interval.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="n">y_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="c"># rescale x and y to [0, W/H]</span>
<span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">W</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">((</span><span class="n">y_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">H</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
</code></pre></div></div>
<p>Now for each coordinate <script type="math/tex">(x_i, y_i)</script> we want to grab 4 corner coordinates.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># grab 4 nearest corner points for each (x_i, y_i)</span>
<span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">x0</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">y0</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<p>(Note that we could just as well use the ceiling function rather than the increment by 1).</p>
<p>Now we must make sure that no value goes beyond the image boundaries. For example, suppose we have <script type="math/tex">x = 399</script>, then <script type="math/tex">x_0 = 399</script> and <script type="math/tex">x_1 = x0 + 1 = 400</script> which would result in a numpy error. Thus we clip our corner coordinates in the following way:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># make sure it's inside img range [0, H] or [0, W]</span>
<span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Now we use advanced numpy indexing to grab the pixel value for each corner coordinate. These correspond to <code class="highlighter-rouge">(x0, y0)</code>, <code class="highlighter-rouge">(x0, y1)</code>, <code class="highlighter-rouge">(x1, y0)</code> and <code class="highlighter-rouge">(x_1, y_1)</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># look up pixel values at corner coords</span>
<span class="n">Ia</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ib</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ic</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
<span class="n">Id</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
</code></pre></div></div>
<p>Almost there! Now, we calculate the weight coefficients,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># calculate deltas</span>
<span class="n">wa</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
</code></pre></div></div>
<p>and finally, multiply and add according to the formula mentioned previously.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># add dimension for addition</span>
<span class="n">wa</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wa</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wb</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wc</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wd</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c"># compute output</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">wa</span><span class="o">*</span><span class="n">Ia</span> <span class="o">+</span> <span class="n">wb</span><span class="o">*</span><span class="n">Ib</span> <span class="o">+</span> <span class="n">wc</span><span class="o">*</span><span class="n">Ic</span> <span class="o">+</span> <span class="n">wd</span><span class="o">*</span><span class="n">Id</span>
</code></pre></div></div>
<hr />
<p><a name="toc10"></a></p>
<h3 id="results">Results</h3>
<p>So now that we’ve gone through the whole code incrementally, let’s have some fun and experiment with different values of the transformation matrix M.</p>
<p>The first thing you need to do is copy and paste the whole code which has been made more modular. Now let’s test if our function works correctly.</p>
<p><strong>Identity Transform.</strong></p>
<p>Add the following 2 lines as the end of the script and execute.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>plt.imshow(out[1])
plt.show()
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft1.png" width="300" />
</p>
<p><strong>Translation.</strong></p>
<p>Say we want to translate the picture by <code class="highlighter-rouge">0.5</code> only in the x direction. This should shift the image to the left.</p>
<p>Edit the following line of your code as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = np.array([[1., 0., 0.5], [0., 1., 0.]])
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft2.png" width="300" />
</p>
<p><strong>Rotation.</strong></p>
<p>Finally, say we want to rotate the picture by <code class="highlighter-rouge">45</code> degrees. Given that <script type="math/tex">\cos{(45)} = \sin{(45)} = \frac{\sqrt{2}}{2} \approx 0.707</script>, edit just this line of your code as follows:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>M = np.array([[0.707, -0.707, 0.], [0.707, 0.707, 0.]])
</code></pre></div></div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft3.png" width="300" />
</p>
<p><a name="toc11"></a></p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we went over basic linear transformations such as rotation, shear and scale before generalizing to affine transformations which included translations. Then, we saw the importance of bilinear interpolation in the context of these transformations. Finally, we went over the algorithm, coded it from scratch in Python and wrote 2 methods that helped us visualize these transformations according to a 3 step image processing pipeline.</p>
<p>In the next installment of this series, we’ll go over the Spatial Transformer Network layer in detail as well as summarize the paper it is described in.</p>
<p>See you next week!</p>
<p><a name="toc12"></a></p>
<h3 id="references">References</h3>
<p>A big thank you to <a href="https://twitter.com/edersantana">Eder Santana</a> for introducing me to this paper!</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Bilinear Interpolation Wikipedia</a></li>
<li><a href="http://supercomputingblog.com/graphics/coding-bilinear-interpolation/">Bilinear Interpolation</a></li>
<li><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Matrix Transformations PDF</a></li>
<li><a href="http://stackoverflow.com/questions/12729228/simple-efficient-bilinear-interpolation-of-images-in-numpy-and-python">Bilinear Interpolation Code</a></li>
</ul>
Tue, 10 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/10/stn-part1/
http://kevinzakka.github.io/2017/01/10/stn-part1/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationNuts and Bolts of Applying Deep Learning<div class="imgcap">
<img src="/assets/app_dl/bolts.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://nutsandbolts.mit.edu/">Image Courtesy</a></div>
</div>
<p>This weekend was very hectic (catching up on courses and studying for a statistics quiz), but I managed to squeeze in some time to watch the <a href="http://www.bayareadlschool.org/">Bay Area Deep Learning School</a> livestream on YouTube. For those of you wondering what that is, BADLS is a 2-day conference hosted at Stanford University, and consisting of back-to-back presentations on a variety of topics ranging from NLP, Computer Vision, Unsupervised Learning and Reinforcement Learning. Additionally, top DL software libraries were presented such as Torch, Theano and Tensorflow.</p>
<p>There were some super interesting talks from leading experts in the field: <a href="http://www.dmi.usherb.ca/~larocheh/index_en.html">Hugo Larochelle</a> from Twitter, <a href="http://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a> from OpenAI, <a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Yoshua Bengio</a> from the Université de Montreal, and <a href="http://www.andrewng.org/">Andrew Ng</a> from Baidu to name a few. Of the plethora of presentations, there was one somewhat non-technical one given by Andrew that really piqued my interest.</p>
<p>In this blog post, I’m gonna try and give an overview of the main ideas outlined in his talk. The goal is to pause a bit and examine the ongoing trends in Deep Learning thus far, as well as gain some insight into applying DL in practice.</p>
<p>By the way, if you missed out on the livestreams, you can still view them at the following: <a href="https://www.youtube.com/watch?v=eyovmAtoUx0">Day 1</a> and <a href="https://www.youtube.com/watch?v=9dXiAecyJrY">Day 2</a>.</p>
<p><strong>Table of Contents</strong>:</p>
<ul>
<li><a href="#toc1">Major Deep Learning Trends</a></li>
<li><a href="#toc2">End-to-End Deep Learning</a></li>
<li><a href="#toc3">Bias-Variance Tradeoff</a></li>
<li><a href="#toc4">Human-level Performance</a></li>
<li><a href="#toc5">Personal Advice</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="major-deep-learning-trends">Major Deep Learning Trends</h3>
<p><strong>Why do DL algorithms work so well?</strong> According to Ng, with the rise of the Internet, Mobile and IOT era, the amount of data accessible to us has greatly increased. This correlates directly to a boost in the performance of neural network models, especially the larger ones which have the capacity to absorb all this data.</p>
<p align="center">
<img src="/assets/app_dl/perf_vs_data.png" width="450" />
</p>
<p>However, in the small data regime (left-hand side of the x-axis), the relative ordering of the algorithms is not that well defined and really depends on who is more motivated to engineer their features better, or refine and tune the hyperparameters of their model.</p>
<p>Thus this trend is more prevalent in the big data realm where hand engineering effectively gets replaced by end-to-end approaches and bigger neural nets combined with a lot of data tend to outperform all other models.</p>
<p><strong>Machine Learning and HPC team.</strong> The rise of big data and the need for larger models has started to put pressure on companies to hire a Computer Systems team. This is because some of the HPC (high-performance computing) applications require highly specialized knowledge and it is difficult to find researchers and engineers with sufficient knowledge in both fields. Thus, cooperation from both teams is the key to boosting performance in AI companies.</p>
<p><strong>Categorizing DL models.</strong> Work in DL can be categorized in the following 4 buckets:</p>
<p align="center">
<img src="/assets/app_dl/bucket.svg" width="350" />
</p>
<p>Most of the value in the industry today is driven by the models in the orange blob (innovation and monetization mostly) but Andrew believes that <strong>unsupervised deep learning</strong> is a super-exciting field that has loads of potential for the future.</p>
<p><a name="toc2"></a></p>
<h3 id="the-rise-of-end-to-end-dl">The rise of End-to-End DL</h3>
<p>A major improvement in the end-to-end approach has been the fact that outputs are becoming more and more complicated. For example, rather than just outputting a simple class score such as 0 or 1, algorithms are starting to generate richer outputs: images like in the case of GAN’s, full captions with RNN’s and most recently, audio like in DeepMind’s WaveNet.</p>
<p><strong>So what exactly does end-to-end training mean?</strong> Essentially, it means that AI practitioners are shying away from intermediate representations and going directly from one end (raw input) to the other end (output) Here’s an example from speech recognition.</p>
<p align="center">
<img src="/assets/app_dl/end-to-end.svg" width="340" />
</p>
<p><strong>Are there any disadvantages to this approach?</strong> End-to-end approaches are data hungry meaning they only perform well when provided with a huge dataset of labelled examples. In practice, not all applications have the luxury of large labelled datasets so other approaches which allow hand-engineered information and field expertise to be added into the model have gained the upper hand. As an example, in a self-driving car setting, going directly from the raw image to the steering direction is pretty difficult. Rather, many features such as trajectory and pedestrian location are calculated first as intermediate steps.</p>
<p>The main take-away from this section is that we should always be cautious of end-to-end approaches in applications where huge data is hard to come by.</p>
<p><a name="toc3"></a></p>
<h3 id="bias-variance-tradeoff">Bias-Variance Tradeoff</h3>
<p><strong>Splitting your data.</strong> In most deep learning problems, train and test come from different distributions. For example, suppose you are working on implementing an AI powered rearview mirror and have gathered 2 chunks of data: the first, larger chunk comes from many places (could be partly bought, and partly crowdsourced) and the second, much smaller chunk is actual car data.</p>
<p>In this case, splitting the data into train/dev/test can be tricky. One might be tempted to carve the dev set out of the training chunk like in the first example of the diagram below. (Note that the chunk on the left corresponds to data mined from the first distribution and the one on the right to the one from the second distribution.)</p>
<p align="center">
<img src="/assets/app_dl/split.svg" width="500" />
</p>
<p>This is bad because we usually want our dev and test to come from the same distribution. The reason for this is that because a part of the team will be spending a lot of time tuning the model to work well on the dev set, if the test set were to turn out very different from the dev set, then pretty much all the work would have been wasted effort.</p>
<p>Hence, a smarter way of splitting the above dataset would be just like the second line of the diagram. Now in practice, Andrew recommends creating dev sets from both data distributions: a train-dev and test-dev set. In this manner, any gap between the different errors can help you tackle the problem more clearly.</p>
<p align="center">
<img src="/assets/app_dl/errors.svg" width="450" />
</p>
<p><strong>Flowchart for working with a model.</strong> Given what we have described above, here’s a simplified flowchart of the actions you should take when confronted with training/tuning a DL model.</p>
<p align="center">
<img src="/assets/app_dl/flowachart.svg" width="500" />
</p>
<p><strong>The importance of data synthesis.</strong> Andrew also stressed the importance of data synthesis as part of any workflow in deep learning. While it may be painful to manually engineer training examples, the relative gain in performance you obtain once the parameters and the model fit well are huge and worth your while.</p>
<p><a name="toc4"></a></p>
<h3 id="human-level-performance">Human-level Performance</h3>
<p>One of the very important concepts underlined in this lecture was that of human-level performance. In the basic setting, DL models tend to plateau once they have reached or surpassed human-level accuracy. While it is important to note that human-level performance doesn’t necessarily coincide with the golden bayes error rate, it can serve as a very reliable proxy which can be leveraged to determine your next move when training your model.</p>
<p align="center">
<img src="/assets/app_dl/perf.png" width="550" />
</p>
<p><strong>Reasons for the plateau.</strong> There could be a theoretical limit on the dataset which makes further improvement futile (i.e. a noisy subset of the data). Humans are also very good at these tasks so trying to make progress beyond that suffers from diminishing returns.</p>
<p>Here’s an example that can help illustrate the usefulness of human-level accuracy. Suppose you are working on an image recognition task and measure the following:</p>
<ul>
<li><strong>Train error</strong>: 8%</li>
<li><strong>Dev Error</strong>: 10%</li>
</ul>
<p>If I were to tell you that human accuracy for such a task is on the order of 1%, then this would be a blatant bias problem and you could subsequently try increasing the size of your model, train longer etc. However, if I told you that human-level accuracy was on the order of 7.5%, then this would be more of a variance problem and you’d focus your efforts on methods such as data synthesis or gathering data more similar to the test.</p>
<p>By the way, there’s always room for improvement. Even if you are close to human-level accuracy overall, there could be subsets of the data where you perform poorly and working on those can boost production performance greatly.</p>
<p>Finally, one might ask what is a good way of defining human-level accuracy. For example, in the following image diagnosis setting, ignoring the cost of obtaining data, how should one pick the criteria for human-level accuracy?</p>
<ul>
<li><strong>typical human</strong>: 5%</li>
<li><strong>general doctor</strong>: 1%</li>
<li><strong>specialized doctor</strong>: 0.8%</li>
<li><strong>group of specialized doctor</strong>s: 0.5%</li>
</ul>
<p>The answer is always the best accuracy possible. This is because, as we mentioned earlier, human-level performance is a proxy for the bayes optimal error rate, so providing a more accurate upper bound to your performance can help you strategize your next move.</p>
<p><a name="toc5"></a></p>
<h3 id="personal-advice">Personal Advice</h3>
<p>Andrew ended the presentation with 2 ways one can improve his/her skills in the field of deep learning.</p>
<ul>
<li><strong>Practice, Practice, Practice</strong>: compete in Kaggle competitions and read associated blog posts and forum discussions.</li>
<li><strong>Do the Dirty Work</strong>: read a lot of papers and try to replicate the results. Soon enough, you’ll get your own ideas and build your own models.</li>
</ul>
Mon, 26 Sep 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/deep learningbiasvarianceadviceend-to-endmachine learningDeriving the Gradient for the Backward Pass of Batch Normalization<div class="imgcap">
<img src="/assets/batch_norm/cs231n.png" width="70%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://cs231n.stanford.edu/">Image Courtesy</a></div>
</div>
<p>I recently sat down to work on assignment 2 of Stanford’s <a href="http://cs231n.github.io/assignments2016/assignment2/">CS231n</a>. It’s lengthy and definitely a step up from the first assignment, but the insight you gain is tremendous.</p>
<p>Anyway, at one point in the assignment, we were tasked with implementing a Batch Normalization layer in our fully-connected net which required writing a forward and backward pass.</p>
<p>The forward pass is relatively simple since it only requires standardizing the input features (zero mean and unit standard deviation). The backwards pass, on the other hand, is a bit more involved. It can be done in 2 different ways:</p>
<ul>
<li><strong>staged computation</strong>: we can break up the function into several parts, derive local gradients for them, and finally multiply them with the chain rule.</li>
<li><strong>gradient derivation</strong>: basically, you have to do a “pen and paper” derivation of the gradient with respect to the inputs.</li>
</ul>
<p>It turns out that second option is faster, albeit nastier and after struggling for a few hours, I finally got it to work. This post is mainly a clear summary of the derivation along with my thought process, and I hope it can provide others with the insight and intuition of the chain rule. There is a similar tutorial online already (but I couldn’t follow along very well) so if you want to check it out, head over to <a href="http://cthorey.github.io./backpropagation/">Clément Thorey’s Blog</a>.</p>
<p>Finally, I’ve summarized the original <a href="https://arxiv.org/abs/1502.03167">research paper</a> and accompanied it with a small numpy implementation which you can view on my <a href="https://github.com/kevinzakka/research-paper-notes">Github</a>. With that being said, let’s jump right into the blog.</p>
<h3 id="notation">Notation</h3>
<p>Let’s start with some notation.</p>
<ul>
<li><strong>BN</strong> will stand for Batch Norm.</li>
<li><script type="math/tex">f</script> represents a layer upwards of the BN one.</li>
<li><script type="math/tex">y</script> is the linear transformation which scales <script type="math/tex">x</script> by <script type="math/tex">\gamma</script> and adds <script type="math/tex">\beta</script>.</li>
<li><script type="math/tex">\hat{x}</script> is the normalized inputs.</li>
<li><script type="math/tex">\mu</script> is the batch mean.</li>
<li><script type="math/tex">\sigma^2</script> is the batch variance.</li>
</ul>
<p>The below table shows you the inputs to each function and will help with the future derivation.</p>
<p align="center">
<img src="\assets\batch_norm\table0.png" width="380" />
</p>
<p><strong>Goal</strong>: Find the partial derivatives with respect to the inputs, that is <script type="math/tex">\dfrac{\partial f}{\partial \gamma}</script>, <script type="math/tex">\dfrac{\partial f}{\partial \beta}</script> and <script type="math/tex">\dfrac{\partial f}{\partial x_i}</script>.</p>
<p><strong>Methodology</strong>: derive the gradient with respect to the centered inputs <script type="math/tex">\hat{x}_i</script> (which requires deriving the gradient w.r.t <script type="math/tex">\mu</script> and <script type="math/tex">\sigma^2</script>) and then use those to derive one for <script type="math/tex">x_i</script>.</p>
<h3 id="chain-rule-primer">Chain Rule Primer</h3>
<p>Suppose we’re given a function <script type="math/tex">u(x, y)</script> where <script type="math/tex">x(r, t)</script> and <script type="math/tex">y(r, t)</script>. Then to determine the value of <script type="math/tex">\frac{\partial u}{\partial r}</script> and <script type="math/tex">\frac{\partial u}{\partial t}</script> we need to use the chain rule which says that:</p>
<script type="math/tex; mode=display">\frac{\partial u}{\partial r} = \frac{\partial u}{\partial x} \cdot \frac{\partial x}{\partial r} + \frac{\partial u}{\partial y} \cdot \frac{\partial y}{\partial r}</script>
<p>That’s basically all there is to it. Using this simple concept can help us solve our problem. We just have to be clear and precise when using it and not get lost with the intermediate variables.</p>
<h3 id="partial-derivatives">Partial Derivatives</h3>
<p>Here’s the gist of BN taken from the paper.</p>
<p align="center">
<img src="\assets\batch_norm\alg1.png" width="380" />
</p>
<p>We’re gonna start by traversing the table from left to right. At each step we derive the gradient with respect to the inputs in the cell.</p>
<h4 id="cell-1">Cell 1</h4>
<p align="center">
<img src="\assets\batch_norm\table1.png" width="380" />
</p>
<p>Let’s compute <script type="math/tex">\dfrac{\partial f}{\partial y_i}</script>. It actually turns out we don’t need to compute this derivative since we already have it - it’s the upstream derivative <code class="highlighter-rouge">dout</code> given to us in the function parameter.</p>
<h4 id="cell-2">Cell 2</h4>
<p align="center">
<img src="\assets\batch_norm\table2.png" width="380" />
</p>
<p>Let’s work on cell 2 now. We note that <script type="math/tex">y</script> is a function of three variables, so let’s compute the gradient with respect to each one.</p>
<hr />
<p>Starting with <script type="math/tex">\gamma</script> and using the chain rule:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \gamma} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \gamma} \qquad \\
&=& \boxed{\sum\limits_{i=1}^m \frac{\partial f}{\partial y_i} \cdot \hat{x}_i}
\end{eqnarray} %]]></script>
<p>Notice that we sum from <script type="math/tex">1 \rightarrow m</script> because we’re working with batches! If you’re worried you wouldn’t have caught that, think about the dimensions. The gradient with respect to a variable should be of the same size as that same variable so if those two clash, it should tell you you’ve done something wrong.</p>
<hr />
<p>Moving on to <script type="math/tex">\beta</script> we compute the gradient as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \beta} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \beta} \qquad \\
&=& \boxed{\sum\limits_{i=1}^m \frac{\partial f}{\partial y_i}}
\end{eqnarray} %]]></script>
<hr />
<p>and finally <script type="math/tex">\hat{x}_i</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \hat{x}_i} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \hat{x}_i} \qquad \\
&=& \boxed{\frac{\partial f}{\partial y_i} \cdot \gamma}
\end{eqnarray} %]]></script>
<hr />
<p>Up to now, things are relatively simple and we’ve already done 2/3 of the work. We <strong>can’t</strong> compute the gradient with respect to <script type="math/tex">x_i</script> just yet though.</p>
<h4 id="cell-3">Cell 3</h4>
<p align="center">
<img src="\assets\batch_norm\table3.png" width="380" />
</p>
<hr />
<p>We start with <script type="math/tex">\mu</script> and notice that <script type="math/tex">\sigma^2</script> is a function of <script type="math/tex">\mu</script>, therefore we need to add its contribution to the partial - (I’ve highlighted the missing partials in red):</p>
<script type="math/tex; mode=display">\dfrac{\partial f}{\partial \mu} = \frac{\partial f}{\partial \hat{x}_i} \cdot \color{red}{\frac{\partial \hat{x}_i}{\partial \mu}} + \color{red}{\frac{\partial f}{\partial \sigma^2}} \cdot \color{red}{\frac{\partial \sigma^2}{\partial\mu}}</script>
<p>Let’s compute the missing partials one at a time.</p>
<p>From</p>
<script type="math/tex; mode=display">\hat{x}_i = \frac{(x_i - \mu)}{\sqrt{\sigma^2 + \epsilon}}</script>
<p>we compute:</p>
<script type="math/tex; mode=display">\boxed{\dfrac{\partial \hat{x}_i}{\partial \mu} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \cdot (-1)}</script>
<p>and from</p>
<script type="math/tex; mode=display">\sigma^2 = \frac{1}{m} \sum\limits_{i=1}^m (x_i - \mu)^2</script>
<p>we calculate:</p>
<script type="math/tex; mode=display">\boxed{\dfrac{\partial \sigma^2}{\partial \mu} = \frac{1}{m} \sum\limits_{i=1}^m 2 \cdot (x_i - \mu)\cdot (-1)}</script>
<p>We’re missing the partial with respect to <script type="math/tex">\sigma^2</script> and that is our next variable, so let’s get to it and come back and plug it in here.</p>
<hr />
<p>Ok so in the expression of the partial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \sigma^2} &=& \frac{\partial f}{\partial \hat{x}} \cdot \frac{\partial \hat{x}}{\partial \sigma^2} \qquad \\
\end{eqnarray} %]]></script>
<p>let’s compute <script type="math/tex">\dfrac{\partial \hat{x}}{\partial \sigma^2}</script> in more detail. I’m gonna rewrite <script type="math/tex">\hat{x}</script> to make its derivative easier to compute:</p>
<script type="math/tex; mode=display">\hat{x}_i = (x_i - \mu)(\sigma^2 + \epsilon)^{-0.5}</script>
<p><script type="math/tex">(x_i - \mu)</script> is a constant therefore:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\dfrac{\partial \hat{x}}{\partial \sigma^2} &=& \sum\limits_{i=1}^m (x_i - \mu) \cdot (-0.5) \cdot (\sigma^2 + \epsilon)^{-0.5 - 1} \qquad \\
&=& -0.5 \sum\limits_{i=1}^m (x_i - \mu) \cdot (\sigma^2 + \epsilon)^{-1.5}
\end{eqnarray} %]]></script>
<hr />
<p>With all that out of the way, let’s plug everything back in our previous partial!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \mu} &=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot \frac{1}{m} \sum\limits_{i=1}^m -2(x_i - \mu) \bigg) \qquad \\
&=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot (-2) \cdot \bigg( \frac{1}{m} \sum\limits_{i=1}^m x_i - \frac{1}{m} \sum\limits_{i=1}^m \mu \bigg) \bigg) \qquad \\
&=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot (-2) \cdot \bigg( \mu - \frac{m \cdot \mu}{m} \bigg) \bigg) \qquad \\
&=& \sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \qquad \\
\end{eqnarray} %]]></script>
<p>Thus we have:</p>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial \mu} = \sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}}}</script>
<hr />
<p>We finally arrive at the last variable <script type="math/tex">x</script>. Again adding the contributions from any parameter containing <script type="math/tex">x</script> we obtain:</p>
<script type="math/tex; mode=display">\dfrac{\partial f}{\partial x_i} = \frac{\partial f}{\partial \hat{x}_i} \cdot \color{red}{\frac{\partial \hat{x}_i}{\partial x_i}} + \frac{\partial f}{\partial \mu} \cdot \color{red}{\frac{\partial \mu}{\partial x_i}} + \frac{\partial f}{\partial \sigma^2} \cdot \color{red}{\frac{\partial \sigma^2}{\partial x_i}}</script>
<p>The missing pieces are super easy to compute at this point.</p>
<script type="math/tex; mode=display">\dfrac{\partial \hat{x}_i}{\partial x_i} = \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}</script>
<script type="math/tex; mode=display">\dfrac{\partial \mu}{\partial x_i} = \dfrac{1}{m}</script>
<script type="math/tex; mode=display">\dfrac{\partial \sigma^2}{\partial x_i} = \dfrac{2(x_i - \mu)}{m}</script>
<p>That’s it, our final gradient is</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial x_i} = \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{\partial f}{\partial \mu} \cdot \dfrac{1}{m}\bigg) + \bigg(\frac{\partial f}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{m}\bigg)</script>
<p><span style="color:red"><strong>Note the following trick</strong></span></p>
<script type="math/tex; mode=display">\boxed{(\sigma^2 + \epsilon)^{-1.5} = (\sigma^2 + \epsilon)^{-0.5}(\sigma^2 + \epsilon)^{-1} = (\sigma^2 + \epsilon)^{-0.5} \frac{1}{\sqrt{\sigma^2 + \epsilon}}\frac{1}{\sqrt{\sigma^2 + \epsilon}}}</script>
<p>With that in mind, let’s plug in the partials and see if we can simplify the expression some more.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial x_i} &=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{\partial f}{\partial \mu} \cdot \dfrac{1}{m}\bigg) + \bigg(\frac{\partial f}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{m}\bigg) \qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{1}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}}\bigg) - \bigg(0.5 \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot (x_j - \mu) \cdot (\sigma^2 + \epsilon)^{-1.5} \cdot \dfrac{2(x_i - \mu)}{m} \bigg) \qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot (\sigma^2 + \epsilon)^{-0.5} \bigg) - \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \bigg) + \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \frac{(x_j - \mu)}{\sqrt{\sigma^2 + \epsilon}} \bigg )\qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot (\sigma^2 + \epsilon)^{-0.5} \bigg) - \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \bigg) + \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \cdot \hat{x}_i \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j \bigg )\qquad \\
\end{eqnarray} %]]></script>
<p>Finally, we factorize by the <code class="highlighter-rouge">sigma + epsilon</code> factor and obtain:</p>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial x_i} = \frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \bigg [\color{red}{m \frac{\partial f}{\partial \hat{x}_i}} - \color{blue}{\sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j}} - \color{green}{\hat{x}_i \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j}\bigg ]}</script>
<h3 id="recap">Recap</h3>
<p>For organizational purposes, let’s summarize the main equations we were able to derive. Using <script type="math/tex">\dfrac{\partial f}{\partial \hat{x}_i} = \dfrac{\partial f}{\partial y_i} \cdot \gamma</script>, we obtain the gradient with respect to our inputs:</p>
<script type="math/tex; mode=display">\boxed{\color{red}{\frac{\partial f}{\partial \beta} = \sum\limits_{i=1}^m \frac{\partial f}{\partial y_i}}}</script>
<script type="math/tex; mode=display">\boxed{\color{blue}{\frac{\partial f}{\partial \gamma} = \sum\limits_{i=1}^m \frac{\partial f}{\partial y_i} \cdot \hat{x}_i}}</script>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial x_i} = \frac{\color{red}{m \dfrac{\partial f}{\partial \hat{x}_i}} - \color{blue}{\sum\limits_{j=1}^m \dfrac{\partial f}{\partial \hat{x}_j}} - \color{green}{\hat{x}_i \sum\limits_{j=1}^m \dfrac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j}}{m\sqrt{\sigma^2 + \epsilon}}}</script>
<h3 id="python-implementation">Python Implementation</h3>
<p>Here’s an example implementation using the equations we derived. <code class="highlighter-rouge">dx</code> is 88 characters long so I’m still wondering how the course instructors were able to write it less than 80 - maybe shorter variable names?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">batchnorm_backward</span><span class="p">(</span><span class="n">dout</span><span class="p">,</span> <span class="n">cache</span><span class="p">):</span>
<span class="n">N</span><span class="p">,</span> <span class="n">D</span> <span class="o">=</span> <span class="n">dout</span><span class="o">.</span><span class="n">shape</span>
<span class="n">x_mu</span><span class="p">,</span> <span class="n">inv_var</span><span class="p">,</span> <span class="n">x_hat</span><span class="p">,</span> <span class="n">gamma</span> <span class="o">=</span> <span class="n">cache</span>
<span class="c"># intermediate partial derivatives</span>
<span class="n">dxhat</span> <span class="o">=</span> <span class="n">dout</span> <span class="o">*</span> <span class="n">gamma</span>
<span class="c"># final partial derivatives</span>
<span class="n">dx</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">/</span> <span class="n">N</span><span class="p">)</span> <span class="o">*</span> <span class="n">inv_var</span> <span class="o">*</span> <span class="p">(</span><span class="n">N</span><span class="o">*</span><span class="n">dxhat</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dxhat</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="o">-</span> <span class="n">x_hat</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dxhat</span><span class="o">*</span><span class="n">x_hat</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="n">dbeta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dout</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">dgamma</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">x_hat</span><span class="o">*</span><span class="n">dout</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dgamma</span><span class="p">,</span> <span class="n">dbeta</span>
</code></pre></div></div>
<p>This version of the batchnorm backward pass can give you a significant boost in speed. I timed both versions and got a superb threefold increase in speed:</p>
<p align="center">
<img src="/assets/speedup.png" width="400" />
</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we learned how to use the chain rule in a staged manner to derive the expression for the gradient of the batch norm layer. We also saw how a smart simplification can help significantly reduce the complexity of the expression for <code class="highlighter-rouge">dx</code>. We finally implemented it the backward pass in Python using the code from CS231n. This version of the function resulted in a 3x speed increase!</p>
<p>If you’re interested in the staged computation method, head over to <a href="https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html">Kratzert’s nicely written post</a>.</p>
<p>Cheers!</p>
Wed, 14 Sep 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/09/14/batch_normalization/
http://kevinzakka.github.io/2016/09/14/batch_normalization/batch normalizationgradientchain rulecs231n