Kevin Zakka's Blog

kNN classification using Neighbourhood Components Analysis

Mon, 10 Feb 2020 00:00:00 +0000

Update (12/02/2020): The implementation is now available as a pip package. Simply run pip install torchnca.

While reading related work¹ for my current research project, I stumbled upon a reference to a classic paper from 2004 called Neighbourhood Components Analysis (NCA). After giving it a read, I was instantly charmed by its simplicity and elegance. Long story short, NCA allows you to learn a linear transformation of your data that maximizes k-nearest neighbours performance. By forcing the transformation to be low-rank, NCA will perform dimensionality reduction, leading to vastly reduced storage sizes and search times for kNN! NCA is a very useful algorithm to have in your toolkit – just like PCA – but it’s very rarely mentioned in the wild. In fact, I couldn’t find any tutorial or reference outside of academic papers. This post is an attempt to rectify this.

Figure 1: Visualizing the embedding space of a synthetic dataset as NCA trains.

I’ve implemented NCA in PyTorch with some added bells and whistles. It took almost 1 week to get it to work right, but I gained a lot of insight along the way. I think implementing algorithms from scratch is a great way of building intuition² for why things work – and by extension when and why they don’t – so I encourage the reader to do the same. There’s also a video presentation of NCA by one of the co-authors on YouTube which should serve as a good supplement to this post.

Paper PyTorch Code

kNN: The Good, The Bad, The Ugly
NCA to the rescue
- Formulating the loss function
- NCA as a special case of the contrastive loss
NCA in PyTorch
Boring… Show me what it can do!
- Dimensionality reduction
- kNN on MNIST
Acknowledgements

kNN: The Good, The Bad, The Ugly

Figure 2: kNN's nonlinear decision boundary (source).

You’ve probably heard of k-nearest neighbours (kNN) at least once in your life. It’s one of the first algorithms taught in many machine learning classes, and not without good reason. There’s lots to love about kNN! To name a few:

It has an extremely simple implementation. In fact, kNN has absolutely no computational training cost.
It’s decision boundary, controlled by $k$ , is highly nonlinear (the lines in Figure 2 are locally linear but their overall shape can’t be defined by a hyperplane). For low values of $k$ , kNN has very little inductive bias.
There’s just a single hyperparameter to tune: the number of neighbours $k$ . You can easily find its optimal value with cross-validation.
It is asymptotically optimal. One can show that as the amount of data approaches infinity, k-NN is guaranteed to yield an error rate no worse than twice the Bayes error rate – the lowest possible error rate for any classifier – on a binary classification task. Or in other words, you can expect the performance of kNN to automatically improve as the number of training examples increases.

But kNN does have some annoying drawbacks that limit its efficiency in big-data regimes. Specifically,

It has to store and search through the entire training data to classify just one test point. Without any optimizations, test-time classification is roughly $\mathcal{O}(n)$ given $n \gg d$ . That’s extremely unappealing from a deployment perspective since we usualy aim for a high test-time efficiency and low memory footprint.
In high dimensions, it suffers from the curse of dimensionality.
The choice of the distance metric can have a significant effect on its performance. What then is the optimal distance metric? How should one go about choosing it?

NCA to the Rescue

Rather than having the user specify some arbitrary distance metric, NCA learns it by choosing a parameterized family of quadratic distance metrics, constructing a loss function of the parameters, and optimizing it with gradient descent. Furthermore, the learned distance metric can explicitly be made low-dimensional, solving test-time storage and search issues. How does NCA do this?

It turns out that learning a quadratic distance metric $\mathcal{d}$ of the input space where the performance of kNN is maximized is equivalent to learning a linear transformation $\mathcal{A}$ of the input space, such that in the transformed space, kNN with a Euclidean distance metric is maximized. In fact, quadratic distance metrics³ can be represented by a positive semi-definite matrix $Q = \mathcal{A}^T \mathcal{A}$ such that:

$\begin{equation} \label{eq1} \begin{split} d(x_1, x_2) &= (x_1 - x_2)^T Q (x_1- x_2) \\ & = (\mathcal{A}x_1 - \mathcal{A}x_2)^T (\mathcal{A}x_1 - \mathcal{A}x_2) \\ &= \langle y_1 - y_2, y_1 - y_2 \rangle \end{split} \end{equation}$

The goal of the learning algorithm then, is to optimize the performance of kNN on future test data. Since we don’t a priori know the test data, we can choose instead to optimize the closest thing in our toolbox: the leave-one-out (LOO) performance of the training data.

At this point, I’d like the reader to appreciate the elegance of NCA. We’ve transformed the problem of maximizing the classification accuracy of kNN into an optimization problem involving a two-dimensional matrix $\mathcal{A}$ . What remains is specifying a loss function that’s parameterized by $\mathcal{A}$ and that can serve as as a proxy for the LOO classification accuracy.

Figure 3: The discontinuous graph of the LOO cross validation error. The red rectangle in particular illustrates how an infinitesimal change in the x-axis may change the value of the y-axis by a finite amount.

Formulating The Loss Function. There’s a slight bump in our road: LOO error is a highly discontinuous loss function. The reason is that it depends solely on the neighbourhood graph of each point. If the distance metric changes slightly at first, there might be no change in the neighbourhood graph and thus no change of the LOO error. But then suddenly, an infinitesimal change in the metric can alter the neighbourhood graph of many points, causing a significant jump in the LOO error. This is illustrated in the figure above.

Clearly, a discontinuous loss function is terrible for optimzation so we need to construct an alternative that is smooth and differentiable. The key to doing this is to replace fixed neighbourhood selection (i.e. what is done in LOO cross-validation) with stochastic neighbourhood selection. That is, each point $i$ in the training set selects another point $j$ as its neighbor with some probability $p_{ij}$ that is inversely proportional to the Euclidean distance $d_{ij}$ in the transformed space. By summing over all values of $j$ , we can compute the probability $p_i$ that a point $i$ will be correctly classified and then sum over all values of $p_i$ to obtain the total number of points we can expect to correctly classifiy.

Denoting the set of points in the same class as $i$ by $C_i$ , our loss function⁴ thus becomes:

$\mathcal{L}(X; \mathcal{A}) = -\sum_i p_i = - \sum_i \sum_{j \in C_i} p_{ij}$

where

$\begin{equation} \label{eq2} \begin{split} p_{ij} &= \frac{e^{-d_{ij}}}{\sum_{k \neq i} e^{-d_{ik}}} &= \frac{\exp{\big(-\lVert Ax_i - Ax_j \lVert ^2\big)}}{\sum_{k \neq i} \exp{\big(- \lVert A x_i - Ax_k \lVert}\big)} \end{split} \end{equation}$

The really neat thing about this stochastic assignment is that we’ve completely avoided having to specify a value of $k$ . It gets learned implicitly through the scale of the matrix $\mathcal{A}$ :

With larger values of $\mathcal{A}$ , the distance between points increases and as a result their probabilities decrease (think exponential of smaller and smaller values). This means kNN will consult fewer neighbours for each point.
With smaller values of $\mathcal{A}$ , the distance between points decreases and as a result their probabilities increase (think exponential of larger and larger values). This means kNN will consult more neighbours for each point.

NCA as a special case of the contrastive loss. If we slightly alter our loss function to sum over log probabilities $-\sum_i \log{p_i}$ , you’ll notice it looks just like a categorical cross entropy loss. In fact, you can think of NCA as a single hidden layer feed-forward neural network that performs metric learning with a contrastive loss function. Recall that a contrastive loss takes on the form:

$\mathcal{L}_ {contr} = \alpha \mathcal{L}_ {pos} + \beta \mathcal{L}_ {neg}$

In most papers, $\mathcal{L}_ {pos}$ is an L2 loss, $\mathcal{L}_ {neg}$ is a hinge loss and $\alpha = \beta = 1$ . The NCA loss function uses a categorical cross-entropy loss for $\mathcal{L}_ {pos}$ with $\alpha = 1$ and $\beta = 0$ . This insight is going to be very valuable in our implementation of NCA when we talk about tricks to stabilize the training.

NCA In PyTorch

There’s currently no GPU-accelerated version of NCA. The two most common ones at the time of this post are sklearn’s python implementation and a C++ implementation. This meant I had the perfect excuse to implement a version in PyTorch that could leverage (a) automatic differentiation to compute the gradient of the loss function with respect to $\mathcal{A}$ and (b) blazing fast GPU acceleration that would prove super useful for large datasets. While the implementation was pretty straightforward, getting it to converge consistently took quite a while. In this section, I’ll walk you through the high-level components needed to implement NCA plus all the additional bells and whistles I added to get it to converge. The entirety of the code is available on GitHub.

Initialization. Since NCA is a gradient-based iterative optimization process, it requires that we specify an initialization strategy for the matrix $\mathcal{A}$ . The two obvious ones (no, not zero init!) are identity initialization and random initialization. Recall that if $d$ is the chosen dimension of the embedding space, and if $X \in \mathcal{R}^{N \ \times \ D}$ is our input dataset, then $\mathcal{A} \in \mathcal{R}^{d \ \times \ D}$ .

D = 3  # feature space dimension
d = 2  # embedding space dimension

if init == "random":
  # random init from a normal distribution
  # with mean 0 and variance 0.01
  A = nn.Parameter(torch.randn(d, D) * 0.01)
elif init == "identity":
  # identity init
  A = nn.Parameter(torch.eye(d, D))

Loss Function. Computing the loss function requires forming a matrix of pairwise Euclidean distances in the transformed space, applying a softmax over the negative distances to compute pairwise probabilities, then summing over probabilities belonging to the same class. The trick here is to vectorize the softmax computation whilst ignoring diagonal values of the distance matrix (i.e. values where $i = j$ ) and probabilities that don’t have the same class labels.

To compute a pairwise Euclidean distance matrix, we make use of the following code:

def pairwise_l2_sq(x):
  """Compute pairwise squared Euclidean distances.
  """
  dot = torch.mm(x.double(), torch.t(x.double()))
  norm_sq = torch.diag(dot)
  dist = norm_sq[None, :] - 2*dot + norm_sq[:, None]
  dist = torch.clamp(dist, min=0)  # replace negative values with 0
  return dist.float()

Note the cast to double to increase numerical precision in the dot product computation and the clamp method to replace any negative values that could have arisen from numerical imprecisions with zeros.

Next, we want to compute a softmax over the negative distances to obtain the pairwise probability matrix $p_{ij}$ . Unlike a typical softmax implementation, the denominator in our equation sums over all $k \neq i$ , i.e. it skips the diagonal entries of the pairwise distance matrix. A neat trick to achieve this without modifying the softmax function is to fill the diagonal entries with np.inf. That way, taking the exponential of their negative evaluates to 0 and doesn’t contribute to the normalization.

Now for each row $i$ in $p_{ij}$ , we need to sum over all columns $j \in C_i$ . We can achieve this simply by creating a pairwise boolean mask of class labels, element-wise multiplying it with $p_{ij}$ then calling the sum method. The code below executes all the aforementioned computations:

# compute pairwise boolean class label mask
y_mask = (y[:, None] == y[None, :]).float()

# compute pairwise squared Euclidean distances
# in transformed space
embedding = torch.mm(X, torch.t(A))
distances = pairwise_l2_sq(embedding)

# compute pairwise probability matrix p_ij defined by a
# softmax over negative squared distances in the transformed space.
# since we are dealing with negative values with the largest value
# being 0, we need not worry about numerical instabilities
# in the softmax function
p_ij = softmax(-distances)

# for each p_i, zero out any p_ij that is not of the same
# class label as i
p_ij_mask = p_ij * y_mask

# sum over js to compute p_i
p_i = p_ij_mask.sum(dim=1)

# compute expected number of points correctly classified by summing
# over all p_i's.
loss = -p_i.sum()

Replacing Conjugate Gradients with SGD. The authors originally optimized NCA with conjuate gradients. I decided to stick with mini-batch Stochastic Gradient Descent. My reasoning was two-fold. First, with very large datasets, the size of the pairwise matrix grows quadratically with the number of points so it was essential that I use a mini-batch optimizer that could run very fast on a memory-limited GPU. Second, SGD has been shown to be a tried and true optimizer in deep learning that tends to generalize better than its counterparts.

Stability Tricks. It took an intense session of debugging to get the implementation to consistently work for the various initializations and input data sizes. Here they are, in no particular order:

Summing over log probabilities was more stable than the non-log variant. In other words, I ended up using a categorical cross-entropy loss.
Initially, the random initialization was sampled from a unit variance Gaussian. Lowering the variance to 0.01 seemed to make the optimization more stable.
Selecting the batch size was crucial for convergence. A small batch size leads to a very jittery loss function. This makes sense intuitively: a small batch means the pairwise matrix is only a very crude approximation of the neighbourhood graph since it only considers a random subset of all possible neigbhours. I noticed a good rule of thumb was to try to maximize the batch size within the GPU limits.
Normalizing the input data (i.e. subtracting the mean and dividing by the standard deviation) helped with convergence. Note that doing this requires that we store the computed statistics and scale any test data appropriately.
Without L2 regularization, the final matrix $\mathcal{A}$ tended to blow up in scale. Adding L2 regularization to the loss function helped tame the matrix and speed-up convergence.
Random init always converged to a collapsed projection where the points lay on a hyperplane. This is possible because there is no term in the loss function that explicity pulls different classes apart. To combat this, I added a hinge loss component to the loss function, essentially turning the NCA loss into a contrastive loss function.

Boring… Show Me What It Can Do!

At this point, you’re probably curious to know if NCA lives up to its claims. Let’s go ahead and test the PyTorch implementation on 2 tasks: dimensionality reduction and kNN classification.

Using the NCA API is super simple. Very briefly, you first instantiate an NCA object with an embedding dimension and an initialization strategy. Then you call the train method on the input and ground-truth tensors, specifying a batch size and learning rate. There are other parameters you can change, all documented in the class docstring.

nca = NCA(dim=2, init="random")  # instantiate nca object
nca.train(X, y, batch_size=64, lr=1e-4)  # fit nca model
X_nca = nca(X)  # apply the learned transformation

Dimensionality Reduction

For this task, I replicated a portion of the results from section 4 of the paper. Specifically, I generated a synthetic three-dimensional dataset which consists of 5 classes, shown in different colors in Figure 4. The first two dimensions of the dataset correspond to concentric circles, while the third dimension is just Gaussian noise with high variance.

Figure 4: NCA vs. PCA vs. LDA on the synthetic dataset.

I then embed the dataset to a 2D space using PCA, LDA and NCA. The results are shown Figure 4. While NCA seems to have recovered the original concentric pattern, PCA fails to project out the noise, a direct consequence of the high variance nature of the noise. If we lower it to 0.1 for example, PCA successfully recovers the pattern. LDA also struggles to recover the concentric pattern since the classes themselves are not linearly separable.

kNN On MNIST

The whole motivation for NCA was that it would vastly reduce the storage and search costs of kNN for high-dimensional datasets. To put this to the test, we compared the storage, run time and error rates of two variants of kNN on the MNIST dataset:

5-NN on the raw MNIST dataset (784 dimensional)
5-NN on the 32 dimensional NCA projection of MNIST

The results are shown in the table below.

Algorithm	Raw kNN	NCA + kNN
Error (%)	2.8	3.3
Time (s)	155.25	2.37
Storage (Mb)	156.8	6.40

That’s a 66x speedup in time and a 25x saveup in storage⁵!

Acknowledgements

I’d like to thank Nick Hynes, Alex Nichol and Brent Yi for their valuable feedback throughout my debugging session and blog writing. I also want to thank Chris Choy for the insight he provided on mode collapse. The javascript code for the animation was adapted from Sam Greydanus’ blog – check him out, he’s got some great content.

The paper in question is Temporal Cycle Consistency Learning from Dwibedi et. al. ↩
John Schulman discusses this in more depth in his latest blog post. ↩
You can convince yourself that this is a valid distance metric by checking that the non-negativity, symmetry and triangle inequality conditions are satisfied. ↩
We negate the expression because our goal is to maximize the expectation and we’re going to feed it to an optimizer that performs minimization. ↩
Performance on MNIST isn’t very representative of real world performance on tougher datasets but this is still a very cool result. ↩

Learning to Assemble and to Generalize from Self-Supervised Disassembly

Thu, 31 Oct 2019 00:00:00 +0000

This is a crosspost from the official Google AI Blog.

Our physical world is full of different shapes, and learning how they are all interconnected is a natural part of interacting with our surroundings — for example, we understand that coat hangers hook onto clothing racks, power plugs insert into wall outlets, and USB cables fit into USB sockets. This general concept of “how things fit together’’ based on their shapes is something that we acquire over time and experience, and it helps to increase the efficiency with which we perform tasks, like assembling DIY furniture kits or packing gifts into a box. If robots could also learn “how things fit together,” then perhaps they could become more adaptable to new manipulation tasks involving objects they have never seen before, like reconnecting severed pipes, or building makeshift shelters by piecing together debris during disaster response scenarios.

To explore this idea, we worked with researchers from Stanford and Columbia Universities to develop Form2Fit, a robotic manipulation algorithm that uses deep neural networks to learn to visually recognize how objects correspond (or “fit”) to each other. To test this algorithm, we tasked a real robot to perform kit assembly, where it needed to accurately assemble objects into a blister pack or corrugated display to form a single unit. Previous systems built for this task required extensive manual tuning to assemble a single kit unit at a time. However, we demonstrate that by learning the general concept of “how things fit together,” Form2Fit enables our robot to assemble various types of kits with a 94% success rate. Furthermore, Form2Fit is one of the first systems capable of generalizing to new objects and kitting tasks not seen during training.

Form2Fit learns to assemble a wide variety of kits by finding geometric correspondences between object surfaces and their target placement locations. By leveraging geometric information learned from multiple kits during training, the system generalizes to new objects and kits.

While often overlooked, shape analysis plays an important role in manipulation, especially for tasks like kit assembly. In fact, the shape of an object often matches the shape of its corresponding space in the packaging, and understanding this relationship is what allows people to do this task with minimal guesswork. At its core, Form2Fit aims to learn this relationship by training over numerous pairs of objects and their corresponding placing locations across multiple different kitting tasks – with the goal to acquire a broader understanding of how shapes and surfaces fit together. Form2Fit improves itself over time with minimal human supervision, gathering its own training data by repeatedly disassembling completed kits through trial and error, then time-reversing the disassembly sequences to get assembly trajectories. After training overnight for 12 hours, our robot learns effective pick and place policies for a variety of kits, achieving 94% assembly success rates with objects and kits in varying configurations, and over 86% assembly success rates when handling completely new objects and kits.

Data-Driven Shape Descriptors For Generalizable Assembly

The core of Form2Fit is a two-stream matching network that learns to infer orientation-sensitive geometric pixel-wise descriptors for objects and their target placement locations from visual data. These descriptors can be understood as compressed 3D point representations that encode object geometry, textures, and contextual task-level knowledge. Form2Fit uses these descriptors to establish correspondences between objects and their target locations (i.e., where they should be placed). Since these descriptors are orientation-sensitive, they allow Form2Fit to infer how the picked object should be rotated before it is placed in its target location.

Form2Fit uses two additional networks to generate valid pick and place candidates. A suction network gets fed a 3D image of the objects and generates pixel-wise predictions of suction success. The suction probability map is visualized as a heatmap, where hotter pixels indicate better locations to grasp the object at the 3D location of the corresponding pixel. In parallel, a place network gets fed a 3D image of the target kit and outputs pixel-wise predictions of placement success. These, too, are visualized as a heatmap, where higher confidence values serve as better locations for the robot arm to approach from a top-down angle to place the object. Finally, the planner integrates the output of all three modules to produce the final pick location, place location and rotation angle.

Overview of Form2Fit: the suction and place networks infer candidate picking and placing locations in the scene respectively. The matching network generates pixel-wise orientation-sensitive descriptors to match picking locations to their corresponding placing locations. The planner then integrates it all to control the robot to execute the next best pick and place action.

Learning Assembly from Disassembly

Neural networks require large amounts of training data, which can be difficult to collect for tasks like assembly. Precisely inserting objects into tight spaces with the correct orientation (e.g., in kits) is challenging to learn through trial and error, because the chances of success from random exploration can be slim. In contrast, disassembling completed units is often easier to learn through trial and error, since there are fewer incorrect ways to remove an object than there are to correctly insert it. We leveraged this difference in order to amass training data for Form2Fit.

An example of self-supervision through time-reversal: rewinding a disassembly sequence of a deodorant kit over time generates a valid assembly sequence.

Our key observation is that in many cases of kit assembly, a disassembly sequence – when reversed over time – becomes a valid assembly sequence. This concept, called time-reversed disassembly, enables Form2Fit to train entirely through self-supervision by randomly picking with trial and error to disassemble a fully-assembled kit, then reversing that disassembly sequence to learn how the kit should be put together.

Generalization Results

The results of our experiments show great potential for learning generalizable policies for assembly. For instance, when a policy is trained to assemble a kit in only one specific position and orientation, it can still robustly assemble random rotations and translations of the kit 90% of the time.

Form2Fit policies are robust to a wide range of rotations and translations of the kits.

We also find that Form2Fit is capable of tackling novel configurations it has not been exposed to during training. For example, when training a policy on two single-object kits (floss and tape), we find that it can successfully assemble new combinations and mixtures of those kits, even though it has never seen such configurations before.

Form2Fit policies can generalize to novel kit configurations such as multiple versions of the same kit and mixtures of different kits.

Furthermore, when given completely novel kits on which it has not been trained, Form2Fit can generalize using its learned shape priors to assemble those kits with over 86% assembly accuracy.

Form2Fit policies can generalize to never-before-seen single and multi-object kits.

What Have the Descriptors Learned?

To explore what the descriptors of the matching network from Form2Fit have learned to encode, we visualize the pixel-wise descriptors of various objects in RGB colorspace through use of an embedding technique called t-SNE.

The t-SNE embedding of the learned object descriptors. Similarly oriented objects of the same category display identical colors (e.g. A, B or F, G), while different objects (e.g. C, H) or same objects with different orientations (e.g. A, C, D or H, F) exhibit different colors.

We observe that the descriptors have learned to encode (a) rotation — objects oriented differently have different descriptors (A, C, D, E) and (H, F); (b) spatial correspondence — same points on the same oriented objects share similar descriptors (A, B) and (F, G); and (c) object identity — zoo animals and fruits exhibit unique descriptors (columns 3 and 4).

Limitations & Future Work

While Form2Fit’s results are promising, its limitations suggest directions for future work. In our experiments, we assume a 2D planar workspace to constrain the kit assembly task so that it can be solved by sequencing top-down picking and placing actions. This may not work for all cases of assembly – for example, when a peg needs to be precisely inserted at a 45 degree angle. It would be interesting to expand Form2Fit to more complex action representations for 3D assembly.

You can learn more about this work and download the code from our GitHub repository.

Acknowledgments

This research was done by Kevin Zakka, Andy Zeng, Johnny Lee, and Shuran Song (faculty at Columbia University), with special thanks to Nick Hynes, Alex Nichol, and Ivan Krasin for fruitful technical discussions; Adrian Wong, Brandon Hurd, Julian Salazar, and Sean Snyder for hardware support; Ryan Hickman for valuable managerial support; and Chad Richards for helpful feedback on writing.

Manifesto

Sun, 26 May 2019 00:00:00 +0000

I find writing to be a very fascinating and therapeutic activity. There’s nothing quite like twiddling a bunch of words into a sequence, reading the result out loud, grimacing, and adjusting until it sounds just right. It’s the reason I started this blog yet I find that I haven’t been able to write as much as I would like to. It sucks, but articles on here have usually been academic and because I prioritize quality over quantity, finding the time to write them has been very challenging.

To combat this dry spell, I’ve decided to create a new section of the blog entitled Miscellany, where I’ll post on a variety of topics such as interesting research papers, books I read, and philosophical ponderings of life. I still intend to publish on the main section, but posts there will be reserved for tutorials and research expositions primarily in machine learning. I’m aiming to write once a month and while it’s not much, it’s still better than nothing. As Andy Dufresne puts it beautifully in The Shawshank Redemption:

Get busy living, or get busy dying¹.

A tad bit dramatic for my case, but I couldn’t resist. ↩

Dex-Net 2.0: Deep Learning to Plan Robust Grasps

Mon, 05 Nov 2018 00:00:00 +0000

In this blog post, we’re going to take a close look at Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics by Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey, Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg.

Overview

TL, DR. This paper tackles grasp planning which is the task of finding a gripper configuration (pose and width) that maximizes a success metric subject to kinematic and collision constraints. The suggested approach is to train a Grasp Quality Convolutional Neural Network (GQ-CNN) on a large synthetic dataset of depth images with associated positive and negative grasps. Then during test time, one can sample various grasps from a depth image, feed each through the GQ-CNN, pick the one with the highest probability of success, and execute the grasp open-loop.

Variables

Let’s start by introducing the variables that appear in the paper.

: the state describing the variable properties of the camera and objects in the environment, where:
- $O$ : the geometry and mass properties of the object.
- $T_o, T_c$ : 3D poses of the object and camera respectively.
- $\gamma$ : the coefficient of friction between the object and the gripper.
$u = (p, \phi)$ : a parallel-jaw grasp in 3D space, specified by a center $p = (x, y, z)$ relative to the camera and an angle in the table plane $\phi$ .
$y = R^{H \times W}$ : a pointcloud represented as a depth image with height H and width W taken by the camera with known intrinsics $K$ and pose $T_c$ .
$S(u, x) \in \{0, 1\}$ : a binary-valued grasp success metric, such as force closure.

Using these random variables, we can define a joint distribution $p(S, x, u, y)$ that models the inherent uncertainty associated with our assumptions, such as erroneous sensors readings (calibration error, noise, limiting pinhole model, etc.), and imprecise control (kinematic inaccuracies, etc.).

Goal. Ingest a depth image $u$ of an object in a scene with an associated grasp candidate $u$ , and spit out the probability that $u$ will succeed under the above uncertainties. This is equivalent to predicting the robustness $Q$ of a grasp, defined as the expected value of $S$ conditioned on $u$ and $y$ , i.e. $Q(u, y) = \mathbb{E}[S \vert u, y]$ .

Solution. Use a neural network with weights $\theta$ to approximate the complex, high-dimensional function $Q$ . Concretely,

$\hat{\theta} = \arg \min_{\theta} \ \mathbb{E}_{p(S, u, x, y)} \big[L(S, Q_{\theta}(u, y)) \big]$

And finally, using Monte-Carlo sampling of input-output pairs from our joint distribution, we obtain:

$\hat{\theta} = \arg\min_{\theta} \sum_{i=1}^{N} L(S_i, Q_{\theta}(u_i, y_i))$

where $(S_i, u_i, x_i, y_i) \sim p(S, x, u, y)$ .

Generative Graphical Model

We can think of our joint $p(S, x, u, y)$ as a generative model of images, grasps and success metrics. The relationship between the different variables is illustrated in the graphical model below.

Graphical Model

Using the chain rule, we can express the joint $p(S, x, u, y)$ as the product of 4 terms: $p(S \vert u, y, x)$ , $p(u \vert x, y)$ , $p(y \vert x)$ and $p(x)$ . And since $S$ and $u$ are independent of $y$ (no arrow going from $y$ to $S$ or $u$ ), we can reduce the expression to

$p(S, u, y, x) = {\color{red}{p(S \vert u, x)}} \cdot {\color{orange}{p(u \vert x)}} \cdot {\color{blue}{p(y \vert x)}} \cdot {\color{green}{p(x)}}$

where:

${\color{green}{p(x)}}$ is the state distribution.
${\color{blue}{p(y \vert x)}}$ is the observation model, conditioned on the current state.
${\color{orange}{p(u \vert x)}}$ is the grasp candidate model, conditioned on the current state.
${\color{red}{p(S \vert u, x)}}$ is the analytic model of grasp success conditioned on the grasp candidate and current state.

The state $x = (O, T_o, T_c, \gamma)$ is represented by the blue nodes in the graphical model. Using the chain rule and independence properties, we can express its underlying distribution as the product of:

$\begin{align*} {\color{green}{p(x)}} &= p(\gamma \vert T_c, T_o, O) \cdot p(T_c \vert T_o, O) \cdot p(T_o \vert O) \cdot p(O) \\ &= p(\gamma) \cdot p(T_c) \cdot p(T_o \vert O) \cdot p(O) \end{align*}$

with:

$p(\gamma)$ : truncated Gaussian over friction coefficients.
$p(O)$ : discrete uniform distribution over 3D object models.
$p(T_o \vert O)$ : continuous uniform distribution over discrete set of stable object poses.
$p(T_c)$ : continuous uniform distribution over spherical coordinates and polar angle.

The grasp candidate model ${\color{orange}{p(u \vert x)}}$ is a uniform distribution over pairs of antipodal contact points on the object surface whose grasp axis is parallel to the table plane (we want top-down grasps), the observation model ${\color{blue}{p(y \vert x)}}$ is a rendered depth image of the scene corrupted with multiplicative and Gaussian Process noise, and the success model ${\color{red}{p(S \vert u, x)}}$ is a binary-valued reward function subject to 2 constraints: epsilon quality and collision freedom.

Now that we’ve examined the inner workings of our generative model $p$ , let’s see how we can use it to generate the massive Dex-Net dataset.

Generating Dex-Net

To train our GQ-CNN, we need to generate i.i.d samples, consisting of depth images, grasps, and grasp robustness labels, by sampling from the generative joint $p(S, x, u, y)$ .

Data Generation Pipeline

Randomly select, from a database of 1,500 meshes, a 3D object mesh using a discrete uniform distribution.
Randomly select, from a set of stable poses, a pose for this object using a continuous uniform distribution.
Use rejection sampling to generate top-down parallel-jaw grasps covering the surface of the object.
Randomly sample the camera pose (also from a continuous uniform distribution) and use it to render the object and its pose w.r.t to the camera into a depth image using ray tracing.
Classify the robustness of each sampled grasps to obtain a set of positive and negative grasps. Robustness is estimated using force closure probability which is a function of object pose, gripper pose, and friction coefficient uncertainty.

Training the GQ-CNN

Once the synthetic dataset has been generated, it becomes trivial to train the network.

Overview of the Model

Remember how we mentioned that GQ-CNN takes as input a depth image and a grasp candidate? Well it actually turns out that the authors have a very clever way of encoding the grasp information into the depth image: they take a depth image and grasp candidate and transform the depth image such that the grasp pixel location $(i, j)$ – projected from the grasp position $(x, y)$ – is aligned with the image center and the grasp axis $\varphi$ corresponds to the middle row of the image. Then, at every iteration of SGD, we sample the transformed depth image and the remaining grasp variable $z$ (i.e the gripper depth from the camera), normalize the depth image to zero mean and unit standard deviation, and feed the tuple to the 18M parameter GQ-CNN model.

Note 1. The model is a typical deep learning architecture composed of convolutional, max-pool and fully-connected primitives.

Note 2. The depth alignment makes it easier for the model to train since it doesn’t have to worry about any rotational invariances. As for feeding the gripper depth to the model, I would think this is useful for pruning grasps that have the correct 2D position and orientation, but are too far away from the object (i.e. either not touching or barely touching).

Grasp Planning (Inference Time)

Once the model is trained, we can pair the QG-CNN with a policy of choice. The one used in the paper is $\pi_{\theta}(y) = \arg \max_{u \in C} Q_{\theta}(u, y)$ which amounts to sampling a set of predefined grasps from a depth image subject to a set of constraints $C$ (e.g. kinematic and collision constraints), scoring each grasp using the GQ-CNN, and finally executing the most robust grasp. There are two sampling strategies used to generate grasp candidates: antipodal grasp sampling and cross-entropy sampling.

Antipodal Grasp Sampling.

First, we perform edge detection by locating pixel areas with high gradient magnitude. This is especially useful since graspable regions usually correspond to contact points on opposite edges of an object.

Then we sample pairs of pixels belonging to these areas to generate antipodal contact points on the object. We enforce the constraints that point pairs are parallel to the table plane.

We repeat this step until we reach the desired number of grasps, potentially increasing the friction coefficient if the amount is insufficient. In the final step, 2D grasps are deprojected to 3D grasps using the camera intrinsics and extrinsics and multiple grasps are obtained from the same contact points by discretizing the height starting from the object surface to the table surface ( $h = 0$ ).

Cross Entropy Method.

Evolution of grasp robustness as the gripper center sweeps the depth image from top to bottom.

Randomly choosing a grasp from a set of candidates doesn’t work very well in cases where the grasping regions are small and require very precise gripper configurations. Taking a look at the image above, we can see that as we sweep candidate grasps from top to bottom, grasp robustness stays near zero and spikes momentarily when we reach the good, yet narrow grasping area. Thus, uniform sampling of grasp candidates is inefficient especially since we’re trying to perform real-time grasp planning.

This is where importance sampling – one of my favorite techniques – can help! We can modify our sampling strategy such that at every iteration, we refit the candidate distribution to the grasps with the highest predicted robustness. The algorithm to perform this fitting is the cross-entropy method (CEM) which tries to minimize the cross-entropy between a mixture of gaussians and the top-k percentile of grasps ranked by GQ-CNN. The result is that at every iteration, we are more likely to sample grasps with high-robustness values (grasps in the spike area) and converge to an optimal grasp candidate. This fitting process is illustrated below.

Discussion

The sampling of grasps is inefficient. It would be interesting to extend the GQ-CNN to a fully-convolutional architecture where robustness labels can be computed for every pixel in the depth image in a single forward pass.
Dex-Net is open-loop which means that once a grasp candidate has been picked, it is executed blindly with no visual feedback. This sets it up for failure when camera calibration is imprecise or the environment it is placed in is dynamic and susceptible to change.
If we can speed-up Dex-Net by creating a smaller, fully-convolutional GQ-CNN, we may be able to run it at a high enough frequency to incorporate visual feedback and close the loop.

Learning What to Learn and When to Learn It

Fri, 28 Sep 2018 00:00:00 +0000

Disclaimer: This blog post describes unfinished research and should be treated as a work in progress.

Hello world! I’m coming out of hibernation after 14 months of radio silence on this blog. I have a lot of things to blog about, from my research internship at Stanford University this past summer, to wrapping up my B.Eng. in EE in July – and I’ll hopefully get to those in future blog posts – but today, I’d like to talk about some of the cool research I did in my senior year of undergrad. Unfortunately, it’s not GAN/RL related (read as fortunately) but it’s definitely an interesting aspect of the field that could use some more attention.

The problem we’ll be investigating today is whether we can get Deep Neural Networks (DNNs) to converge faster and learn more efficiently. In particular, we’ll try to answer the following questions:

Do we really need all the training samples in a dataset to reach a desired accuracy?
Can we do better than (lazy) uniform sampling of the data in a given training epoch?

It actually turns out that on MNIST, we can reliably speedup training by a factor of 2 using just 30% of the available data¹!

Motivation
Refresher
- Stochastic Gradient Descent
- Importance Sampling
Quantifying Sample Importance
Loss Patterns
SGD on Steroids
- Mini-Batch Resampling
- Auxiliary Model
Things I Wish I Tried
Closing Thoughts

Motivation

Human beings acquire knowledge in a unique way, accelerating their learning by choosing where and when to focus their efforts on the available training material. For example, when practicing a new musical composition, a pianist will spend more time on the difficult measures – breaking them down into manageable pieces that can be progressively mastered – rather than wasting her efforts on the simpler, more familiar parts.

Annotated Copy of Bach’s Solo Violin Sonata No. 2

Much of the same can be said about our formal primary and secondary education: our teachers help us learn from a smart selection of examples, leveraging previously acquired concepts to help guide our learning of new tools and abstractions. Human learning thus exhibits resource and time efficiency: we become proficient at mastering new concepts by selecting first, a subset of what is available to us in terms of learning material, and second, the sequence in which to learn the selected items such that we minimize acquisition time.

Unfortunately, the training algorithms we use in AI, unlike human learning, are data hungry and time consuming. With vanilla stochastic gradient descent (SGD) for example, the standard go-to optimizer, we repetitively iterate over the training data in sequential mini-batches for a large number of epochs, where a mini-batch is constructed by uniformly sampling $b$ training points from the dataset. On large datasets – a necessity for good generalization – the naiveté of this sampling strategy hinders convergence and bottlenecks computation.

Refresher

So how can we improve SGD? Can we replace uniform sampling with a more efficient sampling distribution? More specifically, can we somehow predict a sample’s importance such that we adaptively construct training batches that catalyze more learning-per-iteration? These are all excellent questions we’ll be tackling further in the post, so let’s begin by refreshing a few concepts.

Stochastic Gradient Descent. Given a neural network $M$ parameterized by a set of weights $W$ , a dataset $\mathcal{D}$ , and a loss function $L$ , we can express the goal of training as finding the optimal set of weights $\hat{W}$ such that,

$\begin{equation} \begin{split} \hat{W} & = \arg \min_{W} \ L_{\mathcal{D}} \\ & = \arg \min_{W} \ \frac{1}{B} \sum_{i=1}^{B} L_i \\ & = \arg \min_{W} \frac{1}{B} \sum_{i=1}^{B} \sum_{j=1}^{b} L_{ij} \big( M(x_j; W), y_j \big) \\ \end{split} \end{equation}$

where $B$ corresponds to the number of batches in an epoch, $b$ the number of training observations in a batch, and $(x_i, y_i)$ an input-output training pair.

Converging to an Optimum with SGD

Without loss of generality, we can simplify the notation by considering just one training observation, a special case where the batch size is equal to 1. In that case, training our neural network $M$ amounts to updating the weight vector $W$ by taking a small step in the direction of the gradient of the loss with respect to $W$ between two consecutive iterations:

$W_{t+1} = W_t - \alpha \ \mu_i \ \nabla_{W_t} L_i$

In the above equation, $i$ is a discrete random variable sampled from $\mathcal{D}$ according to a probability distribution $\mathcal{P}$ with probabilities $p_i$ and sampling weights $\mu_i$ . With vanilla SGD and uniform sampling, we have that $\forall i \in \mathcal{D}$ ,

$\begin{equation*} \mu_i = 1 \\ p_i = \frac{1}{|\mathcal{D_t}|} \end{equation*}$

Importance Sampling. Importance sampling is a neat little trick for reducing the variance of an integral estimation by selecting a better distribution from which to sample a random variable. The trick is to multiply the integrand by a cleverly disguised $1$ :

$\begin{equation} \begin{split} E_{x \sim p(x)} \big[\ f(x) \big] & = \int f(x)\ p(x)\ dx \\ & = \int f(x)\ p(x)\ \frac{q(x)}{q(x)}\ dx \\ & = \int \frac{p(x)}{q(x)}\cdot f(x)\ q(x)\ dx \\ & = E_{x \sim q(x)} \big[\ f(x)\cdot \frac{p(x)}{q(x)} \big] \\ \end{split} \end{equation}$

Since many quantities of interest (probabilities, sums, integrals)² can be obtained by computing the mean of a function of a random variable $E[f(X)]$ , we can greatly accelerate – and even improve – Monte-Carlo estimates by switching out the original probability distribution with a density that minimizes the sampling of points that contribute very little to the estimate, i.e. points with a function value of 0.

Smaller Point Spread with Importance Sampling

For a tutorial on Monte-Carlo estimation and Importance Sampling, click here.

Quantifying Sample Importance

In the previous section, we mentioned that uniform sampling assigns equal importance to all the training points in $\mathcal{D}$ . This is obviously wasteful: while some samples are “easy” for the model and can be discarded in the initial stages with minimal impact on performance, the more “difficult” samples should be addressed more frequently throughout the training since they contribute to faster learning. So can we find a way to quantify this “importance”?

Fortunately, the answer is yes: we can theoretically³ show that this quantity is none other than the norm of the gradient of a sample. Intuitively this makes sense: in the classification setting for example, we would expect misclassified examples to exhibit larger gradients than their correctly classified counterparts. Unfortunately, the norm of the gradient is pretty expensive to compute, especially in settings where we would like to avoid computing a full forward and backwards pass.

What about the loss of a sample? We essentially get it for free in the forward pass of backprop, so if we can show some degree of correlation with the gradient norm, it would be a less accurate but way cheaper metric for importance. Let’s try and verify this with a small PyTorch experiment. We’re going to train a small convnet on MNIST and record both the loss and gradient of every image in an epoch. We’ll then sort the list containing the gradient norms and use it to index the list of losses. A scatter plot of the reindexed losses should reveal a few things:

If there is indeed a correlation, there should be a (potentially noisy) straight line through the scatter plot.
If the correlation is positive – implying that a higher gradient norm corresponds to a higher loss value and vice versa – this line should be increasing.

EDIT (08/06/2019): @AruniRC kindly mentioned that I can compute the Pearson correlation coefficient to numerically quantify the degree of correlation between the gradient norm and the loss value. I’ve now added a cell in the notebook to compute it.

Here’s a code snippet for computing the L2 norm of the gradient of a batch of losses with respect to the parameters of the network. Since there’s a pair of weights and biases associated with every convolutional and fully-connected layer and we want to return a scalar, we can calculate and return the square root of the sum of the squared gradient norms.

def gradient_norm(losses, model):
  norms = []
  for l in losses:
    grad_params = torch.autograd.grad(l, model.parameters(), create_graph=True)
    grad_norm = 0
    for grad in grad_params:
      grad_norm += grad.norm(2).pow(2)
    norms.append(grad_norm.sqrt())
  return norms

Incorporating the above function in the training loop is pretty trivial. All we need to do is record a (grad_norm, loss) tuple for every image in the dataset.

# train for 1 epoch
epoch_stats = []
for batch_idx, (data, target) in enumerate(train_loader):
  data, target = data.to(device), target.to(device)
  optimizer.zero_grad()
  output = model(data)
  losses = F.nll_loss(output, target, reduction='none')
  grad_norms = gradient_norm(losses, model)
  indices = [batch_idx*len(data) + i for i in range(len(data))]
  batch_stats = []
  for i, g, l in zip(indices, grad_norms, losses):
    batch_stats.append([i, [g, l]])
  epoch_stats.append(batch_stats)
  loss = losses.mean()
  loss.backward()
  optimizer.step()

We can compute the correlation between grad_norms and losses using the following one-liner:

corr = np.cov(grad_norms, losses) / (np.std(grad_norms) * np.std(losses))
print("Pearson Correlation Coeff: {}".format(corr[0, 1]))  # prints ~0.83

This returns a value of 0.83 which shows a strong relationship between both variables. Next, we verify this intuition graphically by indexing our losses using the sorted gradient norms and generating the aforementioned scatter plot.

# reindex the losses using the sorted gradient norms
flat = [val for sublist in epoch_stats for val in sublist]
sorted_idx = sorted(range(len(flat)), key=lambda k: flat[k][1][0])
sorted_losses = [flat[idx][1][1].item() for idx in sorted_idx]

Sorted Losses According to Gradient Norm

The above plot suggests that we can indeed use the loss value of a sample as a proxy for its importance. This is exciting news and opens up some interesting avenues for improving SGD.

If you want to reproduce the above logic, click here.

Loss Patterns

In this section, we’ll try to answer the following question:

Is a sample’s importance consistent across epochs? In other words, if a sample exhibits low loss in the early stages of training, is this still the case in later epochs?

There is substantial benefit in providing empirical evidence to this hypothesis. The reasons are two-fold: first, by eliminating consistently low-loss images from the dataset, we reduce train time proportionally to the discarded images; second, by oversampling the high-loss images, we reduce the variance of the gradients and speedup the convergence to $\hat{W}$ .

To explore this idea, we’re going to track every sample’s loss over a set number of epochs. We’ll bin the loss values into 10 quantiles and compare the histograms over the different epochs. Finally, we’ll repeat these steps with shuffling turned off, then turned on.

NB: We need to be a bit careful with keeping track of a sample’s index when shuffling is turned on. The solution is to create a permutation of [0, 1, 2, ..., 59,999] at the beginning of every epoch and feed it to a sequential sampler with shuffling turned off. By remapping the indices to their true ordering relative to the permutations at the end of training, we would have effectively simulated random shuffling.

If this sounds complicated, let me show you how simple it is to achieve in PyTorch:

# PermSampler takes a list of `indices` and iterates over it sequentially
class PermSampler(Sampler):
  def __init__(self, indices):
    self.indices = indices
  def __iter__(self):
    return iter(self.indices)
  def __len__(self):
    return len(self.indices)

# if `permutation` is None, we return a data loader with no shuffling
# if `permutation` is a list of indices, we return a data loader that iterates
# over the MNIST dataset with indices specified by `permutation`.
def get_data_loader(data_dir, batch_size, permutation=None):
  normalize = transforms.Normalize(mean=(0.1307,), std=(0.3081,))
  transform = transforms.Compose([transforms.ToTensor(), normalize])
  dataset = MNIST(root=data_dir, train=True, download=True, transform=transform)
  sampler = None
  if permutation is not None:
    sampler = PermSampler(permutation)
  loader = DataLoader(dataset, batch_size=batch_size, shuffle=False)
  return loader

After training for 5 epochs, we collect a list containing a tuple (idx, loss_idx) for every image in the dataset. We can remap the indices with the following code:

# remap the indices based on the permutations list
for stat, perm in zip(stats_with_shuffling_flat, permutations):
  for i in range(len(stat)):
    stat[i][0] = perm[i]

Finally, we bin the sorted losses of every epoch into 10 bins and compute the percent match of bins across all epochs, the last 4 epochs, and the last 2 epochs.

def percentage_split(seq, percentages):
  cdf = np.cumsum(percentages)
  assert np.allclose(cdf[-1], 1.0)
  stops = list(map(int, cdf * len(seq)))
  return [seq[a:b] for a, b in zip([0]+stops, stops)]

def bin_losses(all_epochs, num_quantiles=10):
  percentile_splits = []
  for ep in all_epochs:
    sorted_loss_idx = sorted(range(len(ep)), key=lambda k: ep[k][1], reverse=True)
    splits = percentage_split(sorted_loss_idx, [num_quantiles/100]*num_quantiles)
    percentile_splits.append(splits)
  return percentile_splits

fr = [0, 1, 3]
all_matches = []
for f in fr:
  percent_matches = []
  for i in range(num_quantiles):
    percentile_all = []
    for j in range(f, len(percentile_splits)):
      percentile_all.append(percentile_splits[j][i])
      matching = reduce(np.intersect1d, percentile_all)
      percent = 100 * len(matching) / len(percentile_all[0])
      percent_matches.append(percent)
    all_matches.append(percent_matches)

It’s interesting to compute percent matches across a varying range of epochs. The reason is that the training dynamics are less stable in the early epochs when the model weights are still random (analogous to transient response and steady state in circuit theory). For example, we would expect to have higher percent matches if we eliminate the first epoch from the analysis – and this is verified in the below plot!

The histograms confirm our hypothesis:

~ 30% of the samples with a loss value in the top 10% consistently rank in those ranges across all epochs. This number increases to ~ 60% across epochs 1 through 4 and ~ 85% across the last two epochs.
~ 30% of the samples with a loss value in the bottom 10% consistently rank in those ranges across all epochs. This number increases to ~ 50% across epochs 1 through 4 and ~ 70% across the last two epochs.
Shuffling has a minimial impact on the loss evolution of the samples across epochs.

If you want to reproduce the histograms, click here.

SGD on Steroids

Mini-Batch Resampling. In the first version of SGD-S, we’re going to split our training epochs into 2 stages:

Transient Epochs: in the transient epochs, we train our model exactly as we would in regular SGD. However, in the last epoch, we record and return the losses of every image in the dataset.
Steady-State Epochs:
- For every epoch in the steady-state, we sample batches using the loss as the sampling distribution.
- At the end of every epoch in the steady-state, we eliminate 10% of the images with the lowest losses. Furthermore, we can choose to randomly introduce a fraction of the discarded images to combat potential catastrophic forgetting.

Let’s illustrate how we can use the loss function to construct an importance sampling distribution for mini-batch resampling. This is achievable using PyTorch’s WeightedRandomSampler in conjunction with the DataLoader.

# sort the loss in decreasing order
sorted_loss_idx = sorted(range(len(losses)), key=lambda k: losses[k][1], reverse=True)

# house cleaning
to_remove = sorted_loss_idx[-int((perc_to_remove / 100) * len(sorted_loss_idx)):]
to_keep = sorted_loss_idx[:-int((perc_to_remove / 100) * len(sorted_loss_idx))]
to_add = list(np.random.choice(removed, int(.01*len(sorted_loss_idx)), replace=False))

new_idx = to_keep + to_add
new_idx.sort()

weights = [losses[idx][1] for idx in new_idx]
sampler = WeightedRandomSampler(weights, len(weights), True)

Auxiliary Model.

Things I Wish I Tried

Closing Thoughts

CIFAR results pending. ↩
Explain how. ↩
Add proof or point to it. ↩

Getting Up and Running with PyTorch on Amazon Cloud

Sun, 13 Aug 2017 00:00:00 +0000

This is a succint tutorial aimed at helping you set up an AWS GPU instance so that you can train and test your PyTorch models in the cloud. If you don’t own a GPU like me, this can be a great way of drastically reducing the training time of your models, so while your instance is furiously crunching numbers in some faraway Amazon server, you can peacefully experiment with and prototype new architectures from the comfort of a Starbucks couch.

I mean we all love a silent macbook, right?

The cool part is that if you’re a high school or college student, you can sign up for a Github Developer pack which will get you $150 worth of free AWS credits. That’s around 167 hours or 7 days of compute time¹, an amply sufficient amount for those fun weekend side projects and experiments. As usual, any code or script that appears on this page can be downloaded from my Blog Repository. And on that note, let’s get started!

Configuring Your EC2 Instance
Launching & Managing Your EC2 Instance
SSH Persistence With TMUX
Conclusion

Configuring Your EC2 Instance

I’m assuming you’ve already created an AWS account but if you haven’t, the whole process shouldn’t take you more than 2 minutes. Note that it will require you to enter your credit card information which is necessary to charge you if and when you exceed your free credits. Now’s also a great time to claim your GitHub Student Developer Pack credits so go ahead and do that.

Pick your Region. Ok, so the instance type we are going to use is located in US West (Oregon) so make sure the region information on the top right of the screen correctly reflects that.

Limit Increase. The next thing we need to do is request a limit increase for EC2 instances. For some weird reason, Amazon automatically sets the limit to 0 upon account creation so it has to be increased by sending in a support ticket.

Go ahead and click Support > Support Center at the top right of your screen. This will direct you to a page with a blue Create Case button that you should click. You’ll be greeted with the following:

We want a Limit Increase for EC2 instances meaning you need to select Service Limit Increase in Regarding and EC2 Instances in Limit Type. Now fill in the Request 1 box and Use Case Description as I’ve done here.

Finally, make sure to select Web as your Contact method and submit the request. Note that the time of response varies: I’ve had limit increases resolved in a matter of minutes and sometimes up to a full day, so be patient. Also, feel free to change the New limit value to suit your needs. I’ve opted for 2 because the p2.xlarge instance type we’ll be working with has a single GPU with memory constraints that may limit the number of jobs I may run concurrently.

Configure Instance. Ok, we’re now ready to create and configure our EC2 instance. Back on the home page console (click on the orange cube in the top left), navigate to EC2 in the Compute services section, and then click on the blue Launch Instance button.

You’ll be greeted with a 7-step process like so.

AMI. First select the Ubuntu Server 16.04 LTS (HVM), SSD Volume Type as the AMI of choice.

Instance. Select p2.xlarge as your instance type. This is an instance with a single GPU which is what we asked for in our limit increase request.

Spot Instances. At this point, you should be on the Configure Instance Details step. This is where things get interesting. In fact, Amazon gives us the ability to bid on spare Amazon EC2 computing capacity for a much cheaper price than the on-demand one.

Basically, what that means is that if our bid price is higher than the current market price, our instance will be launched and charged at that price. The only downside is that if that ever flips around, instances get terminated instantly and with no warning².

TL;DR: Spot instances can be ideal for non-critical experimentation like hyperparameter tuning but stay away from them if you need to train a model for a large number of epochs.

I’ll assume the user uses On-Demand pricing for the remainder of this post but if you do want to find out more about Spot Instances, feel free to watch this Youtube video.

Add Storage. Next, we’ll be increasing the size of our Root Volume to accomodate large datasets such as ImageNet which is around 48 Gb. Feel free to enter any number above that.

Note that the Root Volume is EBS-backed meaning it persists on instance termination. The default behavior however is to delete it on termination. Weird right? Well, not really. With ephemeral storage, the other type of storage AWS offers, there is no persist option, whether it be on instance stop or terminate. Thus EBS with delete-on-terminate gives us the ability to keep our data on disk when the instance is stopped!

Configure Security Group. You can skip the Add Tags section and jump to this last step. This part is important because it will allow us to monitor our training with Tensorboard and use Jupyter Notebook. We’ll be adding 4 protocols as shown in the picture below.

Once you click the launch button, a window will pop up and prompt you to create a key-pair. This little file is needed when ssh-ing into your instance, so download it and store it in a secure location you’ll remember. For this tutorial’s sake, I’ll be calling mine aws-dl.pem and storing it in my Downloads folder.

Launching & Managing Your EC2 Instance

We’ve finally arrived at the point where we can ssh into our EC2 instance. To do so, you’ll need to navigate to the Instances page located in the navigation panel on the left of your screen. You’ll be greeted with the following:

You need to take note of 2 things:

Public DNS (IPv4): ec2-52-42-90-161.us-west-2.compute.amazonaws.com
IPv4 Public IP: 52.42.90.161

Other than that, there are just 2 ways to interact with your instance you need to be aware of: login with ssh and copy a file to it with scp.

ssh -v -i X ubuntu@Y where X represents the path to the key-pair file and Y represents the Public IP of your instance.
scp -i W -r X ubuntu@Y:Z where W is the path to the key-pair file, X is the path to the local file, Y is the Public IP, and Z is the destination path on the instance.

It’s important to note that if you’re using the key-pair file for the very first time, you’ll need to change its permission to read and write by running chmod 600 ~/Downloads/aws-dl.pem.

With all that being said, we can finally fire up a terminal and execute the following command:

ssh -v -i ~/Downloads/aws-dl.pem ubuntu@52.42.90.161

Enter yes, and voila! You should be successfully logged in. The instance is still not ready for use as there are a few more things that need to be done, but fear not. I’ve created a small bash script that you can execute which automates the following:

It downloads and installs the required nvidia gpu drivers.
It updates and upgrades the distribution packages.
It installs python3 along with virtualenv.
It creates a virtualenv called deepL that will house all the required pip packages and PyTorch.
And it finally installs PyTorch v0.2.

Go ahead and download install.sh from my repo and save it to your Desktop. We need to copy it to our instance, so apply the command I mentioned above:

scp -i ~/Downloads/aws-dL.pem -r ~/Desktop/install.sh ubuntu@52.42.90.161:~/.

Next, go back to the terminal window logged into the instance and execute the following 2 commands:

chmod +x install.sh
./install.sh

Once that’s done, you’ll need to reboot your instance. Enter exit at the command line and navigate to your browser as in the image below. Be patient and wait for a few minutes before you ssh back into the instance!

At this point, we should sanity check our installation by seeing if PyTorch loads correctly.

First, activate the virtualenv by executing source ~/envs/deepL/bin/activate.
Enter python and inside the interpreter, import torch then torch.__version__. Fingers crossed, this should print out 0.2.0_1.
Lastly, check that the GPU is visible by typing torch.cuda.is_available() which should print out True.

Once you’ve finished working on your instance, you should stop it immediately to avoid incurring additional charges.

SSH Persistence With TMUX

I would be doing you a great disservice if I didn’t mention this nifty little package called tmux that you can use when running your instances for long periods of time. What exactly is tmux, and why should you use it?

Well, if you’re shhed into an instance, peacefully running a job, and your connection suddenly drops, your ssh connection will automatically get killed. This means anything running on that instance stops as well (i.e. your model will stop training). Closing your laptop to commute from university to your house for example becomes a big no no.

A TMUX session

This is where tmux comes in! Tmux makes it so that anything running within a session persists even if the connection drops or the terminal gets killed. To see it in action, I’d suggest you watch the following video.

Thus, your workflow should always be as follows:

SSH into your aws instance.
Create a new tmux session called work using the command tmux new -s work.
Do everything as you would previously.
Detach from the session by pressing ctrl-b followed by d.

Once you’ve detached yourself from the session, you can work on anything else, even go to sleep… Subsequently, if you need to reattach to that particular tmux session to check your progress, run tmux a -t work.

That’s pretty much it. For a more complete list of tmux commands, you should refer to this lovely cheatsheet.

Conclusion

In this tutorial, we went over the basic steps needed to create a free, GPU-powered Amazon AWS instance. We explored how to interact with our instance using the ssh and scp commands and how a bash script could be leveraged to download and install all the required packages needed to run PyTorch. Finally, we saw how we could make our ssh session persistent using a very important program called tmux.

Until next time!

This is for a GPU-powered p2.xlarge instance with an on-demand price of around $0.9/hr. ↩
A terminated instance gets deleted, meaning you lose whatever’s on there permanently. On the other hand, a stopped instance just goes offline so you don’t get charged for it and you can fire it back up again at a later time. ↩

Understanding Recurrent Neural Networks - Part I

Thu, 20 Jul 2017 00:00:00 +0000

Recurrent Neural Networks have been my Achilles’ heel for the past few months. Admittedly, I haven’t had the grit to sit down and work out their details, but I’ve figured it’s time I stop treating them like black boxes and try instead to discover what makes them tick. My intentions with this series are hence twofold: first, to combat my weakness by understanding their inner workings and coding one from scratch; and second, to write down what I learn in order to reinforce the insights I may gain along the way.

In this first installment, we’ll be introducing the intuition behind RNNs, motivating their use by highlighting a glaring limitation of traditional neural networks. We’ll then transition into a more technical description of their architecture which will be useful for the next installment where we’ll code one from scratch in numpy.

Human Learning
The Woes of Traditional Neural Nets
Enhancing Neural Networks with Memory
The Nitty Gritty Details
References

Human Learning

We are the sum total of our experiences. None of us are the same as we were yesterday, nor will be tomorrow.
B.J. Neblett

There is an inherent truth to the quote above. Our brain pools from past experiences and combines them in intricate ways to solve new and unseen tasks. It is hardwired to work with sequences of information that we perpetually store and call upon over the course of our lives. At its core, human learning can be distilled into two fundamental processes:

memorization: every time we gain new information, we store it for future reference.
combination: not all tasks are the same, so we couple our analytical skills with a combination of our memorized, previous experiences to reason about the world.

Consider the following pictures.

Even though it’s in a very weird position, a child can instantly tell that the fur ball in front of it is a cat. It’ll recognize the ears, the whiskers and the snout (memory) but the shape of it all may throw it off. Subconciously however, the child may recall how human stretching deforms shape and pose (combination), and infer that the same is happening to the cat.

Not all tasks require the distant past however. At times, solving a problem makes use of information that was processed only moments ago. For example, take a look at this incomplete sentence:

I bought my usual caramel-covered popcorn with iced tea and headed to the ___.

If I asked you to fill-in the missing word, you’d probably guess “movies”. How did you know that library or starbucks were invalid words? Well, it’s probably because you used context, or information from earlier in the sentence to infer the correct answer. Now think about the following. If I asked you to recite the lyrics of your favorite song backwards, would you be able to do it? Probably not… What about counting backwards? Yeah, piece of cake!

So what makes reciting the song backwards so excruciatingly difficult? The answer is that counting backwards is done on the fly. There is a logical relationship between each number, and knowing the order of the 9 digits and how subtraction works means you can count backwards from say 1845098 even if you’ve never done it before. On the other hand, you memorized the lyrics of the song in a specific order. Your brain works by indexing from one word to the next, starting from the first word. It’s hard to index backwards for the simple reason that your brain has never done it before, so that specific sequence was never stored. Think of the memorized lyric sequence as a giant ball of yarn whose unraveled end can only be accessed with the correct first word in the forward sequence.

The main takeaway is that our brains are naturally talented at working with sequences and they do so by relying on a deceptively simple, yet powerful concept called information persistence.

The Woes of Traditional Neural Nets

We live in a world that is inherently sequential. Audio, video, and language (even your DNA!) are but a few examples of data in which information at a given time step is intricately dependent on information from previous timesteps. So how is all this related to deep learning? Well, think about feeding a sequence of frames from a video into a neural network and asking it to predict what comes next… Or, back to our previous example, feeding a set of words and asking it to complete the sentence.

It should be obvious to you that information from the past is crucial for outputting a sane and plausible prediction. But traditional neural networks can’t do this because they operate on the fundamental assumption that inputs are independent! This is a problem because it means our output at any given time is completely and solely determined by the input at that same time. There is no previous history and our network cannot capitalize on the complex temporal dependencies that exist between the different frames or words to refine its predictions.

This is where Recurrent Neural Networks come in! RNNs allow us to deal with sequences by incorporating a mechanism that stores and leverages information from previous history, sort of like a memory. Put differently, whereas a traditional net maps one input to an output, a recurrent net maps an entire history of previous inputs to each output. If that’s still obscure to you, just think of RNNs as a traditional neural net enhanced with a loop¹, one that allows for information to persist across timesteps.

(Video Courtesy) DRAW model improving its output by iterating over the canvas rather than producing the image in one shot.

It is important to note that recurrent neural nets aren’t just bound to sequential data in the sense that many problems can be tackled by decomposing them into a series of smaller subproblems. The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step. As an example, a recurrent net² was used to generate handwritten digits in a sequential fashion, mimicking the way artists refine and reassess their work with brushstrokes.

The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step.

Enhancing Neural Nets with Memory

So how exactly can we endow our networks with the ability to memorize? To answer this question, let’s recall our basic hidden layer neural network, which takes as input a vector X, dot products it with a weight matrix W and applies a nonlinearity. We’ll consider the output y when three successive inputs are fed through the network. Note that the bias term has been eliminated so as to simplify the notation, and I’ve taken the liberty of coloring the equations to make certain patterns stand out.

$y_0 = f(W_x\color{blue}{X_0})$ $y_1 = f(W_x \color{green}{X_1})$ $y_2 = f(W_x \color{red}{X_2})$

Given the simple API above, it’s pretty clear that each output is solely determined by its input, i.e. there is no trace of past inputs in the calculation of its value. So let’s alter the API by allowing our hidden layer to use a combination of both the current input and the previous input, and visualize what happens.

$y_0 = f(W_x\color{blue}{X_0})$ $y_1 = f(W_x \color{green}{X_1} + W_h\color{blue}{X_0})$ $y_2 = f(W_x \color{red}{X_2} + W_h\color{green}{X_1})$

Nice! By introducing recurrence into the formula, we’ve managed to obtain a mix of 2 colors in each hidden layer. Intuitively, our network now has a memory depth of 1, equivalent to “seeing” one step backwards in time. Remember though that our goal is to be able to capture information across all previous timesteps, so this does not cut it.

Hmm… What if we feed in a combination of the current input and the previous hidden layer?

$y_0 = f(W_x\color{blue}{X_0})$ $y_1 = f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)$ $y_2 = f\bigg(W_x \color{red}{X_2} + W_h \ f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)\bigg)$

Much better! Our layer at each timestep is now a blend of all the colors that have come before it, allowing our network to take into account all its past history when computing its output. This is the power of recurrence in all its glory: creating a loop where information can persist across timesteps.

The Nitty Gritty Details

Image Courtesy

At its core, an RNN can be represented by an internal, hidden state h that gets updated with every timestep and from which an output y can be optionally derived³. This update behavior is governed by the following equations:

$\begin{cases} h_t = f \big(W_{xh}x_t + W_{hh}h_{t-1}+b_1\big) \\ y_t = g \big(W_{hy}h_t + b_2\big) \end{cases}$

Don’t let the above notation scare you. It’s actually very simple once you dissect it.

$W_{xh}x_t$ - we’re multiplying the input $x_t$ by a weight matrix $W_{xh}$ . You can think of this dot product as a way for the hidden layer to extract information out of the input.
$W_{hh}h_{t-1}$ - this dot product is allowing the network to extract information from an entire history of past inputs which it will use in conjunction with information gathered from the current input, to compute its output. This is the crucial, self-defining property of RNNs.
$f$ and $g$ are activation functions that squash the dot products to a specific range. The function $f$ is usually tanh or ReLU. $g$ can be a softmax when we want to output class probabilities.
$b_1$ and $b_2$ are biases that help offset the outputs away from the origin (similar to the b in your typical $ax+b$ line).

As you can see, the Vanilla RNN model is quite simple. Once its architecture has been defined, training it is exactly the same as with normal neural nets, i.e. initializing the weight matrices and biases, defining a loss function and minimizing that loss function using some form of gradient descent.

This conclues our first installment in the series. In next week’s blog post, we’ll be coding our very own RNN from the ground up in numpy and apply it to a language modeling task. Stay tuned until then…

References

There are a ton of resources that helped me better grasp the fundamentals of RNNs. I’d like to thank iamtrask especially, for letting me use his idea of colors to explain neural memory. You can read his amazing blog post here.

Denny Britz’s RNN series - click here
Andrej Karpathy’s Blog Post - click here
Chris Olah’s Blog Post - click here

If you’re familiar with Control Theory, this should be slightly reminiscent of a feedback loop, although not quite. ↩
I’m referring to the DRAW model introduced by Gregor et. al at Deepmind. ↩
In the simplest of cases, the hidden state $h_t$ is used as both the output $y_t$ and input to the next hidden state $h_{t+1}$ . ↩

Deep Learning Paper Implementations: Spatial Transformer Networks - Part II

Wed, 18 Jan 2017 00:00:00 +0000

Image Courtesy

In last week’s blog post, we introduced two very important concepts: affine transformations and bilinear interpolation and mentioned that they would prove crucial in understanding Spatial Transformer Networks.

Today, we’ll provide a detailed, section-by-section summary of the Spatial Transformer Networks paper, a concept originally introduced by researchers Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu of Google Deepmind.

Hopefully, it’ll will give you a clear understanding of the module and prove useful for next week’s blog post where we’ll cover its implementation in Tensorflow.

Motivation
Pooling Operator
Spatial Transformer Network
Fun with STNs
- Distorted MNIST
- GTSRB dataset
Summary
References

Motivation

When working on a classification task, it is usually desirable that our system be robust to input variations. By this, we mean to say that should an input undergo a certain “transformation” so to speak, our classification model should in theory spit out the same class label as before that transformation. A few examples of the “challenges” our image classification model may face include:

scale variation: variations in size both in the real world and in the image.
viewpoint variation: different object orientation with respect to the viewer.
deformation: non rigid bodies can be deformed and twisted in unusual shapes.

Image Courtesy

For illustration purposes, take a look at the images above. While the task of classifying them may seem trivial to a human being, recall that our computer algorithms only work with raw 3D arrays of brightness values so a tiny change in an input image can alter every single pixel value in the corresponding array. Hence, our ideal image classification model should in theory be able to disentangle object pose and deformation from texture and shape.

For a different type of intuition, let’s again take a look at the following cat images.

Left: Cat images which may present classification challenges. Right: Transformed images which yield a simplified classification pipeline.

Would it not be extremely desirable if our model could go from left to right using some sort of crop and scale-normalize combination so as to simplify the subsequent classification task?

Pooling Layers

It turns out that the pooling layers we use in our neural network architectures actually endow our models with a certain degree of spatial invariance. Recall that the pooling operator acts as a sort of downsampling mechanism. It progressively reduces the spatial size of the feature map along the depth dimension, cutting down the amount of parameters and computational cost.

Pooling layer downsamples the volume spatially. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Right: 2x2 max pooling. (Image Courtesy)

How exactly does it provide invariance? Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and “pool” the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7, each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we’d be capturing approximately the same information by aggregating pixel values.

Now there are a few downsides to pooling which make it an undesirable operator. For one, pooling is destructive. It discards 75% of feature activations when it is used, meaning we are guaranteed to lose exact positional information. Now you may be wondering why this is bad since we mentioned earlier that it endowed our network with some spatial robustness. Well the thing is that positional information is invaluable in visual recognition tasks. Think of our cat classifier above. It may be important to know where the position of the whiskers are relative to, say the snout. This can’t be achieved when it is this sort of information we throw away when we use max pooling.

Another limitation of pooling is that it is local and predefined. With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can’t just increase the receptive field arbitrarily because then that would downsample our feature map too agressively.

The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)

Spatial Transformer Networks (STNs)

The Spatial Transformer mechanism addresses the issues above by providing Convolutional Neural Networks with explicit spatial transformation capabilities. It possesses 3 defining properties that make it very appealing.

modular: STNs can be inserted anywhere into existing architectures with relatively small tweaking.
differentiable: STNs can be trained with backprop allowing for end-to-end training of the models they are injected in.
dynamic: STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.

As you can see, the Spatial Transformer is superior to the Pooling operator in all regards. So this begs the following question: what exactly is a Spatial Transformer?

Image Courtesy

The Spatial Transformer module consists in three components shown in the figure above: a localisation network, a grid generator and a sampler. Before we dive into each of their details, I’d like to briefly remind you of a 3 step pipeline we talked about last week.

Affine Transformation Pipeline

Recall that we can’t just blindly rush to the input image and apply our affine transformation. It’s important to first create a sampling grid, transform it, and then sample the input image using the grid. With that being said, let’s jump into the core components of the Spatial Transformer.

Localisation Network

The goal of the localisation network is to spit out the parameters $\theta$ of the affine transformation that’ll be applied to the input feature map. More formally, our localisation net is defined as follows:

input: feature map U of shape (H, W, C)
output: transformation matrix $\theta$ of shape (6,)
architecture: fully-connected network or ConvNet as well.

As we train our network, we would like our localisation net to output more and more accurate thetas. What do we mean by accurate? Well, think of our digit 7 rotated by 90 degrees counterclockwise. After say 2 epochs, our localisation net may output a transformation matrix which performs a 45 degree clockwise rotation and after 5 epochs for example, it may actually learn to do a complete 90 degree clockwise rotation. The effect is that our output image looks like a standard digit 7, something our neural network has seen in the training data and can easily classify.

Another way to look at it is that the localisation network learns to store the knowledge of how to transform each training sample in the weights of its layers.

Parametrised Sampling Grid

The grid generator’s job is to output a parametrised sampling grid, which is a set of points where the input map should be sampled to produce the desired transformed output.

Concretely, the grid generator first creates a normalized meshgrid of the same size as the input image U of shape (H, W), that is, a set of indices $(x^t, y^t)$ that cover the whole input feature map (the subscript t here stands for target coordinates in the output feature map). Then, since we’re applying an affine transformation to this grid and would like to use translations, we proceed by adding a row of ones to our coordinate vector to obtain its homogeneous equivalent. This is the little trick we also talked about last week. Finally, we reshape our 6 parameter $\theta$ to a 2x3 matrix and perform the following multiplication which results in our desired parametrised sampling grid.

$\begin{bmatrix} x^{s} \\ y^{s} \\ \end{bmatrix} = \begin{bmatrix} \theta_{11} & \theta_{12} & \theta_{13} \\ \theta_{21} & \theta_{22} & \theta_{23} \end{bmatrix} % \begin{bmatrix} x^t \\ y^t \\ 1 \end{bmatrix}$

The column vector $\begin{bmatrix} x^s \\ y^s \end{bmatrix}$ consists in a set of indices that tell us where we should sample our input to obtain the desired transformed output.

But wait a minute, what if those indices are fractional? Bingo! That’s why we learned about bilinear interpolation and this is exactly what we do next.

Differentiable Image Sampling

Since bilinear interpolation is differentiable, it is perfectly suitable for the task at hand. Armed with the input feature map and our parametrised sampling grid, we proceed with bilinear sampling and obtain our output feature map V of shape (H’, W’, C’). Note that this implies that we can perform downsampling and upsampling by specifying the shape of our sampling grid. (take that pooling!) We definitely aren’t restricted to bilinear sampling, and there are other sampling kernels we can use, but the important takeaway is that it must be differentiable to allow the loss gradients to flow all the way back to our localisation network.

(Image Courtesy) Two examples of applying the parameterised sampling grid to an image U producing the output V. (a) Identity transform (i.e. U = V) (2) Affine Transformation (i.e. rotation)

The above illustrates the inner workings of the Spatial Transformer. Basically it boils down to 2 crucial concepts we’ve been talking about all week: an affine transformation followed by bilinear interpolation. Take a moment and admire the elegance of such a mechanism! We’re letting our network learn the optimal affine transformation parameters that will help it ultimately succeed in the classification task all on its own.

Fun with Spatial Transformers

As a final note, I’ll provide 2 examples that illustrate the power of Spatial Transformers. I’ve attached the references for each example at the bottom of the post, so make sure to look those up if they pique your interest.

Distorted MNIST

Here is the result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification.

(Image Courtesy)

Notice how it has learned to do exactly what we wanted our theoretical “robust” image classification model to do: by zooming in and eliminating background clutter, it has “standardized” the input to facilitate classification. If you want to view a live animation of the transformer in action, click here.

German Traffic Sign Recognition Benchmark (GTSRB) dataset

(Image Courtesy) Left: Behavior of the Spatial Transformer during training. Notice how it learns to focus on the traffic sign, gradually removing background. Right: Output for different input images. Note how it stays approximately contant regardless of the input variability and distortion. Pretty neat!

Summary

In today’s blog post, we went over Google Deepmind’s Spatial Transformer Network paper. We started by introducing the different challenges classification models face, mainly how distortions in the input images can cause our classifiers to fail. One remedy is to use pooling layers; however they possess a few glaring limitations that have made them fall into disuse. The other remedy, and the subject of this blog post, is to use Spatial Transformer Networks.

This consists in a differentiable module that can be inserted anywhere in ConvNet architecture to increase its geometric invariance. It effectively endows our networks with the ability to spatially transform feature maps at no extra data or supervision cost. Finally, we saw how the whole mechanism boils down to 2 familiar operations: an affine transformation and bilinear interpolation.

In next week’s blog post we’ll be using what we’ve learned so far to aid us in coding this paper from scratch in Tensorflow. In the meantime, if you have any questions, feel free to post them in the comment section below.

Cheers and see you next week!

References

The original Deepmind paper - click here
Kudos to the Torch blog post on STNs which really helped me during the learning process - click here
Torch Implementation also helped me grasp the inner workings of STNs - check out this repo
Stanford’s CS231n as always - click here

Deep Learning Paper Implementations: Spatial Transformer Networks - Part I

Tue, 10 Jan 2017 00:00:00 +0000

Image Courtesy

The first three blog posts in my “Deep Learning Paper Implementations” series will cover Spatial Transformer Networks introduced by Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu of Google Deepmind in 2016. The Spatial Transformer Network is a learnable module aimed at increasing the spatial invariance of Convolutional Neural Networks in a computationally and parameter efficient manner.

In this first installment, we’ll be introducing two very important concepts that will prove crucial in understanding the inner workings of the Spatial Transformer layer. We’ll first start by examining a subset of image transformation techniques that fall under the umbrella of affine transformations, and then dive into a procedure that commonly follows these transformations: bilinear interpolation.

In the second installment, we’ll be going over the Spatial Transformer Layer in detail and summarizing the paper, and then in the third and final part, we’ll be coding it from scratch in Tensorflow and applying it to the GTSRB dataset (German Traffic Sign Recognition Benchmark).

For the full code that appears on this page, visit my Github Repository.

Image Transformations
- Scale
- Rotate
- Shear
- Translate
Bilinear Interpolation
Results
Conclusion
References

Image Transformations

To lay the groundwork for affine transformations, we first need to talk about linear transformations. To that end, we’ll be restricting ourselves to 2 dimensions and work with matrices.

We define the following:

a point K with coordinates $\begin{bmatrix} x \\ y \end{bmatrix}$ represented as a $(2\times1)$ column vector.
a matrix $M= \begin{bmatrix} a & b \\ c & d \end{bmatrix}$ represented as a square matrix of shape $(2\times2)$ .

and would like to examine the linear transformation $T$ defined by the matrix product $K' = T(K) = MK$ as we vary the parameters a, b, c and d of M.

Warm-Up Question.

Say we set $a = d = 1$ and $b = c = 0$ as follows:

$M = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$

In that case, what transform do you think we would obtain? Go ahead and give it a few moment’s thought…

Solution.

Let’s write it out:

$K' = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} % \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x \\ y \end{bmatrix} = K$

We’ve actually represented the identity transform, meaning that the point K does not move in the plane. Let us now jump to more interesting transforms.

Scaling.

Image Courtesy

We let $b = c = 0$ , and $a$ and $d$ take on any positive value.

$M = \begin{bmatrix} p & 0 \\ 0 & q \end{bmatrix}$

Note that there is a special case of scaling called isotropic scaling in which the scaling factor for both the x and y direction is the same, say $s$ . In that case, enlarging an image would correspond to $s > 1$ while shrinking would correspond to $s < 1$ . It’s a bit non-intuitive then that to zoom-in on an image, you need $s < 1$ (think about it).

Anyway, performing the matrix product, we obtain

$K' = \begin{bmatrix} p & 0 \\ 0 & q \end{bmatrix} % \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} px \\ qy \end{bmatrix}$

Rotation.

Image Courtesy

Suppose we want to rotate by an angle $\theta$ about the origin. To do so, we set $a = d = \cos{\theta}$ and $b = c = \sin{\theta}$ as follows:

$M = \begin{bmatrix} \cos{\theta} & -\sin{\theta} \\ \sin{\theta} & \cos{\theta} \end{bmatrix}$

We thus obtain

$K' = \begin{bmatrix} \cos{\theta} & -\sin{\theta} \\ \sin{\theta} & \cos{\theta} \end{bmatrix} % \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x\cos{\theta}- y\sin{\theta} \\ x\sin{\theta} + y\cos{\theta} \end{bmatrix}$

Shear.

Image Courtesy

When we shear an image, we offset the y direction by a distance proportional to x, and the x direction by a distance proportional to y. For example, when we go from normal text to italics, we are effectively applying a shear transform (think about shearing a deck of cards if that helps).

To achieve shearing, we set $a = d = 1$ , $b = m$ and $c = n$ as follows:

$M = \begin{bmatrix} 1 & m \\ n & 1 \end{bmatrix}$

This yields

$K' = \begin{bmatrix} 1 & m \\ n & 1 \end{bmatrix} % \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} x + my \\ y + nx \end{bmatrix}$

In summary, we have defined 3 basic linear transformations:

scaling: scales the x and y direction by a scalar.
shearing: offsets the x by a number proportional to y and x by a number proportional to x.
rotating: rotates the points around the origin by an angle $\theta$ .

Now the nice thing about matrices is that we can collapse sequential linear transformations into a single transformation matrix. For example, say we would like to apply a shear, a scale and then a rotation to our column vector K. Given that these transformations can be represented by the matrices $H$ , $S$ and $R$ , and respecting the order of transformations, we can write down this operation as

$K' = R \big[ S \big( HK \big) \big]$

But recall that matrix multiplication is associative! So this reduces to

$\boxed{K' = MK}$

where $M = RSH$ . Be mindful of the order since matrix multiplication $\color{red}{\text{is not}}$ commutative.

A beautiful consequence of this formula is that if we are given multiple transformations to do for a very high-dimensional vector, then we can basically carry out a single matrix multiplication rather than repeatedly manipulating the high-dimensional vector for every sequential transformation.

Translation.

The only downside to this $2 \times 2$ matrix representation is that we cannot represent translation since it isn’t a linear transformation. Translation however, is a very important and needed transformation, so we would like to be able to encapsulate it in our matrix representation.

To solve this dilemna, we represent our 2D vectors in 3D using homogeneous coordinates as follows:

our point K becomes a $(3\times1)$ column vector $\begin{bmatrix} x \\ y \\ 1 \end{bmatrix}$
our matrix M becomes a $(3\times3)$ square matrix $M= \begin{bmatrix} a & b & 0 \\ c & d & 0 \\ 0 & 0 & 1 \end{bmatrix}$

To represent a translation, all we have to do is place 2 new parameters $e$ and $f$ in our third column like so

$M= \begin{bmatrix} a & b & e \\ c & d & f \\ 0 & 0 & 1 \end{bmatrix}$

and we can thus carry out translations as linear transformations in homogeneous coordinates. Note that if we require a 2D output, then all we need to do is represent M as a $2 \times 3$ matrix and leave K untouched.

Example.

Translate both the x and y direction by $\Delta$ . Result should be 2D.

$K' = \begin{bmatrix} 1 & 0 & \Delta \\ 0 & 1 & \Delta \end{bmatrix} % \begin{bmatrix} x \\ y \\ 1 \end{bmatrix} = \begin{bmatrix} x + \Delta \\ y + \Delta \end{bmatrix}$

Summary.

Image Courtesy

By using a little trick, we were able to add a new transformation to our repertoire of linear transformations. This transformation, called translation, is an affine transformation. Hence, we can generalize our results and represent our 4 affine transformations (all linear transformations are affine) by the 6 parameter matrix

$M= \begin{bmatrix} a & b & c \\ d & e & f \end{bmatrix}$

Bilinear Interpolation

Motivation. When an image undergoes an affine transformation such as a rotation or scaling, the pixels in the image get moved around. This can be especially problematic when a pixel location in the output does not map directly to one in the input image.

In the illustration below, you can clearly see that the rotation places some points at locations that are not centered in the squares. This means that they would not have a corresponding pixel value in the original image.

Image Courtesy

So for example, suppose that after rotating an image, we need to find the pixel value at the location (6.7, 3.2). The problem with this is that there is no such thing as fractional pixel locations.

To solve this problem, bilinear interpolation uses the 4 nearest pixel values which are located in diagonal directions from a given location in order to find the appropriate color intensity values of that pixel. The result is smoother and more realistic images!

Algorithm.

Image Courtesy

Our goal is to find the pixel value of the point P. To do so, we calculate the pixel value of $R_1$ and $R_2$ using a weighted average of $(Q_{11}, Q_{21})$ and $(Q_{12}, Q_{22})$ respectively. Then, we use a weighted average of $R_2$ and $R_1$ to find the value of P.

Effectively, we are interpolating in the x direction and then the y direction, hence the name bilinear interpolation. You could just as well flip the order of interpolation and get the exact same value.

So given a point $P = (x, y)$ and 4 corner coordinates $Q_{11} = (x_1, y_1)$ , $Q_{21} = (x_2, y_1)$ , $Q_{12} = (x_1, y_2)$ and $Q_{22} = (x_2, y_2)$ , we first interpolate in the x-direction:

$R_1 = \frac{x_2 - x}{x_2 - x_1}Q_{11} + \frac{x - x_1}{x_2 - x_1}Q_{21}$ $R_2 = \frac{x_2 - x}{x_2 - x_1}Q_{12} + \frac{x - x_1}{x_2 - x_1}Q_{22}$

and finally in the y-direction:

$\boxed{P = \frac{y_2 - y}{y_2 - y_1}R_1 + \frac{y - y_1}{y_2 - y_1}R_2}$

Python Code.

One very very important note before we jump into the code!

An image processing affine transformation usually follows the 3-step pipeline below:

First, we create a sampling grid composed of $(x, y)$ coordinates. For example, given a 400x400 grayscale image, we create a meshgrid of same dimension, that is, evenly spaced $x \in [0, W]$ and $y \in [0, H]$ .
We then apply the transformation matrix to the sampling grid generated in the step above.
Finally, we sample the resulting grid from the original image using the desired interpolation technique.

As you can see, this is different than directly applying a transform to the original image.

I’ve attached 2 cat images in the Github Repository mentioned at the top of this page which you should go ahead and download. Save them to your Desktop in a folder called data/ or make sure to update the path location if you choose differently.

I’ve also written a function load_img() that converts images to numpy arrays. I won’t go into its details but it’s pretty basic and you shouldn’t take long to understand what it does. Note that you’ll need both PIL and Numpy to reproduce the results below.

Armed with this function, let’s load both cat images and concatenate them into a single input array. We’re working with 2 images because we want to make our code as general as possible.

import numpy as np
from PIL import Image

# params
DIMS = (400, 400)
CAT1 = 'cat1.jpg'
CAT2 = 'cat2.jpg'

# load both cat images
img1 = load_img(CAT1, DIMS)
img2 = load_img(CAT2, DIMS, view=True)

# concat into tensor of shape (2, 400, 400, 3)
input_img = np.concatenate([img1, img2], axis=0)

# dimension sanity check
print("Input Img Shape: {}".format(input_img.shape))

Given that we have 2 images, our batch size is equal to 2. This means that we need an equal amount of transformation matrices M for each image in the batch.

Let’s go ahead and initialize 2 identity transform matrices. This is the simplest case, and if we implement our bilinear sampler correctly, we should expect our output image to be almost exact to the input image.

# grab shape
num_batch, H, W, C = input_img.shape

# initialize M to identity transform
M = np.array([[1., 0., 0.], [0., 1., 0.]])

# repeat num_batch times
M = np.resize(M, (num_batch, 2, 3))

(Recall that our general affine transformation matrix is $2 \times 3$ if we want to include translation.)

Now we need to write a function that will generate a meshgrid for us and output a sampling grid resulting from the product of this meshgrid and our transformation matrix M.

Let’s go ahead and generate our meshgrid. We’ll create a normalized one, that is the values of x and y range from -1 to 1 and there are width and height of them respectively. In fact, note that for images, x corresponds to the width of the image (i.e. number of columns of the matrix) while y corresponds to the height of the image (i.e. number of rows of the matrix).

# create normalized 2D grid
x = np.linspace(-1, 1, W)
y = np.linspace(-1, 1, H)
x_t, y_t = np.meshgrid(x, y)

Then we need to augment the dimensions to create homogeneous coordinates.

# reshape to (xt, yt, 1)
ones = np.ones(np.prod(x_t.shape))
sampling_grid = np.vstack([x_t.flatten(), y_t.flatten(), ones])

So we’ve created 1 grid here, but we need num_batch grids. Same as above, our one-liner below repeats our array num_batch times.

# repeat grid num_batch times
sampling_grid = np.resize(sampling_grid, (num_batch, 3, H*W))

Now we perform step 2 of our image transformation pipeline.

# transform the sampling grid i.e. batch multiply
batch_grids = np.matmul(M, sampling_grid)
# batch grid has shape (num_batch, 2, H*W)

# reshape to (num_batch, height, width, 2)
batch_grids = batch_grids.reshape(num_batch, 2, H, W)
batch_grids = np.moveaxis(batch_grids, 1, -1)

Finally, let’s write our bilinear sampler. Given our coordinates x and y in the sampling grid, we want interpolate the pixel value in the original image.

Let’s start by seperating the x and y dimensions and rescaling them to belong in the height/width interval.

x_s = batch_grids[:, :, :, 0:1].squeeze()
y_s = batch_grids[:, :, :, 1:2].squeeze()

# rescale x and y to [0, W/H]
x = ((x_s + 1.) * W) * 0.5
y = ((y_s + 1.) * H) * 0.5

Now for each coordinate $(x_i, y_i)$ we want to grab 4 corner coordinates.

# grab 4 nearest corner points for each (x_i, y_i)
x0 = np.floor(x).astype(np.int64)
x1 = x0 + 1
y0 = np.floor(y).astype(np.int64)
y1 = y0 + 1

(Note that we could just as well use the ceiling function rather than the increment by 1).

Now we must make sure that no value goes beyond the image boundaries. For example, suppose we have $x = 399$ , then $x_0 = 399$ and $x_1 = x0 + 1 = 400$ which would result in a numpy error. Thus we clip our corner coordinates in the following way:

# make sure it's inside img range [0, H] or [0, W]
x0 = np.clip(x0, 0, W-1)
x1 = np.clip(x1, 0, W-1)
y0 = np.clip(y0, 0, H-1)
y1 = np.clip(y1, 0, H-1)

Now we use advanced numpy indexing to grab the pixel value for each corner coordinate. These correspond to (x0, y0), (x0, y1), (x1, y0) and (x_1, y_1).

# look up pixel values at corner coords
Ia = input_img[np.arange(num_batch)[:,None,None], y0, x0]
Ib = input_img[np.arange(num_batch)[:,None,None], y1, x0]
Ic = input_img[np.arange(num_batch)[:,None,None], y0, x1]
Id = input_img[np.arange(num_batch)[:,None,None], y1, x1]

Almost there! Now, we calculate the weight coefficients,

# calculate deltas
wa = (x1-x) * (y1-y)
wb = (x1-x) * (y-y0)
wc = (x-x0) * (y1-y)
wd = (x-x0) * (y-y0)

and finally, multiply and add according to the formula mentioned previously.

# add dimension for addition
wa = np.expand_dims(wa, axis=3)
wb = np.expand_dims(wb, axis=3)
wc = np.expand_dims(wc, axis=3)
wd = np.expand_dims(wd, axis=3)

# compute output
out = wa*Ia + wb*Ib + wc*Ic + wd*Id

Results

So now that we’ve gone through the whole code incrementally, let’s have some fun and experiment with different values of the transformation matrix M.

The first thing you need to do is copy and paste the whole code which has been made more modular. Now let’s test if our function works correctly.

Identity Transform.

Add the following 2 lines as the end of the script and execute.

plt.imshow(out[1])
plt.show()

Translation.

Say we want to translate the picture by 0.5 only in the x direction. This should shift the image to the left.

Edit the following line of your code as follows:

M = np.array([[1., 0., 0.5], [0., 1., 0.]])

Rotation.

Finally, say we want to rotate the picture by 45 degrees. Given that $\cos{(45)} = \sin{(45)} = \frac{\sqrt{2}}{2} \approx 0.707$ , edit just this line of your code as follows:

M = np.array([[0.707, -0.707, 0.], [0.707, 0.707, 0.]])

Conclusion

In this blog post, we went over basic linear transformations such as rotation, shear and scale before generalizing to affine transformations which included translations. Then, we saw the importance of bilinear interpolation in the context of these transformations. Finally, we went over the algorithm, coded it from scratch in Python and wrote 2 methods that helped us visualize these transformations according to a 3 step image processing pipeline.

In the next installment of this series, we’ll go over the Spatial Transformer Network layer in detail as well as summarize the paper it is described in.

See you next week!

References

A big thank you to Eder Santana for introducing me to this paper!

Nuts and Bolts of Applying Deep Learning

Mon, 26 Sep 2016 00:00:00 +0000

Image Courtesy

This weekend was very hectic (catching up on courses and studying for a statistics quiz), but I managed to squeeze in some time to watch the Bay Area Deep Learning School livestream on YouTube. For those of you wondering what that is, BADLS is a 2-day conference hosted at Stanford University, and consisting of back-to-back presentations on a variety of topics ranging from NLP, Computer Vision, Unsupervised Learning and Reinforcement Learning. Additionally, top DL software libraries were presented such as Torch, Theano and Tensorflow.

There were some super interesting talks from leading experts in the field: Hugo Larochelle from Twitter, Andrej Karpathy from OpenAI, Yoshua Bengio from the Université de Montreal, and Andrew Ng from Baidu to name a few. Of the plethora of presentations, there was one somewhat non-technical one given by Andrew that really piqued my interest.

In this blog post, I’m gonna try and give an overview of the main ideas outlined in his talk. The goal is to pause a bit and examine the ongoing trends in Deep Learning thus far, as well as gain some insight into applying DL in practice.

By the way, if you missed out on the livestreams, you can still view them at the following: Day 1 and Day 2.

Table of Contents:

Major Deep Learning Trends
End-to-End Deep Learning
Bias-Variance Tradeoff
Human-level Performance
Personal Advice

Major Deep Learning Trends

Why do DL algorithms work so well? According to Ng, with the rise of the Internet, Mobile and IOT era, the amount of data accessible to us has greatly increased. This correlates directly to a boost in the performance of neural network models, especially the larger ones which have the capacity to absorb all this data.

However, in the small data regime (left-hand side of the x-axis), the relative ordering of the algorithms is not that well defined and really depends on who is more motivated to engineer their features better, or refine and tune the hyperparameters of their model.

Thus this trend is more prevalent in the big data realm where hand engineering effectively gets replaced by end-to-end approaches and bigger neural nets combined with a lot of data tend to outperform all other models.

Machine Learning and HPC team. The rise of big data and the need for larger models has started to put pressure on companies to hire a Computer Systems team. This is because some of the HPC (high-performance computing) applications require highly specialized knowledge and it is difficult to find researchers and engineers with sufficient knowledge in both fields. Thus, cooperation from both teams is the key to boosting performance in AI companies.

Categorizing DL models. Work in DL can be categorized in the following 4 buckets:

Most of the value in the industry today is driven by the models in the orange blob (innovation and monetization mostly) but Andrew believes that unsupervised deep learning is a super-exciting field that has loads of potential for the future.

The rise of End-to-End DL

A major improvement in the end-to-end approach has been the fact that outputs are becoming more and more complicated. For example, rather than just outputting a simple class score such as 0 or 1, algorithms are starting to generate richer outputs: images like in the case of GAN’s, full captions with RNN’s and most recently, audio like in DeepMind’s WaveNet.

So what exactly does end-to-end training mean? Essentially, it means that AI practitioners are shying away from intermediate representations and going directly from one end (raw input) to the other end (output) Here’s an example from speech recognition.

Are there any disadvantages to this approach? End-to-end approaches are data hungry meaning they only perform well when provided with a huge dataset of labelled examples. In practice, not all applications have the luxury of large labelled datasets so other approaches which allow hand-engineered information and field expertise to be added into the model have gained the upper hand. As an example, in a self-driving car setting, going directly from the raw image to the steering direction is pretty difficult. Rather, many features such as trajectory and pedestrian location are calculated first as intermediate steps.

The main take-away from this section is that we should always be cautious of end-to-end approaches in applications where huge data is hard to come by.

Bias-Variance Tradeoff

Splitting your data. In most deep learning problems, train and test come from different distributions. For example, suppose you are working on implementing an AI powered rearview mirror and have gathered 2 chunks of data: the first, larger chunk comes from many places (could be partly bought, and partly crowdsourced) and the second, much smaller chunk is actual car data.

In this case, splitting the data into train/dev/test can be tricky. One might be tempted to carve the dev set out of the training chunk like in the first example of the diagram below. (Note that the chunk on the left corresponds to data mined from the first distribution and the one on the right to the one from the second distribution.)

This is bad because we usually want our dev and test to come from the same distribution. The reason for this is that because a part of the team will be spending a lot of time tuning the model to work well on the dev set, if the test set were to turn out very different from the dev set, then pretty much all the work would have been wasted effort.

Hence, a smarter way of splitting the above dataset would be just like the second line of the diagram. Now in practice, Andrew recommends creating dev sets from both data distributions: a train-dev and test-dev set. In this manner, any gap between the different errors can help you tackle the problem more clearly.

Flowchart for working with a model. Given what we have described above, here’s a simplified flowchart of the actions you should take when confronted with training/tuning a DL model.

The importance of data synthesis. Andrew also stressed the importance of data synthesis as part of any workflow in deep learning. While it may be painful to manually engineer training examples, the relative gain in performance you obtain once the parameters and the model fit well are huge and worth your while.

Human-level Performance

One of the very important concepts underlined in this lecture was that of human-level performance. In the basic setting, DL models tend to plateau once they have reached or surpassed human-level accuracy. While it is important to note that human-level performance doesn’t necessarily coincide with the golden bayes error rate, it can serve as a very reliable proxy which can be leveraged to determine your next move when training your model.

Reasons for the plateau. There could be a theoretical limit on the dataset which makes further improvement futile (i.e. a noisy subset of the data). Humans are also very good at these tasks so trying to make progress beyond that suffers from diminishing returns.

Here’s an example that can help illustrate the usefulness of human-level accuracy. Suppose you are working on an image recognition task and measure the following:

Train error: 8%
Dev Error: 10%

If I were to tell you that human accuracy for such a task is on the order of 1%, then this would be a blatant bias problem and you could subsequently try increasing the size of your model, train longer etc. However, if I told you that human-level accuracy was on the order of 7.5%, then this would be more of a variance problem and you’d focus your efforts on methods such as data synthesis or gathering data more similar to the test.

By the way, there’s always room for improvement. Even if you are close to human-level accuracy overall, there could be subsets of the data where you perform poorly and working on those can boost production performance greatly.

Finally, one might ask what is a good way of defining human-level accuracy. For example, in the following image diagnosis setting, ignoring the cost of obtaining data, how should one pick the criteria for human-level accuracy?

typical human: 5%
general doctor: 1%
specialized doctor: 0.8%
group of specialized doctors: 0.5%

The answer is always the best accuracy possible. This is because, as we mentioned earlier, human-level performance is a proxy for the bayes optimal error rate, so providing a more accurate upper bound to your performance can help you strategize your next move.

Personal Advice

Andrew ended the presentation with 2 ways one can improve his/her skills in the field of deep learning.

Practice, Practice, Practice: compete in Kaggle competitions and read associated blog posts and forum discussions.
Do the Dirty Work: read a lot of papers and try to replicate the results. Soon enough, you’ll get your own ideas and build your own models.