Kevin Zakka's BlogAcademic Journal
http://kevinzakka.github.io/
Thu, 05 Oct 2017 17:54:19 +0000Thu, 05 Oct 2017 17:54:19 +0000Jekyll v3.5.2Getting Up and Running with PyTorch on Amazon Cloud<p align="center">
<img src="/assets/aws/splash.png" alt="Drawing" width="60%" />
</p>
<p>This is a succint tutorial aimed at helping you set up an AWS GPU instance so that you can train and test your PyTorch models in the cloud. If you don’t own a GPU like me, this can be a great way of drastically reducing the training time of your models, so while your instance is furiously crunching numbers in some faraway Amazon server, you can peacefully experiment with and prototype new architectures from the comfort of a Starbucks couch.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/cpu-meter.png" width="30%" style="border:none;" />
<div class="thecap" style="text-align:center">I mean we all love a silent macbook, right?</div>
</p>
</div>
<p>The cool part is that if you’re a high school or college student, you can sign up for a Github Developer pack which will get you $150 worth of free AWS credits. That’s around 167 hours or 7 days of compute time<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, an amply sufficient amount for those fun weekend side projects and experiments. As usual, any code or script that appears on this page can be downloaded from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">Blog Repository</a>. And on that note, let’s get started!</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Configuring Your EC2 Instance</a></li>
<li><a href="#toc2">Launching & Managing Your EC2 Instance</a></li>
<li><a href="#toc3">SSH Persistence With TMUX</a></li>
<li><a href="#toc4">Conclusion</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="configuring-your-ec2-instance">Configuring Your EC2 Instance</h2>
<p>I’m assuming you’ve already created an AWS account but if you haven’t, the whole process shouldn’t take you more than 2 minutes. Note that it will require you to enter your credit card information which is necessary to charge you <em>if and when</em> you exceed your free credits. Now’s also a great time to claim your <a href="https://education.github.com/pack">GitHub Student Developer Pack</a> credits so go ahead and do that.</p>
<p><strong>Pick your Region.</strong> Ok, so the instance type we are going to use is located in <strong>US West (Oregon)</strong> so make sure the region information on the top right of the screen correctly reflects that.</p>
<p align="center">
<img src="/assets/aws/step1.png" alt="Drawing" width="80%" />
</p>
<p><strong>Limit Increase.</strong> The next thing we need to do is request a limit increase for EC2 instances. For some weird reason, Amazon automatically sets the limit to 0 upon account creation so it has to be increased by sending in a support ticket.</p>
<p>Go ahead and click <code class="highlighter-rouge">Support > Support Center</code> at the top right of your screen. This will direct you to a page with a blue <code class="highlighter-rouge">Create Case</code> button that you should click. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step2.png" alt="Drawing" width="80%" />
</p>
<p>We want a Limit Increase for EC2 instances meaning you need to select <code class="highlighter-rouge">Service Limit Increase</code> in <strong>Regarding</strong> and <code class="highlighter-rouge">EC2 Instances</code> in <strong>Limit Type</strong>. Now fill in the <strong>Request 1</strong> box and <strong>Use Case Description</strong> as I’ve done here.</p>
<p align="center">
<img src="/assets/aws/step3.png" alt="Drawing" width="80%" />
</p>
<p>Finally, make sure to select <code class="highlighter-rouge">Web</code> as your <strong>Contact method</strong> and submit the request. Note that the time of response varies: I’ve had limit increases resolved in a matter of minutes and sometimes up to a full day, so be patient. Also, feel free to change the <strong>New limit value</strong> to suit your needs. I’ve opted for 2 because the <code class="highlighter-rouge">p2.xlarge</code> instance type we’ll be working with has a single GPU with memory constraints that may limit the number of jobs I may run concurrently.</p>
<p><strong>Configure Instance.</strong> Ok, we’re now ready to create and configure our EC2 instance. Back on the home page console (click on the orange cube in the top left), navigate to <code class="highlighter-rouge">EC2</code> in the Compute services section, and then click on the blue <code class="highlighter-rouge">Launch Instance</code> button.</p>
<p align="center">
<img src="/assets/aws/step4.png" alt="Drawing" width="80%" />
</p>
<p>You’ll be greeted with a 7-step process like so.</p>
<p align="center">
<img src="/assets/aws/step5.png" alt="Drawing" width="80%" />
</p>
<p><strong>AMI.</strong> First select the <code class="highlighter-rouge">Ubuntu Server 16.04 LTS (HVM), SSD Volume Type</code> as the AMI of choice.</p>
<p><strong>Instance.</strong> Select <code class="highlighter-rouge">p2.xlarge</code> as your instance type. This is an instance with a single GPU which is what we asked for in our limit increase request.</p>
<p><strong>Spot Instances.</strong> At this point, you should be on the <strong>Configure Instance Details</strong> step. This is where things get interesting. In fact, Amazon gives us the ability to bid on spare Amazon EC2 computing capacity for a much cheaper price than the on-demand one.</p>
<p>Basically, what that means is that if our bid price is higher than the current market price, our instance will be launched and charged at that price. The only downside is that if that ever flips around, instances get <span style="color:red">terminated</span> instantly and with no warning<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p><span style="color:blue">TL;DR:</span> Spot instances can be ideal for non-critical experimentation like hyperparameter tuning but stay away from them if you need to train a model for a large number of epochs.</p>
<p>I’ll assume the user uses On-Demand pricing for the remainder of this post but if you do want to find out more about Spot Instances, feel free to watch this Youtube <a href="https://www.youtube.com/watch?v=_XT6McviY7w">video</a>.</p>
<p><strong>Add Storage.</strong> Next, we’ll be increasing the size of our Root Volume to accomodate large datasets such as ImageNet which is around 48 Gb. Feel free to enter any number above that.</p>
<p align="center">
<img src="/assets/aws/step6.png" alt="Drawing" width="80%" />
</p>
<p>Note that the Root Volume is EBS-backed meaning it persists on instance termination. The default behavior however is to delete it on termination. Weird right? Well, not really. With ephemeral storage, the other type of storage AWS offers, there is no persist option, whether it be on instance stop or terminate. Thus EBS with delete-on-terminate gives us the ability to keep our data on disk when the instance is stopped!</p>
<p><strong>Configure Security Group.</strong> You can skip the <strong>Add Tags</strong> section and jump to this last step. This part is important because it will allow us to monitor our training with Tensorboard and use Jupyter Notebook. We’ll be adding 4 protocols as shown in the picture below.</p>
<p align="center">
<img src="/assets/aws/step7.png" alt="Drawing" width="80%" />
</p>
<p>Once you click the launch button, a window will pop up and prompt you to create a key-pair. This little file is needed when ssh-ing into your instance, so download it and store it in a secure location you’ll remember. For this tutorial’s sake, I’ll be calling mine <code class="highlighter-rouge">aws-dl.pem</code> and storing it in my Downloads folder.</p>
<p align="center">
<img src="/assets/aws/step8.png" alt="Drawing" width="80%" />
</p>
<p><a name="toc2"></a></p>
<h2 id="launching--managing-your-ec2-instance">Launching & Managing Your EC2 Instance</h2>
<p>We’ve finally arrived at the point where we can ssh into our EC2 instance. To do so, you’ll need to navigate to the <code class="highlighter-rouge">Instances</code> page located in the navigation panel on the left of your screen. You’ll be greeted with the following:</p>
<p align="center">
<img src="/assets/aws/step9.png" alt="Drawing" width="80%" />
</p>
<p>You need to take note of 2 things:</p>
<ul>
<li><strong>Public DNS (IPv4)</strong>: <code class="highlighter-rouge">ec2-52-42-90-161.us-west-2.compute.amazonaws.com</code></li>
<li><strong>IPv4 Public IP</strong>: <code class="highlighter-rouge">52.42.90.161</code></li>
</ul>
<p>Other than that, there are just 2 ways to interact with your instance you need to be aware of: <strong>login</strong> with ssh and <strong>copy</strong> a file to it with scp.</p>
<ul>
<li><code class="highlighter-rouge">ssh -v -i X ubuntu@Y</code> where X represents the path to the key-pair file and Y represents the Public IP of your instance.</li>
<li><code class="highlighter-rouge">scp -i W -r X ubuntu@Y:Z</code> where W is the path to the key-pair file, X is the path to the local file, Y is the Public IP, and Z is the destination path on the instance.</li>
</ul>
<p>It’s important to note that if you’re using the key-pair file for the very first time, you’ll need to change its permission to read and write by running <code class="highlighter-rouge">chmod 600 ~/Downloads/aws-dl.pem</code>.</p>
<p>With all that being said, we can finally fire up a terminal and execute the following command:</p>
<p><code class="highlighter-rouge">
ssh -v -i ~/Downloads/aws-dl.pem ubuntu@52.42.90.161
</code></p>
<p align="center">
<img src="/assets/aws/term1.png" alt="Drawing" width="80%" />
</p>
<p>Enter yes, and voila! You should be successfully logged in. The instance is still not ready for use as there are a few more things that need to be done, but fear not. I’ve create a small bash script that you can execute that automates the following:</p>
<ul>
<li>It downloads and installs the required nvidia gpu drivers.</li>
<li>It updates and upgrades the distribution packages.</li>
<li>It installs python3 along with virtualenv.</li>
<li>It creates a virtualenv called <code class="highlighter-rouge">deepL</code> that will house all the required pip packages and PyTorch.</li>
<li>And it finally installs PyTorch v0.2.</li>
</ul>
<p>Go ahead and download <code class="highlighter-rouge">install.sh</code> from my <a href="https://github.com/kevinzakka/blog-code/tree/master/aws-pytorch">repo</a> and save it to your Desktop. We need to copy it to our instance, so apply the command I mentioned above:</p>
<p><code class="highlighter-rouge">
scp -i ~/Downloads/aws-dL.pem -r ~/Desktop/install.sh ubuntu@52.42.90.161:~/.
</code></p>
<p>Next, go back to the terminal window logged into the instance and execute the following 2 commands:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>chmod +x install.sh
./install.sh
</code></pre>
</div>
<p>Once that’s done, you’ll need to reboot your instance. Enter <code class="highlighter-rouge">exit</code> at the command line and navigate to your browser as in the image below. Be patient and wait for a few minutes before you ssh back into the instance!</p>
<p align="center">
<img src="/assets/aws/step10.png" alt="Drawing" width="80%" />
</p>
<p>At this point, we should sanity check our installation by seeing if PyTorch loads correctly.</p>
<ul>
<li>First, activate the virtualenv by executing <code class="highlighter-rouge">source ~/envs/deepL/bin/activate</code>.</li>
<li>Enter <code class="highlighter-rouge">python</code> and inside the interpreter, <code class="highlighter-rouge">import torch</code> then <code class="highlighter-rouge">torch.__version__</code>. Fingers crossed, this should print out <code class="highlighter-rouge">0.2.0_1</code>.</li>
<li>Lastly, check that the GPU is visible by typing <code class="highlighter-rouge">torch.cuda.is_available()</code> which should print out True.</li>
</ul>
<p><span style="color:red">Once you’ve finished working on your instance, you should stop it immediately to avoid incurring additional charges.</span></p>
<p><a name="toc3"></a></p>
<h2 id="ssh-persistence-with-tmux">SSH Persistence With TMUX</h2>
<p>I would be doing you a great disservice if I didn’t mention this nifty little package called <code class="highlighter-rouge">tmux</code> that you can use when running your instances for long periods of time. <em>What exactly is tmux, and why should you use it</em>?</p>
<p>Well, if you’re shhed into an instance, peacefully running a job, and your connection suddenly drops, your ssh connection will automatically get killed. This means anything running on that instance stops as well (i.e. your model will stop training). Closing your laptop to commute from university to your house for example becomes a big no no.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/aws/term3.png" width="80%" style="border:none;" />
<div class="thecap" style="text-align:center">A TMUX session</div>
</p>
</div>
<p>This is where tmux comes in! Tmux makes it so that anything running within a session persists even if the connection drops or the terminal gets killed. To see it in action, I’d suggest you watch the following <a href="https://www.youtube.com/watch?v=BHhA_ZKjyxo">video</a>.</p>
<p>Thus, your workflow should always be as follows:</p>
<ul>
<li>SSH into your aws instance.</li>
<li>Create a new tmux session called work using the command <code class="highlighter-rouge">tmux new -s work</code>.</li>
<li>Do everything as you would previously.</li>
<li>Detach from the session by pressing <code class="highlighter-rouge">ctrl-b</code> followed by <code class="highlighter-rouge">d</code>.</li>
</ul>
<p>Once you’ve detached yourself from the session, you can work on anything else, even go to sleep… Subsequently, if you need to reattach to that particular tmux session to check your progress, run <code class="highlighter-rouge">tmux a -t work</code>.</p>
<p>That’s pretty much it. For a more complete list of tmux commands, you should refer to this lovely <a href="https://gist.github.com/MohamedAlaa/2961058">cheatsheet</a>.</p>
<p><a name="toc4"></a></p>
<h2 id="conclusion">Conclusion</h2>
<p>In this tutorial, we went over the basic steps needed to create a free, GPU-powered Amazon AWS instance. We explored how to interact with our instance using the <code class="highlighter-rouge">ssh</code> and <code class="highlighter-rouge">scp</code> commands and how a bash script could be leveraged to download and install all the required packages needed to run PyTorch. Finally, we saw how we could make our ssh session persistent using a very important program called tmux.</p>
<p>Until next time!</p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is for a GPU-powered p2.xlarge instance with an on-demand price of around $0.9/hr. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>A terminated instance gets deleted, meaning you lose whatever’s on there permanently. On the other hand, a stopped instance just goes offline so you don’t get charged for it and you can fire it back up again at a later time. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 13 Aug 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/08/13/aws-pytorch/
http://kevinzakka.github.io/2017/08/13/aws-pytorch/deep learningawsamazonpytorch2017Understanding Recurrent Neural Networks - Part I<p>Recurrent Neural Networks have been my Achilles’ heel for the past few months. Admittedly, I haven’t had the grit to sit down and work out their details, but I’ve figured it’s time I stop treating them like black boxes and try instead to discover what makes them tick. My intentions with this series are hence twofold: first, to combat my weakness by understanding their inner workings and coding one from scratch; and second, to write down what I learn in order to reinforce the insights I may gain along the way.</p>
<p>In this first installment, we’ll be introducing the intuition behind RNNs, motivating their use by highlighting a glaring limitation of traditional neural networks. We’ll then transition into a more technical description of their architecture which will be useful for the next installment where we’ll code one from scratch in numpy.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Human Learning</a></li>
<li><a href="#toc2">The Woes of Traditional Neural Nets</a></li>
<li><a href="#toc3">Enhancing Neural Networks with Memory</a></li>
<li><a href="#toc4">The Nitty Gritty Details</a></li>
<li><a href="#toc5">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="human-learning">Human Learning</h3>
<blockquote>
<p>We are the sum total of our experiences. None of us are the same as we were yesterday, nor will be tomorrow. (B.J. Neblett)</p>
</blockquote>
<p>There is an inherent truth to the quote above. Our brain pools from past experiences and combines them in intricate ways to solve new and unseen tasks. It is hardwired to work with sequences of information that we perpetually store and call upon over the course of our lives. At its core, <em>human learning</em> can be distilled into two fundamental processes:</p>
<ul>
<li><strong>memorization</strong>: every time we gain new information, we store it for future reference.</li>
<li><strong>combination</strong>: not all tasks are the same, so we couple our analytical skills with a combination of our memorized, previous experiences to reason about the world.</li>
</ul>
<p>Consider the following pictures.</p>
<p align="center">
<img src="/assets/rnn/weird_cat.jpg" alt="Drawing" width="200px" /><img src="/assets/rnn/weird_cat2.jpg" alt="Drawing" width="200px" />
</p>
<p>Even though it’s in a very weird position, a child can instantly tell that the fur ball in front of it is a cat. It’ll recognize the ears, the whiskers and the snout (memory) but the shape of it all may throw it off. Subconciously however, the child may recall how human stretching deforms shape and pose (combination), and infer that the same is happening to the cat.</p>
<p>Not all tasks require the distant past however. At times, solving a problem makes use of information that was processed only moments ago. For example, take a look at this incomplete sentence:</p>
<blockquote>
<p>I bought my usual caramel-covered popcorn with iced tea and headed to the ___.</p>
</blockquote>
<p>If I asked you to fill-in the missing word, you’d probably guess “movies”. How did you know that <code class="highlighter-rouge">library</code> or <code class="highlighter-rouge">starbucks</code> were invalid words? Well, it’s probably because you used context, or information from earlier in the sentence to infer the correct answer. Now think about the following. If I asked you to recite the lyrics of your favorite song backwards, would you be able to do it? Probably not… What about counting backwards? Yeah, piece of cake!</p>
<p align="center">
<img src="/assets/rnn/yarn.jpg" alt="Drawing" width="200px" />
</p>
<p>So what makes reciting the song backwards so excruciatingly difficult? The answer is that counting backwards is done <strong>on the fly</strong>. There is a logical relationship between each number, and knowing the order of the 9 digits and how subtraction works means you can count backwards from say 1845098 even if you’ve never done it before. On the other hand, you memorized the lyrics of the song in a specific order. Your brain works by <strong>indexing</strong> from one word to the next, starting from the first word. It’s hard to index backwards for the simple reason that your brain has never done it before, so that specific sequence was never stored. Think of the memorized lyric sequence as a giant ball of yarn whose unraveled end can only be accessed with the correct first word in the forward sequence.</p>
<p>The main takeaway is that our brains are naturally talented at working with sequences and they do so by relying on a deceptively simple, yet powerful concept called <strong>information persistence</strong>.</p>
<p><a name="toc2"></a></p>
<h3 id="the-woes-of-traditional-neural-nets">The Woes of Traditional Neural Nets</h3>
<p>We live in a world that is inherently sequential. Audio, video, and language (even your DNA!) are but a few examples of data in which information at a given time step is intricately dependent on information from previous timesteps. So how is all this related to deep learning? Well, think about feeding a sequence of frames from a video into a neural network and asking it to predict what comes next… Or, back to our previous example, feeding a set of words and asking it to complete the sentence.</p>
<p>It should be obvious to you that information from the past is crucial for outputting a sane and plausible prediction. But traditional neural networks can’t do this because they operate on the fundamental assumption that inputs are independent! This is a problem because it means our output at any given time is completely and <strong>solely</strong> determined by the input at that same time. There is no previous history and our network cannot capitalize on the complex temporal dependencies that exist between the different frames or words to refine its predictions.</p>
<p>This is where <em>Recurrent Neural Networks</em> come in! RNNs allow us to deal with sequences by incorporating a mechanism that stores and leverages information from previous history, sort of like a memory. Put differently, whereas a traditional net maps <strong>one</strong> input to an output, a recurrent net maps an <strong>entire history</strong> of previous inputs to each output. If that’s still obscure to you, just think of RNNs as a traditional neural net enhanced with a loop<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, one that allows for information to persist across timesteps.</p>
<div class="imgcap">
<p align="center">
<img src="/assets/rnn/draw2.gif" width="400" style="border:none;" />
<div class="thecap" style="text-align:center">(<a href="https://www.youtube.com/watch?v=Zt-7MI9eKEo">Video Courtesy</a>) DRAW model improving its output by iterating over the canvas rather than producing the image in one shot.</div>
</p>
</div>
<p>It is important to note that recurrent neural nets aren’t just bound to sequential data in the sense that many problems can be tackled by decomposing them into a series of smaller subproblems. The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step. As an example, a recurrent net<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> was used to generate handwritten digits in a sequential fashion, mimicking the way artists refine and reassess their work with brushstrokes.</p>
<blockquote>
<p>The idea is that instead of burdening our model with predicting an output in one go, we allow it the much easier task of predicting iterative sub-outputs, where each sub-output is an improvement or refinement on the previous step.</p>
</blockquote>
<p><a name="toc3"></a></p>
<h3 id="enhancing-neural-nets-with-memory">Enhancing Neural Nets with Memory</h3>
<p>So how exactly can we endow our networks with the ability to learn? To answer this question, let’s recall our basic hidden layer neural network, which takes as input a vector <code class="highlighter-rouge">X</code>, dot products it with a weight matrix <code class="highlighter-rouge">W</code> and applies a nonlinearity. We’ll consider the output <code class="highlighter-rouge">y</code> when three successive inputs are fed through the network. Note that the bias term has been eliminated so as to simplify the notation and that I’ve taken the liberty of coloring the equations to make certain patterns stand out.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2})</script>
<p>Given the simple API above, it’s pretty clear that each output is solely determined by its input, i.e. there is no trace of past inputs in the calculation of its value. So let’s alter the API by allowing our hidden layer to use a combination of both the current input and the previous input, and visualize what happens.</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f(W_x \color{green}{X_1} + W_h\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_2 = f(W_x \color{red}{X_2} + W_h\color{green}{X_1})</script>
<p>Nice! By introducing recurrence into the formula, we’ve managed to obtain a mix of 2 colors in each hidden layer. Intuitively, our network now has a memory depth of 1, equivalent to “seeing” one step backwards in time. Remember though that our goal is to be able to capture information across <strong>all</strong> previous timesteps, so this does not cut it.</p>
<p>Hmm… What if we feed in a combination of the current input and the previous hidden layer?</p>
<script type="math/tex; mode=display">y_0 = f(W_x\color{blue}{X_0})</script>
<script type="math/tex; mode=display">y_1 = f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)</script>
<script type="math/tex; mode=display">y_2 = f\bigg(W_x \color{red}{X_2} + W_h \ f\big(W_x \color{green}{X_1} + W_h \ f(W_x\color{blue}{X_0}) \big)\bigg)</script>
<p>Much better! Our layer at each timestep is now a blend of all the colors that have come before it, allowing our network to take into account all its past history when computing its output. This is the power of recurrence in all its glory: creating a loop where information can persist across timesteps.</p>
<p><a name="toc4"></a></p>
<h3 id="the-nitty-gritty-details">The Nitty Gritty Details</h3>
<div class="imgcap">
<img src="/assets/rnn/rnn-1_layer-unrolled.svg" width="300px" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://kbullaughey.github.io/lstm-play/rnn/">Image Courtesy</a></div>
</div>
<p>At its core, an RNN can be represented by an internal, hidden state <code class="highlighter-rouge">h</code> that gets updated with every timestep and from which an output <code class="highlighter-rouge">y</code> can be optionally derived<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. This update behavior is governed by the following equations:</p>
<script type="math/tex; mode=display">\begin{cases}
h_t = f \big(W_{xh}x_t + W_{hh}h_{t-1}+b_1\big) \\
y_t = g \big(W_{hy}h_t + b_2\big)
\end{cases}</script>
<p>Don’t let the above notation scare you. It’s actually very simple once you dissect it.</p>
<ul>
<li><script type="math/tex">W_{xh}x_t</script> - we’re multiplying the input <script type="math/tex">x_t</script> by a weight matrix <script type="math/tex">W_{xh}</script>. You can think of this dot product as a way for the hidden layer to extract information out of the input.</li>
<li><script type="math/tex">W_{hh}h_{t-1}</script> - this dot product is allowing the network to extract information from an entire history of past inputs which it will use in conjunction with information gathered from the current input, to compute its output. This is the crucial, self-defining property of RNNs.</li>
<li><script type="math/tex">f</script> and <script type="math/tex">g</script> are activation functions that squash the dot products to a specific range. The function <script type="math/tex">f</script> is usually <code class="highlighter-rouge">tanh</code> or <code class="highlighter-rouge">ReLU</code>. <script type="math/tex">g</script> can be a <code class="highlighter-rouge">softmax</code> when we want to output class probabilities.</li>
<li><script type="math/tex">b_1</script> and <script type="math/tex">b_2</script> are biases that help offset the outputs away from the origin (similar to the b in your typical <script type="math/tex">ax+b</script> line).</li>
</ul>
<p>As you can see, the Vanilla RNN model is quite simple. Once its architecture has been defined, training it is exactly the same as with normal neural nets, i.e. initializing the weight matrices and biases, defining a loss function and minimizing that loss function using some form of gradient descent.</p>
<p>This conclues our first installment in the series. In next week’s blog post, we’ll be coding our very own RNN from the ground up in numpy and apply it to a language modeling task. Stay tuned until then…</p>
<p><a name="toc5"></a></p>
<h3 id="references">References</h3>
<p>There are a ton of resources that helped me better grasp the fundamentals of RNNs. I’d like to thank <a href="https://twitter.com/iamtrask">iamtrask</a> especially, for letting me use his idea of colors to explain neural memory. You can read his amazing blog post <a href="https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/">here</a>.</p>
<ul>
<li>Denny Britz’s RNN series - click <a href="http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/">here</a></li>
<li>Andrej Karpathy’s Blog Post - click <a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/">here</a></li>
<li>Chris Olah’s Blog Post - click <a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/">here</a></li>
</ul>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>If you’re familiar with Control Theory, this should be slightly reminiscent of a feedback loop, although not quite. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>I’m referring to the <a href="https://arxiv.org/abs/1502.04623">DRAW</a> model introduced by Gregor et. al at Deepmind. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>In the simplest of cases, the hidden state <script type="math/tex">h_t</script> is used as both the output <script type="math/tex">y_t</script> and input to the next hidden state <script type="math/tex">h_{t+1}</script>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 20 Jul 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/07/20/rnn/
http://kevinzakka.github.io/2017/07/20/rnn/deep learningrnnsequences2017My Short Term Goals For 2017<div class="imgcap">
<img src="/assets/goals/winter.jpg" width="80%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.pinterest.com/bishopspencer/winter/">Image Courtesy</a></div>
</div>
<p>This past month has been extremely productive, and I’m really satisfied with the way my winter break has panned out. In fact, I had the opportunity to read and learn tons from a multitude of arXiv papers and I actually went hands on and coded 2 projects from scratch in Tensorflow/Keras:</p>
<ul>
<li><a href="https://github.com/kevinzakka/style_transfer">Artistic Style Transfer</a></li>
<li><a href="https://github.com/kevinzakka/spatial_transformer_network">Spatial Transformer Networks</a></li>
</ul>
<p>I think sticking to theory all the time makes me very prone to forming misconceptions, so actually reproducing papers, googling questions and looking at people’s code has really helped me concretize the notions in my head and I’ve gained significant experience in the process.</p>
<p>Since tomorrow marks my last day of winter break, I thought I would use this opportunity to compile a list of projects I’d like to tackle in the next few weeks. It’ll be a perfect way of organizing and prioritizing my short term goals and I’ll be able to hold myself accountable if I get too lazy.</p>
<h2 id="vision">Vision</h2>
<p><strong>Image Super-Resolution.</strong> Super-resolution is the task of estimating a high-resolution (HR) image from its low-resolution (LR) counterpart. Think about those CSI-Miami episodes where they enhanced surveillance videos to glean valuable information for the crime case.</p>
<div class="imgcap">
<img src="/assets/goals/super-resolution.gif" width="80%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://github.com/alexjc/neural-enhance">Image Courtesy</a></div>
</div>
<p>My resource will be Alex Champandard’s <a href="https://github.com/alexjc/neural-enhance">Neural-Enhance</a> repository which uses a combination of 4 papers. Kudos to Alex, definitely go and give him a star for the amazing work.</p>
<p>I think super-resolution is a great application of Deep Learning and tackling it will prove to be very entertaining.</p>
<p><strong>Text-To-Image.</strong> My goal is to try and implement <a href="https://arxiv.org/abs/1612.03242">StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks</a> by Zhang et. al. Amazing work where they synthesize photo-realistic images from text descriptions with GANs! If that doesn’t sound like wizardry, take a look at the image below from the paper.</p>
<div class="imgcap">
<img src="/assets/goals/txt2img.png" width="65%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<p>I’ll definitely have to brush up on the theory of Generative Adversarial Networks so Goodfellow’s <a href="https://arxiv.org/abs/1701.00160">NIPS 2016 GAN tutorial</a> will prove invaluable.</p>
<p><strong>Lip Reading.</strong> My third project in Visual Recognition will be trying to build a model that can recognise phrases and sentences being spoken by a talking face. Specifically, I’ll try and reproduce the results of Son Chung et. al’s <a href="https://arxiv.org/abs/1611.05358">Lip Reading Sentences in the Wild</a>. Lip reading is just so damn useful and it can really help the hearing impaired so this is a big priority of mine. Helping society is exactly why I got into this field.</p>
<p>Here’s the youtube video uploaded by the author of the paper. Impressive results!</p>
<p align="center">
<iframe width="330" height="315" src="https://www.youtube.com/embed/5aogzAUPilE" frameborder="0" allowfullscreen=""></iframe>
</p>
<hr />
<h2 id="sound">Sound</h2>
<p>My goal is to tackle 2 seminal papers in this area.</p>
<p><strong>Wavenet.</strong> Google Deepmind’s <a href="https://arxiv.org/abs/1609.03499">paper</a>, which made waves (no pun intended) when it was released, leverages a deep neural network to generate raw audio waveforms. It won “Best paper from the industry” award on Reddit.</p>
<div class="imgcap">
<img src="/assets/goals/wavenet.gif" width="55%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">Image Courtesy</a></div>
</div>
<p>I invite you to check out Deepmind’s <a href="https://deepmind.com/blog/wavenet-generative-model-raw-audio/">blog post</a> on the matter where they showcase samples created by the network. Not only have they taught it to generate synthetic utterances (english and mandarin), but there’s even a sample of some piano playing which blew me away.</p>
<p>The results are currently very hard to reproduce, but my goal is just to get something minimal working and to familiarize myself with dilated convolutions.</p>
<p><strong>SoundNet.</strong> The second paper is from MIT’s CSAIL and was presented at this year’s NIPS. The project <a href="http://projects.csail.mit.edu/soundnet/">landing page</a> is extremely well presented and detailed so again, check it out if you’re interested.</p>
<div class="imgcap">
<img src="/assets/goals/soundnet.png" width="75%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://projects.csail.mit.edu/soundnet/">Image Courtesy</a></div>
</div>
<p>I think this paper is extremely underrated, I didn’t see much talk about it on social media but it’s actually very elegant: given a video, their ConvNet recognizes objects and scenes <strong>from sound only</strong>!</p>
<hr />
<h2 id="nlp">NLP</h2>
<p>NLP is a very important application of Deep Learning and I’ve never had any experience with it, so I decided I’d like to try and implement two recent approaches that have shifted away from the traditional RNN architecture. The first paper is from Google Deepmind and the second one is from FAIR.</p>
<p><strong>ByteNet.</strong> Google Deepmind’s <a href="https://arxiv.org/abs/1610.10099">paper</a> which can perform language modeling and machine translation in linear time. They use dilated convolutions much like in Wavenet. The below snippet, courtesy of the paper, illustrates the model’s architecture.</p>
<div class="imgcap">
<img src="/assets/goals/bytenet.png" width="50%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<p><strong>Gated Convnets.</strong> This <a href="https://arxiv.org/abs/1612.08083">paper</a> is from FAIR. The authors evade the traditional RNN structure for language modeling and replace it with a convnet endowed with a gating mechanism (similar in concept to LSTMs). This model also enjoys an order of magnitude speedup compared to a recurrent baseline because they can parallelize it. The architecture is illustrated below.</p>
<div class="imgcap">
<img src="/assets/goals/gated.png" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"></div>
</div>
<hr />
<h2 id="autonomous-driving">Autonomous Driving</h2>
<p>Finally, my current and main interest is autonomous driving. I’ve decided to tackle the following 3 projects and I feel they will form a solid background before I start messing around with Comma.ai’s <a href="https://github.com/commaai/openpilot">open source project</a>.</p>
<p><strong>Traffic Sign Classification.</strong> I want to implement and train a convolutional neural network to classify traffic signs. This, incidentally, is the subject of my next blog post which is part 3 of the Spatial Transformer <a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">series</a>.</p>
<div class="imgcap">
<img src="/assets/goals/traffic-signs.png" width="85%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">Image Courtesy</a></div>
</div>
<p><strong>Behavioral Cloning.</strong> I want to train a deep neural network to drive a car using OpenAI’s Universe and GTA V. Also would like to test it on MIT’s “Deep Learning for Self-Driving Cars” <a href="http://selfdrivingcars.mit.edu/deeptrafficjs/">Deep Traffic</a>. I don’t know if I need to be a pro in Reinforcement Learning, and I’ll definitely refine this list if need be. We’ll see when the time comes.</p>
<p><strong>Kalman Filters.</strong> The final goal is Kalman filters. This algorithm is super important for autonomous driving (gps noise smoothing for example), so I want to understand it more and write a small python implementation. I’ll also definitely write a blog post about it in the near future.</p>
<h2 id="summary">Summary</h2>
<p>That’s it for today’s blog post. I talked about a few projects I’d like to work on in the fields of Vision, Sound, NLP and Autonomous Driving. I thoroughly hope I can achieve the goals I have set in mind before the end of this Spring semester as I’d like to spend the second part of this year on Deep Reinforcement Learning.</p>
<p>For those of you that are interested, I’ve set up a <a href="https://github.com/kevinzakka/deeplearning-roadmap">roadmap repository</a> on my Github which mirrors the above list, so you can check it out and see my progress step-by-step.</p>
<p>Until next time, cheers!</p>
Sun, 22 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/22/goals/
http://kevinzakka.github.io/2017/01/22/goals/deep learninggoals2017computer visionNLPsoundself-drivingDeep Learning Paper Implementations: Spatial Transformer Networks - Part II<div class="imgcap">
<img src="/assets/stn2/ai.jpg" width="45%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>In last week’s <a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">blog post</a>, we introduced two very important concepts: <strong>affine transformations</strong> and <strong>bilinear interpolation</strong> and mentioned that they would prove crucial in understanding Spatial Transformer Networks.</p>
<p>Today, we’ll provide a detailed, section-by-section summary of the <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> paper, a concept originally introduced by researchers <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind.</p>
<p>Hopefully, it’ll will give you a clear understanding of the module and prove useful for next week’s blog post where we’ll cover its implementation in Tensorflow.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Motivation</a></li>
<li><a href="#toc2">Pooling Operator</a></li>
<li><a href="#toc3">Spatial Transformer Network</a>
<ul>
<li><a href="#toc4">Localisation Network</a></li>
<li><a href="#toc5">Parametrised Sampling Grid</a></li>
<li><a href="#toc6">Differentiable Image Sampling</a></li>
</ul>
</li>
<li><a href="#toc7">Fun with STNs</a>
<ul>
<li><a href="#toc8">Distorted MNIST</a></li>
<li><a href="#toc9">GTSRB dataset</a></li>
</ul>
</li>
<li><a href="#toc10">Summary</a></li>
<li><a href="#toc11">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h2 id="motivation">Motivation</h2>
<p>When working on a classification task, it is usually desirable that our system be <strong>robust</strong> to input variations. By this, we mean to say that should an input undergo a certain “transformation” so to speak, our classification model should in theory spit out the same class label as before that transformation. A few examples of the “challenges” our image classification model may face include:</p>
<ul>
<li><strong>scale variation</strong>: variations in size both in the real world and in the image.</li>
<li><strong>viewpoint variation</strong>: different object orientation with respect to the viewer.</li>
<li><strong>deformation</strong>: non rigid bodies can be deformed and twisted in unusual shapes.</li>
</ul>
<div class="imgcap">
<div>
<img src="/assets/stn2/var1.png" style="max-width:49%; height:350px;" />
<img src="/assets/stn2/var2.png" style="max-width:49%; height:200px;" />
</div>
<div class="thecap" style="text-align:center"><a href="http://cs231n.github.io/classification/">Image Courtesy</a></div>
</div>
<p>For illustration purposes, take a look at the images above. While the task of classifying them may seem trivial to a human being, recall that our computer algorithms only work with raw 3D arrays of brightness values so a tiny change in an input image can alter every single pixel value in the corresponding array. Hence, our ideal image classification model should in theory be able to disentangle object pose and deformation from texture and shape.</p>
<p>For a different type of intuition, let’s again take a look at the following cat images.</p>
<div class="imgcap">
<div>
<img src="/assets/stn2/cat2.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat2_.jpg" style="max-width:49%; height:300px;" />
<img src="/assets/stn2/cat1.jpg" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/cat1_.jpg" style="max-width:49%; height:250px;" />
</div>
<div class="thecap" style="text-align:center"> <b>Left:</b> Cat images which may present classification challenges. <b>Right:</b> Transformed images which yield a simplified classification pipeline.</div>
</div>
<p>Would it not be extremely desirable if our model could go from left to right using some sort of crop and scale-normalize combination so as to simplify the subsequent classification task?</p>
<p><a name="toc2"></a></p>
<h2 id="pooling-layers">Pooling Layers</h2>
<p>It turns out that the pooling layers we use in our neural network architectures actually endow our models with a certain degree of spatial invariance. Recall that the pooling operator acts as a sort of downsampling mechanism. It progressively reduces the spatial size of the feature map along the depth dimension, cutting down the amount of parameters and computational cost.</p>
<hr />
<div class="fig figcenter fighighlight">
<img src="/assets/stn2/pool.jpeg" width="36%" />
<img src="/assets/stn2/maxpool.jpeg" width="59%" style="border-left: 1px solid black;" />
<div class="figcaption">
Pooling layer downsamples the volume spatially. <b>Left:</b> In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. <b>Right:</b> 2x2 max pooling. (<a href="http://cs231n.github.io/convolutional-networks/#pool">Image Courtesy</a>)
</div>
</div>
<hr />
<p><strong>How exactly does it provide invariance?</strong> Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and “pool” the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7, each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we’d be capturing approximately the same information by aggregating pixel values.</p>
<p>Now there are a few downsides to pooling which make it an undesirable operator. For one, pooling is <strong>destructive</strong>. It discards 75% of feature activations when it is used, meaning we are guaranteed to lose exact positional information. Now you may be wondering why this is bad since we mentioned earlier that it endowed our network with some spatial robustness. Well the thing is that positional information is invaluable in visual recognition tasks. Think of our cat classifier above. It may be important to know where the position of the whiskers are relative to, say the snout. This can’t be achieved when it is this sort of information we throw away when we use max pooling.</p>
<p>Another limitation of pooling is that it is <strong>local and predefined</strong>. With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can’t just increase the receptive field arbitrarily because then that would downsample our feature map too agressively.</p>
<p>The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!</p>
<blockquote>
<p>The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)</p>
</blockquote>
<p><a name="toc3"></a></p>
<h2 id="spatial-transformer-networks-stns">Spatial Transformer Networks (STNs)</h2>
<p>The Spatial Transformer mechanism addresses the issues above by providing Convolutional Neural Networks with explicit spatial transformation capabilities. It possesses 3 defining properties that make it very appealing.</p>
<ul>
<li><strong>modular</strong>: STNs can be inserted anywhere into existing architectures with relatively small tweaking.</li>
<li><strong>differentiable</strong>: STNs can be trained with backprop allowing for end-to-end training of the models they are injected in.</li>
<li><strong>dynamic:</strong> STNs perform active spatial transformation on a feature map for each input sample as compared to the pooling layer which acted identically for all input samples.</li>
</ul>
<p>As you can see, the Spatial Transformer is superior to the Pooling operator in all regards. So this begs the following question: <strong>what exactly is a Spatial Transformer?</strong></p>
<div class="imgcap">
<img src="/assets/stn2/stn_arch.png" width="65%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a></div>
</div>
<p>The Spatial Transformer module consists in three components shown in the figure above: a <strong>localisation network</strong>, a <strong>grid generator</strong> and a <strong>sampler</strong>. Before we dive into each of their details, I’d like to briefly remind you of a 3 step pipeline we talked about last week.</p>
<div class="imgcap">
<img src="/assets/stn2/pipeline.png" width="75%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://kevinzakka.github.io/2017/01/10/stn-part1/">Affine Transformation Pipeline</a></div>
</div>
<p>Recall that we can’t just blindly rush to the input image and apply our affine transformation. It’s important to first create a sampling grid, transform it, and then sample the input image using the grid. With that being said, let’s jump into the core components of the Spatial Transformer.</p>
<p><a name="toc4"></a></p>
<h3 id="localisation-network">Localisation Network</h3>
<p>The goal of the localisation network is to spit out the parameters <script type="math/tex">\theta</script> of the affine transformation that’ll be applied to the input feature map. More formally, our localisation net is defined as follows:</p>
<ul>
<li><strong>input</strong>: feature map U of shape (H, W, C)</li>
<li><strong>output</strong>: transformation matrix <script type="math/tex">\theta</script> of shape (6,)</li>
<li><strong>architecture</strong>: fully-connected network or ConvNet as well.</li>
</ul>
<p>As we train our network, we would like our localisation net to output more and more accurate thetas. <strong>What do we mean by accurate?</strong> Well, think of our digit 7 rotated by 90 degrees counterclockwise. After say 2 epochs, our localisation net may output a transformation matrix which performs a 45 degree clockwise rotation and after 5 epochs for example, it may actually learn to do a complete 90 degree clockwise rotation. The effect is that our output image looks like a standard digit 7, something our neural network has seen in the training data and can easily classify.</p>
<p>Another way to look at it is that the localisation network learns to store the knowledge of how to transform each training sample in the weights of its layers.</p>
<p><a name="toc5"></a></p>
<h3 id="parametrised-sampling-grid">Parametrised Sampling Grid</h3>
<p>The grid generator’s job is to output a parametrised sampling grid, which is a set of points where the input map <strong>should</strong> be sampled to produce the desired transformed output.</p>
<p>Concretely, the grid generator first creates a normalized meshgrid of the same size as the input image U of shape (H, W), that is, a set of indices <script type="math/tex">(x^t, y^t)</script> that cover the whole input feature map (the subscript t here stands for target coordinates in the output feature map). Then, since we’re applying an affine transformation to this grid and would like to use translations, we proceed by adding a row of ones to our coordinate vector to obtain its homogeneous equivalent. This is the little trick we also talked about last week. Finally, we reshape our 6 parameter <script type="math/tex">\theta</script> to a 2x3 matrix and perform the following multiplication which results in our desired parametrised sampling grid.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
x^{s} \\
y^{s} \\
\end{bmatrix} = \begin{bmatrix}
\theta_{11} & \theta_{12} & \theta_{13} \\
\theta_{21} & \theta_{22} & \theta_{23}
\end{bmatrix}
%
\begin{bmatrix}
x^t \\
y^t \\
1
\end{bmatrix} %]]></script>
<p>The column vector <script type="math/tex">\begin{bmatrix}
x^s \\
y^s
\end{bmatrix}</script> consists in a set of indices that tell us where we should sample our input to obtain the desired transformed output.</p>
<p><strong>But wait a minute, what if those indices are fractional?</strong> Bingo! That’s why we learned about bilinear interpolation and this is exactly what we do next.</p>
<p><a name="toc6"></a></p>
<h3 id="differentiable-image-sampling">Differentiable Image Sampling</h3>
<p>Since bilinear interpolation is differentiable, it is perfectly suitable for the task at hand. Armed with the input feature map and our parametrised sampling grid, we proceed with bilinear sampling and obtain our output feature map V of shape (H’, W’, C’). Note that this implies that we can perform downsampling and upsampling by specifying the shape of our sampling grid. (take that pooling!) We definitely aren’t restricted to bilinear sampling, and there are other sampling kernels we can use, but the important takeaway is that it must be differentiable to allow the loss gradients to flow all the way back to our localisation network.</p>
<div class="imgcap">
<img src="/assets/stn2/transformation.png" width="60%" style="border:none;" />
<div class="thecap" style="text-align:justify">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>) Two examples of applying the parameterised sampling grid to an image U producing the output V. <b>(a)</b> Identity transform (i.e. U = V) <b>(2)</b> Affine Transformation (i.e. rotation)</div>
</div>
<p>The above illustrates the inner workings of the Spatial Transformer. Basically it boils down to 2 crucial concepts we’ve been talking about all week: an affine transformation followed by bilinear interpolation. Take a moment and admire the elegance of such a mechanism! We’re letting our network learn the optimal affine transformation parameters that will help it ultimately succeed in the classification task <strong>all on its own</strong>.</p>
<p><a name="toc7"></a></p>
<h2 id="fun-with-spatial-transformers">Fun with Spatial Transformers</h2>
<p>As a final note, I’ll provide 2 examples that illustrate the power of Spatial Transformers. I’ve attached the references for each example at the bottom of the post, so make sure to look those up if they pique your interest.</p>
<p><a name="toc8"></a></p>
<h3 id="distorted-mnist">Distorted MNIST</h3>
<p>Here is the result of using a spatial transformer as the first layer of a fully-connected network trained for distorted MNIST digit classification.</p>
<div class="imgcap">
<img src="/assets/stn2/mnist.png" width="45%" style="border:none;" /><div class="thecap" style="text-align:center">(<a href="https://arxiv.org/abs/1506.02025">Image Courtesy</a>)</div>
</div>
<p>Notice how it has learned to do exactly what we wanted our theoretical “robust” image classification model to do: by zooming in and eliminating background clutter, it has “standardized” the input to facilitate classification. If you want to view a live animation of the transformer in action, click <a href="https://drive.google.com/file/d/0B1nQa_sA3W2iN3RQLXVFRkNXN0k/view">here</a>.</p>
<p><a name="toc9"></a></p>
<h3 id="german-traffic-sign-recognition-benchmark-gtsrb-dataset">German Traffic Sign Recognition Benchmark (GTSRB) dataset</h3>
<div class="imgcap">
<div>
<img src="/assets/stn2/epoch_evolution.gif" style="max-width:49%; height:250px;" />
<img src="/assets/stn2/moving_evolution.gif" style="max-width:49%; height:250px;" />
</div>
<div class="thecap">(<a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">Image Courtesy</a>) <b>Left</b>: Behavior of the Spatial Transformer during training. Notice how it learns to focus on the traffic sign, gradually removing background. <b>Right</b>: Output for different input images. Note how it stays approximately contant regardless of the input variability and distortion. Pretty neat!</div>
</div>
<p><a name="toc10"></a></p>
<h2 id="summary">Summary</h2>
<p>In today’s blog post, we went over Google Deepmind’s Spatial Transformer Network paper. We started by introducing the different challenges classification models face, mainly how distortions in the input images can cause our classifiers to fail. One remedy is to use pooling layers; however they possess a few glaring limitations that have made them fall into disuse. The other remedy, and the subject of this blog post, is to use Spatial Transformer Networks.</p>
<p>This consists in a differentiable module that can be inserted anywhere in ConvNet architecture to increase its geometric invariance. It effectively endows our networks with the ability to spatially transform feature maps at no extra data or supervision cost. Finally, we saw how the whole mechanism boils down to 2 familiar operations: an affine transformation and bilinear interpolation.</p>
<p>In next week’s blog post we’ll be using what we’ve learned so far to aid us in coding this paper from scratch in Tensorflow. In the meantime, if you have any questions, feel free to post them in the comment section below.</p>
<p>Cheers and see you next week!</p>
<p><a name="toc11"></a></p>
<h2 id="references">References</h2>
<ul>
<li>The original Deepmind paper - click <a href="https://arxiv.org/abs/1506.02025">here</a></li>
<li>Kudos to the Torch blog post on STNs which really helped me during the learning process - click <a href="http://torch.ch/blog/2015/09/07/spatial_transformers.html">here</a></li>
<li>Torch Implementation also helped me grasp the inner workings of STNs - check out this <a href="https://github.com/qassemoquab/stnbhwd">repo</a></li>
<li>Stanford’s CS231n as always - click <a href="cs231n.github.io">here</a></li>
</ul>
Wed, 18 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/18/stn-part2/
http://kevinzakka.github.io/2017/01/18/stn-part2/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationDeep Learning Paper Implementations: Spatial Transformer Networks - Part I<div class="imgcap">
<img src="/assets/stn/ai.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/">Image Courtesy</a></div>
</div>
<p>The first three blog posts in my “Deep Learning Paper Implementations” series will cover <a href="https://arxiv.org/abs/1506.02025">Spatial Transformer Networks</a> introduced by <em>Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu</em> of Google Deepmind in 2016. The Spatial Transformer Network is a learnable module aimed at increasing the spatial invariance of Convolutional Neural Networks in a computationally and parameter efficient manner.</p>
<p>In this first installment, we’ll be introducing two very important concepts that will prove crucial in understanding the inner workings of the Spatial Transformer layer. We’ll first start by examining a subset of image transformation techniques that fall under the umbrella of <strong>affine transformations</strong>, and then dive into a procedure that commonly follows these transformations: <strong>bilinear interpolation</strong>.</p>
<p>In the second installment, we’ll be going over the Spatial Transformer Layer in detail and summarizing the paper, and then in the third and final part, we’ll be coding it from scratch in Tensorflow and applying it to the <a href="http://benchmark.ini.rub.de/?section=gtsrb&subsection=news">GTSRB dataset</a> (German Traffic Sign Recognition Benchmark).</p>
<p>For the full code that appears on this page, visit my <a href="https://github.com/kevinzakka/blog-code/tree/master/spatial_transformer">Github Repository</a>.</p>
<h4 id="table-of-contents">Table of Contents</h4>
<ul>
<li><a href="#toc1">Image Transformations</a>
<ul>
<li><a href="#toc2">Scale</a></li>
<li><a href="#toc3">Rotate</a></li>
<li><a href="#toc4">Shear</a></li>
<li><a href="#toc5">Translate</a></li>
</ul>
</li>
<li><a href="#toc6">Bilinear Interpolation</a>
<ul>
<li><a href="#toc7">Motivation</a></li>
<li><a href="#toc8">Algorithm</a></li>
<li><a href="#toc9">Python Code</a></li>
</ul>
</li>
<li><a href="#toc10">Results</a></li>
<li><a href="#toc11">Conclusion</a></li>
<li><a href="#toc12">References</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="image-transformations">Image Transformations</h3>
<p>To lay the groundwork for affine transformations, we first need to talk about linear transformations. To that end, we’ll be restricting ourselves to 2 dimensions and work with matrices.</p>
<p>We define the following:</p>
<ul>
<li>a point K with coordinates
<script type="math/tex">\begin{bmatrix}
x \\
y
\end{bmatrix}</script> represented as a <script type="math/tex">(2\times1)</script> column vector.</li>
<li>a matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix} %]]></script> represented as a square matrix of shape <script type="math/tex">(2\times2)</script>.</li>
</ul>
<p>and would like to examine the linear transformation <script type="math/tex">T</script> defined by the matrix product <script type="math/tex">K' = T(K) = MK</script> as we vary the parameters a, b, c and d of M.</p>
<p><strong>Warm-Up Question.</strong></p>
<p>Say we set <script type="math/tex">a = d = 1</script> and <script type="math/tex">b = c = 0</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix} %]]></script>
<p>In that case, what transform do you think we would obtain? Go ahead and give it a few moment’s thought…</p>
<p><strong>Solution.</strong></p>
<p>Let’s write it out:</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x \\
y
\end{bmatrix} = K %]]></script>
<p>We’ve actually represented the identity transform, meaning that the point K does not move in the plane. Let us now jump to more interesting transforms.</p>
<p><a name="toc2"></a></p>
<p><strong>Scaling.</strong></p>
<div class="imgcap">
<img src="/assets/stn/scale.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>We let <script type="math/tex">b = c = 0</script>, and <script type="math/tex">a</script> and <script type="math/tex">d</script> take on any positive value.</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix} %]]></script>
<p>Note that there is a special case of scaling called <em>isotropic</em> scaling in which the scaling factor for both the x and y direction is the same, say <script type="math/tex">s</script>. In that case, enlarging an image would correspond to <script type="math/tex">s > 1</script> while shrinking would correspond to <script type="math/tex">% <![CDATA[
s < 1 %]]></script>. It’s a bit non-intuitive then that to zoom-in on an image, you need <script type="math/tex">% <![CDATA[
s < 1 %]]></script> (think about it).</p>
<p>Anyway, performing the matrix product, we obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
p & 0 \\
0 & q
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
px \\
qy
\end{bmatrix} %]]></script>
<p><a name="toc3"></a></p>
<p><strong>Rotation.</strong></p>
<div class="imgcap">
<img src="/assets/stn/rot.png" width="19%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>Suppose we want to rotate by an angle <script type="math/tex">\theta</script> about the origin. To do so, we set <script type="math/tex">a = d = \cos{\theta}</script> and <script type="math/tex">b = c = \sin{\theta}</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix} %]]></script>
<p>We thus obtain</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
\cos{\theta} & -\sin{\theta} \\
\sin{\theta} & \cos{\theta}
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x\cos{\theta}- y\sin{\theta} \\
x\sin{\theta} + y\cos{\theta}
\end{bmatrix} %]]></script>
<p><a name="toc4"></a></p>
<p><strong>Shear.</strong></p>
<div class="imgcap">
<img src="/assets/stn/shear.png" width="27%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>When we shear an image, we offset the y direction by a distance proportional to x, and the x direction by a distance proportional to y. For example, when we go from normal text to italics, we are effectively applying a shear transform (think about shearing a deck of cards if that helps).</p>
<p>To achieve shearing, we set <script type="math/tex">a = d = 1</script>, <script type="math/tex">b = m</script> and <script type="math/tex">c = n</script> as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
M = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix} %]]></script>
<p>This yields</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & m \\
n & 1
\end{bmatrix}
%
\begin{bmatrix}
x \\
y
\end{bmatrix} =
\begin{bmatrix}
x + my \\
y + nx
\end{bmatrix} %]]></script>
<hr />
<p>In summary, we have defined 3 basic linear transformations:</p>
<ul>
<li><strong>scaling:</strong> scales the x and y direction by a scalar.</li>
<li><strong>shearing:</strong> offsets the x by a number proportional to y and x by a number proportional to x.</li>
<li><strong>rotating:</strong> rotates the points around the origin by an angle <script type="math/tex">\theta</script>.</li>
</ul>
<p>Now the nice thing about matrices is that we can collapse sequential linear transformations into a single transformation matrix. For example, say we would like to apply a shear, a scale and then a rotation to our column vector K. Given that these transformations can be represented by the matrices <script type="math/tex">H</script>, <script type="math/tex">S</script> and <script type="math/tex">R</script>, and respecting the order of transformations, we can write down this operation as</p>
<script type="math/tex; mode=display">K' = R \big[ S \big( HK \big) \big]</script>
<p>But recall that matrix multiplication is associative! So this reduces to</p>
<script type="math/tex; mode=display">\boxed{K' = MK}</script>
<p>where <script type="math/tex">M = RSH</script>. Be mindful of the order since matrix multiplication <script type="math/tex">\color{red}{\text{is not}}</script> commutative.</p>
<p>A beautiful consequence of this formula is that if we are given multiple transformations to do for a very high-dimensional vector, then we can basically carry out a single matrix multiplication rather than repeatedly manipulating the high-dimensional vector for every sequential transformation.</p>
<hr />
<p><a name="toc5"></a></p>
<p><strong>Translation.</strong></p>
<p>The only downside to this <script type="math/tex">2 \times 2</script> matrix representation is that we cannot represent translation since it isn’t a linear transformation. Translation however, is a very important and needed transformation, so we would like to be able to encapsulate it in our matrix representation.</p>
<p>To solve this dilemna, we represent our 2D vectors in 3D using <strong>homogeneous coordinates</strong> as follows:</p>
<ul>
<li>our point K becomes a <script type="math/tex">(3\times1)</script> column vector
<script type="math/tex">\begin{bmatrix}
x \\
y \\
1
\end{bmatrix}</script></li>
<li>our matrix M becomes a <script type="math/tex">(3\times3)</script> square matrix
<script type="math/tex">% <![CDATA[
M=
\begin{bmatrix}
a & b & 0 \\
c & d & 0 \\
0 & 0 & 1
\end{bmatrix} %]]></script></li>
</ul>
<p>To represent a translation, all we have to do is place 2 new parameters <script type="math/tex">e</script> and <script type="math/tex">f</script> in our third column like so</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & e \\
c & d & f \\
0 & 0 & 1
\end{bmatrix} %]]></script>
<p>and we can thus carry out translations as linear transformations in homogeneous coordinates. Note that if we require a 2D output, then all we need to do is represent M as a <script type="math/tex">2 \times 3</script> matrix and leave K untouched.</p>
<p><strong>Example.</strong></p>
<p>Translate both the x and y direction by <script type="math/tex">\Delta</script>. Result should be 2D.</p>
<script type="math/tex; mode=display">% <![CDATA[
K' = \begin{bmatrix}
1 & 0 & \Delta \\
0 & 1 & \Delta
\end{bmatrix}
%
\begin{bmatrix}
x \\
y \\
1
\end{bmatrix} =
\begin{bmatrix}
x + \Delta \\
y + \Delta
\end{bmatrix} %]]></script>
<p><strong>Summary.</strong></p>
<div class="imgcap">
<img src="/assets/stn/affine.png" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Image Courtesy</a></div>
</div>
<p>By using a little trick, we were able to add a new transformation to our repertoire of linear transformations. This transformation, called translation, is an affine transformation. Hence, we can generalize our results and represent our 4 affine transformations (all linear transformations are affine) by the 6 parameter matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
M=
\begin{bmatrix}
a & b & c \\
d & e & f
\end{bmatrix} %]]></script>
<p><a name="toc6"></a></p>
<h3 id="bilinear-interpolation">Bilinear Interpolation</h3>
<p><a name="toc7"></a></p>
<p><strong>Motivation.</strong> When an image undergoes an affine transformation such as a rotation or scaling, the pixels in the image get moved around. This can be especially problematic when a pixel location in the output does not map directly to one in the input image.</p>
<p>In the illustration below, you can clearly see that the rotation places some points at locations that are not centered in the squares. This means that they would not have a corresponding pixel value in the original image.</p>
<div class="imgcap">
<img src="/assets/stn/stickman.png" width="70%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://northstar-www.dartmouth.edu/doc/idl/html_6.2/Interpolation_Methods.html">Image Courtesy</a></div>
</div>
<p>So for example, suppose that after rotating an image, we need to find the pixel value at the location (6.7, 3.2). The problem with this is that there is no such thing as fractional pixel locations.</p>
<p>To solve this problem, bilinear interpolation uses the 4 nearest pixel values which are located in diagonal directions from a given location in order to find the appropriate color intensity values of that pixel. The result is smoother and more realistic images!</p>
<p><a name="toc8"></a></p>
<p><strong>Algorithm.</strong></p>
<div class="imgcap">
<img src="/assets/stn/interpol.png" width="35%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Image Courtesy</a></div>
</div>
<p>Our goal is to find the pixel value of the point P. To do so, we calculate the pixel value of <script type="math/tex">R_1</script> and <script type="math/tex">R_2</script> using a weighted average of <script type="math/tex">(Q_{11}, Q_{21})</script> and <script type="math/tex">(Q_{12}, Q_{22})</script> respectively. Then, we use a weighted average of <script type="math/tex">R_2</script> and <script type="math/tex">R_1</script> to find the value of P.</p>
<p>Effectively, we are interpolating in the x direction and then the y direction, hence the name bilinear interpolation. You could just as well flip the order of interpolation and get the exact same value.</p>
<p>So given a point <script type="math/tex">P = (x, y)</script> and 4 corner coordinates <script type="math/tex">Q_{11} = (x_1, y_1)</script>, <script type="math/tex">Q_{21} = (x_2, y_1)</script>, <script type="math/tex">Q_{12} = (x_1, y_2)</script> and <script type="math/tex">Q_{22} = (x_2, y_2)</script>, we first interpolate in the x-direction:</p>
<script type="math/tex; mode=display">R_1 = \frac{x_2 - x}{x_2 - x_1}Q_{11} + \frac{x - x_1}{x_2 - x_1}Q_{21}</script>
<script type="math/tex; mode=display">R_2 = \frac{x_2 - x}{x_2 - x_1}Q_{12} + \frac{x - x_1}{x_2 - x_1}Q_{22}</script>
<p>and finally in the y-direction:</p>
<script type="math/tex; mode=display">\boxed{P = \frac{y_2 - y}{y_2 - y_1}R_1 + \frac{y - y_1}{y_2 - y_1}R_2}</script>
<p><a name="toc9"></a></p>
<p><strong>Python Code.</strong></p>
<p>One very very important note before we jump into the code!</p>
<hr />
<p>An image processing affine transformation usually follows the 3-step pipeline below:</p>
<ul>
<li>First, we create a sampling grid composed of <script type="math/tex">(x, y)</script> coordinates. For example, given a 400x400 grayscale image, we create a meshgrid of same dimension, that is, evenly spaced <script type="math/tex">x \in [0, W]</script> and <script type="math/tex">y \in [0, H]</script>.</li>
<li>We then apply the transformation matrix to the sampling grid generated in the step above.</li>
<li>Finally, we sample the resulting grid from the original image using the desired interpolation technique.</li>
</ul>
<p>As you can see, this is different than directly applying a transform to the original image.</p>
<hr />
<p>I’ve attached 2 cat images in the Github Repository mentioned at the top of this page which you should go ahead and download. Save them to your Desktop in a folder called <code class="highlighter-rouge">data/</code> or make sure to update the path location if you choose differently.</p>
<p>I’ve also written a function <code class="highlighter-rouge">load_img()</code> that converts images to numpy arrays. I won’t go into its details but it’s pretty basic and you shouldn’t take long to understand what it does. Note that you’ll need both PIL and Numpy to reproduce the results below.</p>
<p>Armed with this function, let’s load both cat images and concatenate them into a single input array. We’re working with 2 images because we want to make our code as general as possible.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="c"># params</span>
<span class="n">DIMS</span> <span class="o">=</span> <span class="p">(</span><span class="mi">400</span><span class="p">,</span> <span class="mi">400</span><span class="p">)</span>
<span class="n">CAT1</span> <span class="o">=</span> <span class="s">'cat1.jpg'</span>
<span class="n">CAT2</span> <span class="o">=</span> <span class="s">'cat2.jpg'</span>
<span class="c"># load both cat images</span>
<span class="n">img1</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT1</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">)</span>
<span class="n">img2</span> <span class="o">=</span> <span class="n">load_img</span><span class="p">(</span><span class="n">CAT2</span><span class="p">,</span> <span class="n">DIMS</span><span class="p">,</span> <span class="n">view</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="c"># concat into tensor of shape (2, 400, 400, 3)</span>
<span class="n">input_img</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">img1</span><span class="p">,</span> <span class="n">img2</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c"># dimension sanity check</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Input Img Shape: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">input_img</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
</code></pre>
</div>
<p>Given that we have 2 images, our batch size is equal to 2. This means that we need an equal amount of transformation matrices M for each image in the batch.</p>
<p>Let’s go ahead and initialize 2 identity transform matrices. This is the simplest case, and if we implement our bilinear sampler correctly, we should expect our output image to be almost exact to the input image.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># grab shape</span>
<span class="n">num_batch</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">,</span> <span class="n">C</span> <span class="o">=</span> <span class="n">input_img</span><span class="o">.</span><span class="n">shape</span>
<span class="c"># initialize M to identity transform</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span> <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">]])</span>
<span class="c"># repeat num_batch times</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">))</span>
</code></pre>
</div>
<p>(Recall that our general affine transformation matrix is <script type="math/tex">2 \times 3</script> if we want to include translation.)</p>
<p>Now we need to write a function that will generate a meshgrid for us and output a sampling grid resulting from the product of this meshgrid and our transformation matrix M.</p>
<p>Let’s go ahead and generate our meshgrid. We’ll create a normalized one, that is the values of x and y range from -1 to 1 and there are <code class="highlighter-rouge">width</code> and <code class="highlighter-rouge">height</code> of them respectively. In fact, note that for images, x corresponds to the width of the image (i.e. number of columns of the matrix) while y corresponds to the height of the image (i.e. number of rows of the matrix).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># create normalized 2D grid</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="n">H</span><span class="p">)</span>
<span class="n">x_t</span><span class="p">,</span> <span class="n">y_t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre>
</div>
<p>Then we need to augment the dimensions to create homogeneous coordinates.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># reshape to (xt, yt, 1) </span>
<span class="n">ones</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">x_t</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
<span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">([</span><span class="n">x_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">y_t</span><span class="o">.</span><span class="n">flatten</span><span class="p">(),</span> <span class="n">ones</span><span class="p">])</span>
</code></pre>
</div>
<p>So we’ve created 1 grid here, but we need <code class="highlighter-rouge">num_batch</code> grids. Same as above, our one-liner below repeats our array <code class="highlighter-rouge">num_batch</code> times.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># repeat grid num_batch times</span>
<span class="n">sampling_grid</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">resize</span><span class="p">(</span><span class="n">sampling_grid</span><span class="p">,</span> <span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">H</span><span class="o">*</span><span class="n">W</span><span class="p">))</span>
</code></pre>
</div>
<p>Now we perform step 2 of our image transformation pipeline.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># transform the sampling grid i.e. batch multiply</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">sampling_grid</span><span class="p">)</span>
<span class="c"># batch grid has shape (num_batch, 2, H*W)</span>
<span class="c"># reshape to (num_batch, height, width, 2)</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">num_batch</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">H</span><span class="p">,</span> <span class="n">W</span><span class="p">)</span>
<span class="n">batch_grids</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">moveaxis</span><span class="p">(</span><span class="n">batch_grids</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre>
</div>
<p>Finally, let’s write our bilinear sampler. Given our coordinates <code class="highlighter-rouge">x</code> and <code class="highlighter-rouge">y</code> in the sampling grid, we want interpolate the pixel value in the original image.</p>
<p>Let’s start by seperating the x and y dimensions and rescaling them to belong in the height/width interval.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">x_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="n">y_s</span> <span class="o">=</span> <span class="n">batch_grids</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:,</span> <span class="mi">1</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">squeeze</span><span class="p">()</span>
<span class="c"># rescale x and y to [0, W/H]</span>
<span class="n">x</span> <span class="o">=</span> <span class="p">((</span><span class="n">x_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">W</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">((</span><span class="n">y_s</span> <span class="o">+</span> <span class="mf">1.</span><span class="p">)</span> <span class="o">*</span> <span class="n">H</span><span class="p">)</span> <span class="o">*</span> <span class="mf">0.5</span>
</code></pre>
</div>
<p>Now for each coordinate <script type="math/tex">(x_i, y_i)</script> we want to grab 4 corner coordinates.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># grab 4 nearest corner points for each (x_i, y_i)</span>
<span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">x0</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">floor</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">int64</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">y0</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre>
</div>
<p>(Note that we could just as well use the ceiling function rather than the increment by 1).</p>
<p>Now we must make sure that no value goes beyond the image boundaries. For example, suppose we have <script type="math/tex">x = 399</script>, then <script type="math/tex">x_0 = 399</script> and <script type="math/tex">x_1 = x0 + 1 = 400</script> which would result in a numpy error. Thus we clip our corner coordinates in the following way:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># make sure it's inside img range [0, H] or [0, W]</span>
<span class="n">x0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">x1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">x1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">W</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y0</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y1</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">clip</span><span class="p">(</span><span class="n">y1</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">H</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
</code></pre>
</div>
<p>Now we use advanced numpy indexing to grab the pixel value for each corner coordinate. These correspond to <code class="highlighter-rouge">(x0, y0)</code>, <code class="highlighter-rouge">(x0, y1)</code>, <code class="highlighter-rouge">(x1, y0)</code> and <code class="highlighter-rouge">(x_1, y_1)</code>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># look up pixel values at corner coords</span>
<span class="n">Ia</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ib</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x0</span><span class="p">]</span>
<span class="n">Ic</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y0</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
<span class="n">Id</span> <span class="o">=</span> <span class="n">input_img</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">num_batch</span><span class="p">)[:,</span><span class="bp">None</span><span class="p">,</span><span class="bp">None</span><span class="p">],</span> <span class="n">y1</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span>
</code></pre>
</div>
<p>Almost there! Now, we calculate the weight coefficients,</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># calculate deltas</span>
<span class="n">wa</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="p">(</span><span class="n">x1</span><span class="o">-</span><span class="n">x</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y1</span><span class="o">-</span><span class="n">y</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">x0</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">y</span><span class="o">-</span><span class="n">y0</span><span class="p">)</span>
</code></pre>
</div>
<p>and finally, multiply and add according to the formula mentioned previously.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># add dimension for addition</span>
<span class="n">wa</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wa</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wb</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wb</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wc</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wc</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">wd</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">expand_dims</span><span class="p">(</span><span class="n">wd</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c"># compute output</span>
<span class="n">out</span> <span class="o">=</span> <span class="n">wa</span><span class="o">*</span><span class="n">Ia</span> <span class="o">+</span> <span class="n">wb</span><span class="o">*</span><span class="n">Ib</span> <span class="o">+</span> <span class="n">wc</span><span class="o">*</span><span class="n">Ic</span> <span class="o">+</span> <span class="n">wd</span><span class="o">*</span><span class="n">Id</span>
</code></pre>
</div>
<hr />
<p><a name="toc10"></a></p>
<h3 id="results">Results</h3>
<p>So now that we’ve gone through the whole code incrementally, let’s have some fun and experiment with different values of the transformation matrix M.</p>
<p>The first thing you need to do is copy and paste the whole code which has been made more modular. Now let’s test if our function works correctly.</p>
<p><strong>Identity Transform.</strong></p>
<p>Add the following 2 lines as the end of the script and execute.</p>
<div class="highlighter-rouge"><pre class="highlight"><code>plt.imshow(out[1])
plt.show()
</code></pre>
</div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft1.png" width="300" />
</p>
<p><strong>Translation.</strong></p>
<p>Say we want to translate the picture by <code class="highlighter-rouge">0.5</code> only in the x direction. This should shift the image to the left.</p>
<p>Edit the following line of your code as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>M = np.array([[1., 0., 0.5], [0., 1., 0.]])
</code></pre>
</div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft2.png" width="300" />
</p>
<p><strong>Rotation.</strong></p>
<p>Finally, say we want to rotate the picture by <code class="highlighter-rouge">45</code> degrees. Given that <script type="math/tex">\cos{(45)} = \sin{(45)} = \frac{\sqrt{2}}{2} \approx 0.707</script>, edit just this line of your code as follows:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>M = np.array([[0.707, -0.707, 0.], [0.707, 0.707, 0.]])
</code></pre>
</div>
<p align="center">
<img src="/assets/stn/bef1.png" width="200" /> <img src="/assets/stn/aft3.png" width="300" />
</p>
<p><a name="toc11"></a></p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we went over basic linear transformations such as rotation, shear and scale before generalizing to affine transformations which included translations. Then, we saw the importance of bilinear interpolation in the context of these transformations. Finally, we went over the algorithm, coded it from scratch in Python and wrote 2 methods that helped us visualize these transformations according to a 3 step image processing pipeline.</p>
<p>In the next installment of this series, we’ll go over the Spatial Transformer Network layer in detail as well as summarize the paper it is described in.</p>
<p>See you next week!</p>
<p><a name="toc12"></a></p>
<h3 id="references">References</h3>
<p>A big thank you to <a href="https://twitter.com/edersantana">Eder Santana</a> for introducing me to this paper!</p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Bilinear_interpolation">Bilinear Interpolation Wikipedia</a></li>
<li><a href="http://supercomputingblog.com/graphics/coding-bilinear-interpolation/">Bilinear Interpolation</a></li>
<li><a href="https://people.cs.clemson.edu/~dhouse/courses/401/notes/affines-matrices.pdf">Matrix Transformations PDF</a></li>
<li><a href="http://stackoverflow.com/questions/12729228/simple-efficient-bilinear-interpolation-of-images-in-numpy-and-python">Bilinear Interpolation Code</a></li>
</ul>
Tue, 10 Jan 2017 00:00:00 +0000
http://kevinzakka.github.io/2017/01/10/stn-part1/
http://kevinzakka.github.io/2017/01/10/stn-part1/deepmindgooglespatial transformer networkstransformationsaffinelinearbilinear interpolationNuts and Bolts of Applying Deep Learning<div class="imgcap">
<img src="/assets/app_dl/bolts.jpg" width="40%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://nutsandbolts.mit.edu/">Image Courtesy</a></div>
</div>
<p>This weekend was very hectic (catching up on courses and studying for a statistics quiz), but I managed to squeeze in some time to watch the <a href="http://www.bayareadlschool.org/">Bay Area Deep Learning School</a> livestream on YouTube. For those of you wondering what that is, BADLS is a 2-day conference hosted at Stanford University, and consisting of back-to-back presentations on a variety of topics ranging from NLP, Computer Vision, Unsupervised Learning and Reinforcement Learning. Additionally, top DL software libraries were presented such as Torch, Theano and Tensorflow.</p>
<p>There were some super interesting talks from leading experts in the field: <a href="http://www.dmi.usherb.ca/~larocheh/index_en.html">Hugo Larochelle</a> from Twitter, <a href="http://cs.stanford.edu/people/karpathy/">Andrej Karpathy</a> from OpenAI, <a href="http://www.iro.umontreal.ca/~bengioy/yoshua_en/index.html">Yoshua Bengio</a> from the Université de Montreal, and <a href="http://www.andrewng.org/">Andrew Ng</a> from Baidu to name a few. Of the plethora of presentations, there was one somewhat non-technical one given by Andrew that really piqued my interest.</p>
<p>In this blog post, I’m gonna try and give an overview of the main ideas outlined in his talk. The goal is to pause a bit and examine the ongoing trends in Deep Learning thus far, as well as gain some insight into applying DL in practice.</p>
<p>By the way, if you missed out on the livestreams, you can still view them at the following: <a href="https://www.youtube.com/watch?v=eyovmAtoUx0">Day 1</a> and <a href="https://www.youtube.com/watch?v=9dXiAecyJrY">Day 2</a>.</p>
<p><strong>Table of Contents</strong>:</p>
<ul>
<li><a href="#toc1">Major Deep Learning Trends</a></li>
<li><a href="#toc2">End-to-End Deep Learning</a></li>
<li><a href="#toc3">Bias-Variance Tradeoff</a></li>
<li><a href="#toc4">Human-level Performance</a></li>
<li><a href="#toc5">Personal Advice</a></li>
</ul>
<p><a name="toc1"></a></p>
<h3 id="major-deep-learning-trends">Major Deep Learning Trends</h3>
<p><strong>Why do DL algorithms work so well?</strong> According to Ng, with the rise of the Internet, Mobile and IOT era, the amount of data accessible to us has greatly increased. This correlates directly to a boost in the performance of neural network models, especially the larger ones which have the capacity to absorb all this data.</p>
<p align="center">
<img src="/assets/app_dl/perf_vs_data.png" width="450" />
</p>
<p>However, in the small data regime (left-hand side of the x-axis), the relative ordering of the algorithms is not that well defined and really depends on who is more motivated to engineer their features better, or refine and tune the hyperparameters of their model.</p>
<p>Thus this trend is more prevalent in the big data realm where hand engineering effectively gets replaced by end-to-end approaches and bigger neural nets combined with a lot of data tend to outperform all other models.</p>
<p><strong>Machine Learning and HPC team.</strong> The rise of big data and the need for larger models has started to put pressure on companies to hire a Computer Systems team. This is because some of the HPC (high-performance computing) applications require highly specialized knowledge and it is difficult to find researchers and engineers with sufficient knowledge in both fields. Thus, cooperation from both teams is the key to boosting performance in AI companies.</p>
<p><strong>Categorizing DL models.</strong> Work in DL can be categorized in the following 4 buckets:</p>
<p align="center">
<img src="/assets/app_dl/bucket.svg" width="350" />
</p>
<p>Most of the value in the industry today is driven by the models in the orange blob (innovation and monetization mostly) but Andrew believes that <strong>unsupervised deep learning</strong> is a super-exciting field that has loads of potential for the future.</p>
<p><a name="toc2"></a></p>
<h3 id="the-rise-of-end-to-end-dl">The rise of End-to-End DL</h3>
<p>A major improvement in the end-to-end approach has been the fact that outputs are becoming more and more complicated. For example, rather than just outputting a simple class score such as 0 or 1, algorithms are starting to generate richer outputs: images like in the case of GAN’s, full captions with RNN’s and most recently, audio like in DeepMind’s WaveNet.</p>
<p><strong>So what exactly does end-to-end training mean?</strong> Essentially, it means that AI practitioners are shying away from intermediate representations and going directly from one end (raw input) to the other end (output) Here’s an example from speech recognition.</p>
<p align="center">
<img src="/assets/app_dl/end-to-end.svg" width="340" />
</p>
<p><strong>Are there any disadvantages to this approach?</strong> End-to-end approaches are data hungry meaning they only perform well when provided with a huge dataset of labelled examples. In practice, not all applications have the luxury of large labelled datasets so other approaches which allow hand-engineered information and field expertise to be added into the model have gained the upper hand. As an example, in a self-driving car setting, going directly from the raw image to the steering direction is pretty difficult. Rather, many features such as trajectory and pedestrian location are calculated first as intermediate steps.</p>
<p>The main take-away from this section is that we should always be cautious of end-to-end approaches in applications where huge data is hard to come by.</p>
<p><a name="toc3"></a></p>
<h3 id="bias-variance-tradeoff">Bias-Variance Tradeoff</h3>
<p><strong>Splitting your data.</strong> In most deep learning problems, train and test come from different distributions. For example, suppose you are working on implementing an AI powered rearview mirror and have gathered 2 chunks of data: the first, larger chunk comes from many places (could be partly bought, and partly crowdsourced) and the second, much smaller chunk is actual car data.</p>
<p>In this case, splitting the data into train/dev/test can be tricky. One might be tempted to carve the dev set out of the training chunk like in the first example of the diagram below. (Note that the chunk on the left corresponds to data mined from the first distribution and the one on the right to the one from the second distribution.)</p>
<p align="center">
<img src="/assets/app_dl/split.svg" width="500" />
</p>
<p>This is bad because we usually want our dev and test to come from the same distribution. The reason for this is that because a part of the team will be spending a lot of time tuning the model to work well on the dev set, if the test set were to turn out very different from the dev set, then pretty much all the work would have been wasted effort.</p>
<p>Hence, a smarter way of splitting the above dataset would be just like the second line of the diagram. Now in practice, Andrew recommends creating dev sets from both data distributions: a train-dev and test-dev set. In this manner, any gap between the different errors can help you tackle the problem more clearly.</p>
<p align="center">
<img src="/assets/app_dl/errors.svg" width="450" />
</p>
<p><strong>Flowchart for working with a model.</strong> Given what we have described above, here’s a simplified flowchart of the actions you should take when confronted with training/tuning a DL model.</p>
<p align="center">
<img src="/assets/app_dl/flowachart.svg" width="500" />
</p>
<p><strong>The importance of data synthesis.</strong> Andrew also stressed the importance of data synthesis as part of any workflow in deep learning. While it may be painful to manually engineer training examples, the relative gain in performance you obtain once the parameters and the model fit well are huge and worth your while.</p>
<p><a name="toc4"></a></p>
<h3 id="human-level-performance">Human-level Performance</h3>
<p>One of the very important concepts underlined in this lecture was that of human-level performance. In the basic setting, DL models tend to plateau once they have reached or surpassed human-level accuracy. While it is important to note that human-level performance doesn’t necessarily coincide with the golden bayes error rate, it can serve as a very reliable proxy which can be leveraged to determine your next move when training your model.</p>
<p align="center">
<img src="/assets/app_dl/perf.png" width="550" />
</p>
<p><strong>Reasons for the plateau.</strong> There could be a theoretical limit on the dataset which makes further improvement futile (i.e. a noisy subset of the data). Humans are also very good at these tasks so trying to make progress beyond that suffers from diminishing returns.</p>
<p>Here’s an example that can help illustrate the usefulness of human-level accuracy. Suppose you are working on an image recognition task and measure the following:</p>
<ul>
<li><strong>Train error</strong>: 8%</li>
<li><strong>Dev Error</strong>: 10%</li>
</ul>
<p>If I were to tell you that human accuracy for such a task is on the order of 1%, then this would be a blatant bias problem and you could subsequently try increasing the size of your model, train longer etc. However, if I told you that human-level accuracy was on the order of 7.5%, then this would be more of a variance problem and you’d focus your efforts on methods such as data synthesis or gathering data more similar to the test.</p>
<p>By the way, there’s always room for improvement. Even if you are close to human-level accuracy overall, there could be subsets of the data where you perform poorly and working on those can boost production performance greatly.</p>
<p>Finally, one might ask what is a good way of defining human-level accuracy. For example, in the following image diagnosis setting, ignoring the cost of obtaining data, how should one pick the criteria for human-level accuracy?</p>
<ul>
<li><strong>typical human</strong>: 5%</li>
<li><strong>general doctor</strong>: 1%</li>
<li><strong>specialized doctor</strong>: 0.8%</li>
<li><strong>group of specialized doctor</strong>s: 0.5%</li>
</ul>
<p>The answer is always the best accuracy possible. This is because, as we mentioned earlier, human-level performance is a proxy for the bayes optimal error rate, so providing a more accurate upper bound to your performance can help you strategize your next move.</p>
<p><a name="toc5"></a></p>
<h3 id="personal-advice">Personal Advice</h3>
<p>Andrew ended the presentation with 2 ways one can improve his/her skills in the field of deep learning.</p>
<ul>
<li><strong>Practice, Practice, Practice</strong>: compete in Kaggle competitions and read associated blog posts and forum discussions.</li>
<li><strong>Do the Dirty Work</strong>: read a lot of papers and try to replicate the results. Soon enough, you’ll get your own ideas and build your own models.</li>
</ul>
Mon, 26 Sep 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/
http://kevinzakka.github.io/2016/09/26/applying-deep-learning/deep learningbiasvarianceadviceend-to-endmachine learningDeriving the Gradient for the Backward Pass of Batch Normalization<div class="imgcap">
<img src="/assets/batch_norm/cs231n.png" width="70%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://cs231n.stanford.edu/">Image Courtesy</a></div>
</div>
<p>I recently sat down to work on assignment 2 of Stanford’s <a href="http://cs231n.github.io/assignments2016/assignment2/">CS231n</a>. It’s lengthy and definitely a step up from the first assignment, but the insight you gain is tremendous.</p>
<p>Anyway, at one point in the assignment, we were tasked with implementing a Batch Normalization layer in our fully-connected net which required writing a forward and backward pass.</p>
<p>The forward pass is relatively simple since it only requires standardizing the input features (zero mean and unit standard deviation). The backwards pass, on the other hand, is a bit more involved. It can be done in 2 different ways:</p>
<ul>
<li><strong>staged computation</strong>: we can break up the function into several parts, derive local gradients for them, and finally multiply them with the chain rule.</li>
<li><strong>gradient derivation</strong>: basically, you have to do a “pen and paper” derivation of the gradient with respect to the inputs.</li>
</ul>
<p>It turns out that second option is faster, albeit nastier and after struggling for a few hours, I finally got it to work. This post is mainly a clear summary of the derivation along with my thought process, and I hope it can provide others with the insight and intuition of the chain rule. There is a similar tutorial online already (but I couldn’t follow along very well) so if you want to check it out, head over to <a href="http://cthorey.github.io./backpropagation/">Clément Thorey’s Blog</a>.</p>
<p>Finally, I’ve summarized the original <a href="https://arxiv.org/abs/1502.03167">research paper</a> and accompanied it with a small numpy implementation which you can view on my <a href="https://github.com/kevinzakka/research-paper-notes">Github</a>. With that being said, let’s jump right into the blog.</p>
<h3 id="notation">Notation</h3>
<p>Let’s start with some notation.</p>
<ul>
<li><strong>BN</strong> will stand for Batch Norm.</li>
<li><script type="math/tex">f</script> represents a layer upwards of the BN one.</li>
<li><script type="math/tex">y</script> is the linear transformation which scales <script type="math/tex">x</script> by <script type="math/tex">\gamma</script> and adds <script type="math/tex">\beta</script>.</li>
<li><script type="math/tex">\hat{x}</script> is the normalized inputs.</li>
<li><script type="math/tex">\mu</script> is the batch mean.</li>
<li><script type="math/tex">\sigma^2</script> is the batch variance.</li>
</ul>
<p>The below table shows you the inputs to each function and will help with the future derivation.</p>
<p align="center">
<img src="\assets\batch_norm\table0.png" width="380" />
</p>
<p><strong>Goal</strong>: Find the partial derivatives with respect to the inputs, that is <script type="math/tex">\dfrac{\partial f}{\partial \gamma}</script>, <script type="math/tex">\dfrac{\partial f}{\partial \beta}</script> and <script type="math/tex">\dfrac{\partial f}{\partial x_i}</script>.</p>
<p><strong>Methodology</strong>: derive the gradient with respect to the centered inputs <script type="math/tex">\hat{x}_i</script> (which requires deriving the gradient w.r.t <script type="math/tex">\mu</script> and <script type="math/tex">\sigma^2</script>) and then use those to derive one for <script type="math/tex">x_i</script>.</p>
<h3 id="chain-rule-primer">Chain Rule Primer</h3>
<p>Suppose we’re given a function <script type="math/tex">u(x, y)</script> where <script type="math/tex">x(r, t)</script> and <script type="math/tex">y(r, t)</script>. Then to determine the value of <script type="math/tex">\frac{\partial u}{\partial r}</script> and <script type="math/tex">\frac{\partial u}{\partial t}</script> we need to use the chain rule which says that:</p>
<script type="math/tex; mode=display">\frac{\partial u}{\partial r} = \frac{\partial u}{\partial x} \cdot \frac{\partial x}{\partial r} + \frac{\partial u}{\partial y} \cdot \frac{\partial y}{\partial r}</script>
<p>That’s basically all there is to it. Using this simple concept can help us solve our problem. We just have to be clear and precise when using it and not get lost with the intermediate variables.</p>
<h3 id="partial-derivatives">Partial Derivatives</h3>
<p>Here’s the gist of BN taken from the paper.</p>
<p align="center">
<img src="\assets\batch_norm\alg1.png" width="380" />
</p>
<p>We’re gonna start by traversing the table from left to right. At each step we derive the gradient with respect to the inputs in the cell.</p>
<h4 id="cell-1">Cell 1</h4>
<p align="center">
<img src="\assets\batch_norm\table1.png" width="380" />
</p>
<p>Let’s compute <script type="math/tex">\dfrac{\partial f}{\partial y_i}</script>. It actually turns out we don’t need to compute this derivative since we already have it - it’s the upstream derivative <code class="highlighter-rouge">dout</code> given to us in the function parameter.</p>
<h4 id="cell-2">Cell 2</h4>
<p align="center">
<img src="\assets\batch_norm\table2.png" width="380" />
</p>
<p>Let’s work on cell 2 now. We note that <script type="math/tex">y</script> is a function of three variables, so let’s compute the gradient with respect to each one.</p>
<hr />
<p>Starting with <script type="math/tex">\gamma</script> and using the chain rule:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \gamma} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \gamma} \qquad \\
&=& \boxed{\sum\limits_{i=1}^m \frac{\partial f}{\partial y_i} \cdot \hat{x}_i}
\end{eqnarray} %]]></script>
<p>Notice that we sum from <script type="math/tex">1 \rightarrow m</script> because we’re working with batches! If you’re worried you wouldn’t have caught that, think about the dimensions. The gradient with respect to a variable should be of the same size as that same variable so if those two clash, it should tell you you’ve done something wrong.</p>
<hr />
<p>Moving on to <script type="math/tex">\beta</script> we compute the gradient as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \beta} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \beta} \qquad \\
&=& \boxed{\sum\limits_{i=1}^m \frac{\partial f}{\partial y_i}}
\end{eqnarray} %]]></script>
<hr />
<p>and finally <script type="math/tex">\hat{x}_i</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \hat{x}_i} &=& \frac{\partial f}{\partial y_i} \cdot \frac{\partial y_i}{\partial \hat{x}_i} \qquad \\
&=& \boxed{\frac{\partial f}{\partial y_i} \cdot \gamma}
\end{eqnarray} %]]></script>
<hr />
<p>Up to now, things are relatively simple and we’ve already done 2/3 of the work. We <strong>can’t</strong> compute the gradient with respect to <script type="math/tex">x_i</script> just yet though.</p>
<h4 id="cell-3">Cell 3</h4>
<p align="center">
<img src="\assets\batch_norm\table3.png" width="380" />
</p>
<hr />
<p>We start with <script type="math/tex">\mu</script> and notice that <script type="math/tex">\sigma^2</script> is a function of <script type="math/tex">\mu</script>, therefore we need to add its contribution to the partial - (I’ve highlighted the missing partials in red):</p>
<script type="math/tex; mode=display">\dfrac{\partial f}{\partial \mu} = \frac{\partial f}{\partial \hat{x}_i} \cdot \color{red}{\frac{\partial \hat{x}_i}{\partial \mu}} + \color{red}{\frac{\partial f}{\partial \sigma^2}} \cdot \color{red}{\frac{\partial \sigma^2}{\partial\mu}}</script>
<p>Let’s compute the missing partials one at a time.</p>
<p>From</p>
<script type="math/tex; mode=display">\hat{x}_i = \frac{(x_i - \mu)}{\sqrt{\sigma^2 + \epsilon}}</script>
<p>we compute:</p>
<script type="math/tex; mode=display">\boxed{\dfrac{\partial \hat{x}_i}{\partial \mu} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \cdot (-1)}</script>
<p>and from</p>
<script type="math/tex; mode=display">\sigma^2 = \frac{1}{m} \sum\limits_{i=1}^m (x_i - \mu)^2</script>
<p>we calculate:</p>
<script type="math/tex; mode=display">\boxed{\dfrac{\partial \sigma^2}{\partial \mu} = \frac{1}{m} \sum\limits_{i=1}^m 2 \cdot (x_i - \mu)\cdot (-1)}</script>
<p>We’re missing the partial with respect to <script type="math/tex">\sigma^2</script> and that is our next variable, so let’s get to it and come back and plug it in here.</p>
<hr />
<p>Ok so in the expression of the partial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \sigma^2} &=& \frac{\partial f}{\partial \hat{x}} \cdot \frac{\partial \hat{x}}{\partial \sigma^2} \qquad \\
\end{eqnarray} %]]></script>
<p>let’s compute <script type="math/tex">\dfrac{\partial \hat{x}}{\partial \sigma^2}</script> in more detail. I’m gonna rewrite <script type="math/tex">\hat{x}</script> to make its derivative easier to compute:</p>
<script type="math/tex; mode=display">\hat{x}_i = (x_i - \mu)(\sigma^2 + \epsilon)^{-0.5}</script>
<p><script type="math/tex">(x_i - \mu)</script> is a constant therefore:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\dfrac{\partial \hat{x}}{\partial \sigma^2} &=& \sum\limits_{i=1}^m (x_i - \mu) \cdot (-0.5) \cdot (\sigma^2 + \epsilon)^{-0.5 - 1} \qquad \\
&=& -0.5 \sum\limits_{i=1}^m (x_i - \mu) \cdot (\sigma^2 + \epsilon)^{-1.5}
\end{eqnarray} %]]></script>
<hr />
<p>With all that out of the way, let’s plug everything back in our previous partial!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial \mu} &=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot \frac{1}{m} \sum\limits_{i=1}^m -2(x_i - \mu) \bigg) \qquad \\
&=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot (-2) \cdot \bigg( \frac{1}{m} \sum\limits_{i=1}^m x_i - \frac{1}{m} \sum\limits_{i=1}^m \mu \bigg) \bigg) \qquad \\
&=& \bigg(\sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \bigg) + \bigg( \frac{\partial f}{\partial \sigma^2} \cdot (-2) \cdot \bigg( \mu - \frac{m \cdot \mu}{m} \bigg) \bigg) \qquad \\
&=& \sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}} \qquad \\
\end{eqnarray} %]]></script>
<p>Thus we have:</p>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial \mu} = \sum\limits_{i=1}^m \frac{\partial f}{\partial \hat{x}_i} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}}}</script>
<p>EDIT: Just to make it clear, there’s a summation in <script type="math/tex">\dfrac{\partial \hat{x}_i}{\partial \mu}</script> because we want the dimensions to add up with respect to <code class="highlighter-rouge">dfdmu</code> and not <code class="highlighter-rouge">dxhatdmu</code>.</p>
<hr />
<p>We finally arrive at the last variable <script type="math/tex">x</script>. Again adding the contributions from any parameter containing <script type="math/tex">x</script> we obtain:</p>
<script type="math/tex; mode=display">\dfrac{\partial f}{\partial x_i} = \frac{\partial f}{\partial \hat{x}_i} \cdot \color{red}{\frac{\partial \hat{x}_i}{\partial x_i}} + \frac{\partial f}{\partial \mu} \cdot \color{red}{\frac{\partial \mu}{\partial x_i}} + \frac{\partial f}{\partial \sigma^2} \cdot \color{red}{\frac{\partial \sigma^2}{\partial x_i}}</script>
<p>The missing pieces are super easy to compute at this point.</p>
<script type="math/tex; mode=display">\dfrac{\partial \hat{x}_i}{\partial x_i} = \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}</script>
<script type="math/tex; mode=display">\dfrac{\partial \mu}{\partial x_i} = \dfrac{1}{m}</script>
<script type="math/tex; mode=display">\dfrac{\partial \sigma^2}{\partial x_i} = \dfrac{2(x_i - \mu)}{m}</script>
<p>That’s it, our final gradient is</p>
<script type="math/tex; mode=display">\frac{\partial f}{\partial x_i} = \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{\partial f}{\partial \mu} \cdot \dfrac{1}{m}\bigg) + \bigg(\frac{\partial f}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{m}\bigg)</script>
<p><span style="color:red"><strong>Note the following trick</strong></span></p>
<script type="math/tex; mode=display">\boxed{(\sigma^2 + \epsilon)^{-1.5} = (\sigma^2 + \epsilon)^{-0.5}(\sigma^2 + \epsilon)^{-1} = (\sigma^2 + \epsilon)^{-0.5} \frac{1}{\sqrt{\sigma^2 + \epsilon}}\frac{1}{\sqrt{\sigma^2 + \epsilon}}}</script>
<p>With that in mind, let’s plug in the partials and see if we can simplify the expression some more.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{eqnarray}
\frac{\partial f}{\partial x_i} &=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{\partial f}{\partial \mu} \cdot \dfrac{1}{m}\bigg) + \bigg(\frac{\partial f}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{m}\bigg) \qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}}\bigg) + \bigg(\frac{1}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \frac{-1}{\sqrt{\sigma^2 + \epsilon}}\bigg) - \bigg(0.5 \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot (x_j - \mu) \cdot (\sigma^2 + \epsilon)^{-1.5} \cdot \dfrac{2(x_i - \mu)}{m} \bigg) \qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot (\sigma^2 + \epsilon)^{-0.5} \bigg) - \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \bigg) + \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \cdot \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \frac{(x_j - \mu)}{\sqrt{\sigma^2 + \epsilon}} \bigg )\qquad \\
&=& \bigg(\frac{\partial f}{\partial \hat{x}_i} \cdot (\sigma^2 + \epsilon)^{-0.5} \bigg) - \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \bigg) + \bigg(\frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \cdot \hat{x}_i \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j \bigg )\qquad \\
\end{eqnarray} %]]></script>
<p>Finally, we factorize by the <code class="highlighter-rouge">sigma + epsilon</code> factor and obtain:</p>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial x_i} = \frac{(\sigma^2 + \epsilon)^{-0.5}}{m} \bigg [\color{red}{m \frac{\partial f}{\partial \hat{x}_i}} - \color{blue}{\sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j}} - \color{green}{\hat{x}_i \sum\limits_{j=1}^m \frac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j}\bigg ]}</script>
<h3 id="recap">Recap</h3>
<p>For organizational purposes, let’s summarize the main equations we were able to derive. Using <script type="math/tex">\dfrac{\partial f}{\partial \hat{x}_i} = \dfrac{\partial f}{\partial y_i} \cdot \gamma</script>, we obtain the gradient with respect to our inputs:</p>
<script type="math/tex; mode=display">\boxed{\color{red}{\frac{\partial f}{\partial \beta} = \sum\limits_{i=1}^m \frac{\partial f}{\partial y_i}}}</script>
<script type="math/tex; mode=display">\boxed{\color{blue}{\frac{\partial f}{\partial \gamma} = \sum\limits_{i=1}^m \frac{\partial f}{\partial y_i} \cdot \hat{x}_i}}</script>
<script type="math/tex; mode=display">\boxed{\frac{\partial f}{\partial x_i} = \frac{\color{red}{m \dfrac{\partial f}{\partial \hat{x}_i}} - \color{blue}{\sum\limits_{j=1}^m \dfrac{\partial f}{\partial \hat{x}_j}} - \color{green}{\hat{x}_i \sum\limits_{j=1}^m \dfrac{\partial f}{\partial \hat{x}_j} \cdot \hat{x}_j}}{m\sqrt{\sigma^2 + \epsilon}}}</script>
<h3 id="python-implementation">Python Implementation</h3>
<p>Here’s an example implementation using the equations we derived. <code class="highlighter-rouge">dx</code> is 88 characters long so I’m still wondering how the course instructors were able to write it less than 80 - maybe shorter variable names?</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">batchnorm_backward</span><span class="p">(</span><span class="n">dout</span><span class="p">,</span> <span class="n">cache</span><span class="p">):</span>
<span class="n">N</span><span class="p">,</span> <span class="n">D</span> <span class="o">=</span> <span class="n">dout</span><span class="o">.</span><span class="n">shape</span>
<span class="n">x_mu</span><span class="p">,</span> <span class="n">inv_var</span><span class="p">,</span> <span class="n">x_hat</span><span class="p">,</span> <span class="n">gamma</span> <span class="o">=</span> <span class="n">cache</span>
<span class="c"># intermediate partial derivatives</span>
<span class="n">dxhat</span> <span class="o">=</span> <span class="n">dout</span> <span class="o">*</span> <span class="n">gamma</span>
<span class="c"># final partial derivatives</span>
<span class="n">dx</span> <span class="o">=</span> <span class="p">(</span><span class="mf">1.</span> <span class="o">/</span> <span class="n">N</span><span class="p">)</span> <span class="o">*</span> <span class="n">inv_var</span> <span class="o">*</span> <span class="p">(</span><span class="n">N</span><span class="o">*</span><span class="n">dxhat</span> <span class="o">-</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dxhat</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="o">-</span> <span class="n">x_hat</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dxhat</span><span class="o">*</span><span class="n">x_hat</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span>
<span class="n">dbeta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">dout</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">dgamma</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">x_hat</span><span class="o">*</span><span class="n">dout</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">dx</span><span class="p">,</span> <span class="n">dgamma</span><span class="p">,</span> <span class="n">dbeta</span>
</code></pre>
</div>
<p>This version of the batchnorm backward pass can give you a significant boost in speed. I timed both versions and got a superb threefold increase in speed:</p>
<p align="center">
<img src="/assets/speedup.png" width="400" />
</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this blog post, we learned how to use the chain rule in a staged manner to derive the expression for the gradient of the batch norm layer. We also saw how a smart simplification can help significantly reduce the complexity of the expression for <code class="highlighter-rouge">dx</code>. We finally implemented it the backward pass in Python using the code from CS231n. This version of the function resulted in a 3x speed increase!</p>
<p>If you’re interested in the staged computation method, head over to <a href="https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html">Kratzert’s nicely written post</a>.</p>
<p>Cheers!</p>
Wed, 14 Sep 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/09/14/batch_normalization/
http://kevinzakka.github.io/2016/09/14/batch_normalization/batch normalizationgradientchain rulecs231nA Complete Guide to K-Nearest-Neighbors with Applications in Python and R<div class="imgcap">
<img src="/assets/row.png" width="50%" style="border:none;" /><div class="thecap" style="text-align:center"><a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Image Courtesy</a></div>
</div>
<p>This is an in-depth tutorial designed to introduce you to a simple, yet powerful classification algorithm called K-Nearest-Neighbors (KNN). We will go over the intuition and mathematical detail of the algorithm, apply it to a real-world dataset to see exactly how it works, and gain an intrinsic understanding of its inner-workings by writing it from scratch in code. Finally, we will explore ways in which we can improve the algorithm.</p>
<p>For the full code that appears on this page, visit my <a href="https://github.com/kevinzakka/blog-code/">Github Repository</a>.</p>
<h2 id="table-of-contents">Table of Contents</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#what-is-knn">What is KNN?</a></li>
<li><a href="#how-does-knn-work">How does KNN work?</a></li>
<li><a href="#more-on-k">More on K</a></li>
<li><a href="#exploring-knn-in-code">Exploring KNN in Code</a></li>
<li><a href="#parameter-tuning-with-cross-validation">Parameter Tuning with Cross Validation</a></li>
<li><a href="#writing-our-own-knn-from-scratch">Writing our Own KNN from Scratch</a></li>
<li><a href="#pros-and-cons-of-knn">Pros and Cons of KNN</a></li>
<li><a href="#improvements">Improvements</a></li>
<li><a href="#tutorial-summary">Tutorial Summary</a></li>
</ol>
<h2 id="introduction">Introduction</h2>
<p>The KNN algorithm is a robust and versatile classifier that is often used as a benchmark for more complex classifiers such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM). Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications such as economic forecasting, data compression and genetics. For example, KNN was leveraged in a 2006 <a href="http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-S1-S11">study</a> of functional genomics for the assignment of genes based on their expression profiles.</p>
<h2 id="what-is-knn">What is KNN?</h2>
<p>Let’s first start by establishing some definitions and notations. We will use <script type="math/tex">x</script> to denote a <em>feature</em> (aka. predictor, attribute) and <script type="math/tex">y</script> to denote the <em>target</em> (aka. label, class) we are trying to predict.</p>
<p>KNN falls in the <strong>supervised learning</strong> family of algorithms. Informally, this means that we are given a labelled dataset consiting of training observations <script type="math/tex">(x,y)</script> and would like to capture the relationship between <script type="math/tex">x</script> and <script type="math/tex">y</script>. More formally, our goal is to learn a function <script type="math/tex">h : X → Y</script> so that given an unseen observation <script type="math/tex">x</script>, <script type="math/tex">h(x)</script> can confidently predict the corresponding output <script type="math/tex">y</script>.</p>
<p>The KNN classifier is also a <strong>non parametric</strong> and <strong>instance-based</strong> learning algorithm.</p>
<ul>
<li><strong>Non-parametric</strong> means it makes no explicit assumptions about the functional form of h, avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.</li>
<li><strong>Instance-based</strong> learning means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances which are subsequently used as “knowledge” for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.</li>
</ul>
<blockquote>
<p>KNN is non-parametric, instance-based and used in a supervised learning setting.</p>
</blockquote>
<p>It is worth noting that the minimal training phase of KNN comes both at a <em>memory cost</em>, since we must store a potentially huge data set, as well as a <em>computational cost</em> during test time since classifying a given observation requires a run down of the whole data set. Practically speaking, this is undesirable since we usually want fast responses.</p>
<blockquote>
<p>Minimal training but expensive testing.</p>
</blockquote>
<h2 id="how-does-knn-work">How does KNN work?</h2>
<p>In the classification setting, the K-nearest neighbor algorithm essentially boils down to forming a majority vote between the K most similar instances to a given “unseen” observation. Similarity is defined according to a distance metric between two data points. A popular choice is the Euclidean distance given by</p>
<script type="math/tex; mode=display">d(x, x') = \sqrt{\left(x_1 - x'_1 \right)^2 + \left(x_2 - x'_2 \right)^2 + \dotsc + \left(x_n - x'_n \right)^2}</script>
<p>but other measures can be more suitable for a given setting and include the Manhattan, Chebyshev and Hamming distance.</p>
<p>More formally, given a positive integer K, an unseen observation <script type="math/tex">x</script> and a similarity metric <script type="math/tex">d</script>, KNN classifier performs the following two steps:</p>
<ul>
<li>
<p>It runs through the whole dataset computing <script type="math/tex">d</script> between <script type="math/tex">x</script> and each training observation. We’ll call the K points in the training data that are closest to <script type="math/tex">x</script> the set <script type="math/tex">\mathcal{A}</script>. Note that K is usually odd to prevent tie situations.</p>
</li>
<li>
<p>It then estimates the conditional probability for each class, that is, the fraction of points in <script type="math/tex">\mathcal{A}</script> with that given class label. (Note <script type="math/tex">I(x)</script> is the indicator function which evaluates to <script type="math/tex">1</script> when the argument <script type="math/tex">x</script> is true and <script type="math/tex">0</script> otherwise)</p>
</li>
</ul>
<script type="math/tex; mode=display">P(y = j | X = x) = \frac{1}{K} \sum_{i \in \mathcal{A}} I(y^{(i)} = j)</script>
<p>Finally, our input <script type="math/tex">x</script> gets assigned to the class with the largest probability.</p>
<blockquote>
<p>KNN searches the memorized training observations for the K instances that most closely resemble the new instance and assigns to it the their most common class.</p>
</blockquote>
<p>An alternate way of understanding KNN is by thinking about it as calculating a decision boundary (i.e. boundaries for more than 2 classes) which is then used to classify new points.</p>
<h2 id="more-on-k">More on K</h2>
<p>At this point, you’re probably wondering how to pick the variable K and what its effects are on your classifier. Well, like most machine learning algorithms, the K in KNN is a hyperparameter that you, as a designer, must pick in order to get the best possible fit for the data set. Intuitively, you can think of K as controlling the shape of the decision boundary we talked about earlier.</p>
<p>When K is small, we are restraining the region of a given prediction and forcing our classifier to be “more blind” to the overall distribution. A small value for K provides the most flexible fit, which will have low bias but high variance. Graphically, our decision boundary will be more jagged.</p>
<p align="center">
<img src="/assets/1nearestneigh.png" />
</p>
<p>On the other hand, a higher K averages more voters in each prediction and hence is more resilient to outliers. Larger values of K will have smoother decision boundaries which means lower variance but increased bias.</p>
<p align="center">
<img src="/assets/20nearestneigh.png" />
</p>
<p>(If you want to learn more about the bias-variance tradeoff, check out <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Scott Roe’s Blog post</a>. You can mess around with the value of K and watch the decision boundary change!)</p>
<h2 id="exploring-knn-in-code">Exploring KNN in Code</h2>
<p>Without further ado, let’s see how KNN can be leveraged in Python for a classification problem. We’re gonna head over to the UC Irvine Machine Learning Repository, an amazing source for a variety of free and interesting data sets.</p>
<p align="center">
<img src="/assets/flower.jpg" />
</p>
<p>The data set we’ll be using is the <a href="https://archive.ics.uci.edu/ml/datasets/Iris">Iris Flower Dataset</a> (IFD) which was first introduced in 1936 by the famous statistician Ronald Fisher and consists of 50 observations from each of three species of Iris (<em>Iris setosa, Iris virginica and Iris versicolor</em>). Four features were measured from each sample: the length and the width of the sepals and petals. Our goal is to train the KNN algorithm to be able to distinguish the species from one another given the measurements of the 4 features.</p>
<p>Go ahead and <code class="highlighter-rouge">Download Data Folder > iris.data</code> and save it in the directory of your choice.</p>
<p>The first thing we need to do is load the data set. It is in CSV format without a header line so we’ll use pandas’ <code class="highlighter-rouge">read_csv</code> function.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># loading libraries</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="c"># define column names</span>
<span class="n">names</span> <span class="o">=</span> <span class="p">[</span><span class="s">'sepal_length'</span><span class="p">,</span> <span class="s">'sepal_width'</span><span class="p">,</span> <span class="s">'petal_length'</span><span class="p">,</span> <span class="s">'petal_width'</span><span class="p">,</span> <span class="s">'class'</span><span class="p">]</span>
<span class="c"># loading training data</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'path/iris.data.txt'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="n">names</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</code></pre>
</div>
<p>It’s always a good idea to <code class="highlighter-rouge">df.head()</code> to see how the first few rows of the data frame look like. Also, note that you should replace <code class="highlighter-rouge">'path/iris.data.txt'</code> with that of the directory where you saved the data set.</p>
<p>Next, it would be cool if we could plot the data before rushing into classification so that we can have a deeper understanding of the problem at hand. R has a beautiful visualization tool called <code class="highlighter-rouge">ggplot2</code> that we will use to create 2 quick scatter plots of <strong>sepal width vs sepal length</strong> and <strong>petal width vs petal length</strong>.</p>
<div class="language-r highlighter-rouge"><pre class="highlight"><code><span class="c1"># ============================== R code ==============================
# loading packages
</span><span class="n">library</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="n">magrittr</span><span class="p">)</span><span class="w">
</span><span class="c1"># sepal width vs. sepal length
</span><span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Sepal.Length</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">Sepal.Width</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">Species</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w">
</span><span class="c1"># petal width vs. petal length
</span><span class="n">iris</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="n">Petal.Length</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="o">=</span><span class="n">Petal.Width</span><span class="p">,</span><span class="w"> </span><span class="n">color</span><span class="o">=</span><span class="n">Species</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">geom_point</span><span class="p">()</span><span class="w">
</span><span class="c1"># =====================================================================
</span></code></pre>
</div>
<p>Note that we’ve accessed the <code class="highlighter-rouge">iris</code> dataframe which comes preloaded in R by default.</p>
<p align="center">
<img src="/assets/sep_plot.png" />
</p>
<p align="center">
<img src="/assets/pet_plot.png" />
</p>
<p>A quick study of the above graphs reveals some strong classification criterion. We observe that setosas have small petals, versicolor have medium sized petals and virginica have the largest petals. Furthermore, setosas seem to have shorter and wider sepals than the other two classes. Pretty interesting right? Without even using an algorithm, we’ve managed to intuitively construct a classifier that can perform pretty well on the dataset.</p>
<p>Now, it’s time to get our hands wet. We’ll be using <code class="highlighter-rouge">scikit-learn</code> to train a KNN classifier and evaluate its performance on the data set using the 4 step modeling pattern:</p>
<ol>
<li>Import the learning algorithm</li>
<li>Instantiate the model</li>
<li>Learn the model</li>
<li>Predict the response</li>
</ol>
<p><code class="highlighter-rouge">scikit-learn</code> requires that the design matrix <script type="math/tex">X</script> and target vector <script type="math/tex">y</script> be numpy arrays so let’s oblige. Furthermore, we need to split our data into training and test sets. The following code does just that.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># loading libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">sklearn.cross_validation</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="c"># create design matrix X and target vector y</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">ix</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">:</span><span class="mi">4</span><span class="p">])</span> <span class="c"># end index is exclusive</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s">'class'</span><span class="p">])</span> <span class="c"># another way of indexing a pandas df</span>
<span class="c"># split into train and test</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">y_test</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.33</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">42</span><span class="p">)</span>
</code></pre>
</div>
<p>Finally, following the above modeling pattern, we define our classifer, in this case KNN, fit it to our training data and evaluate its accuracy. We’ll be using an arbitrary K but we will see later on how cross validation can be used to find its optimal value.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># loading library</span>
<span class="kn">from</span> <span class="nn">sklearn.neighbors</span> <span class="kn">import</span> <span class="n">KNeighborsClassifier</span>
<span class="c"># instantiate learning model (k = 3)</span>
<span class="n">knn</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="c"># fitting the model</span>
<span class="n">knn</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># predict the response</span>
<span class="n">pred</span> <span class="o">=</span> <span class="n">knn</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="c"># evaluate accuracy</span>
<span class="k">print</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">pred</span><span class="p">)</span>
</code></pre>
</div>
<h2 id="parameter-tuning-with-cross-validation">Parameter Tuning with Cross Validation</h2>
<p>In this section, we’ll explore a method that can be used to <em>tune</em> the hyperparameter K.</p>
<p>Obviously, the best K is the one that corresponds to the lowest test error rate, so let’s suppose we carry out repeated measurements of the test error for different values of K. Inadvertently, what we are doing is using the <code class="highlighter-rouge">test set</code> as a <code class="highlighter-rouge">training set</code>! This means that we are underestimating the true error rate since our model has been forced to fit the test set in the best possible manner. Our model is then incapable of generalizing to newer observations, a process known as <strong>overfitting</strong>. Hence, touching the test set is out of the question and must only be done at the very end of our pipeline.</p>
<blockquote>
<p>Using the test set for hyperparameter tuning can lead to overfitting.</p>
</blockquote>
<p>An alternative and smarter approach involves estimating the test error rate by holding out a subset of the <code class="highlighter-rouge">training set</code> from the fitting process. This subset, called the <code class="highlighter-rouge">validation set</code>, can be used to select the appropriate level of flexibility of our algorithm! There are different validation approaches that are used in practice, and we will be exploring one of the more popular ones called <strong>k-fold cross validation</strong>.</p>
<p align="center">
<img src="/assets/k_fold_cv.jpg" />
</p>
<p>As seen in the image, k-fold cross validation (<em>the k is totally unrelated to K</em>) involves randomly dividing the training set into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining <script type="math/tex">k − 1</script> folds. The misclassification rate is then computed on the observations in the held-out fold. This procedure is repeated k times; each time, a different group of observations is treated as a validation set. This process results in k estimates of the test error which are then averaged out.</p>
<blockquote>
<p>Cross-validation can be used to estimate the test error associated with a learning method in order to evaluate its performance, or to select the appropriate level of flexibility.</p>
</blockquote>
<p>If that is a bit overwhelming for you, don’t worry about it. We’re gonna make it clearer by performing a 10-fold cross validation on our dataset using a generated list of odd K’s ranging from 1 to 50.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># creating odd list of K for KNN</span>
<span class="n">myList</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">50</span><span class="p">))</span>
<span class="c"># subsetting just the odd ones</span>
<span class="n">neighbors</span> <span class="o">=</span> <span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">myList</span><span class="p">)</span>
<span class="c"># empty list that will hold cv scores</span>
<span class="n">cv_scores</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c"># perform 10-fold cross validation</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">neighbors</span><span class="p">:</span>
<span class="n">knn</span> <span class="o">=</span> <span class="n">KNeighborsClassifier</span><span class="p">(</span><span class="n">n_neighbors</span><span class="o">=</span><span class="n">k</span><span class="p">)</span>
<span class="n">scores</span> <span class="o">=</span> <span class="n">cross_val_score</span><span class="p">(</span><span class="n">knn</span><span class="p">,</span> <span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">cv</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">'accuracy'</span><span class="p">)</span>
<span class="n">cv_scores</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">scores</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
</code></pre>
</div>
<p>Again, scikit-learn comes in handy with its <code class="highlighter-rouge">cross_val_score()</code> method. We specifiy that we are performing 10 folds with the <code class="highlighter-rouge">cv=10</code> parameter and that our scoring metric should be <code class="highlighter-rouge">accuracy</code> since we are in a classification setting.</p>
<p>Finally, we plot the misclassification error versus K.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># changing to misclassification error</span>
<span class="n">MSE</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span> <span class="o">-</span> <span class="n">x</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">cv_scores</span><span class="p">]</span>
<span class="c"># determining best k</span>
<span class="n">optimal_k</span> <span class="o">=</span> <span class="n">neighbors</span><span class="p">[</span><span class="n">MSE</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="n">MSE</span><span class="p">))]</span>
<span class="k">print</span> <span class="s">"The optimal number of neighbors is </span><span class="si">%</span><span class="s">d"</span> <span class="o">%</span> <span class="n">optimal_k</span>
<span class="c"># plot misclassification error vs k</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">neighbors</span><span class="p">,</span> <span class="n">MSE</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">'Number of Neighbors K'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">'Misclassification Error'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre>
</div>
<p align="center">
<img src="/assets/cv_knn.png" />
</p>
<p>10-fold cross validation tells us that <script type="math/tex">K = 7</script> results in the lowest validation error.</p>
<h2 id="writing-our-own-knn-from-scratch">Writing our Own KNN from Scratch</h2>
<p>So far, we’ve studied how KNN works and seen how we can use it for a classification task using scikit-learn’s generic pipeline (i.e. input, instantiate, train, predict and evaluate). Now, it’s time to delve deeper into KNN by trying to code it ourselves from scratch.</p>
<p>A machine learning algorithm usually consists of 2 main blocks:</p>
<ul>
<li>
<p>a <strong>training</strong> block that takes as input the training data <script type="math/tex">X</script> and the corresponding target <script type="math/tex">y</script> and outputs a learned model <script type="math/tex">h</script>.</p>
</li>
<li>
<p>a <strong>predict</strong> block that takes as input new and unseen observations and uses the function <script type="math/tex">h</script> to output their corresponding responses.</p>
</li>
</ul>
<p>In the case of KNN, which as discussed earlier, is a lazy algorithm, the training block reduces to just memorizing the training data. Let’s go ahead a write a python method that does so.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">):</span>
<span class="c"># do nothing </span>
<span class="k">return</span>
</code></pre>
</div>
<p>Gosh, that was hard! Now we need to write the predict method which must do the following: it needs to compute the euclidean distance between the “new” observation and all the data points in the training set. It must then select the K nearest ones and perform a majority vote. It then assigns the corresponding label to the observation. Let’s go ahead and write that.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">x_test</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="c"># create list for distances and targets</span>
<span class="n">distances</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">targets</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">)):</span>
<span class="c"># first we compute the euclidean distance</span>
<span class="n">distance</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">square</span><span class="p">(</span><span class="n">x_test</span> <span class="o">-</span> <span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:])))</span>
<span class="c"># add it to list of distances</span>
<span class="n">distances</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">distance</span><span class="p">,</span> <span class="n">i</span><span class="p">])</span>
<span class="c"># sort the list</span>
<span class="n">distances</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">distances</span><span class="p">)</span>
<span class="c"># make a list of the k neighbors' targets</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">k</span><span class="p">):</span>
<span class="n">index</span> <span class="o">=</span> <span class="n">distances</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span>
<span class="n">targets</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">y_train</span><span class="p">[</span><span class="n">index</span><span class="p">])</span>
<span class="c"># return most common target</span>
<span class="k">return</span> <span class="n">Counter</span><span class="p">(</span><span class="n">targets</span><span class="p">)</span><span class="o">.</span><span class="n">most_common</span><span class="p">(</span><span class="mi">1</span><span class="p">)[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</code></pre>
</div>
<p>In the above code, we create an array of <em>distances</em> which we sort by increasing order. That way, we can grab the K nearest neighbors (first K distances), get their associated labels which we store in the <em>targets</em> array, and finally perform a majority vote using a <em>Counter</em>.</p>
<p>Putting it all together, we can define the function KNearestNeighbor, which loops over every test example and makes a prediction.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">kNearestNeighbor</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="c"># train on the input data</span>
<span class="n">train</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># loop over all observations</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_test</span><span class="p">)):</span>
<span class="n">predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:],</span> <span class="n">k</span><span class="p">))</span>
</code></pre>
</div>
<p>Let’s go ahead and run our algorithm with the optimal K we found using cross-validation.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="c"># making our predictions </span>
<span class="n">predictions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">kNearestNeighbor</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
<span class="c"># transform the list into an array</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span>
<span class="c"># evaluating accuracy</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">The accuracy of our classifier is </span><span class="si">%</span><span class="s">d</span><span class="si">%%</span><span class="s">'</span> <span class="o">%</span> <span class="n">accuracy</span><span class="o">*</span><span class="mi">100</span><span class="p">)</span>
</code></pre>
</div>
<p><script type="math/tex">98\%</script> accuracy! We’re as good as scikit-learn’s algorithm, but probably less efficient. Let’s try again with a value of <script type="math/tex">K = 140</script>. We get an <code class="highlighter-rouge">IndexError: list index out of range</code> error. In fact, K can’t be arbitrarily large since we can’t have more neighbors than the number of observations in the training data set. So let’s fix our code to safeguard against such an error. Using <code class="highlighter-rouge">try, except</code> we can write the following code.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">kNearestNeighbor</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
<span class="c"># check if k larger than n</span>
<span class="k">if</span> <span class="n">k</span> <span class="o">></span> <span class="nb">len</span><span class="p">(</span><span class="n">X_train</span><span class="p">):</span>
<span class="k">raise</span> <span class="nb">ValueError</span>
<span class="c"># train on the input data</span>
<span class="n">train</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">)</span>
<span class="c"># predict for each testing observation</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">X_test</span><span class="p">)):</span>
<span class="n">predictions</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">[</span><span class="n">i</span><span class="p">,</span> <span class="p">:],</span> <span class="n">k</span><span class="p">))</span>
<span class="c"># making our predictions </span>
<span class="n">predictions</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">try</span><span class="p">:</span>
<span class="n">kNearestNeighbor</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">y_train</span><span class="p">,</span> <span class="n">X_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="mi">7</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">predictions</span><span class="p">)</span>
<span class="c"># evaluating accuracy</span>
<span class="n">accuracy</span> <span class="o">=</span> <span class="n">accuracy_score</span><span class="p">(</span><span class="n">y_test</span><span class="p">,</span> <span class="n">predictions</span><span class="p">)</span> <span class="o">*</span> <span class="mi">100</span>
<span class="k">print</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">The accuracy of OUR classifier is </span><span class="si">%</span><span class="s">d</span><span class="si">%%</span><span class="s">'</span> <span class="o">%</span> <span class="n">accuracy</span><span class="p">)</span>
<span class="k">except</span> <span class="nb">ValueError</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Can</span><span class="se">\'</span><span class="s">t have more neighbors than training samples!!'</span><span class="p">)</span>
</code></pre>
</div>
<p>That’s it, we’ve just written our first machine learning algorithm from scratch!</p>
<h2 id="pros-and-cons-of-knn">Pros and Cons of KNN</h2>
<h4 id="pros">Pros</h4>
<p>As you can already tell from the previous section, one of the most attractive features of the K-nearest neighbor algorithm is that is simple to understand and easy to implement. With zero to little training time, it can be a useful tool for off-the-bat analysis of some data set you are planning to run more complex algorithms on. Furthermore, KNN works just as easily with multiclass data sets whereas other algorithms are hardcoded for the binary setting. Finally, as we mentioned earlier, the non-parametric nature of KNN gives it an edge in certain settings where the data may be highly “unusual”.</p>
<h4 id="cons">Cons</h4>
<p>One of the obvious drawbacks of the KNN algorithm is the computationally expensive testing phase which is impractical in industry settings. Note the rigid dichotomy between KNN and the more sophisticated Neural Network which has a lengthy training phase albeit a <strong>very fast</strong> testing phase. Furthermore, KNN can suffer from skewed class distributions. For example, if a certain class is very frequent in the training set, it will tend to dominate the majority voting of the new example (large number = more common). Finally, the accuracy of KNN can be severely degraded with high-dimension data because there is little difference between the nearest and farthest neighbor.</p>
<h2 id="improvements">Improvements</h2>
<p>With that being said, there are many ways in which the KNN algorithm can be improved.</p>
<ul>
<li>A simple and effective way to remedy skewed class distributions is by implementing <strong>weighed voting</strong>. The class of each of the K neighbors is multiplied by a weight proportional to the inverse of the distance from that point to the given test point. This ensures that nearer neighbors contribute more to the final vote than the more distant ones.</li>
<li><strong>Changing the distance metric</strong> for different applications may help improve the accuracy of the algorithm. (i.e. Hamming distance for text classification)</li>
<li><strong>Rescaling your data</strong> makes the distance metric more meaningful. For instance, given 2 features <code class="highlighter-rouge">height</code> and <code class="highlighter-rouge">weight</code>, an observation such as <script type="math/tex">x = [180, 70]</script> will clearly skew the distance metric in favor of height. One way of fixing this is by column-wise subtracting the mean and dividing by the standard deviation. Scikit-learn’s <code class="highlighter-rouge">normalize()</code> method can come in handy.</li>
<li><strong>Dimensionality reduction</strong> techniques like PCA should be executed prior to appplying KNN and help make the distance metric more meaningful.</li>
<li><strong>Approximate Nearest Neighbor</strong> techniques such as using <em>k-d trees</em> to store the training observations can be leveraged to decrease testing time. Note however that these methods tend to perform poorly in high dimensions (20+). Try using <strong>locality sensitive hashing (LHS)</strong> for higher dimensions.</li>
</ul>
<h2 id="tutorial-summary">Tutorial Summary</h2>
<p>In this tutorial, we learned about the K-Nearest Neighbor algorithm, how it works and how it can be applied in a classification setting using scikit-learn. We also implemented the algorithm in Python from scratch in such a way that we understand the inner-workings of the algorithm. We even used R to create visualizations to further understand our data. Finally, we explored the pros and cons of KNN and the many improvements that can be made to adapt it to different project settings.</p>
<p>If you want to practice some more with the algorithm, try and run it on the <strong>Breast Cancer Wisconsin</strong> dataset which you can find in the UC Irvine Machine Learning repository. You’ll need to preprocess the data carefully this time. Do it once with scikit-learn’s algorithm and a second time with our version of the code but try adding the weighted distance implementation.</p>
<h2 id="references">References</h2>
<h4 id="notes">Notes</h4>
<ul>
<li>Stanfords <strong>CS231n</strong> notes on KNN. click <a href="http://cs231n.github.io/classification/#nn">here</a></li>
<li>Wikipedia’s KNN page - click <a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">here</a></li>
<li>Introduction to Statistical Learning with Applications in R, Chapters <strong>2</strong> and <strong>3</strong> - click <a href="http://www-bcf.usc.edu/~gareth/ISL/">here</a></li>
<li>Detailed Introduction to KNN - click <a href="https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/">here</a></li>
</ul>
<h4 id="resources">Resources</h4>
<ul>
<li>Scikit-learn’s documentation for KNN - click <a href="http://scikit-learn.org/stable/modules/neighbors.html">here</a></li>
<li>Data wrangling and visualization with pandas and matplotlib from Chris Albon - click <a href="http://chrisalbon.com/">here</a></li>
<li>Intro to machine learning with scikit-learn (Great resource!) - click <a href="https://github.com/justmarkham/scikit-learn-videos">here</a></li>
</ul>
<p>Thank you for reading my guide, and I hope it helps you in theory and in practice!</p>
Wed, 13 Jul 2016 00:00:00 +0000
http://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/
http://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/KNNmachine learningclassificationneighbours