This past month has been extremely productive, and I’m really satisfied with the way my winter break has panned out. In fact, I had the opportunity to read and learn tons from a multitude of arXiv papers and I actually went hands on and coded 2 projects from scratch in Tensorflow/Keras:

I think sticking to theory all the time makes me very prone to forming misconceptions, so actually reproducing papers, googling questions and looking at people’s code has really helped me concretize the notions in my head and I’ve gained significant experience in the process.

Since tomorrow marks my last day of winter break, I thought I would use this opportunity to compile a list of projects I’d like to tackle in the next few weeks. It’ll be a perfect way of organizing and prioritizing my short term goals and I’ll be able to hold myself accountable if I get too lazy.


Image Super-Resolution. Super-resolution is the task of estimating a high-resolution (HR) image from its low-resolution (LR) counterpart. Think about those CSI-Miami episodes where they enhanced surveillance videos to glean valuable information for the crime case.

My resource will be Alex Champandard’s Neural-Enhance repository which uses a combination of 4 papers. Kudos to Alex, definitely go and give him a star for the amazing work.

I think super-resolution is a great application of Deep Learning and tackling it will prove to be very entertaining.

Text-To-Image. My goal is to try and implement StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks by Zhang et. al. Amazing work where they synthesize photo-realistic images from text descriptions with GANs! If that doesn’t sound like wizardry, take a look at the image below from the paper.

I’ll definitely have to brush up on the theory of Generative Adversarial Networks so Goodfellow’s NIPS 2016 GAN tutorial will prove invaluable.

Lip Reading. My third project in Visual Recognition will be trying to build a model that can recognise phrases and sentences being spoken by a talking face. Specifically, I’ll try and reproduce the results of Son Chung et. al’s Lip Reading Sentences in the Wild. Lip reading is just so damn useful and it can really help the hearing impaired so this is a big priority of mine. Helping society is exactly why I got into this field.

Here’s the youtube video uploaded by the author of the paper. Impressive results!


My goal is to tackle 2 seminal papers in this area.

Wavenet. Google Deepmind’s paper, which made waves (no pun intended) when it was released, leverages a deep neural network to generate raw audio waveforms. It won “Best paper from the industry” award on Reddit.

I invite you to check out Deepmind’s blog post on the matter where they showcase samples created by the network. Not only have they taught it to generate synthetic utterances (english and mandarin), but there’s even a sample of some piano playing which blew me away.

The results are currently very hard to reproduce, but my goal is just to get something minimal working and to familiarize myself with dilated convolutions.

SoundNet. The second paper is from MIT’s CSAIL and was presented at this year’s NIPS. The project landing page is extremely well presented and detailed so again, check it out if you’re interested.

I think this paper is extremely underrated, I didn’t see much talk about it on social media but it’s actually very elegant: given a video, their ConvNet recognizes objects and scenes from sound only!


NLP is a very important application of Deep Learning and I’ve never had any experience with it, so I decided I’d like to try and implement two recent approaches that have shifted away from the traditional RNN architecture. The first paper is from Google Deepmind and the second one is from FAIR.

ByteNet. Google Deepmind’s paper which can perform language modeling and machine translation in linear time. They use dilated convolutions much like in Wavenet. The below snippet, courtesy of the paper, illustrates the model’s architecture.

Gated Convnets. This paper is from FAIR. The authors evade the traditional RNN structure for language modeling and replace it with a convnet endowed with a gating mechanism (similar in concept to LSTMs). This model also enjoys an order of magnitude speedup compared to a recurrent baseline because they can parallelize it. The architecture is illustrated below.

Autonomous Driving

Finally, my current and main interest is autonomous driving. I’ve decided to tackle the following 3 projects and I feel they will form a solid background before I start messing around with’s open source project.

Traffic Sign Classification. I want to implement and train a convolutional neural network to classify traffic signs. This, incidentally, is the subject of my next blog post which is part 3 of the Spatial Transformer series.

Behavioral Cloning. I want to train a deep neural network to drive a car using OpenAI’s Universe and GTA V. Also would like to test it on MIT’s “Deep Learning for Self-Driving Cars” Deep Traffic. I don’t know if I need to be a pro in Reinforcement Learning, and I’ll definitely refine this list if need be. We’ll see when the time comes.

Kalman Filters. The final goal is Kalman filters. This algorithm is super important for autonomous driving (gps noise smoothing for example), so I want to understand it more and write a small python implementation. I’ll also definitely write a blog post about it in the near future.


That’s it for today’s blog post. I talked about a few projects I’d like to work on in the fields of Vision, Sound, NLP and Autonomous Driving. I thoroughly hope I can achieve the goals I have set in mind before the end of this Spring semester as I’d like to spend the second part of this year on Deep Reinforcement Learning.

For those of you that are interested, I’ve set up a roadmap repository on my Github which mirrors the above list, so you can check it out and see my progress step-by-step.

Until next time, cheers!