Practical Machine Learning with Python and Keras

Practical Machine Learning with Python and Keras

AI Business

October 2, 2019

23 Min Read

by Daniel Pyrathon, Kite 2 October 2019

Table of Contents

What is machine learning, and why do we care?

Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed. Think of how efficiently (or not) Gmail detects spam emails, or how good text-to-speech has become with the rise of Siri, Alexa, and Google Home.

Some of the tasks that can be solved by implementing Machine Learning include:

  • Anomaly and fraud detection: Detect unusual patterns in credit card and bank transactions.

  • Prediction: Predict future prices of stocks, exchange rates, and now cryptocurrencies.

  • Image recognition: Identify objects and faces in images.

Machine Learning is an enormous field, and today we’ll be working to analyze just a small subset of it.

Supervised Machine Learning

Supervised learning is one of Machine Learning’s subfields. The idea behind Supervised Learning is that you first teach a system to understand your past data by providing many examples to a specific problem and desired output. Then, once the system is “trained”, you can show it new inputs in order to predict the outputs.

How would you build an email spam detector? One way to do it is through intuition – manually defining rules that make sense: such as “contains the word money”, or “contains the word ‘Western Union’”. While manually built rule-based systems can work sometimes, often it becomes hard to create or identify patterns and rules based only on human intuition. By using Supervised Learning, we can train systems to learn the underlying rules and patterns automatically with a lot of past spam data. Once our spam detector is trained, we can feed it new a new email so that it can predict how likely an email is a spam.

Earlier I mentioned that you can use Supervised Learning to predict an output. There are two primary kinds of supervised learning problems: regression and classification.

  • In regression problems, we try to predict a continuous output. For example, predicting the price (real value) of a house when given its size.

  • In classification problems, we try to predict a discrete number of categorical labels. For example, predicting if an email is spam or not given the number of words within it.

You can’t talk about Supervised Machine Learning without talking about supervised learning models – it’s like talking about programming without mentioning programming languages or data structures. In fact, the learning models are the structures that are “trained,” and their weights or structure change internally as they mold and understand what we are trying to predict. There are plenty of supervised learning models, some of the ones I have personally used are:

  • Random Forest

  • Naive Bayes

  • Logistic Regression

  • K Nearest Neighbors

Today we’ll be using Artificial Neural Networks (ANNs) as our model of choice.

Understanding Artificial Neural Networks

ANNs are named this way because their internal structure is meant to mimic the human brain. A human brain consists of neurons and synapses that connect these neurons with each other, and when these neurons are stimulated, they “activate” other neurons in our brain through electricity.

In the world of ANNs, each neuron is “activated” by first computing the weighted sum of its incoming inputs (other neurons from the previous layer), and then running the result through activation function. When a neuron is activated, it will, in turn, activate other neurons that will perform similar computations, causing a chain reaction between all the neurons of all the layers.

It’s worth mentioning that, while ANNs are inspired by biological neurons, they are in no way comparable.

  • What the diagram above is describing here is the entire activation process that every neuron goes through. Let’s look at it together from left to right.

  • All the inputs (numerical values) from the incoming neurons are read. The incoming inputs are identified as x1..xn

  • Each input is multiplied by the weight associated with that connection. The weights associated with the connections here are denoted as W1j..Wnj.

  • All the weighted inputs are summed together and passed into the activation function. The activation function reads the single summed weighted input and transforms it into a new numerical value.

  • Finally, the numerical value that was returned by the activation function will then be the input of another neuron in another layer.

Neural Network layers

Neuronsinside the ANN are arranged into layers. Layers are a way to givestructure to the Neural Network, each layer will contain 1 or moreneurons. A Neural Network will usually have 3 or more layers. There are 2special layers that are always defined, which are the input and theoutput layer.

  • The input layer is used as an entry point toour Neural Network. In programming, think of this as the arguments wedefine to a function.

  • The output layer is used as the result to our Neural Network. In programming, think of this as the return value of a function.

Thelayers in between are described as “hidden layers”, and they are wheremost of the computation happens. All layers in an ANN are encoded asfeature vectors.

Choosing how many hidden layers and neurons

There isn’t necessarily a golden rule on choosing how many layers and their size (or the number of neurons they have). Generally, you want to try and at least have 1 hidden layer and tweak around the size to see what works best.

Using the Keras library to train a simple Neural Network that recognizes handwritten digits

For us Python Software Engineers, there’s no need to reinvent the wheel. Libraries like Tensorflow, Torch, Theano, and Keras already define the main data structures of a Neural Network, leaving us with the responsibility of describing the structure of the Neural Network in a declarative way.

Keras gives us a few degrees of freedom here: the number of layers, the number of neurons in each layer, the type of layer, and the activation function. In practice, there are many more of these, but let’s keep it simple. As mentioned above, there are two special layers that need to be defined based on your problematic domain: the size of the input layer and the size of the output layer. All the remaining “hidden layers” can be used to learn the complex non-linear abstractions to the problem.

Today we’ll be using Python and the Keras library to predict handwritten digits from the MNIST dataset. There are three options to follow along: use the rendered Jupyter Notebook hosted on Kite’s github repository, running the notebook locally, or running the code from a minimal Python installation on your machine.

Running the iPython Notebook Locally

If you wish to load this Jupyter Notebook locally instead of following the linked rendered notebook, here is how you can set it up:


  • A Linux or Mac operating system

  • Conda 4.3.27 or later

  • Git 2.13.0 or later

  • wget 1.16.3 or later

In a terminal, navigate to a directory of your choice and run:

Running from a Minimal Python Distribution

To run from a pure Python installation (anything after 3.5 should work), install the required modules with pip, then run the code as typed, excluding lines marked with a % which are used for the iPython environment.

It is strongly recommended, but not necessary, to run example code in a virtual environment. For extra help, see

Okay! If these modules installed successfully, you can now run all the code in this project.

In [1]:

The MNIST Dataset

The MNIST dataset is a large database of handwritten digits that is used as a benchmark and an introduction to machine learning and image processing systems. We like MNIST because the dataset is very clean and this allows us to focus on the actual network training and evaluation. Remember: a clean dataset is a luxury in the ML world! So let’s enjoy and celebrate MNIST’s cleanliness while we can ??

The objective

Given a dataset of 60,000 handwritten digit images (represented by 28×28 pixels, each containing a value 0 – 255 with its grayscale value), train a system to classify each image with it’s respective label (the digit that is displayed).

The dataset

The dataset is composed of a training and testing dataset, but for simplicity we are only going to be using the training set. Below we can download the train dataset

In [2]:

Reading the labels

There are 10 possible handwritten digits: (0-9), therefore every label must be a number from 0 to 9. The file that we downloaded, train-labels-idx1-ubyte.gz, encodes labels as following:

TRAINING SET LABEL FILE (train-labels-idx1-ubyte):






32 bit integer


magic number (MSB first)


32 bit integer


number of items


unsigned byte




unsigned byte








unsigned byte



The labels values are 0 to 9.

It looks like the first 8 bytes (or the first 2 32-bit integers) can be skipped because they contain metadata of the file that is usually useful to lower-level programming languages. To parse the file, we can perform the following operations:

  • Open the file using the gzip library, so that we can decompress the file

  • Read the entire byte array into memory

  • Skip the first 8 bytes

  • Iterate over every byte, and cast that byte to integer

NOTE: If this file was not from a trusted source, a lot more checking would need to be done. For the purpose of this blog post, I’m going to assume the file is valid in it’s integrity.

In [3]:

Reading the images






32 bit integer


magic number


32 bit integer


number of images


32 bit integer


number of rows


32 bit integer


number of columns


unsigned byte




unsigned byte








unsigned byte



Reading images is slightly different than reading labels. The first 16 bytes contain metadata that we already know. We can skip those bytes and directly proceed to reading the images. Every image is represented as a 28*28 unsigned byte array. All we have to do is read one image at a time and save it into an array.

In [4]:

Out [4]: (60000, 784)

Our images list now contains 60,000 images. Each image is represented as a byte vector of SIZE_OF_ONE_IMAGE Let’s try to plot an image using the matplotlib library:

In [5]:

Encoding image labels using one-hot encoding

We are going to use One-hot encoding to transform our target labels into a vector.

In [6]:

Out [6]:

We have successfully created input and output vectors that will be fed into the input and output layers of our neural network. The input vector at index i will correspond to the output vector at index i.

In [7]:labels_np_onehot[999]

Out [7]:array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])

In [8]:plot_image(images[999])

In the example above, we can see that the image at index 999 clearly represents a 6. It’s associated output vector contains 10 digits (since there are 10 available labels) and the digit at index 6 is set to 1, indicating that it’s the correct label.

Building train and test split

Inorder to check that our ANN has correctly been trained, we take apercentage of the train dataset (our 60,000 images) and set it aside fortesting purposes.

In [9]:X_train, X_test, y_train, y_test = train_test_split(images, labels_np_onehot)

In [10]:y_train.shape

Out [10]:(45000, 10)

In [11]:y_test.shape

Out [11]:(15000, 10)

As you can see, our dataset of 60,000 images was split into one dataset of 45,000 images, and the other of 15,000 images.

Training a Neural Network using Keras

In [12]:

Layer (type)

Output Shape

Param #

dense (Dense)

(None, 128)


dense_1 (Dense)

(None, 10)


Total params: 101,770Trainable params: 101,770Non-trainable params: 0

In [13]:X_train.shape

Out [13]:(45000, 784)

In [14], y_train, epochs=20, batch_size=128)

Out [14]:> In [15]:model.evaluate(X_test, y_test) 15000/15000 [==============================] – 2s 158us/step Out [15]:[0.2567395991722743, 0.9264] Inspecting the results Congratulations! You just trained a neural network to predict handwritten digits with more than 90% accuracy! Let’s test out the network with one of the pictures we have in our testset. Let’s take a random image, in this case the image at index 1010. We take the predicted label (in this case, the value is a 4 because the 5th index is set to 1). In [16]:y_test[1010] Out [16]:array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]) Let’s plot the image of the corresponding image. In [17]:plot_image(X_test[1010]) Understanding the output of a softmax activation layer Now, let’s run this number through the neural network and we can see what our predicted output looks like! In [18]:predicted_results = model.predict(X_test[1010].reshape((1, -1))) The output of a softmax layer is a probability distribution for every output. In our case, there are 10 possible outputs (digits 0-9). Of course, every one of our images is expected to only match one specific output (in other words, all of our images only contain one distinct digit). Because this is a probability distribution, the sum of the predicted results is ~1.0 In [19]:predicted_results.sum() Out [19]:1.0000001 Reading the output of a softmax activation layer for our digit As you can see below, the 7th index is really close to 1 (0.9) which means that there is a 90% probability that this digit is a 6… which it is! congrats! In [20]:predicted_results Out [20]: array([[1.2202066e-06, 3.4432333e-08, 3.5151488e-06, 1.2011528e-06, 9.9889344e-01, 3.5855610e-05, 1.6140550e-05, 7.6822333e-05, 1.0446112e-04, 8.6736667e-04]], dtype=float32) Viewing the confusion matrix In [21]: predicted_outputs = np.argmax(model.predict(X_test), axis=1) expected_outputs = np.argmax(y_test, axis=1) predicted_confusion_matrix = confusion_matrix(expected_outputs, predicted_outputs) In [22]:predicted_confusion_matrix Out [22]: array([[1413, 0, 10, 3, 2, 12, 12, 2, 10, 1], [ 0, 1646, 12, 6, 3, 8, 0, 5, 9, 3], [ 16, 9, 1353, 16, 22, 1, 18, 28, 44, 3], [ 1, 6, 27, 1420, 0, 48, 11, 16, 25, 17], [ 3, 7, 5, 1, 1403, 1, 12, 3, 7, 40], [ 15, 13, 7, 36, 5, 1194, 24, 6, 18, 15], [ 10, 8, 9, 1, 21, 16, 1363, 0, 9, 0], [ 2, 14, 18, 4, 16, 4, 2, 1491, 1, 27], [ 4, 28, 19, 31, 10, 28, 13, 2, 1280, 25], [ 5, 13, 1, 21, 58, 10, 1, 36, 13, 1333]]) In [23]: # Source code: def plot_confusion_matrix(cm, classes, title='Confusion matrix', """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) fmt = 'd' thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])): plt.text(j, i, format(cm[i, j], fmt), horizontalalignment="center", color="white" if cm[i, j] > thresh else "black") plt.ylabel('True label') plt.xlabel('Predicted label') plt.tight_layout() # Compute confusion matrix class_names = [str(idx) for idx in range(10)] cnf_matrix = confusion_matrix(expected_outputs, predicted_outputs) np.set_printoptions(precision=2) # Plot non-normalized confusion matrix plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix, without normalization') Conclusion During this tutorial, you’ve gotten a taste of a couple important concepts that are a fundamental part of one’s job in Machine Learning. We learned how to: Encode and decode images in the MNIST dataset Encode categorical features using one-hot encoding Define our Neural Network with 2 hidden layers, and an output layer that uses the softmax activation function Inspect the results of a softmax activation function output Plot the confusion matrix of our classifier Libraries like Sci-Kit Learn and Keras have substantially lowered the entry barrier to Machine Learning – just as Python has lowered the bar of entry to programming in general. Of course, it still takes years (or decades) of work to master! Engineers who understand Machine Learning are in high demand. With the help of the libraries I mentioned above, and introductory blog posts focused on practical Machine Learning (like this one), all engineers should be able to get their hands on Machine Learning even if they don’t understand the full theoretical reasoning behind a particular model, library, or framework. And, hopefully, they’ll use this skill to improve whatever they’re building every day. If we start making our components a little bit smarter and a little more personalized every day, we can make customers more engaged and at the center of whatever we are building. Take home exercise Here are a few challenges you can do at home to dig deeper into the world of machine learning using Python: Tweak around with the number of neurons in the hidden layer. Can you increase the accuracy? Try to add more layers. Does the neural network train slower? Can you think of why? Try to train a Random Forest classifier (requires scikit-learn library) instead of a Neural Network. Is the accuracy better?

This article originally appeared on Kite. Want to code faster? Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance.

Get the newsletter
From automation advancements to policy announcements, stay ahead of the curve with the bi-weekly AI Business newsletter.