This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.
Machine learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed. Think of how efficiently (or not) Gmail detects spam emails, or how good text-to-speech has become with the rise of Siri, Alexa, and Google Home.
Some of the tasks that can be solved by implementing Machine Learning include:
Anomaly and fraud detection: Detect unusual patterns in credit card and bank transactions.
Prediction: Predict future prices of stocks, exchange rates, and now cryptocurrencies.
Image recognition: Identify objects and faces in images.
Machine Learning is an enormous field, and today we’ll be working to analyze just a small subset of it.
Supervised Machine Learning
Supervised learning is one of Machine Learning’s subfields. The idea behind Supervised Learning is that you first teach a system to understand your past data by providing many examples to a specific problem and desired output. Then, once the system is “trained”, you can show it new inputs in order to predict the outputs.
How would you build an email spam detector? One way to do it is through intuition – manually defining rules that make sense: such as “contains the word money”, or “contains the word ‘Western Union’”. While manually built rule-based systems can work sometimes, often it becomes hard to create or identify patterns and rules based only on human intuition. By using Supervised Learning, we can train systems to learn the underlying rules and patterns automatically with a lot of past spam data. Once our spam detector is trained, we can feed it new a new email so that it can predict how likely an email is a spam.
Earlier I mentioned that you can use Supervised Learning to predict an output. There are two primary kinds of supervised learning problems: regression and classification.
In regression problems, we try to predict a continuous output. For example, predicting the price (real value) of a house when given its size.
In classification problems, we try to predict a discrete number of categorical labels. For example, predicting if an email is spam or not given the number of words within it.
You can’t talk about Supervised Machine Learning without talking about supervised learning models – it’s like talking about programming without mentioning programming languages or data structures. In fact, the learning models are the structures that are “trained,” and their weights or structure change internally as they mold and understand what we are trying to predict. There are plenty of supervised learning models, some of the ones I have personally used are:
K Nearest Neighbors
Today we’ll be using Artificial Neural Networks (ANNs) as our model of choice.
Understanding Artificial Neural Networks
ANNs are named this way because their internal structure is meant to mimic the human brain. A human brain consists of neurons and synapses that connect these neurons with each other, and when these neurons are stimulated, they “activate” other neurons in our brain through electricity.
In the world of ANNs, each neuron is “activated” by first computing the weighted sum of its incoming inputs (other neurons from the previous layer), and then running the result through activation function. When a neuron is activated, it will, in turn, activate other neurons that will perform similar computations, causing a chain reaction between all the neurons of all the layers.
It’s worth mentioning that, while ANNs are inspired by biological neurons, they are in no way comparable.
What the diagram above is describing here is the entire activation process that every neuron goes through. Let’s look at it together from left to right.
All the inputs (numerical values) from the incoming neurons are read. The incoming inputs are identified as x1..xn
Each input is multiplied by the weight associated with that connection. The weights associated with the connections here are denoted as W1j..Wnj.
All the weighted inputs are summed together and passed into the activation function. The activation function reads the single summed weighted input and transforms it into a new numerical value.
Finally, the numerical value that was returned by the activation function will then be the input of another neuron in another layer.
Neural Network layers
inside the ANN are arranged into layers. Layers are a way to give
structure to the Neural Network, each layer will contain 1 or more
neurons. A Neural Network will usually have 3 or more layers. There are 2
special layers that are always defined, which are the input and the
The input layer is used as an entry point to
our Neural Network. In programming, think of this as the arguments we
define to a function.
The output layer is used as the result to our Neural Network. In programming, think of this as the return value of a function.
layers in between are described as “hidden layers”, and they are where
most of the computation happens. All layers in an ANN are encoded as
Choosing how many hidden layers and neurons
There isn’t necessarily a golden rule on choosing how many layers and their size (or the number of neurons they have). Generally, you want to try and at least have 1 hidden layer and tweak around the size to see what works best.
Using the Keras library to train a simple Neural Network that recognizes handwritten digits
For us Python Software Engineers, there’s no need to reinvent the wheel. Libraries like Tensorflow, Torch, Theano, and Keras already define the main data structures of a Neural Network, leaving us with the responsibility of describing the structure of the Neural Network in a declarative way.
Keras gives us a few degrees of freedom here: the number of layers, the number of neurons in each layer, the type of layer, and the activation function. In practice, there are many more of these, but let’s keep it simple. As mentioned above, there are two special layers that need to be defined based on your problematic domain: the size of the input layer and the size of the output layer. All the remaining “hidden layers” can be used to learn the complex non-linear abstractions to the problem.
Today we’ll be using Python and the Keras library to predict handwritten digits from the MNIST dataset. There are three options to follow along: use the rendered Jupyter Notebook hosted on Kite’s github repository, running the notebook locally, or running the code from a minimal Python installation on your machine.
Running the iPython Notebook Locally
If you wish to load this Jupyter Notebook locally instead of following the linked rendered notebook, here is how you can set it up:
A Linux or Mac operating system
Conda 4.3.27 or later
Git 2.13.0 or later
wget 1.16.3 or later
In a terminal, navigate to a directory of your choice and run:
# Clone the repository
git clone https://github.com/kiteco/kite-python-blog-post-code.git
cd kite-python-blog-post-code/Practical\ Machine\ Learning\ with\ Python\ and\ Keras/
# Use Conda to setup and activate the Python environment with the correct dependencies
conda env create -f environment.yml
source activate kite-blog-post
Running from a Minimal Python Distribution
To run from a pure Python installation (anything after 3.5 should work), install the required modules with pip, then run the code as typed, excluding lines marked with a % which are used for the iPython environment.
# Set up and Activate a Virtual Environment under Python3
$ pip3 install virtualenv
$ python3 -m virtualenv venv
$ source venv/bin/activate
# Install Modules with pip (not pip3)
(venv) $ pip install matplotlib
(venv) $ pip install sklearn
(venv) $ pip install tensorflow
Okay! If these modules installed successfully, you can now run all the code in this project.
import numpy as np
import matplotlib.pyplot as plt
from typing import List
from sklearn.preprocessing import OneHotEncoder
import tensorflow.keras as keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
The MNIST Dataset
The MNIST dataset is a large database of handwritten digits that is used as a benchmark and an introduction to machine learning and image processing systems. We like MNIST because the dataset is very clean and this allows us to focus on the actual network training and evaluation. Remember: a clean dataset is a luxury in the ML world! So let’s enjoy and celebrate MNIST’s cleanliness while we can ??
Given a dataset of 60,000 handwritten digit images (represented by 28×28 pixels, each containing a value 0 – 255 with its grayscale value), train a system to classify each image with it’s respective label (the digit that is displayed).
The dataset is composed of a training and testing dataset, but for simplicity we are only going to be using the training set. Below we can download the train dataset
There are 10 possible handwritten digits: (0-9), therefore every label must be a number from 0 to 9. The file that we downloaded, train-labels-idx1-ubyte.gz, encodes labels as following:
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
32 bit integer
magic number (MSB first)
32 bit integer
number of items
The labels values are 0 to 9.
It looks like the first 8 bytes (or the first 2 32-bit integers) can be skipped because they contain metadata of the file that is usually useful to lower-level programming languages. To parse the file, we can perform the following operations:
Open the file using the gzip library, so that we can decompress the file
Read the entire byte array into memory
Skip the first 8 bytes
Iterate over every byte, and cast that byte to integer
NOTE: If this file was not from a trusted source, a lot more checking would need to be done. For the purpose of this blog post, I’m going to assume the file is valid in it’s integrity.
with gzip.open('train-labels-idx1-ubyte.gz') as train_labels:
data_from_train_file = train_labels.read()
# Skip the first 8 bytes, we know exactly how many labels there are
label_data = data_from_train_file[8:]
assert len(label_data) == 60000
# Convert every byte to an integer. This will be a number between 0 and 9
labels = [int(label_byte) for label_byte in label_data]
assert min(labels) == 0 and max(labels) == 9
assert len(labels) == 60000
Reading the images
32 bit integer
32 bit integer
number of images
32 bit integer
number of rows
32 bit integer
number of columns
Reading images is slightly different than reading labels. The first 16 bytes contain metadata that we already know. We can skip those bytes and directly proceed to reading the images. Every image is represented as a 28*28 unsigned byte array. All we have to do is read one image at a time and save it into an array.
SIZE_OF_ONE_IMAGE = 28 ** 2
images = 
# Iterate over the train file, and read one image at a time
with gzip.open('train-images-idx3-ubyte.gz') as train_images:
train_images.read(4 * 4)
ctr = 0
for _ in range(60000):
image = train_images.read(size=SIZE_OF_ONE_IMAGE)
assert len(image) == SIZE_OF_ONE_IMAGE
# Convert to numpy
image_np = np.frombuffer(image, dtype='uint8') / 255
images = np.array(images)
Out : (60000, 784)
Our images list now contains 60,000 images. Each image is represented as a byte vector of SIZE_OF_ONE_IMAGE Let’s try to plot an image using the matplotlib library:
We have successfully created input and output vectors that will be fed into the input and output layers of our neural network. The input vector at index i will correspond to the output vector at index i.
In the example above, we can see that the image at index 999 clearly represents a 6. It’s associated output vector contains 10 digits (since there are 10 available labels) and the digit at index 6 is set to 1, indicating that it’s the correct label.
Building train and test split
order to check that our ANN has correctly been trained, we take a
percentage of the train dataset (our 60,000 images) and set it aside for
In : X_train, X_test, y_train, y_test = train_test_split(images, labels_np_onehot)
In : y_train.shape
Out : (45000, 10)
In : y_test.shape
Out : (15000, 10)
As you can see, our dataset of 60,000 images was split into one dataset of 45,000 images, and the other of 15,000 images.
Understanding the output of a softmax activation layer
Now, let’s run this number through the neural network and we can see what our predicted output looks like!
In : predicted_results = model.predict(X_test.reshape((1, -1)))
The output of a softmax layer is a probability distribution for every output. In our case, there are 10 possible outputs (digits 0-9). Of course, every one of our images is expected to only match one specific output (in other words, all of our images only contain one distinct digit).
Because this is a probability distribution, the sum of the predicted results is ~1.0
In : predicted_results.sum()
Out : 1.0000001
Reading the output of a softmax activation layer for our digit
As you can see below, the 7th index is really close to 1 (0.9) which means that there is a 90% probability that this digit is a 6… which it is! congrats!
# Source code: https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
def plot_confusion_matrix(cm, classes,
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
plt.imshow(cm, interpolation='nearest', cmap=cmap)
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
fmt = 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape), range(cm.shape)):
plt.text(j, i, format(cm[i, j], fmt),
color="white" if cm[i, j] > thresh else "black")
# Compute confusion matrix
class_names = [str(idx) for idx in range(10)]
cnf_matrix = confusion_matrix(expected_outputs, predicted_outputs)
# Plot non-normalized confusion matrix
title='Confusion matrix, without normalization')
During this tutorial, you’ve gotten a taste of a couple important concepts that are a fundamental part of one’s job in Machine Learning. We learned how to:
Encode and decode images in the MNIST dataset
Encode categorical features using one-hot encoding
Define our Neural Network with 2 hidden layers, and an output layer that uses the softmax activation function
Inspect the results of a softmax activation function output
Plot the confusion matrix of our classifier
Libraries like Sci-Kit Learn and Keras have substantially lowered the entry barrier to Machine Learning – just as Python has lowered the bar of entry to programming in general. Of course, it still takes years (or decades) of work to master!
Engineers who understand Machine Learning are in high demand. With the help of the libraries I mentioned above, and introductory blog posts focused on practical Machine Learning (like this one), all engineers should be able to get their hands on Machine Learning even if they don’t understand the full theoretical reasoning behind a particular model, library, or framework. And, hopefully, they’ll use this skill to improve whatever they’re building every day.
If we start making our components a little bit smarter and a little more personalized every day, we can make customers more engaged and at the center of whatever we are building.
Take home exercise
Here are a few challenges you can do at home to dig deeper into the world of machine learning using Python:
Tweak around with the number of neurons in the hidden layer. Can you increase the accuracy?
Try to add more layers. Does the neural network train slower? Can you think of why?
Try to train a Random Forest classifier (requires scikit-learn library) instead of a Neural Network. Is the accuracy better?
This article originally appeared on Kite. Want to code faster? Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance.