AI Business is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 3099067.

AI Practitioner

Ten essential data science packages for Python

by  
Article ImageInterest in data science has risen remarkably in the last five years. And while there are many programming languages suited for data science and machine learning, Python is the most popular.

Since itís the language of choice for machine learning, hereís a Python-centric roundup of ten essential data science packages, including the most popular machine learning packages.

Scikit-Learn

Scikit-Learn is a Python module for machine learning built on top of SciPy and NumPy. David Cournapeau started it as a Google Summer of Code project. Since then, itís grown to over 20,000 commits and more than 90 releases. Companies such as J.P. Morgan and Spotify use it in their data science work.

Because Scikit-Learn has such a gentle learning curve, even the people on the business side of an organization can use it. For example, a range of tutorials on the Scikit-Learn website show you how to analyze real-world data sets. If youíre a beginner and want to pick up a machine learning library, Scikit-Learn is the one to start with.

Hereís what it requires:

  • Python 3.5 or higher
  • NumPy 1.11.0 or higher
  • SciPy 0.17.0 or higher

PyTorch

PyTorch does two things very well. First, it accelerates tensor computation using strong GPU. Second, it builds dynamic neural networks on a tape-based autograd system, thus allowing reuse and greater performance. If youíre an academic or an engineer who wants an easy-to-learn package to perform these two things, PyTorch is for you.

PyTorch is excellent in specific cases. For instance, do you want to compute tensors faster by using a GPU, as I mentioned above? Use PyTorch because you canít do that with NumPy. Want to use RNN for language processing? Use PyTorch because of its define-by-run feature. Or do you want to use deep learning but youíre just a beginner? Use PyTorch because Scikit-Learn doesnít cater to deep learning.

Requirements for PyTorch depend on your operating system. The installation is slightly more complicated than, say, Scikit-Learn. I recommend using the ďGet StartedĒ page for guidance. It usually requires the following:

  • Python 3.6 or higher
  • Conda 4.6.0 or higher

Caffe

Caffe is one of the fastest implementations of a convolutional network, making it ideal for image recognition. Itís best for processing images.

Yangqing Jia started Caffe while working on his PhD at UC Berkeley. Itís released under the BSD 2-Clause license, and itís touted as one of the fastest-performing deep-learning frameworks out there. According to the website, Caffeís image processing is quite astounding. They claim it can process ďover 60M images per day with a single NVIDIA K40 GPU

I should highlight that Caffe assumes you have at least a mid-level knowledge of machine learning, although the learning curve is still relatively gentle.

As with PyTorch, requirements depend on your operating system. Check the installation guide here. I recommend using the Docker version if you can so it works right out of the box. The compulsory dependencies are below:

  • CUDA for GPU mode
    • Library version 7 or higher and the latest driver version are recommended, but releases in the 6s are fine too
    • Versions 5.5 and 5.0 are compatible but considered legacy
  • BLAS via ATLAS, MKL, or OpenBLAS
  • Boost 1.55 or higher

TensorFlow

TensorFlow is one of the most famous machine learning libraries for some very good reasons. It specializes in numerical computation using dataflow graphs.

Originally developed by Google Brain, TensorFlow is open sourced. It uses dataflow graphs and differentiable programming across a range of tasks, making it one of the most highly flexible and powerful machine learning libraries ever created.

If you need to process large data sets quickly, this is a library you shouldnít ignore.

The most recent stable version is v1.13.1, but the new v2.0 is in beta now.

Theano

Theano is one of the earliest open-source software libraries for deep-learning development. Itís best for high-speed computation.

While Theano announced that it would stop major developments after the release of v1.0 in 2017, you can still study it for historical reasons. Itís made this list of top ten data science packages for Python because if you familiarize yourself with it, youíll get a sense of how its innovations later evolved into the features you now see in competing libraries.

Pandas

Pandas is a powerful and flexible data analysis library written in Python. While not strictly a machine learning library, itís well-suited for data analysis and manipulation for large data sets. In particular, I enjoy using it for its data structures, such as the DataFrame, the time series manipulation and analysis, and the numerical data tables. Many business-side employees of large organizations and startups can easily pick up Pandas to perform analysis. Plus, itís fairly easy to learn, and it rivals competing libraries in terms of its features in data analysis.

If you want to use Pandas, hereís what youíll need:

Keras

Keras is built for fast experimentation. Itís capable of running on top of other frameworks like TensorFlow, too. Keras is best for easy and fast prototyping as a deep learning library.

Keras is popular amongst deep learning library aficionados for its easy-to-use API. Jeff Hale created a compilation that ranked the major deep learning frameworks, and Keras compares very well.

The only requirement for Keras is one of three possible backend engines, like TensorFlow, Theano, or CNTK.

NumPy

NumPy is the fundamental package needed for scientific computing with Python. Itís an excellent choice for researchers who want an easy-to-use Python library for scientific computing. In fact, NumPy was designed for this purpose; it makes array computing a lot easier.

Originally, the code for NumPy was part of SciPy. However, scientists who need to use the array object in their work were having to install the large SciPy package. To avoid that, a new package was separated from SciPy and called NumPy.

If you want to use NumPy, youíll need Python 2.6.x, 2.7.x, 3.2.x, or newer.

Matplotlib

Matplotlib is a Python 2D plotting library that makes it easy to produce cross-platform charts and figures.

So far in this roundup, weíve covered plenty of machine learning, deep learning, and even fast computational frameworks. But with data science, you also need to draw graphs and charts. When you talk about data science and Python, Matplotlib is what comes to mind for plotting and data visualization. Itís ideal for publication-quality charts and figures across platforms.

For long-term support, the current stable version is v2.2.4, but you can get v3.0.3 for the latest features. It does require that you have Python 3 or newer, since support for Python 2 is being dropped.

SciPy

SciPy is a gigantic library of data science packages mainly focused on mathematics, science, and engineering. If youíre a data scientist or engineer who wants the whole kitchen sink when it comes to running technical and scientific computing, youíve found your match with SciPy.

Since it builds on top of NumPy, SciPy has the same target audience. It has a wide collection of sub packages, each focused on niches such as Fourier transforms, signal processing, optimizing algorithms, spatial algorithms, and nearest neighbor. Essentially, this is the companion Python library for your typical data scientist.

As far as requirements go, youíll need NumPy if you want SciPy. But thatís it.

Summary

This brings to an end my roundup of the 10 major data-science-related Python libraries. Is there something else youíd like us to cover that also uses Python extensively? Let us know!

And donít forget that Kite can help you learn these packages faster with its ML-powered autocomplete as well as handy in-editor docs lookups. Check it out for free as an IDE plugin for any of the leading IDEs.


This article originally appeared on Kite. Want to code faster? Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance.

EBooks

More EBooks

Latest video

More videos

Upcoming Webinars

More Webinars
AI Knowledge Hub

Research Reports

More Research Reports

Infographics

Smart Building AI

Infographics archive

Newsletter Sign Up


Sign Up