by Barbara Murphy
SAN JOSE – Artificial intelligence (AI), with an emphasis on machine learning (ML), has fueled the growth of innovation across a broad range of use cases including autonomous vehicles (AV), fraud detection, speech recognition, and predictive medicine.
However, AI development is complex and requires the right technology and methodology to be successful. In this article, I’ll explore four key technical and infrastructure considerations for effective AI deployments – as well as tips on how your enterprise can get started.
1. Storage constrains can hamper scaling efforts
GPUs have shrunk the processing power of tens of CPU servers into a single GPU server delivering massively parallel processing and dramatically improving machine learning cycles. However, the shared storage systems being leveraged to support AI workloads are utilizing technology developed in the 1980s when networks were slow.
If your data set does not fit inside the local storage on a single GPU server then scaling the AI workload is a nightmare. NFS, the predominant protocol for data sharing is limited to about 1.5GB/second in bandwidth while a single GPU server can easily consume 10x that throughput. GPU workloads demand a low latency, highly parallel I/O pattern to ensure that the AI workloads are operating at full bandwidth.
2. Building competitive advantage with AI means cutting time to market – and improving data collection
The two key elements to competitive advantage in AI are being in first place to get product to market – whether it is a digital MRI machine, an autonomous taxi or an automated trucking fleet. The faster you can train your model, the quicker you will get to market and the better chance to achieve the number one position.
This means that every minute, hour, and day counts. Training models for autonomous vehicles can take weeks and reducing that down to days has a huge impact on the bottom line. This demands that the infrastructure is highest performance and lowest latency (the secret time killer of machine learning projects). Technologies like InfiniBand, NVMe, multi-node GPUs and fast data access are critical in the race to win.
The other key element is the size of the training model dataset, because more data means better models and hence faster time to production. The larger the training data set, the more accurate the training model will be and the faster it can get to market. Large data sets need a shared storage solution that offer massively high bandwidth, low latency and parallel access so that all GPUs are kept fully busy.
3. Balance compute, networking, and storage to deliver optimal performance for AI workloads
Infrastructure choices have a significant impact on the performance and scalability of a deep learning workflow. Model complexity, catalog data size, and input type (such as images and text) will impact key elements of a solution, including the number of GPUs, servers, network interconnects, and storage type (local disk or shared). The more complex the environment, the greater the need to balance components.
IT infrastructure to support AI and ML is a symbiotic system that must balance compute, networking and storage to get the optimal performance from the solution. Any imbalance between these three elements will result in wasted resources – both human and hardware infrastructure. The speed of data insight is a function of the computational power and data analysis, hence advancements in infrastructure have a significant impact on the rate of innovation and discovery—flaws in the infrastructure result in delayed time-to-market and time-to-answer.
Traditional high-performance computing (HPC) infrastructures that have history support research and technical computing workloads are now finding their way to the enterprise who are supping new workloads in ML and AI. These new AI and ML workloads don’t often have the same types of data as traditional HPC, and often require processing millions of tiny files at very high bandwidth. This has forced the enterprise to adopt new media types and networking architectures for storage to ensure the compute infrastructure is utilized to its maximum.
Hard disk drives (HDD) have been the predominant storage medium for HPC workloads since its foundation. However, they choke under any latency sensitive workloads like AI and ML due to the rotational latency incurred during a disk sector seek. The typical read latencies for a SATA HDD are around 5.56 milliseconds while Intel’s enterprise NVMe SSDs are 65 times lower at 85 microseconds. NVMe flash is well positioned to service the I/O demands of low latency applications common in AI and ML workloads. For real-life training environments, particularly in complex workloads found in AV and fraud detection, data sets can range from hundreds of terabytes to tens of petabytes, making a shared storage solution essential for the DL training process. In this case, local storage is not an option and requires a high-performance, scalable, shared storage solution.
4. No business is too small to benefit from AI
Numerous studies have shown that companies who are adopting AI are reducing costs, improving efficiency and delivering bottom line profit to the company. AI can help with problems as basic as setting a maintenance schedule for a factory floor, all the way to targeting the right product to potential buyers and improving sales closure rates.
Look at a company like gong.io that is helping salespeople use the right language to improve the rate of sales closure. No business is too small to utilize readily available AI-powered tools or develop its own AI strategies.
Join Barbara and the weka.io team at The AI Summit London, June 12-13. Find out more
Barbara Murphy is VP of Marketing at weka.io