Building an ML pipeline can be challenging, from the lingo to the unfamiliar approaches. DoiT International’s Mike Sparr provides some detailed definitions and explanations
Observation, model, dimension, feature, fit, train, test, inference. If you’re familiar with machine learning, you’ll probably have heard these terms before. If, however, you’re an IT professional who has yet to fully embrace AI and ML, then this vocab will mean little.
If you fall into this latter camp, don’t worry, you’re one among many still at the early stages of ML adoption.
Learning about an entirely new technology and area of AI, including an entirely new vocabulary of terms, is not straightforward. Many current means of learning about ML and AI (I’m thinking online tutorials, webinars, exams, YouTube videos, etc.) fail to cover one of the most important aspects of ML: how to apply and integrate it into your own business. Of course, there are providers who offer bespoke training tailored to the specific needs of your company. However, this is prohibitively expensive for many.
As a business, this is something we help many organizations with – optimizing their investment with public clouds to realize cost savings, helping solve technical challenges, and counseling them on best practices.
I can’t promise to offer a comprehensive what/why/how of ML in this one article alone. However, I can shed some light on areas that, in my experience, I’ve found people tend to struggle with. In doing so, I hope to begin to demystify ML and give you an appetite for learning more about the tech and its applications in the real world, delivering real benefits for your organization.
Learn the lingo
First things first, let’s start with the lingo. Don’t worry if you don’t remember all of these terms immediately. It’s only by putting into practice what you learn (and applying the terms when doing so) that it all will make sense and become a key part of your own vocab. Here’s a handy cheat sheet for starters:
Model — a math function generated by a computer program that applies unique weights to numerical input values and returns a numerical result.
Feature — a numerical input value or parameter (AKA dimension) you pass to your function for it to return a prediction.
Label — a known answer for past data (AKA class); e.g., if you’re training an ML model to predict whether your friend will like a gift you’re buying them, then the inputs would be those gift ideas. You already know they liked books and clothing in the past, and not homeware, so you give books and clothing a value of 1 (good) and homeware a value of 0 (bad).
Observation — a set of features gathered about a particular input; e.g., with the above example, you could have different genres as the features extracted from books.
Feature engineering — identifying the attributes about a given observation that likely influence the result and converting them to number values so you can build a mathematical function instead of a manually written one.
Training — repeatedly iterating through a data set, trying out different weights (theta) on each feature until most results are correct.
Fit — behind the scenes your ML algorithm is trying to figure out the coordinates of a line that can separate ‘good’ from ‘bad’ records, so future records that plot on one side of the line are ‘good’ and ones that plot on the other side are ‘bad’.
Inference —the result or prediction that is returned by your model function (i.e., given these parameters, I infer this answer).
Serving — using your function in a program by loading the model file and passing it parameters so it can return predictions.
Testing — passing in test features to model to get predictions, and then comparing results to known values (labels) for a group of records.
Accuracy — the percentage of prediction results that matched the expected result out of all test records.
Ok, so you’ve learned some terms. Now let’s put them into action. I’ve used a basic example below to illustrate how to apply ML to a problem. Of course, in order to apply it to a specific business problem your organization is facing, a more bespoke approach is required.
Applying ML and building a model
Machine Learning for Minimum Viable Model (MVM) can help to develop an initial ML model to demonstrate proof of concept modeling of a specific business use case.
1. Data exploration
First, analyze available data sources to assess the state of data and potential usefulness in applying in an ML model, including analysis of data characteristics, data quality, cleanliness, potential correlation and patterns. Check for class imbalance and validate hypotheses relative to data.
2. Algorithm selection
Research modeling strategies to determine the appropriate ML selection algorithm to address business problems, including research existing strategies and whitepapers. Then select known algorithms based on hypothesis, type of features and patterns in data.
3. Feature engineering
Create ML model features based on raw data analysis and tests using the domain knowledge to identify potential features and advice on the transformation of raw data into feature recommendations.
4. Initial model development
Finally, develop an initial ML model using the data to solve the business problem and iterate.
Let’s use a specific example here. Say you have a website and want to increase subscriber numbers. In the past, you spent equal amounts on advertising on Google, Facebook, TV, radio and print. You have the same budget next year and need to know where to spend to increase new subscribers by 20 percent.
You’ve analyzed your new subscribers each month from last year, including the source with cost and sign-up counts. Most subscribers came from Google, then Facebook, TV and a few from radio and print, but the amounts vary by season and month.
Now you can use the ‘Goal Seek’ feature in Excel to optimize the allocation of the budget. Paste in last year’s results and each row shows how much you spent on each provider and how many new subscribers each delivered at what cost, and possibly other data points. Goal Seek will automatically change these values to see if you get a higher number of subscribers until it reaches the desired value in the output cell.
A process like this would be far too difficult to achieve with human power. Say your data includes 100 signups per day for 365 days of the year, and different days resulted in more signups with one provider over another. It would be extremely hard to identify the differences of 36,500 rows of data with the provider, day of the week, day of the year, views, clicks, conversions, signups, cost per click or impression, etc. ML algorithms are a lot like Goal Seek, but they can iteratively change the values in many cells at once and test each change until the output value gets closer to the known (label or class) for any given input.
Building a ML pipeline can be challenging, from the lingo to the unfamiliar approaches. So, if you’re still in the dark after reading the above, or just want to ensure you achieve your goals as quickly and efficiently as possible, consider partnering with a service provider packed to the gills with specialists who can unlock the full potential of your public cloud infrastructure – to demystify ML and make it work for you and your organization.
Mike Sparr is a staff cloud architect at DoiT International. Mike has over 20 years of experience in software technology, ranging from co-founding several startups to working in Fortune Global 50 companies. A former customer of both DoiT International and Google, Mike joined DoiT in early 2020.