A Quickstart Guide to AutoML

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: July 21, 2021
        |  

Enjoy listening to this Blog while you are working with something else !

   

Background

AI has penetrated every industry, leading to bold claims like 'data is the new oil' and 'AI is the new electricity'. Three business titans - Jeff Bezos, Bill Gates, and Elon Musk - recently became centibillionaires, as their AI-powered products and services have taken off.
AI has traditionally been limited largely to academia and big tech, for three main reasons:

1. AI used to be expensive, requiring teams of data scientists, who average six-figure salaries in the United States.
2. AI used to be time-consuming, with even simple Proofs-of-Concept taking at least 1-2 months.
3. AI used to be exceptionally challenging, with intensive data engineering and deployment demands.

AutoML solves these three problems, by enabling organizations to quickly, easily, and affordably implement AI.

Automated Machine Learning provides methods and processes to make Machine Learning available for non-Machine Learning experts, to improve efficiency of Machine Learning and to accelerate research on Machine Learning.

The high degree of automation in AutoML allows non-experts to make use of machine learning models and techniques without requiring them to become experts in machine learning. AutoML allows for greater access to AI development for those without the theoretical background currently needed for role in data science.

Standard ML approach

In a typical ML application, data science team have a set of input data to be used for training. The raw data may not be in a form that all algorithms can be applied to it. To make the data amenable for machine learning, a team of experts may have to apply appropriate data pre-processing, feature engineering, feature extraction, and feature selection methods. After these steps, team must then perform algorithm selection and hyperparameter optimization to maximize the predictive performance of their model. Each of these steps may be challenging, time and resource consuming, resulting in significant hurdles to using machine learning.

AutoML dramatically simplifies these steps for non-experts. One can think of AutoML - regardless of whether building classifiers or training regressions - as a generalized search concept, with specialized search algorithms for finding the optimal solutions for each component piece of the ML pipeline. In building a system that allows for the automation of just three key pieces of automation - feature engineering, hyperparameter optimization, and neural architecture search - AutoML promises a future where democratized machine learning is a reality.

Advantages of AutoML

There are three main advantages to incorporating AutoML. These are:
- Productivity - automation reduces the manual resources needed to monitor and perform repetitive ML tasks. This frees teams to focus on model refinement and packaging.
- Standardization - automated pipelines help reduce the chance of configuration errors and ensure that training and tests are performed uniformly.
- Democratization - AutoML lowers the barrier to entry for organizations with little to no ML expertise. This increases competitiveness and can increase innovation.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

A Quickstart Guide to AutoML
A Quickstart Guide to AutoML

Popular open-source AutoML tools

If you're ready to get started with AutoML, there is a growing number of tools available to you. Below are a few to consider using.

auto-sklearn

auto-sklearn is an open source toolkit based on scikit-learn that you can use to perform model selection, feature engineering, and hyperparameter tuning. It includes features that enable you to leverage Bayesian optimization, ensembles, and meta-learning for more accurate models and training. When using auto-sklearn, you can restrict the time and memory limits of scikit-learn, restrict your searchspace, and control preprocessing. You also have the ability to inspect training statistics and results and perform parallel computations.

Auto-WEKA

Auto-WEKA is an open source library that you can use to optimize your hyperparameter selection. It uses Bayesian optimization to select a learning algorithm and hyperparameters from those available in the WEKA package. Auto-WEKA has the same requirements as WEKA and includes a graphical user interface (GUI) for ease of use.

AutoKeras

AutoKeras is an open source library based on Keras that you can use for classification and regression of images, text, and structured data. It enables you to use pre-built blocks to construct a model, leaving you to focus on high-level architecture. AutoKeras supports use with Python 3.5 and up and TensorFlow 2.1.0 and up.

H2O AutoML

H2O is an open source platform for ML that is distributed and runs in-memory. It supports a wide range of ML and statistical algorithms, including deep learning, generalized linear models, and gradient boosted machines. While H2O is not automated, it includes a paid add-on, called H2O AutoML. H2O AutoML enables you to train and tune models with automatic feature selection and extraction, hyperparameter optimization, and use of ensembles (multiple models for greater performance). You can use H2O AutoML from a web GUI. It integrates with Hadoop, Spark, and Kubernetes.

H2O Auto ML = Random Grid Search + Stacking

MLBox

MLBox is an open source Python library that you can use to automate many aspects of model training. These aspects include data preprocessing, feature selection, and hyperparameter optimization. It also includes predictive models for regression and classification, such as LightGBM, Stacking, and Deep Learning.

TPOT

TPOT is an open-source Python data science automation tool, which operates by optimizing a series of feature preprocessors and models, in order to maximize cross-validation accuracy on data sets.
TPOT is built on the scikit learn library and follows the scikit learn API closely. It can be used for regression and classification tasks and has special implementations for medical research.
TPOT has what its developers call a genetic search algorithm to find the best parameters and model ensembles. It could also be thought of as a natural selection or evolutionary algorithm.
TPOT tries a pipeline, evaluates its performance, and randomly changes parts of the pipeline in search of better-performing algorithms.

Auto-Pytorch

Pytorch the library managed by Facebook has many important features for Machine Learning and Deep Learning, Auto Pytorch is one of those features that really helps to automate many time-consuming processes in ML.
Finding the right architecture and hyperparameter settings for training a deep neural network is crucial for achieving top performance.
Auto-PyTorch automates these two aspects by using multi-fidelity optimization and Bayesian optimization (BOHB) to search for the best settings.
The current version of Auto-PyTorch is an early alpha and only supports featured data. The upcoming versions will also support image data, natural language processing, speech, and videos.

Commercial AutoML services


- AutoML Microsoft Azure cloud service.
- Google Cloud AutoML solution on Google Cloud Platform.
- AutoAI in IBM Watson Studio for automation of data preparation, model development, feature engineering, and hyper-parameter optimization.
- Oracle Accelerated Data Science SDK, a Python library included as part of the Oracle Cloud Infrastructure Data Science service.
- IBM AutoAI in Watson Studio.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Will AutoML replace Data Scientist's?

The short answer is NO, not in it's current form.

Besides the difficulty of automating many of the data science tasks, that's not really the point behind AutoML; its purpose is to assist data scientists and free them from the burden of repetitive, and less demanding tasks, so they can invest their time on tasks that are more challenging, creative, and harder to automate.

- Unsupervised Learning - Unsupervised learning techniques aim to discover patterns from data when no ground truth is available. Also, there is no clear measure of success that can be used to assess the quality of unsupervised learning results, since there is no ground truth to measure against. As a result, it is harder to judge the effectiveness of different methods since there is no direct way to compare them. How do we define 'success' here ? AutoML overlooks the more challenging tasks of unsupervised and reinforcement learning, focusing only on supervised tasks that require labeled data as input.
- Automation is always associated with cost - most products which highlight automated machine learning as their core feature - are relatively expensive.
- Automated machine learning has a switching cost - when implemented at a provider, The more you 'automate' your pipeline for a specific provider the harder it is to switch.
- Value / differentiation - The AI/ Data science role at senior levels is all about intellectual property / differentiation / scale. These elements need customization. If features which can be easily automated are the core value proposition of your service, it could lack differentiation
- The 80/20 rule - Automated machine leaning automate mostly the 80% which you could do as well in many cases. The 20% will require a lot of work in any case - probably irrespective of using automated machine learning or not. The same idea could apply to industries. Most data science work today is based on financial services / insurance etc. If your industry is from outside this - you may have fewer pre-built components in any case.

Final Thoughts

We live in an era where the growth of data outpaces our ability to make sense of it. The AutoML technology has now been implemented in many scenarios, but the challenge is to implement it on a large scale and in more industries. The obstacle is that technological breakthroughs of AutoML require deeper research on the theoretical and algorithmic levels. Companies are experiencing many iterations of autoML. It has evolved from the earliest two-category expansion to multi-category and regression, from structured data to unstructured data such as images and videos, is used in automatic supervised learning that covers low-quality data, and in automatic multi-party machine learning that protects privacy. In theory, researchers are exploring the boundaries of the AutoML algorithm, because there is no general algorithm that can solve all problems.

Our team of experts at Vyom Data Science's, can assist you in setting up your very first autoML pipeline. Do contact us if you desire to become an expert in the field of autoML. It would be fun !

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

A Quickstart Guide to AutoML
                         



Corporate Scholarship Career Courses