Machine Learning Development Life Cycle

GUPTA, Gagan       Posted by GUPTA, Gagan
      Published: June 20, 2021
        |  

Enjoy listening to this Blog while you are working with something else !

   

Machine Learning Development Life Cycle (MLDLC) is a multi stage (and underlying sub-stages) process acquired by the Teams to develop, train and serve the models using the Data Lakes that are involved in various applications so that the organization can take advantage of AL & ML algorithms to derive a practical business value. MLDLC is still evolving to be a standard model, often organization & projects customize the base model, based on their specific requirements.

It can be tempting to confuse MLDLC with SDLC. They are not the same thing. SDLC is a relatively deterministic process. In Data Science, we do not know the outputs beforehand. This risk is mitigated by following good practices and by defining success metrics, throughout the MLDLC process.
Machine learning is about data - no lie there. Machine learning is about development, manipulating data, and modeling. All of these separate parts together form a machine learning development life cycle.
Machine learning provides the benefits of power, speed, efficiency, and intelligence through learning without explicitly programming these into an application. It provides opportunities for improved performance, productivity, and robustness.

0-Problem Understanding

In Computer's we start to count from 0. Many organizations and projects choose to skip this step. Each project starts with a problem that one need to solve. Generating a ML-Model is not solution; solving the underlying problem is the solution. Many people seem to be under the assumption that an ML project is fairly straightforward if you have the data and computing resources necessary to train a model. They could not be more wrong. Moreover, just because the problem is defined in business terms, machine learning does not happen by itself. A lot of mountains must be moved.

Understand the business and the use case one is working with and define a proper problem statement. Asking the right questions to the business people to get required information plays a prominent role. Explain the problem in the terms of AI / ML project. If the business problem is not translated into a proper ML problem, then the final model is not deployable for the business use case. One might be able to generate the ML model successfully, but still far from solving the underlying business problem that needed attention.
Involve various stakeholders (Business User, Business SME's, Project Sponsor, Project Manager, BI analysts, DS engineers, Data Scientists, DBA). Run the feasibility report.

Have a system in place that provides data in real time or batch to a machine learning model and as a result of this model the system reacts and prescribe a new outcome. Simple enough!

1-Data Access And Collection

Once the problem is defined, The next step to a machine learning problem is accessing the data. While the end goal is a high-quality model, the lifeblood of training a good model is in the amount and quality of the data being passed into it.

Typically, data engineers will obtain the data for the business problems they are working on by querying the databases where their companies store their data. In addition, there is a lot of value in unstructured datasets that do not fit well into a relational database (e.g. logs, raw texts, images, videos, etc.). These datasets are heavily processed via Extract, Transform, Load (ETL) pipelines written by data engineers and data scientists. These datasets often reside in a data lake.

When data scientists do not have the data needed to solve their problems, they can get the data by scraping data from websites, purchasing data from data providers or collecting the data from surveys, clickstream data, sensors, cameras, etc. Another option to consider is buying data from third-party providers. Some of the datasets come from government organizations, some are from public companies and universities. Public datasets usually come along with annotations (when applicable), so you and your team can avoid doing the manual operations that take a significant amount of project time and costs.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Machine Learning Development Life Cycle
Machine Learning Development Life Cycle

2-Data Preparation And Exploration

When data is coming from various sources, prior knowledge is needed on clubbing the data from different data sources. Come up with insights in data that can lead to potential solutions. It involves analyzing and visualizing data - sometimes also involves building a strawman model to understand the feasibility of solving the problem.

Data scientists have to prepare the raw data, perform data exploration, visualize data, transform data and possibly repeat the steps until it's ready to use for modeling. Data preparation is cleansing and processing raw data before analysis. Raw data can be messy, unstructured, duplicated or inaccurate. Data scientists explore the data available to them, then cleanse the data by identifying corrupt, inaccurate and incomplete data and replacing or deleting it.

After data is cleansed, data scientists explore the features (or the variables) in their dataset, identify any relationship between the features transformations. There are various tools data scientists can use for exploratory data analysis (EDA) in open source libraries (pandas-profiling) and analytics/data science platforms.

EDA is a science in itself, and is beyond the scope of this article. Shall keep it short and to the point here. Follow my blogs for detailed reading on EDA.

3-Model : Build, Train and Evaluation

Model build consists of choosing the correct machine learning models to solve the problems and features that go into the models. In the first step of model build, data scientists need to decide what might be the appropriate machine learning model to solve the problem. There are two main types of machine learning models: supervised and unsupervised. Supervised learning involves modeling a set of input data to an output or a label. Classification and regression are supervised learning problems. Unsupervised learning involves modeling a set of input data without a label. For example, customer segmentation is an unsupervised learning problem. You do not know a priori what customer segment a customer belongs to. The segment will be assigned by the model.

Different classes of machine learning models are used to solve unsupervised and supervised learning problems. Typically, data scientists will try multiple models and algorithms and generate multiple model candidates. Data scientists do not know a priori what model will perform best on the dataset, so they experiment with several of them. During the model training, a data scientist might do feature selection which is the process of selecting only a subset of features as input to the machine learning model.

During the model training, the dataset is split up into training and testing sets. The training dataset is used to train the model, and the testing dataset is used to see how well the model performs on data it has not seen.

Model hyperparameter tuning is a major task in the model training process. Models are algorithms, and hyperparameters are the knobs that a data scientist can tune to improve the performance of the model.

Some models can be trained faster on specialized hardware (e.g., training perceptrons/deep neural network models on GPUs.) You may also explore distributed training environments that can speed up the process, especially when the amount of data cannot fit in the memory of the largest machine available, through splitting and distributing the data across multiple machines, or when you want to simultaneously train multiple model candidates in parallel on separate machines.

AutoML can improve the productivity of data scientists by automating the training process. It also allows data analysts and developers to build machine learning models without tweaking every aspect of the model training process that comes with data science expertise

Model explanations typically fall into global explanation and local explanation. Global explanation is understanding the general behavior of a machine learning model as a whole.

There are many open source tools that help data scientist calculate the metrics for evaluating machine learning models and help them visualize the metrics. AutoML and Evaluation are own sciences in themselves. Details are beyond the scope of this article.

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

4-Model Deployment

After the model training and evaluation processes are complete, the best candidate models are saved. Models are usually saved in Pickle, ONNX and PMML format. Depending on the objectives, data scientists might work on a machine learning problem for proof of concept, experimentation or to deploy it to production. Typically, data scientists will work with engineers on model deployment. Depending on how you intend to consume the predictions, you can deploy for batch consumption or real time consumption. There are different tools and cloud platform offerings for model deployment such as Functions-as-a-Service (FaaS) platforms, fully managed deployment of models as HTTP endpoints, DIY with flask or Django in a container orchestration platform such as k8 and docker swarm, etc

5-Model : Monitoring, Maintenance And Updates

Model monitoring is a challenging step that is sometimes forgotten by organizations without mature machine learning and data science initiatives. The data that ML models ingest from various data stores and data lakes, which are often created and maintained by other teams, must be constantly monitored for unexpected changes that may have unexpected effects on ML model outputs. The subtle changes in the data often occur silently and can lead to performance degradation.

Model monitoring can be broken down into two components: drift/statistical monitoring of the model performance and ops monitoring.

Ops monitoring of the machine learning system will require partnership between the data scientists and engineering team. Things to monitor include serving latency, memory/CPU usage, throughput and system reliability. Logs and metrics need to be set up for tracking and monitoring. Logs contain records of events, along with the time when they occurred. They can be used to investigate specific incidents and figure out the cause of the incident. Kibana is an open-source tool used for searching and viewing logs. Metrics measure the usage and behavior of the machine learning system. Prometheus and Grafana are tools for monitoring metrics.

Conclusion

It is important to remember that machine learning is a very iterative process, and still evolving. Adopting machine learning isn't simply a question of learning to train a model, and one is done. One need to think deeply about how those ML models will fit into your existing systems and processes, and how they help in solving your business needs.

Support our effort by subscribing to our youtube channel. Update yourself with our latest videos on Data Science.

Looking forward to see you soon, till then Keep Learning !

Our On-Premise Corporate Classroom Training is designed for your immediate training needs

Machine Learning Development Life Cycle
                         



Corporate Scholarship Career Courses