A simple and economical MLOps approach with AWS and GCP!

May 16, 2022

With advancements in AI and the growth of data, many individuals and businesses have started leveraging Machine Learning (ML), a sub-field of AI to solve their complex problems. Use case of ML varies across different areas — it can be used for detecting patterns in customer behavior, to make data-driven decisions, in speech and image recognition, in translation, etc...

Generally, an ML model is trained on the past data to predict the next best outcome learned by its algorithm. These trained ML models are deployed on Cloud or locally to serve the business case i.e. to make data-driven recommendations and decisions based on the given input data. However, as new data comes in, the model performance on the new data may vary as the model has not learnt these latest occurring scenarios. In such cases, the ML model needs to be retrained on the latest data to maintain or upgrade its performance.

The requirement of retraining of the ML model varies from scenario to scenario. Some business cases require no retraining as the data never changes, while some require daily or hourly retraining on the new data. It indeed gets difficult to maintain the frequent training, preprocessing of the data, monitoring of the performance, and deploying of the model. To tackle this, the Machine Learning community gave birth to an approach called Machine Learning Operations — MLOps.

source: https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

MLOps concept has been primarily derived from DevOps, which focuses on Continuous Integration and Continuous Delivery. In addition to CD/CI, MLOps majorly differs from DevOps with an additional stage of Continuous Training and Monitoring. This blog will focus on how one can orchestrate MLOps’ best practices i.e. Continuous Training, Monitoring and Inference by using simple and economical services on a Cloud platform.

Many of us have heard and known all the fancy MLOps tools available in the market — KubeFlow, MFflow, TFX,….a never-ending list. Having used TFX with Airflow and KubeFlow, I found it rather a bit specific compared to the conventional methods of developing a machine learning pipeline. It is a good pipeline architecture when building an end-to-end pipeline when starting from scratch. However, not every Machine Learning use case requires the use of such complex and expensive infrastructure for supporting MLOps.

Most of these services and tools can be configured locally for pipeline creation and experimentation. However, it gets very complicated and infeasible to manage it locally. That’s where Cloud services come into the picture. Cloud offers multiple scalable services based on a pay-per-use model, making it possible to use high-end resources for ML training at low costs. It enables multiple experimentation, easy integration, and deployment of ML models with automatic pipelines.

Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft’s Azure have proven to be the market leaders in Cloud Services because of their robustness, reliability, and widely available services with a lot of resources. I started my Cloud journey with Google, knowing that they are the pioneers in Machine Learning and AI space. For the past year, I have been using AWS more at my work and realized that the offerings from both cloud services do not differ much; however, sometimes under the hood they use different technologies but these services are extremely relatable.

Because of my familiarity with AWS and GCP, this article will focus on the custom MLOps approach using AWS and GCP. However, this approach can be easily extended to other Cloud platforms.

Scenario

Let's consider the following simple scenarios:

To support your business case you need to periodically re-train your ML model based on the availability of new data and monitor its performance
Or you need to trigger the training of the model when a specific amount of data is newly added or the performance of the model drops below the threshold

Considering these business cases and based on the data available on the respective Cloud platform i.e. AWS or GCP, we will tackle the problem in the next steps.

1. Machine Learning Model

If you are aiming for continuous training, it is most convenient to package the trainer module and create a Docker image. Containerization of an ML trainer grants flexibility for its training on any platform and unleashes its capabilities for distributed training if supported.

This Docker image can be pushed to the Cloud Container registry eg. AWS ECR or GCP container registry using GitHub Actions or any CI/CD tools; hence, taking care of the continuous deployment of the ML trainer module. Having a trainer image on the Cloud container registry opens up the possibility to use various services on Cloud for the training of the model.

2. Continuous Training

Once the trainer module is available in the Cloud’s container registry and data is available through regular data ingestion on the Storage bucket (like S3 or GCP storage) or any Cloud DB, we can move forward to the next step. To perform continuous training and scheduling, a self-managed Airflow service can be used eg. an Amazon-managed AirFlow on AWS or Cloud Composer on GCP.

However, the self-managed services on Cloud tend to be expensive. It is more convenient and economical to use a docker image of the Airflow and deploy it on a Linux-based instance like an EC2 instance or Compute Engine on GCP with the required configuration and network settings. The CD/CI of the DAGs can be controlled with GitHub Actions or alike tools.

Since the training of the model on the Airflow machine hosted on AWS or GCP can be inefficient, the training can be executed by leveraging any of the independent services eg. AWS Batch, AWS Sagemaker, and AWS EMR, depending on the amount of data to be processed and the need for a Spark job. As AWS Batch provides more custom control, it is preferred for the training of the model. Similarly, the AI Platform or Vertex AI can be used for the training of the model on GCP.

The Airflow DAGs can be as simple as below.

Here, the DAG can be scheduled for daily runs or based on the required custom frequency. And the training of the model would happen only if certain conditions are met. Through these conditions, we can control both the business scenarios stated earlier. As part of the last step, it would reach out to AWS Batch to initiate the training using the trainer Docker image from the container registry.

The Airflow pipeline can be customized as per needs and can get as complicated as below.

In this example, the Airflow pipeline is run weekly. Based on the amount of newly scrapped data added to the Data repository, a new training job is triggered and submitted to the AI Platform on GCP. Considering the pipeline is for an NLP task, the AI Platform will then further fine-tune an already fine-tuned transformer-based model based on the new data. The training would be skipped for the days when the amount of newly added data is not enough for the training new model.

As part of the training step, it is also important to export all the metadata, evaluation metrics, and the business metrics you want to track in a JSONL or a CSV format to a structured folder on a Storage bucket which can later be used for continuous monitoring. Similarly, if the performance of the newly trained model is better than the previously deployed model, the newly trained model can be exported or deployed in the last step to take care of the model’s continuous deployment cycle.

3. Continuous Monitoring

It is easier to track the training performances and experiments when the evaluation metrics and the metadata containing the information of the model, hyperparameters used, training data used, timestamp, etc are stored on the Storage bucket. These stored evaluation metrics can be accessed and queried by creating an external table using AWS GLUE or GCP BigQuery.

This enables the creation of a Dashboard to track the performances of these models using Grafana or other dashboards. Through these dashboards, it becomes easier to keep track of all experiments and also to continuously monitor data and model drift by having everything related to the model and data on the same page.

source: https://grafana.com/blog/2021/08/02/how-basisai-uses-grafana-and-prometheus-to-monitor-model-drift-in-machine-learning-workloads/

4. Documentation

Unfortunately, documentation is one of the most important tasks which gets overlooked. No process or approach is good enough if one does not follow a good documentation process. Following a simple practice of documentation of new findings, changes, or summarizations of new/current/old/future experiments or processes can save up on a lot of technical debt.

If you think something is worth documenting, Document IT! If you feel something is not worth documenting but can help future employees, Document IT! Period.

I hope this article presented a good high-level overview of how one can architect the MLOps concept by simply combining the different pieces of Cloud services together on AWS and GCP. This approach is not only simple and economical but also provides a lot of control and flexibility on the variety of services that can be used for performing particular operations.

The solution presented is more transparent, and lean and can be set up much quicker compared to the complicated, specific tools and services like TFX, MLFlow, KubeFlow, etc. This solution might not necessarily fit every use case but everyone can take inspiration from it to develop a custom MLOps approach for their own use case. Please feel free to share ideas/improvements and ask questions through comments.