Putting A Machine Learning Model into Production
Suppose we encounter a major problem at our company, and take on the challenge of solving it with a fancy machine-learning model. After loads of research and experimentation, we also managed to train it, and it’s nailing its predictions. We’re pretty stoked about using this model to make our users’ lives better.
But guess what? Building the model was just the easy part, like the tip of an iceberg. The real grind comes in getting that model into action, into production. This second phase can gobble up as much as 90% of our time and effort on the project.
So, what exactly does this second phase involve, and why does it chew up so much time? To get there, let’s start by looking at what we achieve by the end of the first phase—Model Building and Training.
After completing the first phase, which I am not discussing in this article, we have the model all ready to go. This phase is usually handled by the Data Science crew. In the end, we’re left with some model code tucked away in Jupyter notebooks, complete with the trained weights.
Typically, this model has been trained using a static snapshot of our dataset, which could be in the form of a CSV or an Excel file. And more often than not, this snapshot is just a subset of our entire dataset. As for where the magic happens, the training takes place either on a developer’s local laptop or maybe on a virtual machine (VM) in the cloud. So, to sum it up, this phase of model development is quite separate from our company’s applications and data pipelines. It’s like working in a little bubble.
THE TERM ‘PRODUCTION’
Alright, let’s break down what “Production” means in the world of machine learning. When a model is put into production, it has two major jobs:
- Real-time Inference: This is where the model performs online predictions on new input data, handling one data sample at a time, making it work in real time. It’s like the instant response mode.
- Retraining: This is about offline retraining of the model, which usually happens on a nightly or weekly basis. The model refreshes itself with the latest data to stay up-to-date and accurate.
Now, the interesting part is that these two modes have quite different requirements and tasks. So, when our model gets the green light for production, it has to go live in two different environments:
- Serving Environment: This is where the model lives for Real-time Inference. It’s all about making predictions and serving them up when needed.
- Training Environment: This is where the model gets refreshed through retraining, so it remains sharp and accurate.
Typically, when people think about “production,” they often imagine Real-time Inference, where the model is making predictions on the fly. But there are also scenarios where we need Batch Inference:
- Batch Inference: This involves performing predictions offline, often on a whole dataset, nightly or weekly, and is essential for various use cases.
Now, here’s the kicker. To make each of these modes work, the model has to be seamlessly integrated into our company’s production systems. This includes our business applications, data pipelines, and deployment infrastructure. So, it’s not just about building the model; it’s about making it a functional part of our daily operations.
We’ll dig deeper into Real-time Inference first, and later, we’ll explore the Batch cases (Retraining and Batch Inference). Some challenges we’ll encounter here are unique to machine learning, but many are your classic software engineering hurdles. It’s a bit like solving a puzzle to make everything fit together smoothly.
INFERENCE — APPLICATION INTEGRATION
In the world of machine learning, making a model work in the real world is not a solo act. It’s like having a star player on our basketball team. They need to be in sync with the rest of the team. In this case, our model is like that star player, and it needs to be seamlessly integrated with a business application that’s meant for end-users.
For instance, think of a recommender model for an e-commerce website. This model needs to be part of the entire interaction flow and business logic of the application. It’s like the brains behind the recommendations.
Now, how the application and the model communicate is key. The application might receive end-user input through a user interface (UI) and then pass it over to the model. Alternatively, it could fetch data from an API endpoint or from a real-time data stream. For example, imagine a fraud detection system that approves credit card transactions; it could process transaction data from a Kafka topic.
In a similar way, the results or predictions from the model have to be incorporated back into the application. These predictions might be shown to users in the UI, or the application could use them to make important business decisions.
Creating this communication link between the model and the application is a big part of the job. We might deploy the model as a standalone service that the application can access through an API call. If the application is written in the same programming language as the model (let’s say Python), it can simply make a local function call to the model’s code.
This part of the work is typically a collaboration between the Application Developer and the Data Scientist. Just like any teamwork in software development, they need to make sure they agree on all things. It’s all about ensuring that the data formats and meanings are consistent on both ends. You know how tricky it can get. For example, if the model expects a ‘quantity’ field to be a positive number, should the application validate it before sending it to the model, or should the model handle that? And then there’s the question of date formats. Do they match what the application sends and what the model expects? It’s like making sure everyone’s speaking the same language.
In the world of machine learning, when we talk about “Inference,” it involves several critical phases that are essential for putting a machine learning model into production. This phase is all about transitioning from model development to real-world application. Let’s explore what’s involved in this phase:
DATA INTEGRATION FOR INFERENCE
The model can no longer rely on a static dataset that contains all the features it needs to make its predictions. It needs to fetch ‘live’ data from the organization’s data stores. These features might reside in transactional data sources, such as SQL or NoSQL databases, or they might be in semi-structured or unstructured datasets like log files or text documents. Some features might be fetched by calling an API, either an internal microservice or application or an external third-party endpoint. If any of this data isn’t in the right place or in the right format, some ETL (Extract, Transform, Load) jobs may have to be built to pre-fetch the data to the store that the application will use.
Dealing with all the data integration issues can be a major undertaking, involving considerations like:
- Access requirements: How do we connect to each data source, and what are its security and access control policies?
- Handling errors: What if the request times out, or the system is down?
- Matching latencies: How long does a query to the data source take versus how quickly do we need to respond to the user?
- Handling sensitive data: Is there personally identifiable information that has to be masked or anonymized?
- Decryption: Does data need to be decrypted before the model can use it?
- Internationalization: Can the model handle the necessary character encodings and number/date formats?
This tooling for data integration typically gets built by a Data Engineer, who collaborates with the Data Scientist to ensure that the assumptions are consistent and the integration goes smoothly.
DEPLOYMENT
Once our model and data are integrated for real-time inference, it’s time to deploy the model to the production environment. This involves several considerations similar to any software deployment, including:
- Model Hosting: Where will the model be hosted? On a mobile app, in an on-premise data center, or on the cloud? It could even be on an embedded device.
- Model Packaging: What dependent software and ML libraries does the model need? These are typically different from your regular application libraries.
- Co-location: Will the model be co-located with the application or act as an external service?
- Model Configuration settings: How will they be maintained and updated?
- System resources required: What kind of hardware resources are needed? This includes CPU, RAM, disk, and, importantly, GPU for specialized hardware requirements.
- Non-functional requirements: What are the expected volume and throughput of request traffic? What is the acceptable response time and latency?
- Auto-Scaling: Is the infrastructure capable of auto-scaling to support changing demands?
- Containerization: Does the model need to be packaged into a Docker container? How will container orchestration and resource scheduling be done?
- Security requirements: How will credentials be stored, and private keys managed to access data? Are there any cloud services integration requirements, such as with AWS S3, along with access control privileges?
- Automated deployment tooling: What tools will be used to provision, deploy, configure infrastructure, and install the software? How does it integrate with the organization’s CI/CD pipeline?
The responsibility for implementing this deployment phase typically falls on the ML Engineer. Once the model is deployed, it can be put in front of the customer, marking a significant milestone.
MONITORING
After deployment, the journey isn’t over. Now comes the MLOps task of monitoring the application to ensure that it continues to perform optimally in production. Monitoring serves several critical purposes, including:
- Checking that the model continues to make correct predictions in production, with live customer data, as it did during development.
- Monitoring standard DevOps application metrics, such as latency, response time, throughput, and system metrics like CPU utilization and RAM.
- Running normal health checks to ensure uptime and stability of the application.
- Continuously comparing current evaluation metrics to past metrics to detect deviations from historical trends, which could occur because of data drift.
Data Validation is an essential part of this phase because, as time goes on, our data will evolve and change, potentially leading to shifts in its distribution. Monitoring and validating the model with current data should be an ongoing activity. It’s crucial to evaluate metrics for different slices and segments of the data to account for changes in customer demographics, preferences, and behavior.
BATCH RETRAINING
Finally, Batch Retraining is another critical aspect of production. Here’s what’s involved:
1. Data Integration for Retraining: Retraining involves fetching a full dataset of historical data, typically residing in an organization’s analytics stores, such as data warehouses or data lakes. Data may have to be transferred into the warehouse in the required format if it’s not already present.
2. Application Integration for Retraining: No application integration work is typically needed for retraining, as you’re retraining the model in isolation.
3. Deployment for Retraining: Retraining often involves a massive amount of data, potentially far larger than used during development. The hardware infrastructure needed to train the model must be determined, considering GPU and RAM requirements. Training must be distributed across many nodes in a cluster to ensure timely completion, and each node needs to be managed by a Resource Scheduler. This setup ensures the efficient allocation of hardware resources for each training process and the efficient transfer of large data volumes.
Lastly, Batch Inference, where pre-computed results for a large set of data samples are generated and cached, is another scenario that may be part of the production environment. Many of the same application and data integration challenges seen in Real-time Inference apply here, but Batch Inference doesn’t have the same real-time response time requirements. Instead, it focuses on high throughput to handle extensive data volumes. This setup can be part of a workflow involving a network of applications, each executed after its dependencies have been completed.