An efficient AutoML ~Magical TPOT lamp in SageMaker Container~
Introduction
A MLOps has been on an upward trend for years and the companies are confronting issues to establish pipelines and automations for the ML lifecycle. The focal points seem model generation, inference deployment and an orchestration for especially ML engineers how to automate those elements in an efficient way and create valuable products with the machine learning and deep learning algorithms not only deploy good models. [1]
MLOps applies to the entire lifecycle — from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.
A common architecture of an MLOps system would include data science platforms where models are constructed and the analytical engines were computations are performed, with the MLOps tool orchestrating the movement of machine learning models, data and outcomes between the systems. In the MLOps context we may need some interventions in the platforms or tools to control pipelines or workflows if we build a CI/CD system in the MLOps, which some managed services might help such as Amazon SageMaker. The MLOps cover these areas mainly:
- Deployment and automation
- Reproducibility of models and predictions
- Diagnostics
- Governance and regulatory compliance
- Scalability
- Collaboration
- Business uses
Amazon SageMaker is AWS based machine learning platform that enables developers to build machine learning models, train data, deploy an inference point on the public cloud. It consists of various services such as Ground Truth for build and manage training data sets, SageMaker Notebooks that is one-click notebooks with EC2(Elastic Compute), and SageMaker Studio that is an integrated development environment(IDE) for machine learning and so on. SageMaker leverages EC2 computing resources for training a machine learning model and running an deployed inference. Even without SageMaker NoteBooks there are bindings for a number of languages, including Ruby, Python, Java, Node.js to control a set of workflows by a code. [2]
This article also helps a lot to understand ML workflows in details.
Amazon SageMaker is a fabulous tool to cover the aforementioned two areas that MLOps needs especially, deployment and automation, and reproducibility of models and predictions. The MLOps needs an extensive works and inter-connections of services(for example let’s think separate micro services to provide of each workflow in that illustration) as either on-prem or cloud-based architecture to automate a build, train and deploy a machine learning model.
AutoML in CI/CD
Some progressive and profound algorithms and models those could be investigated by human would still need some manual works to invent an algorithm and code it in Python or R. However there is no need of human interventions to write a Jupyter Nootebook and train dataset and deploy an inference for relative simple or the known models in the MLOps CI/CD pipeline. The AutoML(Automated Machine Learning) takes place to automate machine learning processes to cover the complete pipeline from the raw dataset to the deployable machine learning model, which was proposed as an artificial intelligence-based solution to the ever-growing challenge of applying machine learning in general. In here we use the terminology “AutoML” as a library or a tool that can apply appropriate methods such as data pre-processing, feature engineering, feature extraction, and feature selection automatically for the raw data and perform algorithm selection and hyperparameter optimization to maximize the predictive performance to create a model. [3]
You might already notice that the AutoML can achieve most of workflows in the MLOps except “Fetch”, “Deploy” and “Monitor/Evaluate” in the illustrated diagram in the preceding section. “Deploy” and “Monitor/Evaluate” could be covered by the managed service such as Amazon SageMaker that I mentioned. It seems so realistic to establish the lifecyle of the MLOps in pipelines and automate those with the AutoML technique and Amazon SageMaker. There are some famous AutoML libraries and tools in place. Open-source libraries auto-sklearn and TPOT are free-to-use easily if you know scikit-learn in Python while there are some commercial services such as Google Cloud AutoML and DataRobot. An auto-sklearn is said that it leverages NIPS 2015 in Bayesian optimization, meta-learning and ensemble construction. It is a drop-in replacement for a scikit-learn estimator and the feel is the same as the scikit-learn.
TPOT is a python-based AutoML library by using genetic programming algorithm to find the best performing ML pipelines, which is built on top of scikit-learn. Automation of TPOT consists of feature selection, model selection and parameter optimization mainly except “wrangling” part(Data Cleaning/Data Clensing) to do transformation of data, a dimension reduction or a scale change and model validation. Here is the illustrated diagram what can be automated by TPOT as an example. [4]
If we could eliminate the most tedious part of machine learning by AutoML tools such as TPOT we would be able to have our CI/CD process run seemlessly and effortlessly more than ever. It seems that most of MLOps workflows are achievable in AutoML for the known algorithms now. So let’s say TPOT(AutoML) plus SageMaker(CI/CD) is a good start of the MLOps to automate for the aforementioned two areas, deployment and automation, and reproducibility of models and predictions additionally in the MLOps lifecyle.
That’s why I introduced the container that had scikit-learn & Optuna to automate hyperparameter optimization with the limited computing resources and time(Optuna has an option to set optimizing time in seconds), which was one of the automations of the processes in machine learning. The given SageMaker Notebook pull the Boston Housing data and train a model then deploy an inference with the scikit-learn gradient-boosting regressor for predictions. [5]
https://github.com/yuyasugano/sagemaker-optuna-container
https://hub.docker.com/r/suganoyuya/optuna-sklearn-container
Let’s pursue more sophisticated automation in machine learning processes now with one of the AutoML library TPOT to comprehend model selection and parameter optimization in one place in the SageMaker container and deploy an inference with Amazon SageMaker Notebook for maximized predictive performance.
TPOT Container
I modified the official example “Bring-your-own Algorithm Sample” to package an own algorithm in a Docker container and use it from ECR repository. A training job takes the ECR path where the container is saved and pull an image to run train script in the container to train dataset. The training job includes the following information as explained in this article. [6]
- The URL of the Amazon Simple Storage Service (Amazon S3) bucket where you’ve stored the training data.
- The compute resources that you want Amazon SageMaker to use for model training. Compute resources are ML compute instances that are managed by Amazon SageMaker.
- The URL of the S3 bucket where you want to store the output of the job.
- The Amazon Elastic Container Registry path where the training code is stored.
Amazon SageMaker provides several built-in algorithms, Apache Spark, and takes custome Python codes that uses TensorFlow or Apache MXNet for a model training. We leverage our own container that supports two both execution modes in one image: training where the algorithm uses input data to train a new model and serving where the algorithm accepts HTTP requests and uses the previously trained model to do an inference (also called “scoring”, “prediction”, or “transformation”). train and serve scripts are invoked when a container is called for training dataset or serving an inference. [7]
Here’s a Dockerfile for Amazon SageMaker container. The required libraries such as numpy, scipy, scikit-learn and tpot are written in requirements.txt to install the libraries appropriately. If an error occurred due to incompatible version of the libraries please specify the library versions in the requirementx.txt file.
ERROR: tpot 0.11.1 has requirement numpy>=1.16.3, but you'll have numpy 1.16.2 which is incompatible.
ERROR: tpot 0.11.1 has requirement scikit-learn>=0.22.0, but you'll have scikit-learn 0.21.2 which is incompatible.
ERROR: tpot 0.11.1 has requirement scipy>=1.3.1, but you'll have scipy 1.2.1 which is incompatible.
Example requirements.txt
numpy==1.16.3
scipy==1.3.1
scikit-learn==0.22.0
pandas==0.25.3
tpot
flask
gunicorn
gevent
All required application codes are packed in the directory tpot
and copied into /opt/program
path in the container where Amazon SageMaker can recognize the application codes are stored.
I needed to modify train file to read iris flower dataset as csv file and some parameters of tpot object as json under /opt/ml/input/config/hyperparameter.json
file. The json file is originally intended to provide hyperparameters to algorithms. The “generations”, “population” and “cv” are controlable paramters for a TPOT object it is important to set these values appropriately for your purpose. We can give these three as hyperparameters now when we call a training job. As an intelligent search over machine learning pipelines and it goes through a broad range of supervised models, transformers, and their hyperparameters, it may take time if you give a big number for an iteration of the optimization process that is “generations” here.
The run time limit can be given with the option max_time_min following the TPOT API site when we configure None for “generations” option. But again TPOT explores an extensive sets of supervised models, preprocessors, feature selection techniques, and any other estimator or transformer that follows the scikit-learn API. So finding the appropriate generations is recommended instead of the run time limit personally. [8]
generations = params.get('generations', None)
if generations is not None:
generations = int(generations)
else:
generations = 100 # default generations value
populations = params.get('populations', None)
if populations is not None:
populations = int(populations)
else:
polulations = 100 # default population value
cv = params.get('cv', None)
if cv is not None:
cv = int(cv)
else:
cv = 5 # default cv value
Unfortunately a TPOT object isn’t pickable as of now so the optimized pipeline can’t be persist like we usually can use pickle to save & load the model or use joblib as an alternative. I hope this will be overcome in near future development or improvement. What TPOT supports now is to export the corresponding Python code for the optimized pipeline to a text file with the export
fucntion.
file_name = 'pipeline.py'
pipeline_optimizer.export(file_name)
The exported python code needs to be modified to run it inside the container for some degree. It was so tedious and it seemed not a better approach however I defined one function to replace the necessary lines with regex in the exported Python code to run the Python code that imports scikit-learn and pickle to train the dataset and save a model in the model_path /opt/ml/model
directory as expected. Here’s the train code.
Now you can build a container with an arbitrary name and tag. I used sklearn-tpot-container
as below and push on DockerHub with the same image name. [9]
$ docker build -t sklearn-tpot-container .
$ docker run --rm -it sklearn-tpot-container /bin/bash
root@90edf781e616:/opt/program# python
Python 3.7.5 (default, Oct 19 2019, 00:03:48)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tpot
>>> print(tpot.__version__)
0.11.1
Local test
There is a local_test
directory containing scripts and a setup for running a simple training and inference jobs locally so that you can test that everything is set up correctly. payload.csv
must be modified for your dataset sample but the rest of shell scripts are workable as they are. In here I pick five rows from each label of the iris flower dataset. The samples are exactly the same data but selected randomly. So the total number of samples was fifteen. [9]
- train-local.sh: Instantiate the container configured for training.
- serve-local.sh: Instantiate the container configured for serving.
- predict.sh: Run predictions against a locally instantiated server.
- test-dir: The directory that gets mounted into the container with test data mounted in all the places that match the container schema.
- payload.csv: Sample data for used by predict.sh for testing the server.
The tree under test-dir is mounted into the container and mimics the directory structure that SageMaker would create for the running container during training or hosting.
input/config/hyperparameters.json
: The TPOT parameters for the training job.input/data/training/iris.csv
: The iris flower dataset for training.model
: The directory where the algorithm writes the model file.output
: The directory where the algorithm can write its success or failure file.
./train_local.sh sklearn-tpot-container
calls thetrain
function inside the container and create a model. I ran the training for the iris flower dataset with under the /opt/ml/input/config/hyperparameter.json
file like this.
{"generations": 10, "populations": 10, "cv": 5}
You can change the parameter both when you run it locally or when you call a training job on Amazon SageMaker. As a result the best pipeline was RandomForestClassifier with the best internal CV score: 0.9666666666666668. The score with the same regressor for the split test data was 0.9333333333333333. The given score does not look relatively good for the iris flower dataset. You have to run longer hours to reach really the best pipeline because TPOT’s optimization algorithm is stochastic in nature, which means that it uses randomness to search the possible pipeline space. We could try multiple times or multile objects for shorter hours also to search the best pipeline.
$ ./train_local.sh sklearn-tpot-container
Starting the training.
X shape: (150,4)
y shape: (150,1)
Warning: xgboost.XGBClassifier is not available and will not be used by TPOT.
Generation 1 - Current best internal CV score: 0.9666666666666668
Optimization Progress: 27%|??? | 30/110 [00:20<01:00, 1.33pipeline/s]
...Generation 10 - Current best internal CV score: 0.9666666666666668Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.2, min_samples_leaf=8, min_samples_split=4, n_estimators=100)
TPOT Accuracy score: 0.9333333333333333
TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
disable_update_check=False, early_stop=None, generations=10,
max_eval_time_mins=5, max_time_mins=None, memory=None,
mutation_rate=0.9, n_jobs=1, offspring_size=None,
periodic_checkpoint_folder=None, population_size=10,
random_state=42, scoring=None, subsample=1.0, template=None,
use_dask=False, verbosity=2, warm_start=False)
Training completed.
./serve_local.sh sklearn-tpot-container
calls the serve
function in the container locally. This is exactly the same docker run -v $(pwd)/test_dir:/opt/ml -p 8080:8080 --rm sklearn-tpot-container serve
command. This runs flask web application and opens a request on the endpoints /ping
and /invocations
for inference service.
$ ./serve_local.sh sklearn-tpot-container
Starting the inference server with 2 workers.
[2020-01-09 10:41:50 +0000] [9] [INFO] Starting gunicorn 20.0.4
[2020-01-09 10:41:50 +0000] [9] [INFO] Listening at: unix:/tmp/gunicorn.sock (9)
[2020-01-09 10:41:50 +0000] [9] [INFO] Using worker: gevent
[2020-01-09 10:41:50 +0000] [13] [INFO] Booting worker with pid: 13
[2020-01-09 10:41:50 +0000] [14] [INFO] Booting worker with pid: 14
...# after sent the 15 records
Invoked with 15 records
172.17.0.1 - - [09/Jan/2020:10:43:56 +0000] "POST /invocations HTTP/1.1" 200 60 "-" "curl/7.47.0"
./predict.sh ./payload.csv
can invoke curl in the script and request predictions on the /invocations
with POST on the local inference server. The returned result was perfect !!
$ ./predict.sh ./payload.csv
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 8080 (#0)
> POST /invocations HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: text/csv
> Content-Length: 237
>
* upload completely sent off: 237 out of 237 bytes
< HTTP/1.1 200 OK
< Server: nginx/1.14.2
< Date: Thu, 09 Jan 2020 10:43:56 GMT
< Content-Type: text/csv; charset=utf-8
< Content-Length: 60
< Connection: keep-alive
<
0.0
0.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
1.0
2.0
2.0
2.0
2.0
2.0
We will test this container works on SageMaker Notebook with ECR(Elastic Container Registry) and SageMaker Python SDK next.
Push to ECR(Elastic Container Registry)
The script named build_and_push.sh
has been prepared and giving the name of the built image as the argument generate a repository and push the image to the created repository. Or you can tag and push the image manually like you usually do:
$ aws ecr get-login --no-include-email
$ docker login -u AWS -p https://<account>.dkr.ecr.<region>.amazonaws.comLogin Succeeded
Create a repository with an arbitrary name.
$ aws ecr create-repository --repository-name "sklea
rn-tpot-container"
{
"repository": {
"repositoryUri": "<account>.dkr.ecr.<region>.amazonaws.com/sklearn-tpot-container",
"registryId": "<registry id>",
"imageTagMutability": "MUTABLE",
"repositoryArn": "arn:aws:ecr:<region>:<account>:repository/sklearn-tpot-container",
"repositoryName": "sklearn-tpot-container",
"createdAt": 1578568683.0
}
}# Describe the existing repository
$ aws ecr describe-repositories
Now you can push the image to the created ECR repository.
$ docker tag sklearn-tpot-container <account>.dkr.ecr.<region>.amazonaws.com/sklearn-tpot-container
$ docker push <account>.dkr.ecr.<region>.amazonaws.com/sklearn-tpot-container# If deletion is needed, run this
$ aws ecr delete-repository --repository-name optuna-sklearn-container --force
As you recall this image contains both execution modes in one image: training where the algorithm uses input data to train a new model and serving where the algorithm accepts HTTP requests and uses the previously trained model to do an inference for predictions, now we will use the ECR path on SageMaker Notebook for testing purpose.
SageMaker Notebook
SageMaker Python SDK provides several high-level abstractions for working with Amazon SageMaker and it’s easy to use if you’re familiar with Python. Amazon SageMaker launches the compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts in the specified location of S3 bucket. We use Estimators
that runs SageMaker compatible custom Docker containers on SageMaker Notebook. [10]
Training a model is straightforward just by calling fit
with the sagemaker.estimator.Estimator
instance. The built model will be saved in the specified output_path once the tarining job is completed.
It went well and the model was saved in the output_path. We’re ready to deploy an inference service with the saved model.
from sagemaker.predictor import csv_serializer
predictor = clf.deploy(initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=csv_serializer)
Predictions were accurate as same as the local test we conducted. Please do not forget to delete the endpoint if you won’t use at last.
Here are the SageMaker Notebook and github link for this sample.
In summary, we’ve discussed the MLOps to consider the AutoML integration with the magical TPOT lamp and SageMaker Notebook to pipeline workflows from model selection and parameter optimization.
Reference
- [1] Wikipedia — MLOps
- [2] Machine Learning with Amazon SageMaker
- [3] Wikipedia — Automated Machine Learning
- [4] TPOT — Introduction
- [5] yuyasugano/sagemaker-optuna-container
- [6] Train a Model with Amazon SageMaker
- [7] amazon-sagemaker-examples/Bring-your-own Algorithm Sample
- [8] TPOT — TPOT API
- [9] suganoyuya/sklearn-tpot-container
- [10] Using the SageMaker Python SDK