Machine Learning Infrastructure ~Terraforming SageMaker Part 1~

10 min readJan 16, 2020

Introduction

A MLOps has been on an upward trend for years and the companies are confronting issues to establish pipelines and automations for the ML lifecycle. The focal points seem model generation, inference deployment and an orchestration for especially ML engineers how to automate those elements in an efficient way and create valuable products with the machine learning and deep learning algorithms not only deploying good models now. [1]

MLOps applies to the entire lifecycle — from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics.

A common architecture of an MLOps system would include data science platforms where models are constructed and the analytical engines were computations are performed, with the MLOps tool orchestrating the movement of machine learning models, data and outcomes between the systems. In the MLOps context we may need some interventions in the platforms or tools to control pipelines or workflows if we build a CI/CD system in the MLOps, which some managed services might help such as Amazon SageMaker. The MLOps cover these areas mainly:

Deployment and automation
Reproducibility of models and predictions
Diagnostics
Governance and regulatory compliance
Scalability
Collaboration
Business uses

Amazon SageMaker is AWS based machine learning platform that enables developers to build machine learning models, train data, deploy an inference point on the public cloud. It consists of various services such as Ground Truth for build and manage training data sets, SageMaker Notebooks that is one-click notebooks with EC2(Elastic Compute), and SageMaker Studio that is an integrated development environment(IDE) for machine learning and so on. SageMaker leverages EC2 computing resources for training a machine learning model and running an deployed inference. Even without SageMaker NoteBooks there are bindings for a number of languages, including Ruby, Python, Java, Node.js to control a set of workflows by a code. [2]

https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html

This article also helps a lot to understand ML workflows in details.

Machine Learning Workflows

Machine learning in production happens in five phases. (There are few standardized best practices across teams and…

pathmind.com

Amazon SageMaker is a fabulous tool to cover the aforementioned two areas that MLOps needs especially, deployment and automation, and reproducibility of models and predictions. The MLOps needs an extensive works and inter-connections of services(for example let’s think separate micro services to provide of each workflow in that illustration) as either on-prem or cloud-based architecture to automate a build, train and deploy a machine learning model.

However I think we must keep our infrastructure for the MLOps lifecycle consistent and manageable to operate the infrastructure in the adequate way, especially when our infrastructure gets grown and expanded. Here’s a motivation to adopt a tool like terraform to automate provisioning and manage the infrastructure lifecyle with code. [3]

Terraform is an open-source infrastructure as code software tool created by HashiCorp. It enables users to define and provision a datacenter infrastructure using a high-level configuration language known as Hashicorp Configuration Language (HCL).

Terraform is a powerful tool to deliver infrastructure with code HCL(Hashicorp Configuration Language) which is relevant to JSON(JavaScript Object Notation). It has some distinctive characteristics as an infrastructure as code software tool as below.

Automate infrastructure provision
Write declarative configuration file
Consistent and repeatable workflows
Reproducible and reusable infrastructure
Versioning infrastructure with shared code

Terraform has a great coverage with many providers that are responsible for understanding API interactions and exposing resources more than 200 cloud services, IaaS, well-known products not only AWS/GCP/Azure/Openstack but also datacenter technologies such as Cisco and VMware, Nutanix shown in this link. [4]

Providers - Terraform by HashiCorp

Terraform is used to create, manage, and update infrastructure resources such as physical machines, VMs, network…

www.terraform.io

In the first part (part 1) and the coming part (part 2) my aim is to automate the aforementioned two areas in the MLOps lifecycle with Terraform coding to eliminate some human manual works that might cause operational mistakes and a lot of redos of creations of components. Once the code is written, it would be reusable for deploying similar infrastructure and save time because the we could just optimize parameters and variables in terraform HCL file for different environments. In this article it’s discussed about how to initialize all reqruired AWS components such as IAM role/policy, S3 buckets and SageMaker instance to run a training job and a deployment of an infering service with terraform AWS provider. In the next part (part 2) the fucus is how to bring some degree of automations with CloudWatch and Lambda to trigger Reproducibility of models that could be considered as a part of pipelines of the MLOps with terraform AWS provider.

Overview

The following diagram illustrates the overall AWS architecture with terraform automation. We’ve got some AWS components for the tool to create and deploy with appropriate codes. Those are IAM(Identity and Access Management), SageMaker Notebook and S3 buckets for storing dataset and a built model. Let’s assume that data scientists have developed a Jupyter Notebook already that could be used in SageMaker so that we can ask them to upload their notebook on the appropriate directory for this integration.

As seen above, the notebook instance requires some supporting resources including:

IAM Role and Policy for the SageMaker Notebook instance

A Notebook should have an extensive set of privileges in the policy to access S3, SageMaker, CloudWatch, EC2, etc if required.

S3 bucket for storing the SageMaker Notebook and dataset/built model

Here are two separate S3 buckets to store the notebook and dataset/built model but this could be one bucket if you don’t mind.

SageMaker Notebook instance

A SageMaker Notebook instance should be initiated with the created IAM role for the instance. I must obtain the provided notebook from the S3 bucket for the SageMaker notebook above.

Let’s take a rudimentary dataset and a simple scikit-learn estimator for the example notebook. The goal of this architecture is to predict a housing price with the built model from boston housing price dataset. I will use the gradient boosting regressor algorithm(provided in ECR) for training data. Here’s the prepared Jupyter notebook file in SageMaker in advance. It’s located at source/notebooks/sagemaker-terraform-boston.ipynb in the repository and loaded in aws_s3_bucket resource block to upload it to S3 bucket properly.

The main purpose of this article is not how to write a Jupyter notebook off course. But let me brief what we’re going to do (again my assumption was that data scientists had developed the Jupyter notebook already for us). The boston housing data set is well-known dataset and consists of 506 rows and 14 columns. The last column “medv” is the target value we will need to predict its value (median value of owner-occupied homes in $1000s).

We convert the DataFrame data into csv format and upload the csv file in S3 location. I specified S3 bucket location bucket='sagemaker-bucket-sample-test' for the test in this snippet. Please note that this bucket must match the bucket for storing dataset and model with terraform automation. We will configure input variables in terraform.tfvars for main.tf so you may need to modify this bucket and prefix parameters in Jupyter notebook at the same time when you decide the variable notebook_bucket_name in terraform.tfvars file.

Dataset upload to specific S3 location

It’s straightforward now to use estimator.SKLearn and call scikit_learn_script.py in the same directory to generate a model. This script also can be uploaded to S3 by terraform. It’s located at source/scripts/scikit_learn_script.py in out repository.

We can deploy an inference service with sagemaker.predictor with specific input/output formats. Please be carefull about not calling predictor.delete_endpoint() here you would need to delete the endpoint by yourself manually.

Deploy a predictor for the model

Complete code: https://github.com/yuyasugano/terraform-sagemaker-sample-1/blob/master/source/notebooks/Scikit-learn_Estimator_Example_With_Terraform.ipynb

Let’s terraforming

It’s time to do terraforming those components!! To keep maintenability and re-usuability of the code you may need to follow terraform module concept. Modularizing tf files is now general for some level of isolations to abstract common blocks of configuration into reusable & manageable infrastructure elements. [5]

Infrastructure code like application code benefits from a well-managed approach consisting of three steps: write, test, and refactor. Modules can help with this as they significantly reduce duplication, enable isolation, and enhance testability.

HashiCorp Terraform: Modules as Building Blocks for Infrastructure

Operators adopt tools like HashiCorp Terraform to provide a simple workflow for managing infrastructure. Users write…

www.hashicorp.com

Here’s the directory structure I’ve made for Terraform building blocks. The main main.tf calls all required modules under modules directory iam, s3 and sagemaker respectively. The source codes such as the Jupyter notebook and the train script that are used for the SageMaker container were saved under the source directory. To check the directory structure, run tree like below.

$ tree --charset=o -I "*.template"

I used the terraform v0.12.6 and the providers as follows:

$ cd main
$ terraform version
Terraform v0.12.6
+ provider.aws v2.23.0
+ provider.template v2.1.2

Please consider to upgrade to the major version 0.12.x if you haven’t upgraded yet. You need to upgrade your terraform configuration files also by following the official guide. [6]

Upgrading to Terraform 0.12 - Terraform by HashiCorp

Terraform v0.12 is a major release focused on configuration language improvements and thus includes some changes that…

www.terraform.io

Here gives some tips about SageMaker in terraform. The resource aws_sagemaker_notebook_instance in terraform initializes only a notebook so it has to obtain the prepared Jupyter notebook from the S3 bucket and run that notebook to train dataset and deploy an inference service with the notebook. The “Lifecycle Configuration” could be leveraged to accomplish these tasks. It provides shell scripts that run only when we initialize and start the notebook instance. Please be careful that shell scripts can’t run for longer than 5 minutes. As long as I confirmed the initializing notebook stayed “Pending” when the shell script was running for training and deploying an inference endpoint. If a script runs for longer than 5 minutes, it fails and the notebook instance is not initialized or started. These are the recommendations from the official Amazon SageMaker site to avoid such situations. [7]

I chose No.3 to put nohup command to run the prepared notebook.

Cut down on necessary steps. For example, limit which conda environments in which to install large packages
Run tasks in parallel processes
Use the nohupcommand in your script

Under the terraform sagemaker module directory I created a template directory to locate the shell script (sagemaker_instance_init.sh) for the lifecycle configuration. Terraform supports the resource called aws_sagemaker_notebook_instance_lifecycle to provides a lifecycle configuration for SageMaker notebook in automation. This is the sample main.tf that was created for the SageMaker module.

modules/sagemaker/main.tf

Even with the light-weight data sample and the ordinary scikit-learn container it took a long time more than 5 minutes. You would need more elaborated and sophisticated way (maybe not by the lifecycle configuration) for the different dataset or the algorithms, the containers that might take longer time when you leverage the lifecycle configuration. Here’s the example shell script that is called on_start when a notebook starts.

sagemaker_instance_init.sh

Deployment

Terraform main code is located under the main directory and here is the place we can issue terraform plan and terraform apply commands. Before running terraform plan and terraform apply, there are a few set-ups to perform:

Compelete code: https://github.com/yuyasugano/terraform-sagemaker-sample-1

i. Copy terraform_backend.tf.template to terraform_backend.tf and modify values accordingly. You need to manually create an S3 bucket or use an existing one to store the Terraform state file.

terraform {
  required_version = "0.12.6"
  backend "s3" {
    bucket = "<bucket-name>"
    key    = "sagemaker-sample/terraform.tfstate"
    region = "<region>"
  }
}

ii. Copy terraform.tfvars.template to terraform.tfvars and modify values accordingly. You don’t need to create any buckets specified in here; they’re to be created by terraform apply.

aws_region = "<aws_region>"
aws_profile = "<aws_profile>"
iam_name = "<iam_name>"
identifier = "sagemaker.amazonaws.com"
notebook_bucket_name = "<notebook_bucket_name>"
sagemaker_bucket_name = "<sagemaker_bucket_name>"
sagemaker_notebook_name = "<sagemaker_notebook_name>"

iii. Once the above files are created, simply run through the following terraform commands. Remember to ensure all commands return ok and to review the terraform plan before applying.

$ terraform init
$ terraform validate
Success! The configuration is valid.$ terraform plan -var-file=terraform.tfvarsPlan: 9 to add, 0 to change, 0 to destroy.$ terraform apply -var-file=terraform.tfvars # yes
Apply complete! Resources: 9 added, 0 changed, 0 destroyed.

You will see all resources created and an inference endpoint enabled finally. I configured the endpoint name sagemaker-terraform-test in this experiment.

Cleanup

Once you are done with the experiment, simply perform the followings to delete all resources. Again, recall that the method predictor.delete_endpoint() is not called in this notebook so you would need to delete the endpoint manually on AWS console or comment out the method predictor.delete_endpoint() in the notebook at the bottom before running terraform commands.

$ terraform plan -destroy -var-file=terraform.tfvars
$ terraform destroy -var-file=terraform.tfvars

In summary, we’ve discussed the use of terraform for the one of the aforementioned MLOps areas and coded tf files for AWS components such as IAM(Identity and Access Management), SageMaker Notebook and S3 buckets in this article. I will consider adding an additional part that is to build Reproducibility of models with Lambda function mainly in the next article.