DynamoDB for high-frequency crypto board data ~high volume time-series pattern~
Introduction
One of the distinctive characteristics of crypto market is there is no concepts the traditional markets have had such as trading hours at exchange, board lot, and actual demands like commodity and real estate possess. It made crypto market more flat and 24 hours open for everyone, there is no barriers to start from 1 USD wherever you are to bet on it. The price formation tends to be more predictive by perceiving patterns and analyzing indicators behind surge and plunge more than checking in news for example reward halving and investors movements. Especially for the short time trading it seems significant people use algorithmic trading and bots in the market, which leverages high frequency data to analyze patterns to pursue high yield in short time. In this writing I will introduce the AWS blog “Design patterns for high-volume, time-series data in Amazon DynamoDB” and implement it DynamoDB for one of crypto exchanges board information. The first step is to design and implement prebuild table and optimize table resources with the introduced pattern for board data, the second step is to write a Lambda function to pull data via API and store in the appropriate table every second.
I looked though some OHLCV data but data did not give insightful information because of lack of information such as thickness of the board, trade price and the disrtibution of orders around the close price. This is the motivation to implement this system for analysis purpose to put usefull data in schemaless DynamoDB.
For the high volume even, a data ingestion system requires: [1]
- High write throughput to ingest new records related to the current time
- Low read throughput for recent records
- Very low read throughput for older records
In that AWS blog there are three classifications for DynamoDB table (High, Low and Very low) but I will use only two table classifications that are high write throughput and low read throughput for crypto exchange board.
As every DynamoDB access pattern requires a different allocation of read capacity units and write capacity units. Here are two distinct groups based on how often it reads and writes data for this system:
- New records written every second
- Old records read sometimes
General design principles of DynamoDB recommend using the smallest number of tables possible. When it comes to time-series data though, break from these design principles and create multiple tables for each period. In this post, I show you how to use such an anti-pattern for DynamoDB, but it is a great fit for time-series data.
Automatically prebuild tables with optimized RCU/WCU
Against the general design principles of DynamoDB we create DynamoDB table daily basis. This part is to focus on how to prebuild daily tables automatically and optimize RCU/WCU configuration for current day table and older tables. For new injested records in today’s table we could allocate the bigger number of write capacity unit and less number of read capacity unit while we must have the bigger number of read capacity unit and smaller number of write capacity unit in the older tables for analysis purpose. In the older tables we can reduce its write capacity units to 1 technically.
This is the requirement for daily tables in my assumption. Here’s the official information how you can calculate RCU/WCU in Provisioned mode. [2]
- Today’s table with 3 write capacity units and 3 read capacity units
- Older tables with 1 write capacity unit and 1 read capacity unit
Here’s the code of the Lambda function and resizer.py
written in Python to achieve this task. The tasks are comprised of…
- Create new table with 3 WCU and 3 RCU
- Update the old table with 1 WCU and 1RCU
These tasks are encapsulated in DailyResize
class and it’s called from the Lambda function handler. What we need to do for the rest is to schedule this Lambda function invocation every day a few minutes before and after midnight. CloudWatch Events allows us to do this by using a cron-like syntax.
The lambda function takes Operation
parameter from the incoming event and call the DailyResize class to perform the required processes that are create new table and update the older table.
resizer.py
was coded like this. The name is given to the partition kay and the datetime is given to the sort key. All items with the same partition key value are stored together, in sorted order by sort key value. So it’s important to consider how partitioning works in DynamoDB and efficient query across the partitions. In here the name can be the pair of currency for example btc/jpy
in bitbank.cc trade. [3]
You can deploy this lambda function and configure CloudWatch Events in cron sytax manually. I followed the same procedure in the AWS blog to deploy the lambda application by using AWS Serverless Application Model (AWS SAM). SAM is a framework for building serverless applications. It provides shorthand syntax to express functions, APIs, databases, and event source mappings in YAML file. It’s quite usefull to deploy a serverless application in AWS with declarative expression in a file. When we use SAM in AWS, SAM transforms and expands the SAM syntax into AWS CloudFormation syntax to build serverless applications. [4]
To get started with building SAM-based applications, use the AWS SAM CLI. SAM CLI provides a Lambda-like execution environment that lets you locally build, test, and debug applications defined by SAM templates. You can also use the SAM CLI to deploy your applications to AWS.
Please install aws-sam-cli
which is a tool for local development and testing of serverless applications. [5]
$ pip3 install --user aws-sam-cli
...Successfully installed Flask-1.0.4 arrow-0.15.5 aws-lambda-builders-0.8.0 aws-sam-cli-0.48.0 aws-sam-translator-1.22.0 binaryornot-0.4.4 chevron-0.13.1 cookiecutter-1.6.0 docker-4.2.0 future-0.18.2 importlib-metadata-1.6.0 jinja2-time-0.2.0 jmespath-0.9.5 jsonschema-3.2.0 poyo-0.5.0 pyrsistent-0.16.0 requests-2.23.0 serverlessrepo-0.1.9 tomlkit-0.5.8 websocket-client-0.57.0 wheel-0.34.2 whichcraft-0.6.1 zipp-3.1.0$ sam --version
SAM CLI, version 0.48.0
Prepare S3 bucket for storing Lambda function code and resizer.py
class code. I synced lambda_function.py
resizer.py
to the target bucket. A bucket name must be unique. You have to change it to your unique bucket name.
# create new s3 bucket
$ aws s3 mb s3://serverless-aws-sam-deployment
make_bucket: serverless-aws-sam-deployment# sync the codes to the created s3 folder
$ aws s3 sync . s3://serverless-aws-sam-deployment
upload: ./lambda_function.py to s3://serverless-aws-sam-deployment/lambda_function.py
upload: ./resizer.py to s3://serverless-aws-sam-deployment/resizer.py
sam package
command can generate a compiled version of the template YAML file. Then, run sam deploy
with the compiled template to submit it to CloudFormation, which applies the AWS::Serverless Transform and provisions all resources and triggers mentioned in the compiled file.
# generate a compiled version of the template.yml
$ sam package --template-file template.yml --s3-bucket serverless-aws-sam-dep
loyment --output-template-file compiled.yml
Uploading to c35eb979efe4219093e237288b8e 10475 / 10475.0 (100.00%)Successfully packaged artifacts and wrote output template to file compiled.yml.
Execute the following command to deploy the packaged template
sam deploy --template-file compiled.yml --stack-name <YOUR STACK NAME># invoke CloudFormation with the compiled template with deploy
$ sam deploy --template-file compiled.yml --stack-name dynamodb-time-series --capabilities CAPABILITY_IAM --parameter-overrides TablePrefix=btcjpySuccessfully created/updated stack - dynamodb-time-series in None
You can confirm the CloudFormation stack was invoked with the name you specified as --stack-name <YOUR STACK NAME>
and in this example dynamodb-time-series is the stackname.
If you want to test the added Lambda function you can test an event to create new table with this operation parameter to invoke create_new
function.
{
"Operation": "create_new"
}
Serverless application to ingest ticker information and depth of the board
In the same way of the mentioned SAM application for prebuilding DynamoDB tables I packaged lambda_function.py
bitbank.py
for calling bitbank.cc API and saved ticker and the order board information daily tables that are generated in the former SAM application. We can invoke the Lambda function every second or sub-minute basis for HFT (High-frequency trading) and scalping technique. [6]
I used bitbank’s public-api to retrieve ticker information and thickness of board in list with the right timestamp. Board information such as the trade price and the distribution of orders around the last price are information that we can’t obtain from candlestick data.
The lambda function takes Operation
parameter from the incoming event and call the BitbankApi class to call API and save data into DynamoDB of today.
bitbank.py
is the code to ingest new item in DynamoDB table. It’s a similar code to resizer.py
to encapsulate required processes in a single class. This class can be called in lambda_function.py
. The name attribute is given to the partition kay and the datetime attribute is given to the sort key. There is no schema defined for the rest of attributes that will be stored in ad-hoc way in DynamoDB.
This application needs external libraries requests
python_bitbankcc
in S3 when we make an application package in SAM.
# install required libraries in the same directory
$ cd lambda-sam/lambda_function/
$ pip install requests -t ./
$ pip install git+https://github.com/bitbankinc/python-bitbankcc.git -t ./# sync the codes and libraries to the created s3 folder
$ aws s3 sync . s3://serverless-aws-sam-deployment
...2020-05-03 13:44:08 1802 bitbank.py
2020-05-03 13:44:08 285 lambda_function.py
Next sam package
command can generate a compiled version of the template YAML file. Then, run sam deploy
with the compiled template to submit it to CloudFormation, which applies the AWS::Serverless Transform and provisions all resources and triggers mentioned in the compiled file.
# generate a compiled version of the template.yml
$ sam package --template-file template.yml --s3-bucket serverless-aws-sam-deployment --output-template-file compiled.yml
Uploading to 6afbc4807a8bb09a27b86269242800c6 914945 / 914945.0 (100.00%)Successfully packaged artifacts and wrote output template to file compiled.yml.
Execute the following command to deploy the packaged template
sam deploy --template-file compiled.yml --stack-name <YOUR STACK NAME># invoke CloudFormation with the compiled template with deploy
$ sam deploy --template-file compiled.yml --stack-name lambda-bitbankcc-depth --capabilities CAPABILITY_IAM --parameter-overrides TablePrefix=btcjpySuccessfully created/updated stack - lambda-bitbankcc-depth in None
You can confirm the CloudFormation stack was invoked with the name you specified as --stack-name <YOUR STACK NAME>
and in this example lambda-bitbankcc-depth was the stackname of CloudFormation.
If you want to test the added Lambda function you can test an event to create new item in DynamoDB with this operation parameter to invoke ingest_new
function.
{
"Operation": "ingest_new"
}
This is one example item that was added in the table. depth
in the item has a list of asks and a list of bids, also sequenceId and timestamp. A drawback of this implementation is the minimum invocation of CloudWatch Events was 1 minute interval so that I couldn’t set this interval a second or a sub-minute.