Crypto Backtesting Datalake

4 min readMar 9, 2023

There has been a ton of information about technical indicators and how to unleash the power of cloud for technical trading in crypto field. I’d like to share one implementation that I worked recently. This is a datalake in AWS S3 and Glue tables, crawlers and jobs to maintain and operate datalake from daily backtesting data.

Here’s a result in Amazon Athena quety about what pair and technical trading were the most profitable the last year from ohlcv data. n1 n2 n3 are the following parameters that we can use in each technical indicator. With this said, you can see the pair link/usdt with Bollinger Bands was relatively performant.

Please note that equity starts from $10000 in backtesting and the shown values are proved to reach those values in 365 days from the last year daily candle data. The top one 53970 means that $10000 got about 400% gains and became $53970 as a result if we followed the seen strategy. Let’s delve into how to set up a backtesting datalake.

Athena: Sort top 10 equity in desc order (as of writing)

First things first, I don’t go deeper technically. I’ll share this idea and the concept how you can implement your own backtesting datalake. You don’t need to use the same cloud service provider or the same language and the libraries that I used. However, as I used AWS services for this datalake, I’ll use the service names from AWS. You can read those names interchangeably if you use a different cloud service provider.

Here is the basic concept.

Create a file based datalake (I put csv files only)
Crawl datalake and create/maintain meta data catalog (Hive)
Inspect datalake by query and find the latest parameter

Create a file based datalake

Let’s start off from a datalake. I need to store daily backtesting data (it is just a csv file with parameters and obtained equity from backtesting.py) that can be obtained in Jupyter notebook. I run a few notebooks in cron.

You can choose AWS S3 or Azure Storage Account whatever to store data if it can store a file (csv). It is better to have a feature to archive data or change saving tiers automatically for the same of reducing overall costs.

Some tips:

Partitioning (I followed hive partitioning for Athena)
Cost reduction with archiving or tiering

Here’s a sample csv file for ETH/USDT pair. Parameter optimization is run in backtesting with the calculated equity final. This example shows n1 that is short moving average and n2 that is long moving average from daily ohlcv data.

Crawl datalake and create/maintain meta data catalog

Datalake is ready. I created a database and tables in meta data catalog. AWS supports Hive-like meta data catalog in Glue. You can automate crawling your datalake and create/maintain your table in a database with such services. I’ve got two tables for raw data and transformed data.

Glue is a fully managed ETL services in AWS. I ran an automated job that transforms csv files into parquet format with snappy compression. The key is automation. You can automate all processes in here from crawling raw data and transformed data to running an ETL job to transform raw data.

Some tips:

Optimize data format and compression (I chose parquet and snappy)
Automate crawling and jobs for data transformation (csv to parquet)

Inspect datalake by query and find the latest parameter

Once you’ve got your data in Glue meta data catalog, then it is time to run a query for your analytic purpose. For example, you can search the most profitable token pair and its parameter from all backtesting result in datalake. If you got your datalake partitioned, you would have a choice to run a query for specific time period as well.

I programmed a script to take the latest parameter from datalake and invoke a buy/sell order automatically. Backtesting results vary when we provide different timespan, different pair and timeframe for ohlcv. The idea is our trading strategy needs to be updated and to change daily basis from backtesting results.

Some tips:

Come up with your use cases before analysis (To me, maximize profits)
Rearchitect your entire process and system to efficient state

I hope this article gives you some insights. Please let me know if you have a question. I’m available on Twitter or here. Thanks for reading.

Crypto Backtesting Datalake

Create a file based datalake

Crawl datalake and create/maintain meta data catalog

Inspect datalake by query and find the latest parameter

Written by Yuya Sugano