Next level of time series data storage and delivery: MarketStore
Alpaca’s products deal with the capital market data, majority of which are time series. As our business expands from a niche market to a broader audience, we have realized some of the data challenges we have to face. Some of them are publicly common, and others are very unique to us, but here is the high level data demands for the capital market applications.
- Maximal throughput and lowest latency with limited resources
- Scale with growing number of users and algos, as well as asset classes
- Reliable delivery of market data to applications
Maximal throughput and lowest latency with limited resources
Speed is key. If the chart loading is slow, people stop using the application. We need to serve to thousands of people who dynamically scroll the intraday charts back to 10 years to take a close look at what would have happened in the market and their algos. This contributes a lot in the user experience as well as reliable live test operation. Time series data is notoriously hard to achieve the high performance in general-purpose data storage system as it requires lots of specific operations such as sorting and resampling. And it’s true that we could spend a lot of money buying huge hardware to achieve the best performance, but that’s not what we want at Alpaca. We utilize the best capability out of the minimum resources. Aside from the computer resources, other important resources are our team members efforts. Not to mention about managing the running software, we should be able to quickly do trial and error to develop new applications using the market data.
Scale with growing number of users and algos, as well as asset classes
We started from a niche market but are seeing more demands for the bigger use cases everyday. The user base increases day by day and we envision serving to millions of people eventually. Even if we are talking about only tens of thousands of users, with each typically building handfuls of algos, the number of algos we host in our platform reaches to the order of magnitude of hundreds of thousands of algos. What kind of hedge fund or investment banks in the world ever ran as many trading algos as we do simultaneously all the time? Remember each algo is like you reading the chart every minute, so the data needs to be served without any issues to that many algos. And while the number of algos ever increases, we are also expanding our data coverage from a handful currency pairs to more than ten thousand stock symbols, futures, bitcoins and ETFs across the world. We don’t want to stop growing our business due to this data scalability issue.
Reliable delivery of market data to applications
Data is like sashimi. It’s best when it’s served as fresh as possible. As described above, we are hosting so many algos that have their mouths open all the time to catch the fresh data coming out of the market. The appropriate data should be delivered to the right consumer at the earliest time, otherwise algos will not be able to generate meaningful results. Since our neural network-based algos are super computationally heavy, we always distribute the tasks to many computers and shuffle them to utilize the best computing resources. And what’s more important is to keep the data close to the computing. On the other hand, we have to watch the market all the time and get the latest information of any of those the ever growing asset classes. Finally this is a system that deals with the real money. If we fail to deliver the true data to the right calculation, our business will lose the confidence.
How and why are we approaching it now?
We wrote a few hundred lines of python that does the job for the current size of data last year to start our closed beta program, but we knew it wouldn’t be able to scale up to what we wanted it to be. That’s not bad, as we always start from small and see how big it needs to be, as a small startup ourselves. But we are realizing the time had already come to revisit this problem.
So, a couple of months ago we sat together and had a long discussion about how we would tackle this capital market data problem. Chris, who recently joined the team, brings his experiences from mission-critical defense system and IBM/DB2 development. Luke used to build the high performance computing cluster back in 90s for numerical simulation and later built a company and database called Greenplum, which became one of the most successful Big Data databases. I built many data-driven applications before joining Greenplum when it wasn’t called Big Data or Data Science, and later I architected the Greenplum database engine. During that time, we worked with many customers in the financial sectors and got to know a lot about what kind of problems this capital market applications are facing. Out of those experiences, we came up with this system that is now called MarketStore. It is designed specifically to solve our problems described earlier, by incorporating modern technologies.
It is still too early to finalize the design and we continue to improve it as we better understand the data demands in our platform, but basically the key design points are as follows.
- 100s of thousands of timeseries to serve and update real time (10k+ symbols * 5-6 timeframes)
- more than 10 years of second-level data
- 100s of thousands of concurrent clients (algos) for both historical query and real time pub/sub model
- up to seconds of update frequency for each of these time series
- sub-milliseconds latency
- with minimal hardware resources
- highly available, with minimal manual intervention
Basically these requirements say that it is a mix of fast data and big data, which should typically be separated into different products. Since it is hard to satisfy those different requirements at the same time, one product had to focus on one not on both. In that sense, I can see these points are challenging from my database development experiences. The reason MarketStore can do the both jobs is because it is optimized for our financial application purposes.
It is written mainly in Go to overcome the previous concurrency and memory problems from Python’s interpreter. It is designed with the modern hardware/software such as SSD and the sparse file support in mind, while it persists the data with global object storage. Queries from clients are processed in a similar manner to typical databases going through the parser/planner/executor model. The client interface employs HTTP (currently 1.1 but could expand to 2.0) including WebSocket.
Now we have this MarketStore to expand our business to US equity and broader range of data to be managed. While running our products, we are also listening to our users and are keen on building things that solve our customers real problems. Thanks to this data store, our application development has become so easy and we are able to iterate with hypothesis to verify our idea in short cycles. It may sound easy to build such applications around the market data, but it would be much much harder without MarketStore to fetch one symbol of one timeframe data out of tens of thousands of symbol/timeframe pairs that updates very frequently.
You may ask me why we are building a new data storage even though there are so many existing data products out there. Let’s look at one by one.
RDB such as PostgreSQL and MySQL are viable solutions and I like them. That said, it is hard to optimize the data for the time series application since relational algebra doesn’t have a notion of orders in dataset. And we didn’t need SQL.
HDFS is a good foundation of any kind of modern data platform. It is especially good if you want to throw gigantic data and forget about it, but we pretty frequently have to retrieve majority of the data and for that purpose we have S3 anyway. MapReduce and Hive’s latency is not in our scope at all, and Spark would be good if we want to run k-means or something over large data, but we don’t. HBase is more for write-oriented than our read-oriented operation and none of these solutions provide time series capability with low latency delivery. I can list many arguments in this space, but most importantly, there are just too many products and managing a Hadoop cluster in our team size was simply not the choice.
Yes there are good solution for the message delivery such as Kafka and RabbitMQ. They are built for that purpose. It is highly available and easy to manage. We could utilize it but also it’s not hard to build one. Especially in this sector, the newly delivered data should be persisted and should be consistent with the following history query. Message Bus is not designed to keep the data, and we anyway needed to build the storage part, and we just didn’t need a separate messaging system thanks to Go’s concurrency support in its language and runtime.
DataFrame is great, and in fact, we started from it. It was originally developed by the quants guys so it totally makes sense it has got a full set of time series functionality for this financial data application. It’s just that Pandas wasn’t meant for the server side backend system. We are still using DataFrame in some of our system and the MarketStore clients can build local DataFrame objects from the server response, but it is also one of our immediate goals to replace all DataFrame persistency with MarketStore.
You may have heard about this relatively new open source database, also written in Go. It is a great aim to address time series database problems in the community with an open approach, and I like that. Though, it seemed to me it was best designed for system metrics data (and possibly for IoT kind of application), which isn’t best for our financial data. When it comes technical analysis in the financial application, there are many windowing operations such as moving average and other derivative indicators. That kind of extensibility in the query was not there when I looked at it (I guess it’s getting there these days) and also it doesn’t have the seamless integration between historical query and real time message delivery.
If you have ever been in the financial industry as a DBA, you should probably know this unique database that’s actually quite popular in that sector. It’s a columnar, in-memory database that is designed specifically for the financial application. It is, however, a commercial product and runs on as a single node database that doesn’t scale, so you would have to have an expensive hardware with sizeable RAM on it.
So, although there are really many different data products around the world, I didn’t think any of them could help us solve our problems, and I am guessing that’s the case for many other people in this sector. That’s one of the reason we are putting our effort in this storage.
I summarized the current state of our financial data storage called MarketStore. I am sure this is an interesting topic to many engineers, and I would definitely look forward to hearing any kind of feedback or questions. Although we invest our resources in this component as we need it, this is not our main business, so we also may think about open sourcing it once it gets ready, but for now we will incubate it in our internal server, so stay tuned…
Hitoshi, CTO of Alpaca