Gary Collier, CTO of Man Alpha Technology, agreed to do an interview with me about ArcticDB. Here it is:
Can you give us some background information about ArcticDB?
Certainly. ArcticDB is a high-performance Python-native database. It was built in response to the increasing amount of data and complexity of front-office research at Man Group, which we recognise is a challenge faced by many institutions.
What kind of database is ArcticDB? (Relational, NoSQL, etc)
We call it a DataFrame database. Under the hood, ArcticDB is a columnar database. It shares best features of a number of different technologies which, we think, makes it unique in this space. Those features include:
- Support for wide-tables. Unlike metrics time series databases, we support hundreds of thousands of columns
- NoSQL-like support for variable schemas, you can add and remove columns over time as data is appended.
- Serverless architecture supporting both high performance vertical scaling and scale-out horizontal cluster compute.
- Bi-temporality as a first-class concern.
- Cloud-native storage support. S3 backend built-in.
Does ArcticDB support other programming languages besides?Python?
Today ArcticDB supports Python. In the future, we’ll support Java and C#/.Net ecosystems too, and other languages as demand requires.
What industries is ArcticDB best suited for?
Our view is that it’s relevant to anyone or any organisation working with data in Python; any organisation where data is being processed and insights extracted at scale will find value in ArcticDB. Financial data science was a natural starting point for us, given our background, but opportunities to apply the technology exist across lots of different sectors — from bio sciences to aerospace.
How much faster is ArcticDB versus Arctic 1.0?
I should start by saying that we’re proud of Arctic 1.0 and it served us well for the terabyte scale data challenges of its time. What we’ve built in ArcticDB is something more powerful — it offers the same user-friendly Python interface as its predecessor but with a brand-new C++ engine. This offers performance at petabyte scale with full enterprise support and compatibility with modern S3 storage. There is an order of magnitude increase in performance. In terms of speed, it processes billions of rows and thousands of columns in seconds.
Will you be supporting Polars in addition to Pandas?
Polars is on our radar, and if the demand exists we will absolutely support it. We are also grateful to accept any contributions and work with the wider Python open source community.
What are some of the most interesting or unusual use cases you have seen ArcticDB used for?
I mentioned before that we think ArcticDB is relevant to anyone or any organisation carrying out data science in the modern Python data science ecosystem. Financial data science was a logical place for us to start, given the industry we work in. One of the use cases we’ve seen in that sphere relates to alternative data. You’ll know that alternative datasets are often large (multi-terabyte) and require some form of ETL before they can be stored in a database. They also require concurrent processing to reduce time to complete backfill; re-partitioning of the dataset into a usable format; and, point-in-time record of historical changes for research purposes.
ArcticDB solves that in the following way:
- The data can be structured as one or more symbols in ArcticDB.
- ETL tools concurrently process the dataset and write to ArcticDB.
- Each ETL batch is versioned, to allow point-in-time access to the history of changes.
- Block level deduplication between versions ensures efficient usage of storage.
- And finally, users can read and slice the resultant data as a single DataFrame.