What is time-series data?
With ever-growing amounts and variety of data, any data analysis certainly includes the date-time axis.
- How many products do I sell today compared to the day before?
- What is the trend of search terms in Google Ads?
- How has the bitcoin price changed over the past year?
All those types of questions imply time-series analysis. What’s the logical difference from a normal database? Well, it’s a matter of priorities and usage patterns.
For example, a regular database of employees probably doesn’t require much time-series analysis. It definitely won’t be the main type of request.
On the other hand, if we want to analyze sensor data from IoT (the Internet of Things) devices, its key metric—besides the sensor name or location—is a datetime. It makes time-series analysis the main usage scenario for IoT data.
What are some examples of time-series data?
As we previously mentioned, time-series data is a special type of data that changes over time. The most obvious examples of such data are the following:
- Stock prices
- Monthly website traffic
- City population
- Network speed
- Amount of goods at a warehouse
All types of time-series could have different fractions like per second, per minute, or even per year. But still, it changes, and businesses can take advantage of it.
Why do I need a time-series database?
And do I even need it if I want to analyze anything from the datetime perspective? Of course, you don’t. Any relational database can analyze data against datetime, while small amounts of data or data with no high insert rate need no time-series specific DB.
To put it simply, you need to consider a time-series DB when you aim to consume and analyze a huge amount of time-series data—both in terms of the insert rate and absolute value.
If I put the purpose of a time-series database in a single sentence, it would be something like this:
‘An ability to consume, store, and ensure stable query performance with huge amounts of data’
By saying ‘huge’, I mean billions of records or thousands of inserts per second.
But how can we achieve that? Normally, time-series data analysis happens on various granularity levels, e.g. we want to see price changes every minute for the past hour, every hour for the past week and every day for the past month.
Those types of aggregations are pretty common and can dramatically reduce the amount of storage required for your data. At the same time, we still keep all data received so far, even though it’s at the cost of losing precision.
It gives us a second (optional) quality of time-series databases:
‘An ability to automatically manage data aggregations’
Also, it includes any data retention policies (data that gets deleted automatically after it expires).
It’s optional because it can be easily achieved using a custom ETL process, but many DB vendors support it out of the box, though.
Learn more about “What ETL is and what steps ETL has?“
To sum it up, let’s see how time-series databases help handle big amounts of data:
- On a logical level, it’s aggregations and automatic retention policies (if it applies to your data).
- On a physical level, it’s all about storage optimization. Depending on your database, it could be sharding strategies, partitioning, clustering, compression, etc. Many column-stored databases claim good support of time-series because they can support sorted or clustered storage.
- On a computational level, it could be also some custom functions that are not part of the SQL standard by default (like LSOF in TimescaleDB).
A couple of use cases:
- IoT. Sensor data could be usable in many areas such as telco (with metrics from cells), healthcare, production lines, etc. Collections of IoT sensors can produce vast amounts of data that needs to be efficiently stored and queried to detect anomalies and predict future failures, for instance.
- Trading or blockchain prices. Millions of transactions in the markets produce gigantic amounts of data to be efficiently stored, processed, and analyzed to extract insights and predict price movements.
- Monitoring systems. Systems like Prometheus, which were designed specifically to collect, store, and analyze log data.
How to choose the right database for time-series data
There are several features you must consider while selecting the right time-scale DB.
Cost of usage
Commonly, time-series data would grow, so you have to maintain gigabytes or terabytes of data. Well, be careful while estimating. It happens that your Proof of Concept was smooth and easy. But you have to calculate the approximate production-like workload properly and the data volumes you’ll handle and then try to assume the hardware needed to maintain it.
Complexity of migration
Sometimes, time-series projects count on relational databases, so if you have one, it seems like a sweet spot. But the easiness of migration may play a bad game with you in the future. It’s better to suffer at the beginning while migrating and get many benefits when having a proper solution. Also, be aware of vendor lock-in. Once you opt to go with some cloud provider and choose their solution, you’re now trapped, and migrating out of this cloud would be hard. However, if you know you’ll be using the same cloud provider forever, then no worries.
Effectiveness for use cases
It’s a common mistake to test your solution on a small data set or low workload, spend a lot of time migrating it, and come across a situation where your solution was perfect at first glance but couldn’t keep up with the production due to its workload. You’d better try to keep your DB as close to your product as possible.
What are the most popular time-series databases?
Nowadays, there is a plenty of solutions with time-series support. I’ll reference some of them here.
- TimescaleDB—a Postgres-based open-source database with a cloud version. It offers a seamless transition from any Postgres-based application. You can maintain hybrid databases, i.e. when only part of the tables gets optimized for time-series analysis. In recent releases, they also claim support for clustering, but we had no chance to check it out.
- ClickHouse—an open-source database optimized for fast ingestion and processing of time-series data. It’s SQL-based, but keep in mind that there are some missing features compared to regular relational databases. Overall, it’s a solid approach with fastest ingestion (from what I saw) and some unusual solutions for common problems.
- InfluxDB—an open-source database with a cloud version. It’s a ground-up solution for timeseries data. It’s not a relational database, so it can’t be a drop-in replacement for your regular DB.
- AWS Timestream—well, it’s AWS timeseries DB, with SQL support. We’ve never compared it to other solutions, so I leave it without insights.
- SingleStore—it offers a cloud version. Formerly MemSQL. A solid column-oriented database with strong support for time-series data. It’s scalable by design, so it can maintain a good rate of inserts and decent query performance for analysis.
How to get started with a minimum amount of time and effort
Now let’s assume you’re ready to get started implementing your solution to collect the time-series data, you have a great idea, and it’s the right time to start. But something is missing. It’s Broscorp’s team that’s a key element for your success. Why?
- We’ve got vast experience designing and implementing high-load storage and database solutions.
- We conduct time-series database comparisons to help you choose the best database for time-series data in your project to effectively use time & money.
- We’re a super business-oriented company. Our goal is to build a system that brings profit to your company, not to develop some solution.
Don’t hesitate to contact us, and we’ll discuss your project right away.
We’re open to discussing your ideas and questions. Leave us your email and we will contact you to arrange an initial conversation.