Cloud Solution Architect - Data Platform @Microsoft. Databases | Data Engineering | Analytics. Drums, good food, fine wine. I speak for myself ♥ != endorse
Time-series Data Is Everywhere!
What is TimescaleDB?
TimescaleDB is an open-source time-series database optimized for fast ingest and complex queries. It speaks "full SQL" and is correspondingly easy to use like a traditional relational database, yet scales in ways previously reserved for NoSQL databases.
Easy to Use
- Full SQL interface for all SQL natively supported by PostgreSQL (including secondary indexes, non-time based aggregates, sub-queries, JOINs, window functions).
- Connects to any client or tool that speaks PostgreSQL, no changes needed.
- Time-oriented features, API functions, and optimizations.
- Robust support for Data retention policies.
- Transparent time/space partitioning for both scaling up (single node) and scaling out (forthcoming).
- High data write rates (including batched commits, in-memory indexes, transactional support, support for data backfill).
- Right-sized chunks (two-dimensional data partitions) on single nodes to ensure fast ingest even at large data sizes.
- Parallelized operations across chunks and servers.
- Engineered up from PostgreSQL, packaged as an extension.
- Proven foundations benefiting from 20+ years of PostgreSQL research (including streaming replication, backups).
- Flexible management options (compatible with existing PostgreSQL ecosystem and tooling).
What is Time-series data?
"Time-series data is data that collectively represents how a system, process, or behaviour changes over time."
TimescaleDB utilizes a "wide-table" data model, which is quite common in the world of relational databases. This makes Timescale somewhat different than most other time-series databases, which typically use a "narrow-table" model.
Architecture & Concepts
TimescaleDB is implemented as an extension on PostgreSQL, which means that a Timescale database runs within an overall PostgreSQL instance. The extension model allows the database to take advantage of many of the attributes of PostgreSQL such as reliability, security, and connectivity to a wide range of third-party tools.
Architecture & Concepts
TimescaleDB exposes what look like singular tables, called hypertables, that are actually an abstraction or a virtual view of many individual tables holding the data, called chunks.
- The primary point of interaction with your data is a hypertable, the abstraction of a single continuous table across all space and time intervals, such that one can query it via standard SQL.
- Virtually all user interactions with TimescaleDB are with hypertables. Creating tables and indexes, altering tables, inserting data, selecting data, etc. can (and should) all be executed on the hypertable.
- A hypertable is defined by a standard schema with column names and types, with at least one column specifying a time value, and one (optional) column specifying an additional partitioning key.
- Internally, TimescaleDB automatically splits each hypertable into chunks, with each chunk corresponding to a specific time interval and a region of the partition key’s space (using hashing). These partitions are disjoint (non-overlapping), which helps the query planner to minimize the set of chunks it must touch to resolve a query.
- Each chunk is implemented using a standard database table. (In PostgreSQL internals, the chunk is actually a "child table" of the "parent" hypertable.)
- Chunks are right-sized, ensuring that all of the B-trees for a table’s indexes can reside in memory during inserts.
Single Node vs. Clustering
The current open-source release of TimescaleDB only supports single-node deployments. Of note is that the single-node version of TimescaleDB has been benchmarked to over 10-billion-row hypertables on commodity machines without a loss in insert performance.
*clustered deployments (in development)
TimescaleDB offers three key benefits over vanilla PostgreSQL or other traditional RDBMSs for storing time-series data:
- Much higher data ingest rates, especially at larger database sizes.
- Query performance ranging from equivalent to orders of magnitude greater.
- Time-oriented features.
Much Higher Ingest Rates
TimescaleDB includes new functions for time-oriented analytics, including some of the following:
- Time bucketing: A more powerful version of the standard date_trunc function, it allows for arbitrary time intervals (e.g., 5 minutes, 6 hours, etc.), as well as flexible groupings and offsets, instead of just second, minute, hour, etc.
SELECT time_bucket('3 hours', time) AS period asset_code, first(price, time) AS opening, last(price, time) AS closing, max(price) AS high, min(price) AS low FROM prices WHERE time > NOW() - interval '7 days' GROUP BY period, asset_code ORDER BY period DESC, asset_code;
Last and first aggregates: These functions allow you to get the value of one column as ordered by another. For example, last(temperature, time) will return the latest temperature value based on time within a group (e.g., an hour).
SELECT time_bucket('1 day', time) AS day, asset_code, last(price, time_recorded) FROM prices WHERE time > '2017-01-01' GROUP BY day, asset_code ORDER BY day DESC, asset_code;
Time-oriented Data Management
TimescaleDB allows efficient deletion of old data at the chunk level, rather than at the row level, via its drop_chunks functionality.
SELECT drop_chunks(interval '7 days', 'conditions');
Why Use TimescaleDB over NoSQL?
Compared to general NoSQL databases (e.g., MongoDB, Cassandra) or even more specialized time-oriented ones (e.g., InfluxDB, KairosDB), TimescaleDB provides both qualitative and quantitative differences:
- Normal SQL
- Operational simplicity
- Query performance
- Manage like PostgreSQL
- Native support for geospatial data
- Third-party tools
When Not to Use TimescaleDB?
If any of the following is true, you might not want to use TimescaleDB:
- Simple read requirements
- Very sparse or unstructured data
- Heavy compression is a priority
- Infrequent or offline analysis
By Stéphane Fréchette