ClickHouse
An open-source columnar database for real-time analytics on very large datasets.
olap columnar analytical open-source sql
What it is
ClickHouse is an open-source columnar database engineered for high-throughput analytical queries on large datasets. It was built at Yandex (the Russian search engine) starting around 2009 to power their web analytics product Yandex.Metrica, and open sourced under Apache 2 in 2016. ClickHouse Inc., the commercial company behind it, was founded in 2021 in the US.
The pitch: scan billions of rows per second per node, with 5–15× compression, on commodity hardware. The classic use cases are observability, ad tech, clickstream, and any “real-time dashboard over a firehose of events” workload.
Why people use it
- Speed. Vectorized execution over compressed columnar data. Single-node ClickHouse on a modern machine routinely scans hundreds of millions of rows per second. With a cluster, billions per second.
- Compression. Typical compression ratios of 5–15× on real-world data. This makes storage cheap and pulls more data into the page cache.
- Cost. For many analytical workloads, ClickHouse is one to two orders of magnitude cheaper than Snowflake or BigQuery on a per-query basis.
- Materialized views. Incremental refresh on insert. Aggregate views stay up to date without batch refresh jobs.
- Streaming ingest. Kafka/Kinesis integrations, native streaming inserts, and a strong story for sub-second-fresh analytics.
- Wide ecosystem. Connectors for dbt, Airflow, Grafana, Superset, and every popular BI tool.
When to use ClickHouse
- Real-time analytics over event streams (clickstream, ad tech, observability, product analytics).
- Time-series at scale beyond what TimescaleDB or InfluxDB can handle.
- Replacement for Snowflake / BigQuery when query cost is the bottleneck.
- Workloads needing streaming ingest plus sub-second analytical queries.
- Logs and metrics platforms (Highlight, PostHog, Grafana Cloud all use ClickHouse internally).
When not to use ClickHouse
- OLTP workloads. ClickHouse is not designed for single-row updates or point reads. Use Postgres.
- Strong-consistency requirements. Replication is async; ZooKeeper / Keeper coordinate, but you don’t get linearizable writes.
- Small datasets. If your data fits in a Postgres or DuckDB instance, ClickHouse is operational overhead you don’t need.
- High-concurrency point lookups. ClickHouse is optimized for analytical scans, not thousands of concurrent users hitting the same indexed rows.
- Apps that need traditional ACID transactions across many rows. Limited.
Notable trade-offs
- SQL dialect quirks. ClickHouse SQL is mostly standard but has its own array functions, custom syntax for some operations, and quirks around NULL handling. Tools that assume Postgres or MySQL SQL won’t always work without adjustments.
- Joins are weaker than row stores. Distributed joins exist but are expensive. Schemas tend toward denormalization.
- No traditional transactions. Atomic batch inserts, but no multi-statement transactions.
- Replication is async. Eventually consistent. For zero data loss, careful configuration is required.
- Operationally complex at scale. Multi-node ClickHouse with replication, sharding, and Keeper coordination has real learning curve. Smaller deployments are simpler.
- Schema evolution. Adding columns is free; modifying or dropping is more expensive than in row stores.
Ecosystem
- ClickHouse Cloud. The managed service from ClickHouse Inc. — serverless with separated compute/storage on S3.
- Altinity. Commercial support, Altinity.Cloud, and a strong Kubernetes operator.
- Tinybird. Real-time data API built on ClickHouse, popular with engineers who want to expose ClickHouse-backed APIs without the ops work.
- chDB. An in-process ClickHouse, similar to DuckDB. Ships as a library.
- Self-hosting. Docker images and Kubernetes operators (Altinity, official) make self-hosting realistic. Production deployments still need real ops investment.