Projections
Introduction
ClickHouse offers various mechanisms of speeding up analytical queries on large amounts of data for real-time scenarios. One such mechanism to speed up your queries is through the use of Projections. Projections help optimize queries by creating a reordering of data by attributes of interest. This can be:
- A complete reordering
- A subset of the original table with a different order
- A precomputed aggregation (similar to a Materialized View) but with an ordering aligned to the aggregation.
How do Projections work?
Practically, a Projection can be thought of as an additional, hidden table to the original table. The projection can have a different row order, and therefore a different primary index, to that of the original table and it can automatically and incrementally pre-compute aggregate values. As a result, using Projections provide two "tuning knobs" for speeding up query execution:
- Properly using primary indexes
- Pre-computing aggregates
Projections are in some ways similar to Materialized Views , which also allow you to have multiple row orders and pre-compute aggregations at insert time. Projections are automatically updated and kept in-sync with the original table, unlike Materialized Views, which are explicitly updated. When a query targets the original table, ClickHouse automatically samples the primary keys and chooses a table that can generate the same correct result, but requires the least amount of data to be read as shown in the figure below:
When to use Projections?
Projections are an appealing feature for new users as they are automatically maintained as data is inserted. Furthermore, queries can just be sent to a single table where the projections are exploited where possible to speed up the response time.
This is in contrast to Materialized Views, where the user has to select the appropriate optimized target table or rewrite their query, depending on the filters. This places greater emphasis on user applications and increases client-side complexity.
Despite these advantages, projections come with some inherent limitations which users should be aware of and thus should be deployed sparingly.
- Projections don't allow using different TTL for the source table and the (hidden) target table, materialized views allow different TTLs.
- Projections don't currently support
optimize_read_in_order
for the (hidden) target table. - Lightweight updates and deletes are not supported for tables with projections.
- Materialized Views can be chained: the target table of one Materialized View can be the source table of another Materialized View, and so on. This is not possible with projections.
- Projections don't support joins, but Materialized Views do.
- Projections don't support filters (
WHERE
clause), but Materialized Views do.
We recommend using projections when:
- A complete re-ordering of the data is required. While the expression in the
projection can, in theory, use a
GROUP BY,
materialized views are more effective for maintaining aggregates. The query optimizer is also more likely to exploit projections that use a simple reordering, i.e.,SELECT * ORDER BY x
. Users can select a subset of columns in this expression to reduce storage footprint. - Users are comfortable with the associated increase in storage footprint and overhead of writing data twice. Test the impact on insertion speed and evaluate the storage overhead.
Examples
Filtering on columns which aren't in the primary key
In this example, we'll show you how to add a projection to a table. We'll also look at how the projection can be used to speed up queries which filter on columns which are not in the primary key of a table.
For this example, we'll be using the New York Taxi Data
dataset available at sql.clickhouse.com which is ordered
by pickup_datetime
.
Let's write a simple query to find all the trip IDs for which passengers tipped their driver greater than $200:
Notice that because we are filtering on tip_amount
which is not in the ORDER BY
, ClickHouse
had to do a full table scan. Let's speed this query up.
So as to preserve the original table and results, we'll create a new table and copy the data using an INSERT INTO SELECT
:
To add a projection we use the ALTER TABLE
statement together with the ADD PROJECTION
statement:
It is necessary after adding a projection to use the MATERIALIZE PROJECTION
statement so that the data in it is physically ordered and rewritten according
to the specified query above:
Let's run the query again now that we've added the projection:
Notice how we were able to decrease the query time substantially, and needed to scan less rows.
We can confirm that our query above did indeed use the projection we made by
querying the system.query_log
table:
Using projections to speed up UK price paid queries
To demonstrate how projections can be used to speed up query performance, let's take a look at an example using a real life dataset. For this example we'll be using the table from our UK Property Price Paid tutorial with 30.03 million rows. This dataset is also available within our sql.clickhouse.com environment.
If you would like to see how the table was created and data inserted, you can refer to "The UK property prices dataset" page.
We can run two simple queries on this dataset. The first lists the counties in London which have the highest prices paid, and the second calculates the average price for the counties:
Notice that despite being very fast how a full table scan of all 30.03 million rows occurred for both queries, due
to the fact that neither town
nor price
were in our ORDER BY
statement when we
created the table:
Let's see if we can speed this query up using projections.
To preserve the original table and results, we'll create a new table and copy the data using an INSERT INTO SELECT
:
We create and populate projection prj_oby_town_price
which produces an
additional (hidden) table with a primary index, ordering by town and price, to
optimize the query that lists the counties in a specific town for the highest
paid prices:
The mutations_sync
setting is
used to force synchronous execution.
We create and populate projection prj_gby_county
– an additional (hidden) table
that incrementally pre-computes the avg(price) aggregate values for all existing
130 UK counties:
If there is a GROUP BY
clause used in a projection like in the prj_gby_county
projection above, then the underlying storage engine for the (hidden) table
becomes AggregatingMergeTree
, and all aggregate functions are converted to
AggregateFunction
. This ensures proper incremental data aggregation.
The figure below is a visualization of the main table uk_price_paid_with_projections
and its two projections:
If we now run the query that lists the counties in London for the three highest paid prices again, we see an improvement in query performance:
Likewise, for the query that lists the U.K. counties with the three highest average-paid prices:
Note that both queries target the original table, and that both queries resulted in a full table scan (all 30.03 million rows got streamed from disk) before we created the two projections.
Also, note that the query that lists the counties in London for the three highest paid prices is streaming 2.17 million rows. When we directly used a second table optimized for this query, only 81.92 thousand rows were streamed from disk.
The reason for the difference is that currently, the optimize_read_in_order
optimization mentioned above isn’t supported for projections.
We inspect the system.query_log
table to see that ClickHouse
automatically used the two projections for the two queries above (see the
projections column below):
Further examples
The following examples use the same UK price dataset, contrasting queries with and without projections.
In order to preserve our original table (and performance), we again create a copy of the table using CREATE AS
and INSERT INTO SELECT
.
Build a Projection
Let's create an aggregate projection by the dimensions toYear(date)
, district
, and town
:
Populate the projection for existing data. (Without materializing it, the projection will be created for only newly inserted data):
The following queries contrast performance with and without projections. To disable projection use we use the setting optimize_use_projections
, which is enabled by default.
Query 1. Average price per year
The results should be the same, but the performance better on the latter example!
Query 2. Average price per year in London
Query 3. The most expensive neighborhoods
The condition (date >= '2020-01-01') needs to be modified so that it matches the projection dimension (toYear(date) >= 2020)
:
Again, the result is the same but notice the improvement in query performance for the 2nd query.