Skip to main content
Skip to main content

Projections

Introduction

ClickHouse offers various mechanisms of speeding up analytical queries on large amounts of data for real-time scenarios. One such mechanism to speed up your queries is through the use of Projections. Projections help optimize queries by creating a reordering of data by attributes of interest. This can be:

  1. A complete reordering
  2. A subset of the original table with a different order
  3. A precomputed aggregation (similar to a Materialized View) but with an ordering aligned to the aggregation.

How do Projections work?

Practically, a Projection can be thought of as an additional, hidden table to the original table. The projection can have a different row order, and therefore a different primary index, to that of the original table and it can automatically and incrementally pre-compute aggregate values. As a result, using Projections provide two "tuning knobs" for speeding up query execution:

  • Properly using primary indexes
  • Pre-computing aggregates

Projections are in some ways similar to Materialized Views , which also allow you to have multiple row orders and pre-compute aggregations at insert time. Projections are automatically updated and kept in-sync with the original table, unlike Materialized Views, which are explicitly updated. When a query targets the original table, ClickHouse automatically samples the primary keys and chooses a table that can generate the same correct result, but requires the least amount of data to be read as shown in the figure below:

When to use Projections?

Projections are an appealing feature for new users as they are automatically maintained as data is inserted. Furthermore, queries can just be sent to a single table where the projections are exploited where possible to speed up the response time.

This is in contrast to Materialized Views, where the user has to select the appropriate optimized target table or rewrite their query, depending on the filters. This places greater emphasis on user applications and increases client-side complexity.

Despite these advantages, projections come with some inherent limitations which users should be aware of and thus should be deployed sparingly.

  • Projections don't allow using different TTL for the source table and the (hidden) target table, materialized views allow different TTLs.
  • Projections don't currently support optimize_read_in_order for the (hidden) target table.
  • Lightweight updates and deletes are not supported for tables with projections.
  • Materialized Views can be chained: the target table of one Materialized View can be the source table of another Materialized View, and so on. This is not possible with projections.
  • Projections don't support joins, but Materialized Views do.
  • Projections don't support filters (WHERE clause), but Materialized Views do.

We recommend using projections when:

  • A complete re-ordering of the data is required. While the expression in the projection can, in theory, use a GROUP BY, materialized views are more effective for maintaining aggregates. The query optimizer is also more likely to exploit projections that use a simple reordering, i.e., SELECT * ORDER BY x. Users can select a subset of columns in this expression to reduce storage footprint.
  • Users are comfortable with the associated increase in storage footprint and overhead of writing data twice. Test the impact on insertion speed and evaluate the storage overhead.

Examples

Filtering on columns which aren't in the primary key

In this example, we'll show you how to add a projection to a table. We'll also look at how the projection can be used to speed up queries which filter on columns which are not in the primary key of a table.

For this example, we'll be using the New York Taxi Data dataset available at sql.clickhouse.com which is ordered by pickup_datetime.

Let's write a simple query to find all the trip IDs for which passengers tipped their driver greater than $200:

Notice that because we are filtering on tip_amount which is not in the ORDER BY, ClickHouse had to do a full table scan. Let's speed this query up.

So as to preserve the original table and results, we'll create a new table and copy the data using an INSERT INTO SELECT:

To add a projection we use the ALTER TABLE statement together with the ADD PROJECTION statement:

It is necessary after adding a projection to use the MATERIALIZE PROJECTION statement so that the data in it is physically ordered and rewritten according to the specified query above:

Let's run the query again now that we've added the projection:

Notice how we were able to decrease the query time substantially, and needed to scan less rows.

We can confirm that our query above did indeed use the projection we made by querying the system.query_log table:

Using projections to speed up UK price paid queries

To demonstrate how projections can be used to speed up query performance, let's take a look at an example using a real life dataset. For this example we'll be using the table from our UK Property Price Paid tutorial with 30.03 million rows. This dataset is also available within our sql.clickhouse.com environment.

If you would like to see how the table was created and data inserted, you can refer to "The UK property prices dataset" page.

We can run two simple queries on this dataset. The first lists the counties in London which have the highest prices paid, and the second calculates the average price for the counties:

Notice that despite being very fast how a full table scan of all 30.03 million rows occurred for both queries, due to the fact that neither town nor price were in our ORDER BY statement when we created the table:

Let's see if we can speed this query up using projections.

To preserve the original table and results, we'll create a new table and copy the data using an INSERT INTO SELECT:

We create and populate projection prj_oby_town_price which produces an additional (hidden) table with a primary index, ordering by town and price, to optimize the query that lists the counties in a specific town for the highest paid prices:

The mutations_sync setting is used to force synchronous execution.

We create and populate projection prj_gby_county – an additional (hidden) table that incrementally pre-computes the avg(price) aggregate values for all existing 130 UK counties:

note

If there is a GROUP BY clause used in a projection like in the prj_gby_county projection above, then the underlying storage engine for the (hidden) table becomes AggregatingMergeTree, and all aggregate functions are converted to AggregateFunction. This ensures proper incremental data aggregation.

The figure below is a visualization of the main table uk_price_paid_with_projections and its two projections:

If we now run the query that lists the counties in London for the three highest paid prices again, we see an improvement in query performance:

Likewise, for the query that lists the U.K. counties with the three highest average-paid prices:

Note that both queries target the original table, and that both queries resulted in a full table scan (all 30.03 million rows got streamed from disk) before we created the two projections.

Also, note that the query that lists the counties in London for the three highest paid prices is streaming 2.17 million rows. When we directly used a second table optimized for this query, only 81.92 thousand rows were streamed from disk.

The reason for the difference is that currently, the optimize_read_in_order optimization mentioned above isn’t supported for projections.

We inspect the system.query_log table to see that ClickHouse automatically used the two projections for the two queries above (see the projections column below):

Further examples

The following examples use the same UK price dataset, contrasting queries with and without projections.

In order to preserve our original table (and performance), we again create a copy of the table using CREATE AS and INSERT INTO SELECT.

Build a Projection

Let's create an aggregate projection by the dimensions toYear(date), district, and town:

Populate the projection for existing data. (Without materializing it, the projection will be created for only newly inserted data):

The following queries contrast performance with and without projections. To disable projection use we use the setting optimize_use_projections, which is enabled by default.

Query 1. Average price per year

The results should be the same, but the performance better on the latter example!

Query 2. Average price per year in London

Query 3. The most expensive neighborhoods

The condition (date >= '2020-01-01') needs to be modified so that it matches the projection dimension (toYear(date) >= 2020):

Again, the result is the same but notice the improvement in query performance for the 2nd query.