ClickHouse Release 24.9

Another month goes by, which means it’s time for another release!

ClickHouse version 24.9 contains 23 new features 🎁 14 performance optimisations 🛷 76 bug fixes 🐛

In this release, we’ve got the APPEND clause for refreshable materialized views, new functions for the JSON data type, and the Variant type can be returned by automatic schema inference.

As always, we send a special welcome to all the new contributors in 24.9! ClickHouse's popularity is, in large part, due to the efforts of the community that contributes. Seeing that community grow is always humbling.

Below are the names of the new contributors:

1on, Alexey Olshanskiy, Alexis Arnaud, Austin Bruch, Denis Hananein, Dergousov, Gabriel Mendes, Konstantin Smirnov, Kruglov Kirill, Marco Vilas Boas, Matt Woenker, Maxim Dergousov, Michal Tabaszewski, NikBarykin, Oleksandr, Pedro Ferreira, Rodrigo Garcia, Samuel Warfield, Sergey (Finn) Gnezdilov, Tuan Pham Anh, Zhigao Hong, baolin.hbl, gao chuan, haozelong, imddba, kruglov, leonkozlowski, m4xxx1m, marco-vb, megao, mmav, neoman36, okunev, siyuan

Hint: if you’re curious how we generate this list… here.

You can also view the slides from the presentation.

Refreshable Materialized Views is an experimental version of materialized views that store the result of a query for quick retrieval later. In this release, we’ve added APPEND functionality, which means that rather than replacing the whole view, new rows will be added to the end of the table.

One use of this feature is to capture snapshots of values at a point in time. For example, let’s imagine that we have an events table populated by a stream of messages from Redpanda, Kafka, or another streaming data platform.

SELECT *
FROM events
LIMIT 10

Query id: 7662bc39-aaf9-42bd-b6c7-bc94f2881036

┌──────────────────ts─┬─uuid─┬─count─┐
│ 2008-08-06 17:07:19 │ 0eb  │   547 │
│ 2008-08-06 17:07:19 │ 60b  │   148 │
│ 2008-08-06 17:07:19 │ 106  │   750 │
│ 2008-08-06 17:07:19 │ 398  │   875 │
│ 2008-08-06 17:07:19 │ ca0  │   318 │
│ 2008-08-06 17:07:19 │ 6ba  │   105 │
│ 2008-08-06 17:07:19 │ df9  │   422 │
│ 2008-08-06 17:07:19 │ a71  │   991 │
│ 2008-08-06 17:07:19 │ 3a2  │   495 │
│ 2008-08-06 17:07:19 │ 598  │   238 │
└─────────────────────┴──────┴───────┘

This dataset has 4096 values in the uuid column, and we can write the following query to find the ones with the highest total count:

SELECT
    uuid,
    sum(count) AS count
FROM events
GROUP BY ALL
ORDER BY count DESC
LIMIT 10

┌─uuid─┬───count─┐
│ c6f  │ 5676468 │
│ 951  │ 5669731 │
│ 6a6  │ 5664552 │
│ b06  │ 5662036 │
│ 0ca  │ 5658580 │
│ 2cd  │ 5657182 │
│ 32a  │ 5656475 │
│ ffe  │ 5653952 │
│ f33  │ 5653783 │
│ c5b  │ 5649936 │
└──────┴─────────┘

Let’s say we want to capture the count for each uuid every 10 seconds and store it in a new table called events_snapshot. The schema of events_snapshot would look like this:

CREATE TABLE events_snapshot (
    ts DateTime32,
    uuid String,
    count UInt64
) 
ENGINE = MergeTree 
ORDER BY uuid;

We could then create a refreshable materialized view to populate this table:

SET allow_experimental_refreshable_materialized_view=1;

CREATE MATERIALIZED VIEW events_snapshot_mv
REFRESH EVERY 10 SECOND APPEND TO events_snapshot
AS SELECT
    now() AS ts,
    uuid,
    sum(count) AS count
FROM events
GROUP BY ALL;

We can then query events_snapshot to get the count over time for a specific uuid:

SELECT *
FROM events_snapshot
WHERE uuid = 'fff'
ORDER BY ts ASC
FORMAT PrettyCompactMonoBlock

┌──────────────────ts─┬─uuid─┬───count─┐
│ 2024-10-01 16:12:56 │ fff  │ 5424711 │
│ 2024-10-01 16:13:00 │ fff  │ 5424711 │
│ 2024-10-01 16:13:10 │ fff  │ 5424711 │
│ 2024-10-01 16:13:20 │ fff  │ 5424711 │
│ 2024-10-01 16:13:30 │ fff  │ 5674669 │
│ 2024-10-01 16:13:40 │ fff  │ 5947912 │
│ 2024-10-01 16:13:50 │ fff  │ 6203361 │
│ 2024-10-01 16:14:00 │ fff  │ 6501695 │
└─────────────────────┴──────┴─────────┘

ClickHouse now supports automatic usage of the Variable data type for schema inference. This feature is disabled by default but enabled by setting ``input_format_try_infer_variants.

Let’s have a look at how it works when reading the following file:

data1.json

{"id": [1], "name": "Mark"}
{"id": "agerty", "name": "Dale"}

The id field is an array of integers on the first row and a string on the second. Let’s query the file and return the type of the id column:

select *, toTypeName(id)
FROM file('data1.json')
SETTINGS input_format_try_infer_variants=1;

┌─id─────┬─name─┬─toTypeName(id)──────────────────────────┐
│ [1]    │ Mark │ Variant(Array(Nullable(Int64)), String) │
│ agerty │ Dale │ Variant(Array(Nullable(Int64)), String) │
└────────┴──────┴─────────────────────────────────────────┘

If we do that query without setting input_format_try_infer_variants=1, we’ll see the following error message instead:

Received exception:
Code: 636. DB::Exception: The table structure cannot be extracted from a JSON format file. Error:
Code: 53. DB::Exception: Automatically defined type String for column 'id' in row 1 differs from type defined by previous rows: Array(Int64). You can specify the type for this column using setting schema_inference_hints. (TYPE_MISMATCH) (version 24.9.1.3278 (official build)).
You can specify the structure manually: (in file/path/to/24.9/data1.json). (CANNOT_EXTRACT_TABLE_STRUCTURE)

Remember that the Variant data type won’t always be inferred where you want (or even expect) it to be inferred. For example, if the values in the id field can be cast to String, that will be the inferred type even if the Variant type could also be inferred. This is the case in the following file:

data2.json

{"id": 1, "name": "Mark"}
{"id": "agerty", "name": "Dale"}
{"id": "2021-01-04", "name": "Tom"}

If we run the following query:

select *, toTypeName(id)
FROM file('data2.json')
SETTINGS input_format_try_infer_variants=1;

┌─id─────────┬─name─┬─toTypeName(id)───┐
│ 1          │ Mark │ Nullable(String) │
│ agerty     │ Dale │ Nullable(String) │
│ 2021-01-04 │ Tom  │ Nullable(String) │
└────────────┴──────┴──────────────────┘

The id column is inferred as Nullable(String) because every value can be cast to a string. You can still have it infer the id column as Variant, but you’ll have to supply a hint:

SET allow_experimental_variant_type=1;

SELECT *, toTypeName(id) 
FROM  file('data2.json') 
SETTINGS schema_inference_hints='id Variant(String, Int64, Date)';

┌─id─────────┬─name─┬─toTypeName(id)───────────────┐
│ 1          │ Mark │ Variant(Date, Int64, String) │
│ agerty     │ Dale │ Variant(Date, Int64, String) │
│ 2021-01-04 │ Tom  │ Variant(Date, Int64, String) │
└────────────┴──────┴──────────────────────────────┘

In the 24.8 release post, we learned about the new JSON data type. This release sees more functions to operate on data in the JSON and Dynamic data types. Let’s see how to use them on this sample dataset:

data3.json

{"id": 1, "name": "Mark"}
{"id": "agerty", "name": "Dale"}
{"id": "2021-01-04", "name": "Tom"}
{"id": ["1", 2, "3"], "name": "Alexey", "location": "Netherlands"}

We have the distinctJSONPaths function, which returns distinct JSON paths.

SELECT distinctJSONPaths(json)
FROM file('data3.json', JSONAsObject)
FORMAT Vertical;

Row 1:
──────
distinctJSONPaths(json): ['id','location','location.city','location.country','name']

If you want to return the types as well, distinctJSONPathsAndTypes does that:

SELECT distinctJSONPathsAndTypes(json)
FROM file('data3.json', JSONAsObject)
FORMAT Vertical;

Row 1:
──────
distinctJSONPathsAndTypes(json): {'id':['Array(Nullable(String))','Int64','String'],'location':['String'],'location.city':['String'],'location.country':['String'],'name':['String']}

Finally, we have distinctDynamicTypes, which returns distinct types for Dynamic columns.

SELECT distinctDynamicTypes(json.id)
FROM file('data3.json', JSONAsObject)
FORMAT Vertical

Row 1:
──────
distinctDynamicTypes(json.id): ['Array(Nullable(String))','Int64','String']

When you query the url table function, you can now access the response headers via the _headers virtual column:

SELECT _headers
FROM url(
'https://en.wikipedia.org/w/api.php?action=query&list=recentchanges&rcprop=title%7Cids%7Csizes%7Cflags%7Cuser%7Cuserid%7Ctimestamp&format=json&rcdir=newer'
)
LIMIT 1
FORMAT Vertical;


Row 1:
──────
_headers: {'accept-ranges':'bytes','age':'0','cache-control':'private, must-revalidate, max-age=0','content-disposition':'inline; filename=api-result.json','content-type':'application/json; charset=utf-8','date':'Tue, 01 Oct 2024 15:32:59 GMT','nel':'{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}','report-to':'{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }','server':'mw-api-ext.codfw.main-54d5bc66d9-98km5','server-timing':'cache;desc="pass", host;desc="cp3067"','set-cookie':'WMF-Last-Access=01-Oct-2024;Path=/;HttpOnly;secure;Expires=Sat, 02 Nov 2024 12:00:00 GMT','strict-transport-security':'max-age=106384710; includeSubDomains; preload','transfer-encoding':'chunked','vary':'Accept-Encoding,Treat-as-Untrusted,X-Forwarded-Proto,Cookie,Authorization','x-cache':'cp3067 miss, cp3067 pass','x-cache-status':'pass','x-client-ip':'82.35.72.115','x-content-type-options':'nosniff','x-frame-options':'DENY'}

If you need to replace a string fragment with another string, that just got easier with the overlay function. You provide the initial string, the replacement string, and then the index where you want the replacement string to start and how many characters should be replaced.

We can use this function to make sure everyone knows that chDB is cool as well!

SELECT overlay('ClickHouse is cool', 'and chDB are', 12, 2) AS res

┌─res──────────────────────────┐
│ ClickHouse and chDB are cool │
└──────────────────────────────┘