Skip to main content
Skip to main content
Edit this page

Full-text search with text indexes

Beta feature. Learn more.

Text indexes (also known as inverted indexes) enable fast full-text search on text data. A text index stores a mapping from tokens to the row numbers which contain each token. Tokens are generated by a process called tokenization. For example, ClickHouse's default tokenizer converts the English sentence "The cat likes mice." to tokens ["The", "cat", "likes", "mice"].

As an example, assume a table with a single column and three rows

1: The cat likes mice.
2: Mice are afraid of dogs.
3: I have two dogs and a cat.

The corresponding tokens are:

1: The, cat, likes, mice
2: Mice, are, afraid, of, dogs
3: I, have, two, dogs, and, a, cat

We usually like to search case-insensitively, therefore we lower-case the tokens:

1: the, cat, likes, mice
2: mice, are, afraid, of, dogs
3: i, have, two, dogs, and, a, cat

We will also remove fill words such as "I", "the" and "and" as they occur in almost every row:

1: cat, likes, mice
2: mice, afraid, dogs
3: have, two, dogs, cat

A text index then (conceptually) contains this information:

afraid : [2]
cat    : [1, 3]
dogs   : [2, 3]
have   : [3]
likes  : [1]
mice   : [1]
two    : [3]

Given a search token, this index structure allows to find all matching rows quickly.

Creating a Text Index

Text indexes are generally available (GA) in ClickHouse version 26.2 and newer. In these versions, no special settings need to be configured to use the text index. We strongly recommend using ClickHouse versions >= 26.2 for production use cases.

Note

If you have upgraded (or were upgraded, e.g. ClickHouse Cloud) from a ClickHouse version older than 26.2, the presence of a compatibility setting may still cause the index to be disabled, and/or text-index related performance optimizations to be deactivated.

If query

SELECT value FROM system.settings WHERE name = 'compatibility';

returns

25.4

or any other value smaller than 26.2, you will need to set three additional settings to use the text index:

SET enable_full_text_index = true;
SET query_plan_direct_read_from_text_index = true;
SET use_skip_indexes_on_data_read = true;

Alternatively, you can increment the compatibility setting to 26.2 or newer but this affects many settings and typically requires prior testing.

Text indexes can be defined on String, FixedString, Array(String), Array(FixedString), and Map (via mapKeys and mapValues map functions) columns using the following syntax:

CREATE TABLE table
(
    key UInt64,
    str String,
    INDEX text_idx(str) TYPE text(
                                -- Mandatory parameters:
                                tokenizer = splitByNonAlpha
                                            | splitByString[(S)]
                                            | ngrams[(N)]
                                            | sparseGrams[(min_length[, max_length[, min_cutoff_length]])]
                                            | array
                                -- Optional parameters:
                                [, preprocessor = expression(str)]
                                -- Optional advanced parameters:
                                [, dictionary_block_size = D]
                                [, dictionary_block_frontcoding_compression = B]
                                [, posting_list_block_size = C]
                                [, posting_list_codec = 'none' | 'bitpacking' ]
                            )
)
ENGINE = MergeTree
ORDER BY key

Alternatively, to add a text index to an existing table:

ALTER TABLE table
    ADD INDEX text_idx(str) TYPE text(
                                -- Mandatory parameters:
                                tokenizer = splitByNonAlpha
                                            | splitByString[(S)]
                                            | ngrams[(N)]
                                            | sparseGrams[(min_length[, max_length[, min_cutoff_length]])]
                                            | array
                                -- Optional parameters:
                                [, preprocessor = expression(str)]
                                -- Optional advanced parameters:
                                [, dictionary_block_size = D]
                                [, dictionary_block_frontcoding_compression = B]
                                [, posting_list_block_size = C]
                                [, posting_list_codec = 'none' | 'bitpacking' ]
                            )

If you add an index to an existing table, we recommend materializing the index for existing table parts (otherwise search on parts without index will fall back to slow brute-force scans).

ALTER TABLE table MATERIALIZE INDEX text_idx SETTINGS mutations_sync = 2;

To remove a text index, please run

ALTER TABLE table DROP INDEX text_idx;

Tokenizer argument (mandatory). The tokenizer argument specifies the tokenizer:

  • splitByNonAlpha splits strings along non-alphanumeric ASCII characters (see function splitByNonAlpha).
  • splitByString(S) splits strings along certain user-defined separator strings S (see function splitByString). The separators can be specified using an optional parameter, for example, tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', ' in the example). The default separator list, if not specified explicitly (for example, tokenizer = splitByString), is a single whitespace [' '].
  • ngrams(N) splits strings into equally large N-grams (see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example, tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example, tokenizer = ngrams), is 3.
  • sparseGrams(min_length, max_length, min_cutoff_length) splits strings into variable-length n-grams of at least min_length and at most max_length (inclusive) characters (see function sparseGrams). Unless specified explicitly, min_length and max_length default to 3 and 100. If parameter min_cutoff_length is provided, only n-grams with length greater or equal than min_cutoff_length are returned. Compared to ngrams(N), the sparseGrams tokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example, tokenizer = sparseGrams(3, 5, 4) internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.
  • array performs no tokenization, i.e. every row value is a token (see function array).

All available tokenizers are listed in system.tokenizers.

Note

The splitByString tokenizer applies the split separators left-to-right. This can create ambiguities. For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc']. In the most cases, you want that matching prefers longer separators first. This can generally be done by passing the separator strings in order of descending length. If the separator strings happen to form a prefix code, they can be passed in arbitrary order.

To understand how a tokenizer split the input string, you can use the tokens function:

Example:

SELECT tokens('abc def', 'ngrams', 3);

Result:

['abc','bc ','c d',' de','def']

Working with non-ASCII inputs. While text indexes can in principle be build on top of text data in any language and character set, we recommend doing so at the moment only for input in an extended ASCII character sets, i.e. Western languages. In particular, Chinese, Japanese, and Korean currently lack comprehensive indexing support, leading to potentially huge index sizes and large query times. We plan to add specialized language-specific tokenizers to handle these cases better in future. :::

Preprocessor argument (optional). The preprocessor refers to an expression which is applied to the input string before tokenization.

Typical use cases for the preprocessor argument include

  1. Lower- or upper-casing to enable case-insensitive matching, e.g., lower, lowerUTF8 (see the first example below).
  2. UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, toValidUTF8.
  3. Removing or transforming unwanted characters or substrings, e.g. extractTextFromHTML, substring, idnaEncode, translate.

The preprocessor expression must transform an input value of type String or FixedString to a value of the same type.

Examples:

  • INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))
  • INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))
  • INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col))

Also, the preprocessor expression must only reference the column on top of which the text index is defined. Using non-deterministic functions is disallowed.

The preprocessor can also be used with Array(String) and Array(FixedString) columns. In this case, the preprocessor expression transforms the array elements individually.

Example:

CREATE TABLE table
(
    col Array(String),
    INDEX idx col TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))

    -- This is not legal:
    INDEX idx_illegal col TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = arraySort(col))
)
ENGINE = MergeTree
ORDER BY tuple();

Functions hasToken, hasAllTokens and hasAnyTokens use the preprocessor to first transform the search term before tokenizing it.

For example,

CREATE TABLE table
(
    key UInt64,
    str String,
    INDEX idx(str) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(str))
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM table WHERE hasToken(str, 'Foo');

is equivalent to:

CREATE TABLE table
(
    key UInt64,
    str String,
    INDEX idx(lower(str)) TYPE text(tokenizer = 'splitByNonAlpha')
)
ENGINE = MergeTree
ORDER BY tuple();

SELECT count() FROM table WHERE hasToken(str, lower('Foo'));

Other arguments (optional).

Optional advanced parameters

The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.

Optional parameter dictionary_block_size (default: 512) specifies the size of dictionary blocks in rows.

Optional parameter dictionary_block_frontcoding_compression (default: 1) specifies if the dictionary blocks use front coding as compression.

Optional parameter posting_list_block_size (default: 1048576) specifies the size of posting list blocks in rows.

Optional parameter posting_list_codec (default: none) specifies the codec for posting list:

  • none - the posting lists are stored without additional compression.
  • bitpacking - apply differential (delta) coding, followed by bit-packing (each within blocks of fixed-size). Slows down SELECT queries, not recommended at the moment.

Index granularity. Text indexes are implemented within ClickHouse as a type of skip indexes. However, unlike other skip indexes, text indexes use an infinite granularity (100 million). This can be seen in the table definition of a text index.

Example:

CREATE TABLE table(
    k UInt64,
    s String,
    INDEX idx(s) TYPE text(tokenizer = ngrams(2)))
ENGINE = MergeTree()
ORDER BY k;

SHOW CREATE TABLE table;

Result:

┌─statement──────────────────────────────────────────────────────────────┐
│ CREATE TABLE default.table                                            ↴│
│↳(                                                                     ↴│
│↳    `k` UInt64,                                                       ↴│
│↳    `s` String,                                                       ↴│
│↳    INDEX idx s TYPE text(tokenizer = ngrams(2)) GRANULARITY 100000000↴│ <-- here
│↳)                                                                     ↴│
│↳ENGINE = MergeTree                                                    ↴│
│↳ORDER BY k                                                            ↴│
│↳SETTINGS index_granularity = 8192                                      │
└────────────────────────────────────────────────────────────────────────┘

The huge index granularity ensures that the text index is created for the entire part. An explicitly specified index granularity is ignored.

Using a Text Index

Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists on a column or table part, the string search functions will fall back to slow brute-force scans.

Note

We recommend using functions hasAnyTokens and hasAllTokens to search the text index, please see below. These functions work with all available tokenizers and all possible preprocessor expressions. As the other supported functions historically preceded the text index, they had to retain their legacy behavior in many cases (e.g. no preprocessor support).

Supported functions

The text index can be used if text functions are used in the WHERE clause or PREWHERE clauses:

SELECT [...]
FROM [...]
WHERE string_search_function(column_with_text_index)

= and !=

= (equals) and != (notEquals) match the entire given search term.

Example:

SELECT * from table WHERE str = 'Hello';

The text index supports = and !=, yet equality and inequality search only make sense with the array tokenizer (it causes the index to store entire row values).

IN and NOT IN

IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or no (NOT IN) search terms.

Example:

SELECT * from table WHERE str IN ('Hello', 'World');

The same restrictions as for = and != apply, i.e. IN and NOT IN only make sense in conjunction with the array tokenizer.

LIKE, NOT LIKE and match

Note

These functions currently use the text index for filtering only if the index tokenizer is either splitByNonAlpha, ngrams or sparseGrams.

In order to use LIKE (like), NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term. For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.

Example for the text index with splitByNonAlpha tokenizer:

SELECT count() FROM table WHERE comment LIKE 'support%';

support in the example could match support, supports, supporting etc. This kind of query is a substring query and it cannot be sped up by a text index.

To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:

SELECT count() FROM table WHERE comment LIKE ' support %'; -- or `% support %`

The spaces left and right of support make sure that the term can be extracted as a token.

startsWith and endsWith

Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term. For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.

Example for the text index with splitByNonAlpha tokenizer:

SELECT count() FROM table WHERE startsWith(comment, 'clickhouse support');

In the example, only clickhouse is considered a token. support is no token because it can match support, supports, supporting etc.

To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:

startsWith(comment, 'clickhouse supports ')`

Similarly, endsWith should be used with a leading space:

SELECT count() FROM table WHERE endsWith(comment, ' olap engine');

hasToken and hasTokenOrNull

Note

Function hasToken looks straightforward to use but it has certain pitfalls with non-default tokenizers and preprocessor expressions. We recommend using functions hasAnyTokens and hasAllTokens instead.

Functions hasToken and hasTokenOrNull match against a single given token.

Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).

Example:

SELECT count() FROM table WHERE hasToken(comment, 'clickhouse');

hasAnyTokens and hasAllTokens

Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens.

These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching. See the function documentation for more info.

Example:

-- Search tokens passed as string argument
SELECT count() FROM table WHERE hasAnyTokens(comment, 'clickhouse olap');
SELECT count() FROM table WHERE hasAllTokens(comment, 'clickhouse olap');

-- Search tokens passed as Array(String)
SELECT count() FROM table WHERE hasAnyTokens(comment, ['clickhouse', 'olap']);
SELECT count() FROM table WHERE hasAllTokens(comment, ['clickhouse', 'olap']);

has

Array function has matches against a single token in the array of strings.

Example:

SELECT count() FROM table WHERE has(array, 'clickhouse');

mapContains

Function mapContains (an alias of mapContainsKey) matches against tokens extracted from the searched string in the keys of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it was created on a mapKeys(map) expression.

Example:

SELECT count() FROM table WHERE mapContainsKey(map, 'clickhouse');
-- OR
SELECT count() FROM table WHERE mapContains(map, 'clickhouse');

mapContainsValue

Function mapContainsValue matches against tokens extracted from the searched string in the values of a map. The behaviour is similar to the equals function with a String column. The text index is only used if it was created on a mapValues(map) expression.

Example:

SELECT count() FROM table WHERE mapContainsValue(map, 'clickhouse');

mapContainsKeyLike and mapContainsValueLike

The functions mapContainsKeyLike and mapContainsValueLike match a pattern against all keys or values (respectively) of a map.

Example:

SELECT count() FROM table WHERE mapContainsKeyLike(map, '% clickhouse %');
SELECT count() FROM table WHERE mapContainsValueLike(map, '% clickhouse %');

operator[]

Access operator[] can be used with the text index to filter out keys and values. The text index is only used if it is created on mapKeys(map) or mapValues(map) expressions, or both.

Example:

SELECT count() FROM table WHERE map['engine'] = 'clickhouse';

See the following examples for using columns of type Array(T) and Map(K, V) with the text index.

Examples for Array and Map columns with text indexes

Indexing Array(String) columns

Imagine a blogging platform, where authors categorize their blog posts using keywords. We like users to discover related content by searching for or clicking on topics.

Consider this table definition:

CREATE TABLE posts
(
    post_id UInt64,
    title String,
    content String,
    keywords Array(String)
)
ENGINE = MergeTree
ORDER BY (post_id);

Without a text index, finding posts with a specific keyword (e.g. clickhouse) requires scanning all entries:

SELECT count() FROM posts WHERE has(keywords, 'clickhouse'); -- slow full-table scan - checks every keyword in every post

As the platform grows, this becomes increasingly slow because the query must examine every keywords array in every row. To overcome this performance issue, we define a text index for column keywords:

ALTER TABLE posts ADD INDEX keywords_idx(keywords) TYPE text(tokenizer = splitByNonAlpha);
ALTER TABLE posts MATERIALIZE INDEX keywords_idx; -- Don't forget to rebuild the index for existing data

Indexing Map columns

In many observability use cases, log messages are split into "components" and stored as appropriate data types, e.g. date time for the timestamp, enum for the log level etc. Metrics fields are best stored as key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring.

Consider this logs table:

CREATE TABLE logs
(
    id UInt64,
    timestamp DateTime,
    message String,
    attributes Map(String, String)
)
ENGINE = MergeTree
ORDER BY (timestamp);

Without a text index, searching through Map data requires full table scans:

-- Finds all logs with rate limiting data:
SELECT * FROM logs WHERE has(mapKeys(attributes), 'rate_limit'); -- slow full-table scan

-- Finds all logs from a specific IP:
SELECT * FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- slow full-table scan

As log volume grows, these queries become slow.

The solution is creating a text index for the Map keys and values. Use mapKeys to create a text index when you need to find logs by field names or attribute types:

ALTER TABLE logs ADD INDEX attributes_keys_idx mapKeys(attributes) TYPE text(tokenizer = array);
ALTER TABLE posts MATERIALIZE INDEX attributes_keys_idx;

Use mapValues to create a text index when you need to search within the actual content of attributes:

ALTER TABLE logs ADD INDEX attributes_vals_idx mapValues(attributes) TYPE text(tokenizer = array);
ALTER TABLE posts MATERIALIZE INDEX attributes_vals_idx;

Example queries:

-- Find all rate-limited requests:
SELECT * FROM logs WHERE mapContainsKey(attributes, 'rate_limit'); -- fast

-- Finds all logs from a specific IP:
SELECT * FROM logs WHERE has(mapValues(attributes), '192.168.1.1'); -- fast

-- Finds all logs where any attribute includes an error:
SELECT * FROM logs WHERE mapContainsValueLike(attributes, '% error %'); -- fast

Performance Tuning

Direct read

Certain types of text queries can be speed up significantly by an optimization called "direct read".

Example:

SELECT column_a, column_b, ...
FROM [...]
WHERE string_search_function(column_with_text_index)

The direct read optimization answers the query exclusively using the text index (i.e., text index lookups) without accessing the underlying text column. Text index lookups read relatively little data and are therefore much faster than usual skip indexes in ClickHouse (which do a skip index lookup, followed by loading and filtering remaining granules).

Direct read is controlled by two settings:

  • Setting query_plan_direct_read_from_text_index (true by default) which specifies if direct read is generally enabled.
  • Setting use_skip_indexes_on_data_read, another prerequisite for direct read. In ClickHouse versions >= 26.1, the setting is enabled by default. In earlier versions, you need to run SET use_skip_indexes_on_data_read = 1 explicitly.

Supported functions

The direct read optimization supports functions hasToken, hasAllTokens, and hasAnyTokens. If the text index is defined with an array tokenizer, direct read is also supported for functions equals, has, mapContainsKey, and mapContainsValue. These functions can also be combined by AND, OR, and NOT operators. The WHERE or PREWHERE clauses can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions).

To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1. As an example, a query with disabled direct read

EXPLAIN PLAN actions = 1
SELECT count()
FROM table
WHERE hasToken(col, 'some_token')
SETTINGS query_plan_direct_read_from_text_index = 0, -- disable direct read
         use_skip_indexes_on_data_read = 1;

returns

[...]
Filter ((WHERE + Change column names to column identifiers))
Filter column: hasToken(__table1.col, 'some_token'_String) (removed)
Actions: INPUT : 0 -> col String : 0
         COLUMN Const(String) -> 'some_token'_String String : 1
         FUNCTION hasToken(col :: 0, 'some_token'_String :: 1) -> hasToken(__table1.col, 'some_token'_String) UInt8 : 2
[...]

whereas the same query run with query_plan_direct_read_from_text_index = 1

EXPLAIN PLAN actions = 1
SELECT count()
FROM table
WHERE hasToken(col, 'some_token')
SETTINGS query_plan_direct_read_from_text_index = 1, -- enable direct read
         use_skip_indexes_on_data_read = 1;

returns

[...]
Expression (Before GROUP BY)
Positions:
  Filter
  Filter column: __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 (removed)
  Actions: INPUT :: 0 -> __text_index_idx_hasToken_94cc2a813036b453d84b6fb344a63ad3 UInt8 : 0
[...]

The second EXPLAIN PLAN output contains a virtual column __text_index_<index_name>_<function_name>_<id>. If this column is present, then direct read is used.

If the WHERE filter clause only contains text search functions, the query can avoid reading the column data entirely and has the greatest performance benefit by direct read. However, even if the text column is accessed elsewhere in the query, direct read will still provide performance improvement.

Direct read as a hint

Direct read as a hint is based on the same principles as normal direct read, but instead adds an additional filter build from the text index data without removing the underlying text column. It is used for functions when reading only from the text index would produce false positives.

Supported functions are: like, startsWith, endsWith, equals, has, mapContainsKey, and mapContainsValue.

The additional filter can provide additional selectivity to restrict the result set in combination with other filters further, helping to reduce the amount of data read from other columns.

Direct read as a hint is controlled by setting query_plan_text_index_add_hint (enabled by default).

Example of query without hint:

EXPLAIN actions = 1
SELECT count()
FROM table
WHERE (col LIKE '%some-token%') AND (d >= today())
SETTINGS use_skip_indexes_on_data_read = 1, query_plan_text_index_add_hint = 0
FORMAT TSV

returns

[...]
Prewhere filter column: and(like(__table1.col, \'%some-token%\'_String), greaterOrEquals(__table1.d, _CAST(20440_Date, \'Date\'_String))) (removed)
[...]

whereas the same query run with query_plan_text_index_add_hint = 1

EXPLAIN actions = 1
SELECT count()
FROM table
WHERE col LIKE '%some-token%'
SETTINGS use_skip_indexes_on_data_read = 1, query_plan_text_index_add_hint = 1

returns

[...]
Prewhere filter column: and(__text_index_idx_col_like_d306f7c9c95238594618ac23eb7a3f74, like(__table1.col, \'%some-token%\'_String), greaterOrEquals(__table1.d, _CAST(20440_Date, \'Date\'_String))) (removed)
[...]

In the second EXPLAIN PLAN output, you can see that an additional conjunct (__text_index_...) has been added to the filter condition. Thanks to the PREWHERE optimization, the filter condition is broken down into three separate conjuncts, which are applied in order of increasing computational complexity. For this query, the application order is __text_index_..., then greaterOrEquals(...), and finally like(...). This ordering enables skipping even more data granules than the granules skipped by the text index and the original filter, before reading the heavy columns used in the query after WHERE clause further reducing the amount of data to read.

Caching

Different caches are available to buffer parts of the text index in memory (see section Implementation Details): Currently, there are caches for the deserialized dictionary blocks, headers and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_dictionary_cache, use_text_index_header_cache, and use_text_index_postings_cache. By default, all caches are disabled. To clear the caches, use statement SYSTEM CLEAR TEXT INDEX CACHES

Please refer the following server settings to configure the caches.

Dictionary blocks cache settings

SettingDescription
text_index_dictionary_block_cache_policyText index dictionary block cache policy name.
text_index_dictionary_block_cache_sizeMaximum cache size in bytes.
text_index_dictionary_block_cache_max_entriesMaximum number of deserialized dictionary blocks in cache.
text_index_dictionary_block_cache_size_ratioThe size of the protected queue in the text index dictionary block cache relative to the cache's total size.

Header cache settings

SettingDescription
text_index_header_cache_policyText index header cache policy name.
text_index_header_cache_sizeMaximum cache size in bytes.
text_index_header_cache_max_entriesMaximum number of deserialized headers in cache.
text_index_header_cache_size_ratioThe size of the protected queue in the text index header cache relative to the cache's total size.

Posting lists cache settings

SettingDescription
text_index_postings_cache_policyText index postings cache policy name.
text_index_postings_cache_sizeMaximum cache size in bytes.
text_index_postings_cache_max_entriesMaximum number of deserialized postings in cache.
text_index_postings_cache_size_ratioThe size of the protected queue in the text index postings cache relative to the cache's total size.

Limitations

The text index currently has the following limitations:

  • The materialization of text indexes with a high number of tokens (e.g. 10 billion tokens) can consume significant amounts of memory. Text index materialization can happen directly (ALTER TABLE <table> MATERIALIZE INDEX <index>) or indirectly in part merges.
  • It is not possible to materialize text indexes on parts with more than 4.294.967.296 (= 2^32 = ca. 4.2 billion) rows. Without a materialized text index, queries fall back to slow brute-force search within the part. As a worst case estimation, assume a part contains a single column of type String and MergeTree setting max_bytes_to_merge_at_max_space_in_pool (default: 150 GB) was not changed. In this case, the situation happens if the column contains less than 29.5 characters per row on average. In practice, tables also contain other columns and the threshold is multiples times smaller than that (depending on the number, type and size of the other columns).

Text Indexes vs Bloom-Filter-Based Indexes

String predicates can be sped up using text indexes and bloom-filter-based based indexes (index type bloom_filter, ngrambf_v1, tokenbf_v1, sparse_grams), yet both are fundamentally different in their design and intended use cases:

Bloom filter indexes

  • Are based on probabilistic data structures which may produce false positives.
  • Are only able to answer set membership questions, i.e. the column may contain token X vs. definitely does not contain X.
  • Store granule-level information to enable skipping coarse ranges during query execution.
  • Are hard to tune properly (see here for an example).
  • Are rather compact (a few kilobytes or megabytes per part).

Text indexes

  • Build a deterministic inverted index over tokens. No false positives are possible by the index itself.
  • Are specifically optimized for text search workloads.
  • Store row-level information which enables efficient term lookup.
  • Are rather large (dozens to hundreds of megabytes per part).

Bloom-filter-based indexes support full-text search only as a "side effect":

  • They do not support advanced tokenization and preprocessing.
  • They do not support multi-token search.
  • They do not provide the performance characteristics expected from an inverted index.

Text indexes, in contrast, are purpose-built for full-text search:

  • They provide tokenization and preprocessing
  • They provide efficient support for hasAllTokens, LIKE, match, and similar text-search functions.
  • They have significantly better scalability for large text corpora.

Implementation Details

Each text index consists of two (abstract) data structures:

  • a dictionary which maps each token to a postings list, and
  • a set of postings lists, each representing a set of row numbers.

Text index is built for the whole part. Unlike other skip indexes, text index can be merged instead of rebuilt on merge of the data parts (see below).

During index creation, three files are created (per part):

Dictionary blocks file (.dct)

The tokens in the text index are sorted and stored in dictionary blocks of 512 tokens each (the block size is configurable by parameter dictionary_block_size). A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.

Index header file (.idx)

The index header file contains for each dictionary block the block's first token and its relative offset in the dictionary blocks file.

This sparse index structure is similar to ClickHouse's sparse primary key index).

Postings lists file (.pst)

The posting lists for all tokens are laid out sequentially in the postings list file. To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps. If the posting list is larger than posting_list_block_size, it is split into multiple blocks that are stored sequentially to the postings lists file.

Merging of text indexes

When data parts are merged, the text index does not need to be rebuilt from scratch; instead, it can be merged efficiently in a separate step of the merge process. During this step, the sorted dictionaries of the text indexes of each input part are read and combined into a new unified dictionary. The row numbers in the postings lists are also recalculated to reflect their new positions in the merged data part, using a mapping of old to new row numbers that is created during the initial merge phase. This method of merging text indexes is similar to how projections with _part_offset column are merged. If index is not materialized in the source part, it is built, written into a temporary file and then merged together with indexes from the other parts and from other temporary index files.

Example: Hackernews dataset

Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:

CREATE TABLE hackernews (
    id UInt64,
    deleted UInt8,
    type String,
    author String,
    timestamp DateTime,
    comment String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    children Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32
)
ENGINE = MergeTree
ORDER BY (type, author);

The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews table:

INSERT INTO hackernews
    SELECT * FROM s3Cluster(
        'default',
        'https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet',
        'Parquet',
        '
    id UInt64,
    deleted UInt8,
    type String,
    by String,
    time DateTime,
    text String,
    dead UInt8,
    parent UInt64,
    poll UInt64,
    kids Array(UInt32),
    url String,
    score UInt32,
    title String,
    parts Array(UInt32),
    descendants UInt32');

We will use ALTER TABLE and add a text index on comment column, then materialize it:

-- Add the index
ALTER TABLE hackernews ADD INDEX comment_idx(comment) TYPE text(tokenizer = splitByNonAlpha);

-- Materialize the index for existing data
ALTER TABLE hackernews MATERIALIZE INDEX comment_idx SETTINGS mutations_sync = 2;

Now, let's run queries using hasToken, hasAnyTokens, and hasAllTokens functions. The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.

1. Using hasToken

hasToken checks if the text contains a specific single token. We'll search for the case-sensitive token 'ClickHouse'.

Direct read disabled (Standard scan) By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules. We can simulate this behavior by disabling direct read.

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0;

┌─count()─┐
│     516 │
└─────────┘

1 row in set. Elapsed: 0.362 sec. Processed 24.90 million rows, 9.51 GB

Direct read enabled (Fast index read) Now we run the same query with direct read enabled (the default).

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1;

┌─count()─┐
│     516 │
└─────────┘

1 row in set. Elapsed: 0.008 sec. Processed 3.15 million rows, 3.15 MB

The direct read query is over 45 times faster (0.362s vs 0.008s) and processes significantly less data (9.51 GB vs 3.15 MB) by reading from the index alone.

2. Using hasAnyTokens

hasAnyTokens checks if the text contains at least one of the given tokens. We'll search for comments containing either 'love' or 'ClickHouse'.

Direct read disabled (Standard scan)

SELECT count()
FROM hackernews
WHERE hasAnyTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0;

┌─count()─┐
│  408426 │
└─────────┘

1 row in set. Elapsed: 1.329 sec. Processed 28.74 million rows, 9.72 GB

Direct read enabled (Fast index read)

SELECT count()
FROM hackernews
WHERE hasAnyTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1;

┌─count()─┐
│  408426 │
└─────────┘

1 row in set. Elapsed: 0.015 sec. Processed 27.99 million rows, 27.99 MB

The speedup is even more dramatic for this common "OR" search. The query is nearly 89 times faster (1.329s vs 0.015s) by avoiding the full column scan.

3. Using hasAllTokens

hasAllTokens checks if the text contains all of the given tokens. We'll search for comments containing both 'love' and 'ClickHouse'.

Direct read disabled (Standard scan) Even with direct read disabled, the standard skip index is still effective. It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.

SELECT count()
FROM hackernews
WHERE hasAllTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0;

┌─count()─┐
│      11 │
└─────────┘

1 row in set. Elapsed: 0.184 sec. Processed 147.46 thousand rows, 57.03 MB

Direct read enabled (Fast index read) Direct read answers the query by operating on the index data, reading only 147.46 KB.

SELECT count()
FROM hackernews
WHERE hasAllTokens(comment, 'love ClickHouse')
SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1;

┌─count()─┐
│      11 │
└─────────┘

1 row in set. Elapsed: 0.007 sec. Processed 147.46 thousand rows, 147.46 KB

For this "AND" search, the direct read optimization is over 26 times faster (0.184s vs 0.007s) than the standard skip index scan.

The direct read optimization also applies to compound boolean expressions. Here, we'll perform a case-insensitive search for 'ClickHouse' OR 'clickhouse'.

Direct read disabled (Standard scan)

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse')
SETTINGS query_plan_direct_read_from_text_index = 0, use_skip_indexes_on_data_read = 0;

┌─count()─┐
│     769 │
└─────────┘

1 row in set. Elapsed: 0.450 sec. Processed 25.87 million rows, 9.58 GB

Direct read enabled (Fast index read)

SELECT count()
FROM hackernews
WHERE hasToken(comment, 'ClickHouse') OR hasToken(comment, 'clickhouse')
SETTINGS query_plan_direct_read_from_text_index = 1, use_skip_indexes_on_data_read = 1;

┌─count()─┐
│     769 │
└─────────┘

1 row in set. Elapsed: 0.013 sec. Processed 25.87 million rows, 51.73 MB

By combining the results from the index, the direct read query is 34 times faster (0.450s vs 0.013s) and avoids reading the 9.58 GB of column data. For this specific case, hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.

Outdated material