Full-text search with text indexes
Text indexes (also known as inverted indexes) enable fast full-text search on text data. A text index stores a mapping from tokens to the row numbers which contain each token. Tokens are generated by a process called tokenization. For example, ClickHouse's default tokenizer converts the English sentence "The cat likes mice." to tokens ["The", "cat", "likes", "mice"].
As an example, assume a table with a single column and three rows
The corresponding tokens are:
We usually like to search case-insensitively, therefore we lower-case the tokens:
We will also remove fill words such as "I", "the" and "and" as they occur in almost every row:
A text index then (conceptually) contains this information:
Given a search token, this index structure allows to find all matching rows quickly.
Creating a Text Index
Text indexes are generally available (GA) in ClickHouse version 26.2 and newer. In these versions, no special settings need to be configured to use the text index. We strongly recommend using ClickHouse versions >= 26.2 for production use cases.
If you have upgraded (or were upgraded, e.g. ClickHouse Cloud) from a ClickHouse version older than 26.2, the presence of a compatibility setting may still cause the index to be disabled, and/or text-index related performance optimizations to be deactivated.
If query
returns
or any other value smaller than 26.2, you will need to set three additional settings to use the text index:
Alternatively, you can increment the compatibility setting to 26.2 or newer but this affects many settings and typically requires prior testing.
Text indexes can be defined on String, FixedString, Array(String), Array(FixedString), and Map (via mapKeys and mapValues map functions) columns using the following syntax:
Alternatively, to add a text index to an existing table:
If you add an index to an existing table, we recommend materializing the index for existing table parts (otherwise search on parts without index will fall back to slow brute-force scans).
To remove a text index, please run
Tokenizer argument (mandatory). The tokenizer argument specifies the tokenizer:
splitByNonAlphasplits strings along non-alphanumeric ASCII characters (see function splitByNonAlpha).splitByString(S)splits strings along certain user-defined separator stringsS(see function splitByString). The separators can be specified using an optional parameter, for example,tokenizer = splitByString([', ', '; ', '\n', '\\']). Note that each string can consist of multiple characters (', 'in the example). The default separator list, if not specified explicitly (for example,tokenizer = splitByString), is a single whitespace[' '].ngrams(N)splits strings into equally largeN-grams (see function ngrams). The ngram length can be specified using an optional integer parameter between 1 and 8, for example,tokenizer = ngrams(3). The default ngram size, if not specified explicitly (for example,tokenizer = ngrams), is 3.sparseGrams(min_length, max_length, min_cutoff_length)splits strings into variable-length n-grams of at leastmin_lengthand at mostmax_length(inclusive) characters (see function sparseGrams). Unless specified explicitly,min_lengthandmax_lengthdefault to 3 and 100. If parametermin_cutoff_lengthis provided, only n-grams with length greater or equal thanmin_cutoff_lengthare returned. Compared tongrams(N), thesparseGramstokenizer produces variable-length N-grams, allowing for a more flexible representation of the original text. For example,tokenizer = sparseGrams(3, 5, 4)internally generates 3-, 4-, 5-grams from the input string but only the 4- and 5-grams are returned.arrayperforms no tokenization, i.e. every row value is a token (see function array).
All available tokenizers are listed in system.tokenizers.
The splitByString tokenizer applies the split separators left-to-right.
This can create ambiguities.
For example, the separator strings ['%21', '%'] will cause %21abc to be tokenized as ['abc'], whereas switching both separators strings ['%', '%21'] will output ['21abc'].
In the most cases, you want that matching prefers longer separators first.
This can generally be done by passing the separator strings in order of descending length.
If the separator strings happen to form a prefix code, they can be passed in arbitrary order.
To understand how a tokenizer split the input string, you can use the tokens function:
Example:
Result:
Working with non-ASCII inputs. While text indexes can in principle be build on top of text data in any language and character set, we recommend doing so at the moment only for input in an extended ASCII character sets, i.e. Western languages. In particular, Chinese, Japanese, and Korean currently lack comprehensive indexing support, leading to potentially huge index sizes and large query times. We plan to add specialized language-specific tokenizers to handle these cases better in future. :::
Preprocessor argument (optional). The preprocessor refers to an expression which is applied to the input string before tokenization.
Typical use cases for the preprocessor argument include
- Lower- or upper-casing to enable case-insensitive matching, e.g., lower, lowerUTF8 (see the first example below).
- UTF-8 normalization, e.g. normalizeUTF8NFC, normalizeUTF8NFD, normalizeUTF8NFKC, normalizeUTF8NFKD, toValidUTF8.
- Removing or transforming unwanted characters or substrings, e.g. extractTextFromHTML, substring, idnaEncode, translate.
The preprocessor expression must transform an input value of type String or FixedString to a value of the same type.
Examples:
INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(col))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = substringIndex(col, '\n', 1))INDEX idx(col) TYPE text(tokenizer = 'splitByNonAlpha', preprocessor = lower(extractTextFromHTML(col))
Also, the preprocessor expression must only reference the column on top of which the text index is defined. Using non-deterministic functions is disallowed.
The preprocessor can also be used with Array(String) and Array(FixedString) columns. In this case, the preprocessor expression transforms the array elements individually.
Example:
Functions hasToken, hasAllTokens and hasAnyTokens use the preprocessor to first transform the search term before tokenizing it.
For example,
is equivalent to:
Other arguments (optional).
Optional advanced parameters
The default values of the following advanced parameters will work well in virtually all situations. We do not recommend changing them.
Optional parameter dictionary_block_size (default: 512) specifies the size of dictionary blocks in rows.
Optional parameter dictionary_block_frontcoding_compression (default: 1) specifies if the dictionary blocks use front coding as compression.
Optional parameter posting_list_block_size (default: 1048576) specifies the size of posting list blocks in rows.
Optional parameter posting_list_codec (default: none) specifies the codec for posting list:
none- the posting lists are stored without additional compression.bitpacking- apply differential (delta) coding, followed by bit-packing (each within blocks of fixed-size). Slows down SELECT queries, not recommended at the moment.
Index granularity. Text indexes are implemented within ClickHouse as a type of skip indexes. However, unlike other skip indexes, text indexes use an infinite granularity (100 million). This can be seen in the table definition of a text index.
Example:
Result:
The huge index granularity ensures that the text index is created for the entire part. An explicitly specified index granularity is ignored.
Using a Text Index
Using a text index in SELECT queries is straightforward as common string search functions will leverage the index automatically. If no index exists on a column or table part, the string search functions will fall back to slow brute-force scans.
We recommend using functions hasAnyTokens and hasAllTokens to search the text index, please see below.
These functions work with all available tokenizers and all possible preprocessor expressions.
As the other supported functions historically preceded the text index, they had to retain their legacy behavior in many cases (e.g. no preprocessor support).
Supported functions
The text index can be used if text functions are used in the WHERE clause or PREWHERE clauses:
= and !=
= (equals) and != (notEquals) match the entire given search term.
Example:
The text index supports = and !=, yet equality and inequality search only make sense with the array tokenizer (it causes the index to store entire row values).
IN and NOT IN
IN (in) and NOT IN (notIn) are similar to functions equals and notEquals but they match all (IN) or no (NOT IN) search terms.
Example:
The same restrictions as for = and != apply, i.e. IN and NOT IN only make sense in conjunction with the array tokenizer.
LIKE, NOT LIKE and match
These functions currently use the text index for filtering only if the index tokenizer is either splitByNonAlpha, ngrams or sparseGrams.
In order to use LIKE (like), NOT LIKE (notLike), and the match function with text indexes, ClickHouse must be able to extract complete tokens from the search term.
For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
support in the example could match support, supports, supporting etc.
This kind of query is a substring query and it cannot be sped up by a text index.
To leverage a text index for LIKE queries, the LIKE pattern must be rewritten in the following way:
The spaces left and right of support make sure that the term can be extracted as a token.
startsWith and endsWith
Similar to LIKE, functions startsWith and endsWith can only use a text index, if complete tokens can be extracted from the search term.
For the index with ngrams tokenizer, this is the case if the length of the searched strings between wildcards is equal or longer than the ngram length.
Example for the text index with splitByNonAlpha tokenizer:
In the example, only clickhouse is considered a token.
support is no token because it can match support, supports, supporting etc.
To find all rows that start with clickhouse supports, please end the search pattern with a trailing space:
Similarly, endsWith should be used with a leading space:
hasToken and hasTokenOrNull
Function hasToken looks straightforward to use but it has certain pitfalls with non-default tokenizers and preprocessor expressions.
We recommend using functions hasAnyTokens and hasAllTokens instead.
Functions hasToken and hasTokenOrNull match against a single given token.
Unlike the previously mentioned functions, they do not tokenize the search term (they assume the input is a single token).
Example:
hasAnyTokens and hasAllTokens
Functions hasAnyTokens and hasAllTokens match against one or all of the given tokens.
These two functions accept the search tokens as either a string which will be tokenized using the same tokenizer used for the index column, or as an array of already processed tokens to which no tokenization will be applied prior to searching. See the function documentation for more info.
Example:
has
Array function has matches against a single token in the array of strings.
Example:
mapContains
Function mapContains (an alias of mapContainsKey) matches against tokens extracted from the searched string in the keys of a map.
The behaviour is similar to the equals function with a String column.
The text index is only used if it was created on a mapKeys(map) expression.
Example:
mapContainsValue
Function mapContainsValue matches against tokens extracted from the searched string in the values of a map.
The behaviour is similar to the equals function with a String column.
The text index is only used if it was created on a mapValues(map) expression.
Example:
mapContainsKeyLike and mapContainsValueLike
The functions mapContainsKeyLike and mapContainsValueLike match a pattern against all keys or values (respectively) of a map.
Example:
operator[]
Access operator[] can be used with the text index to filter out keys and values. The text index is only used if it is created on mapKeys(map) or mapValues(map) expressions, or both.
Example:
See the following examples for using columns of type Array(T) and Map(K, V) with the text index.
Examples for Array and Map columns with text indexes
Indexing Array(String) columns
Imagine a blogging platform, where authors categorize their blog posts using keywords. We like users to discover related content by searching for or clicking on topics.
Consider this table definition:
Without a text index, finding posts with a specific keyword (e.g. clickhouse) requires scanning all entries:
As the platform grows, this becomes increasingly slow because the query must examine every keywords array in every row.
To overcome this performance issue, we define a text index for column keywords:
Indexing Map columns
In many observability use cases, log messages are split into "components" and stored as appropriate data types, e.g. date time for the timestamp, enum for the log level etc. Metrics fields are best stored as key-value pairs. Operations teams need to efficiently search through logs for debugging, security incidents, and monitoring.
Consider this logs table:
Without a text index, searching through Map data requires full table scans:
As log volume grows, these queries become slow.
The solution is creating a text index for the Map keys and values. Use mapKeys to create a text index when you need to find logs by field names or attribute types:
Use mapValues to create a text index when you need to search within the actual content of attributes:
Example queries:
Performance Tuning
Direct read
Certain types of text queries can be speed up significantly by an optimization called "direct read".
Example:
The direct read optimization answers the query exclusively using the text index (i.e., text index lookups) without accessing the underlying text column. Text index lookups read relatively little data and are therefore much faster than usual skip indexes in ClickHouse (which do a skip index lookup, followed by loading and filtering remaining granules).
Direct read is controlled by two settings:
- Setting query_plan_direct_read_from_text_index (true by default) which specifies if direct read is generally enabled.
- Setting use_skip_indexes_on_data_read, another prerequisite for direct read. In ClickHouse versions >= 26.1, the setting is enabled by default. In earlier versions, you need to run
SET use_skip_indexes_on_data_read = 1explicitly.
Supported functions
The direct read optimization supports functions hasToken, hasAllTokens, and hasAnyTokens.
If the text index is defined with an array tokenizer, direct read is also supported for functions equals, has, mapContainsKey, and mapContainsValue.
These functions can also be combined by AND, OR, and NOT operators.
The WHERE or PREWHERE clauses can also contain additional non-text-search-functions filters (for text columns or other columns) - in that case, the direct read optimization will still be used but less effective (it only applies to the supported text search functions).
To understand a query utilizes direct read, run the query with EXPLAIN PLAN actions = 1.
As an example, a query with disabled direct read
returns
whereas the same query run with query_plan_direct_read_from_text_index = 1
returns
The second EXPLAIN PLAN output contains a virtual column __text_index_<index_name>_<function_name>_<id>.
If this column is present, then direct read is used.
If the WHERE filter clause only contains text search functions, the query can avoid reading the column data entirely and has the greatest performance benefit by direct read. However, even if the text column is accessed elsewhere in the query, direct read will still provide performance improvement.
Direct read as a hint
Direct read as a hint is based on the same principles as normal direct read, but instead adds an additional filter build from the text index data without removing the underlying text column. It is used for functions when reading only from the text index would produce false positives.
Supported functions are: like, startsWith, endsWith, equals, has, mapContainsKey, and mapContainsValue.
The additional filter can provide additional selectivity to restrict the result set in combination with other filters further, helping to reduce the amount of data read from other columns.
Direct read as a hint is controlled by setting query_plan_text_index_add_hint (enabled by default).
Example of query without hint:
returns
whereas the same query run with query_plan_text_index_add_hint = 1
returns
In the second EXPLAIN PLAN output, you can see that an additional conjunct (__text_index_...) has been added to the filter condition.
Thanks to the PREWHERE optimization, the filter condition is broken down into three separate conjuncts, which are applied in order of increasing computational complexity.
For this query, the application order is __text_index_..., then greaterOrEquals(...), and finally like(...).
This ordering enables skipping even more data granules than the granules skipped by the text index and the original filter, before reading the heavy columns used in the query after WHERE clause further reducing the amount of data to read.
Caching
Different caches are available to buffer parts of the text index in memory (see section Implementation Details): Currently, there are caches for the deserialized dictionary blocks, headers and posting lists of the text index to reduce I/O. They can be enabled via settings use_text_index_dictionary_cache, use_text_index_header_cache, and use_text_index_postings_cache. By default, all caches are disabled. To clear the caches, use statement SYSTEM CLEAR TEXT INDEX CACHES
Please refer the following server settings to configure the caches.
Dictionary blocks cache settings
| Setting | Description |
|---|---|
| text_index_dictionary_block_cache_policy | Text index dictionary block cache policy name. |
| text_index_dictionary_block_cache_size | Maximum cache size in bytes. |
| text_index_dictionary_block_cache_max_entries | Maximum number of deserialized dictionary blocks in cache. |
| text_index_dictionary_block_cache_size_ratio | The size of the protected queue in the text index dictionary block cache relative to the cache's total size. |
Header cache settings
| Setting | Description |
|---|---|
| text_index_header_cache_policy | Text index header cache policy name. |
| text_index_header_cache_size | Maximum cache size in bytes. |
| text_index_header_cache_max_entries | Maximum number of deserialized headers in cache. |
| text_index_header_cache_size_ratio | The size of the protected queue in the text index header cache relative to the cache's total size. |
Posting lists cache settings
| Setting | Description |
|---|---|
| text_index_postings_cache_policy | Text index postings cache policy name. |
| text_index_postings_cache_size | Maximum cache size in bytes. |
| text_index_postings_cache_max_entries | Maximum number of deserialized postings in cache. |
| text_index_postings_cache_size_ratio | The size of the protected queue in the text index postings cache relative to the cache's total size. |
Limitations
The text index currently has the following limitations:
- The materialization of text indexes with a high number of tokens (e.g. 10 billion tokens) can consume significant amounts of memory. Text
index materialization can happen directly (
ALTER TABLE <table> MATERIALIZE INDEX <index>) or indirectly in part merges. - It is not possible to materialize text indexes on parts with more than 4.294.967.296 (= 2^32 = ca. 4.2 billion) rows. Without a materialized text index, queries fall back to slow brute-force search within the part. As a worst case estimation, assume a part contains a single column of type String and MergeTree setting
max_bytes_to_merge_at_max_space_in_pool(default: 150 GB) was not changed. In this case, the situation happens if the column contains less than 29.5 characters per row on average. In practice, tables also contain other columns and the threshold is multiples times smaller than that (depending on the number, type and size of the other columns).
Text Indexes vs Bloom-Filter-Based Indexes
String predicates can be sped up using text indexes and bloom-filter-based based indexes (index type bloom_filter, ngrambf_v1, tokenbf_v1, sparse_grams), yet both are fundamentally different in their design and intended use cases:
Bloom filter indexes
- Are based on probabilistic data structures which may produce false positives.
- Are only able to answer set membership questions, i.e. the column may contain token X vs. definitely does not contain X.
- Store granule-level information to enable skipping coarse ranges during query execution.
- Are hard to tune properly (see here for an example).
- Are rather compact (a few kilobytes or megabytes per part).
Text indexes
- Build a deterministic inverted index over tokens. No false positives are possible by the index itself.
- Are specifically optimized for text search workloads.
- Store row-level information which enables efficient term lookup.
- Are rather large (dozens to hundreds of megabytes per part).
Bloom-filter-based indexes support full-text search only as a "side effect":
- They do not support advanced tokenization and preprocessing.
- They do not support multi-token search.
- They do not provide the performance characteristics expected from an inverted index.
Text indexes, in contrast, are purpose-built for full-text search:
- They provide tokenization and preprocessing
- They provide efficient support for
hasAllTokens,LIKE,match, and similar text-search functions. - They have significantly better scalability for large text corpora.
Implementation Details
Each text index consists of two (abstract) data structures:
- a dictionary which maps each token to a postings list, and
- a set of postings lists, each representing a set of row numbers.
Text index is built for the whole part. Unlike other skip indexes, text index can be merged instead of rebuilt on merge of the data parts (see below).
During index creation, three files are created (per part):
Dictionary blocks file (.dct)
The tokens in the text index are sorted and stored in dictionary blocks of 512 tokens each (the block size is configurable by parameter dictionary_block_size).
A dictionary blocks file (.dct) consists all the dictionary blocks of all index granules in a part.
Index header file (.idx)
The index header file contains for each dictionary block the block's first token and its relative offset in the dictionary blocks file.
This sparse index structure is similar to ClickHouse's sparse primary key index).
Postings lists file (.pst)
The posting lists for all tokens are laid out sequentially in the postings list file.
To save space while still allowing fast intersection and union operations, the posting lists are stored as roaring bitmaps.
If the posting list is larger than posting_list_block_size, it is split into multiple blocks that are stored sequentially to the postings lists file.
Merging of text indexes
When data parts are merged, the text index does not need to be rebuilt from scratch; instead, it can be merged efficiently in a separate step of the merge process.
During this step, the sorted dictionaries of the text indexes of each input part are read and combined into a new unified dictionary.
The row numbers in the postings lists are also recalculated to reflect their new positions in the merged data part, using a mapping of old to new row numbers that is created during the initial merge phase.
This method of merging text indexes is similar to how projections with _part_offset column are merged.
If index is not materialized in the source part, it is built, written into a temporary file and then merged together with indexes from the other parts and from other temporary index files.
Example: Hackernews dataset
Let's look at the performance improvements of text indexes on a large dataset with lots of text. We will use 28.7M rows of comments on the popular Hacker News website. Here is the table without text index:
The 28.7M rows are in a Parquet file in S3 - let's insert them into the hackernews table:
We will use ALTER TABLE and add a text index on comment column, then materialize it:
Now, let's run queries using hasToken, hasAnyTokens, and hasAllTokens functions.
The following examples will show the dramatic performance difference between a standard index scan and the direct read optimization.
1. Using hasToken
hasToken checks if the text contains a specific single token.
We'll search for the case-sensitive token 'ClickHouse'.
Direct read disabled (Standard scan) By default, ClickHouse uses the skip index to filter granules and then reads the column data for those granules. We can simulate this behavior by disabling direct read.
Direct read enabled (Fast index read) Now we run the same query with direct read enabled (the default).
The direct read query is over 45 times faster (0.362s vs 0.008s) and processes significantly less data (9.51 GB vs 3.15 MB) by reading from the index alone.
2. Using hasAnyTokens
hasAnyTokens checks if the text contains at least one of the given tokens.
We'll search for comments containing either 'love' or 'ClickHouse'.
Direct read disabled (Standard scan)
Direct read enabled (Fast index read)
The speedup is even more dramatic for this common "OR" search. The query is nearly 89 times faster (1.329s vs 0.015s) by avoiding the full column scan.
3. Using hasAllTokens
hasAllTokens checks if the text contains all of the given tokens.
We'll search for comments containing both 'love' and 'ClickHouse'.
Direct read disabled (Standard scan) Even with direct read disabled, the standard skip index is still effective. It filters down the 28.7M rows to just 147.46K rows, but it still must read 57.03 MB from the column.
Direct read enabled (Fast index read) Direct read answers the query by operating on the index data, reading only 147.46 KB.
For this "AND" search, the direct read optimization is over 26 times faster (0.184s vs 0.007s) than the standard skip index scan.
4. Compound search: OR, AND, NOT, ...
The direct read optimization also applies to compound boolean expressions. Here, we'll perform a case-insensitive search for 'ClickHouse' OR 'clickhouse'.
Direct read disabled (Standard scan)
Direct read enabled (Fast index read)
By combining the results from the index, the direct read query is 34 times faster (0.450s vs 0.013s) and avoids reading the 9.58 GB of column data.
For this specific case, hasAnyTokens(comment, ['ClickHouse', 'clickhouse']) would be the preferred, more efficient syntax.
Related content
- Presentation: https://github.com/ClickHouse/clickhouse-presentations/blob/master/2025-tumuchdata-munich/ClickHouse_%20full-text%20search%20-%2011.11.2025%20Munich%20Database%20Meetup.pdf
- Presentation: https://presentations.clickhouse.com/2026-fosdem-inverted-index/Inverted_indexes_the_what_the_why_the_how.pdf
Outdated material