For most of the time that Iβve been aware of ClickHouse, my understanding was that it was an analytics database with a traditional client-server architecture, designed to serve a high volume of concurrent queries with low latency.
Itβs only in the last few months that I became aware of a couple of tools that have changed my understanding: ClickHouse Local, which lets us run an in-process version of ClickHouse frontend by a CLI, and chDB, an embedded SQL OLAP Engine powered by ClickHouse.
Iβve found myself using a mixture of these tools for the majority of the videos that Iβve been creating for the ClickHouse YouTube channel.
The one most notable exception was a video explaining Materialized Views with help from ClickPy. ClickPy is a ClickHouse service hosted on ClickHouse Cloud that contains data on the downloads of packages from Pythonβs PyPi package manager. The dataset helps us understand the most downloaded packages over time, grouped by country, installer, version, and a number of other dimensions.
The database also contains metadata about each package, including a project homepage, which is often a GitHub repository. It got me thinking that it would be quite interesting to compare the GitHub metrics (e.g. star count or number of forks) of those libraries against the download numbers.
I asked my colleague Dale whether we could add the GitHub data onto the ClickPy server and he suggested that I first take a look at ClickHouseβs remote and remoteSecure table functions. These functions let you query remote ClickHouse servers on the fly from another ClickHouse client. We can also join the data from a remote query with data in the local ClickHouse, which when used with ClickHouse Local means that we can achieve a kind of hybrid query execution.
I should point out that this isnβt a use case for which ClickHouse is currently optimized, but I thought it would be a fun experiment, so letβs get to it!
Querying GitHub metrics with ClickHouse Local #
I wrote a little Python script to download data from the GitHub API for as many of the projects as possible, storing each project in its own JSON file on my machine. For example, below is a subset of the data for the Langchain project:
{
"id": 552661142,
"node_id": "R_kgDOIPDwlg",
"name": "langchain",
"full_name": "langchain-ai/langchain",
...
"topics": [],
"visibility": "public",
"forks": 10190,
"open_issues": 2109,
"watchers": 69585,
"default_branch": "master",
...
"subscribers_count": 606
}
Weβre going to explore these files using ClickHouse Local, so letβs launch that on our machine:
./clickhouse local -m
We can run the following query to find the most popular PyPi packages according to GitHub stars:
FROM file('data/*.json', JSONEachRow)
SELECT full_name, stargazers_count AS stars, forks
ORDER BY stargazers_count DESC
LIMIT 10;
ββfull_nameβββββββββββββββββ¬ββstarsββ¬βforksββ
β huggingface/transformers β 116073 β 23147 β
β langchain-ai/langchain β 69585 β 10190 β
β tiangolo/fastapi β 65210 β 5519 β
β yt-dlp/yt-dlp β 60914 β 4994 β
β keras-team/keras β 59836 β 19477 β
β ansible/ansible β 59352 β 23867 β
β openai/whisper β 51217 β 5828 β
β localstack/localstack β 50301 β 3822 β
β Textualize/rich β 45582 β 1686 β
β psf/black β 35545 β 2339 β
ββββββββββββββββββββββββββββ΄βββββββββ΄ββββββββ
10 rows in set. Elapsed: 0.140 sec. Processed 2.08 thousand rows, 14.97 MB (14.91 thousand rows/s., 107.28 MB/s.)
Peak memory usage: 48.50 KiB.
I suppose itβs not too surprising to see that quite a few of the libraries used in Generative AI applications are much loved on GitHub.
Querying popular PyPi projects on ClickHouse Cloud #
Now we need to work out which projects in the ClickPy database have a GitHub repository as their project home page. Letβs first connect to the ClickPy database using the read-only play user:
./clickhouse client -m \
-h clickpy-clickhouse.clickhouse.com \
--user play --secure
And now letβs write a query that finds the most popular PyPi projects that have a GitHub repository. Weβll do this by joining the pypi_downloads and projects tables. We run the following directly on the server:
SELECT name,
replaceOne(home_page, 'https://github.com/', '') AS repository,
sum(count) AS count
FROM pypi.pypi_downloads AS downloads
INNER JOIN (
SELECT name, argMax(home_page, version) AS home_page
FROM pypi.projects
GROUP BY name
) AS projects ON projects.name = downloads.project
WHERE projects.home_page LIKE '%github%'
GROUP BY ALL
ORDER BY count DESC
LIMIT 10;
ββnameββββββββββββββββ¬βrepositoryββββββββββββββββββ¬βββββββcountββ
β boto3 β boto/boto3 β 16031894410 β
β botocore β boto/botocore β 11033306159 β
β certifi β certifi/python-certifi β 8606959885 β
β s3transfer β boto/s3transfer β 8575775398 β
β python-dateutil β dateutil/dateutil β 8144178765
β charset-normalizer β Ousret/charset_normalizer β 5891178066 β
β jmespath β jmespath/jmespath.py β 5405618311 β
β pyasn1 β pyasn1/pyasn1 β 5378303214 β
β google-api-core β googleapis/python-api-core β 5022394699 β
β importlib-metadata β python/importlib_metadata β 4353215364 β
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
10 rows in set. Elapsed: 0.260 sec. Processed 12.28 million rows, 935.69 MB (47.16 million rows/s., 3.59 GB/s.)
Peak memory usage: 1.02 GiB.
Letβs have a look at a diagram that shows where the different bits of data reside.
Configuring permissions for remote querying #
What I want to do next is combine the query that finds PyPi projects with the one that returns GitHub metrics. The main challenge is that the PyPi data is on Clickhouse Cloud, while the GitHub metrics is on my machine.
I donβt want to pollute the ClickHouse Cloud instance with my GitHub data, so instead Iβm going to use the remoteSecure table function to query ClickHouse Cloud from my machine. In order to use this function to join the projects and pypi_downloads tables, weβll need to create a user that has the following permission:
GRANT CREATE TEMPORARY TABLE, REMOTE ON *.* TO <user>
Once iβve created a user called _mark_ on the ClickPy server that has this permission, we can return to our ClickHouse Local session and define a password as a parameter:
set param_password = 'my-password';
Querying ClickHouse Cloud from ClickHouse Local #
And now weβre going to run a version of the above query, that finds the most popular PyPi projects, using the remoteSecure function.
SELECT name,
replaceOne(home_page, 'https://github.com/', '') AS repository,
sum(count) AS count
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.pypi_downloads',
'mark', {password:String}
) AS pypi_downloads
INNER JOIN
(
SELECT name, argMax(home_page, version) AS home_page
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.projects',
'mark', {password:String}
)
GROUP BY name
) AS projects ON projects.name = pypi_downloads.project
WHERE projects.home_page LIKE '%github%'
GROUP BY ALL
ORDER BY count DESC
LIMIT 10;
ββnameββββββββββββββββ¬βrepositoryββββββββββββββββββ¬βββββββcountββ
β boto3 β boto/boto3 β 16031894410 β
β botocore β boto/botocore β 11033306159 β
β certifi β certifi/python-certifi β 8606959885 β
β s3transfer β boto/s3transfer β 8575775398 β
β python-dateutil β dateutil/dateutil β 8144178765 β
β charset-normalizer β Ousret/charset_normalizer β 5891178066 β
β jmespath β jmespath/jmespath.py β 5405618311 β
β pyasn1 β pyasn1/pyasn1 β 5378303214 β
β google-api-core β googleapis/python-api-core β 5022394699 β
β importlib-metadata β python/importlib_metadata β 4353215364 β
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ
10 rows in set. Elapsed: 1.703 sec.
As weβd expect, we get the same results as before. This query takes a bit longer to run because, although the JOIN is done on the ClickPy server, we are initializing a new connection to the ClickPy server each time we run the query. We can check that the join is done remotely by prefixing the query with EXPLAIN PLAN, which will return the following:
ββexplainββββββββββββββββββββββββββββββββββββ
β ReadFromRemote (Read from remote replica) β
βββββββββββββββββββββββββββββββββββββββββββββ
If the JOIN was being done locally, we would see a Join operator in the query plan.
Joining data from ClickHouse Cloud with ClickHouse Local #
Next, letβs join this data with the local GitHub dataset:
SELECT
projects.name,
replaceOne(home_page, 'https://github.com/', '') AS repository,
sum(count) AS count,
gh.stargazers_count AS stars
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.pypi_downloads',
'mark', {password:String}
) AS pypi_downloads
INNER JOIN
(
SELECT name, argMax(home_page, version) AS home_page
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.projects',
'mark', {password:String}
)
GROUP BY name
) AS projects ON projects.name = pypi_downloads.project
INNER JOIN
(
SELECT *
FROM file('data/*.json', JSONEachRow)
) AS gh ON gh.svn_url = projects.home_page
GROUP BY ALL
ORDER BY stars DESC
LIMIT 10;
This results in the following output:
ββprojects.nameβββββββββββββ¬βrepositoryββββββββββββββββ¬βββββcountββ¬ββstarsββ
β in-transformers β huggingface/transformers β 881 β 116073 β
β richads-transformers β huggingface/transformers β 1323 β 116073 β
β transformers-machinify β huggingface/transformers β 999 β 116073 β
β transformers-phobert β huggingface/transformers β 4550 β 116073 β
β transformers β huggingface/transformers β 302008339 β 116073 β
β langchain β langchain-ai/langchain β 35657607 β 69585 β
β langchain-by-johnsnowlabsβ langchain-ai/langchain β 565 β 69585 β
β langchain-core β langchain-ai/langchain β 2440921 β 69585 β
β gigachain-core β langchain-ai/langchain β 4181 β 69585 β
β langchain-community β langchain-ai/langchain β 1438159 β 69585 β
β gigachain-community β langchain-ai/langchain β 1914 β 69585 β
β yt-dlp-custom β yt-dlp/yt-dlp β 948 β 60914 β
β yt-dlp β yt-dlp/yt-dlp β 86175495 β 60914 β
β keras β keras-team/keras β 374424308 β 59836 β
β keras-nightly β keras-team/keras β 20349029 β 59836 β
β symai-whisper β openai/whisper β 790 β 51217 β
β test10101010101 β openai/whisper β 46 β 51217 β
β whisper-openai β openai/whisper β 11486 β 51217 β
β openai-whisper β openai/whisper β 2029106 β 51217 β
β localstack β localstack/localstack β 3998353 β 50301 β
ββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββ΄ββββββββββββ΄βββββββββ
20 rows in set. Elapsed: 3.704 sec. Processed 12.28 million rows, 950.66 MB (3.31 million rows/s., 256.66 MB/s.)
Peak memory usage: 339.80 MiB.
huggingface/transformers, langchain-ai/langchain, and openai/whisper are repeated several times. This is because there are different PyPi projects using the same GitHub repository as their homepage. Some of those look like genuinely different projects, but others seem to be abandoned forks of the main project.
This query takes almost 4 seconds to run because the result of the join of the projects and pypi_downloads tables is being streamed to my ClickHouse Local instance before the join with the GitHub data is done. We can see a diagram showing how this works below:
Apart from keras and transformers, most of the well downloaded packages donβt appear on our list. We can find out how many stars those packages have by changing the ORDER BY clause to sort by downloads instead of stars. We need to change the following line: \
ORDER BY stars DESC
To be:
ORDER BY count DESC
And if we run the query with that change, weβll see the following output:
ββprojects.nameβββββββ¬βrepositoryββββββββββββββββββ¬βββββββcountββ¬βstarsββ
β boto3 β boto/boto3 β 16031894410 β 8440 β
β botocore β boto/botocore β 11033306159 β 1352 β
β certifi β certifi/python-certifi β 8606959885 β 707 β
β s3transfer β boto/s3transfer β 8575775398 β 189 β
β python-dateutil β dateutil/dateutil β 8144178765 β 2164 β
β charset-normalizer β Ousret/charset_normalizer β 5891178066 β 448 β
β jmespath β jmespath/jmespath.py β 5405618311 β 1975 β
β pyasn1 β pyasn1/pyasn1 β 5378303214 β 18 β
β google-api-core β googleapis/python-api-core β 5022394699 β 98 β
β importlib-metadata β python/importlib_metadata β 4353215364 β 101 β
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββ
10 rows in set. Elapsed: 3.957 sec. Processed 11.96 million rows, 941.07 MB (3.02 million rows/s., 237.81 MB/s.)
Peak memory usage: 336.19 MiB.
Thereβs not much love for most of these projects on GitHub! The query still takes 4 seconds, but with this one we can speed it up because weβre sorting by a field thatβs on the remote table. This means that we could restrict the number of records being returned by the remote join, as shown in the diagram below:
Letβs restrict the number of records to 1,000, as shown in the query below:
WITH pypiProjects AS (
SELECT home_page, projects.name, sum(count) AS count
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.pypi_downloads',
'mark', {password:String}
) AS pypi_downloads
INNER JOIN
(
SELECT name, argMax(home_page, version) AS home_page
FROM remoteSecure(
'clickpy-clickhouse.clickhouse.com',
'pypi.projects',
'mark', {password:String}
)
GROUP BY name
) AS projects ON projects.name = pypi_downloads.project
GROUP BY ALL
ORDER BY count DESC
LIMIT 1000
)
SELECT
name,
replaceOne(home_page, 'https://github.com/', '') AS repository,
count,
gh.stargazers_count AS stars
FROM pypiProjects
INNER JOIN
(
SELECT *
FROM file('data/*.json', JSONEachRow)
) AS gh ON gh.svn_url = pypiProjects.home_page
GROUP BY ALL
ORDER BY count DESC
LIMIT 10;
ββnameββββββββββββββββ¬βrepositoryββββββββββββββββββ¬βββββββcountββ¬βstarsββ
β boto3 β boto/boto3 β 16031894410 β 8440 β
β botocore β boto/botocore β 11033306159 β 1352 β
β certifi β certifi/python-certifi β 8606959885 β 707 β
β s3transfer β boto/s3transfer β 8575775398 β 189 β
β python-dateutil β dateutil/dateutil β 8144178765 β 2164 β
β charset-normalizer β Ousret/charset_normalizer β 5891178066 β 448 β
β jmespath β jmespath/jmespath.py β 5405618311 β 1975 β
β pyasn1 β pyasn1/pyasn1 β 5378303214 β 18 β
β google-api-core β googleapis/python-api-core β 5022394699 β 98 β
β importlib-metadata β python/importlib_metadata β 4353215364 β 101 β
ββββββββββββββββββββββ΄βββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββ
10 rows in set. Elapsed: 1.758 sec. Processed 2.08 thousand rows, 14.97 MB (1.18 thousand rows/s., 8.51 MB/s.)
Peak memory usage: 448.22 MiB.
This time it takes just under 2 seconds because we arenβt streaming so many records to ClickHouse Local before doing the join with the GitHub data. This isnβt a perfect solution, however, because we could have ended up with fewer than 10 records if more than 990 of our 1,000 records didnβt have a match in the GitHub dataset.
Summary #
And thatβs about it for now. Iβd be curious to know what you all think? Can you see a real use case for this functionality? If so let us know in the comments or on ClickHouse Slack.



