在 ClickHouse 中使用 GitHub 数据编写查询

该数据集包含 ClickHouse 仓库的所有提交和变更。可以使用 ClickHouse 提供的原生 git-import 工具生成。

生成的数据会为下列每个表输出一个 tsv 文件：

commits - 带有统计信息的提交。
file_changes - 每次提交中发生变更的文件，以及关于这些变更和统计信息。
line_changes - 每次提交中每个发生变更的文件里的每一处行级变更，包含该行的完整信息以及该行上一次变更的信息。

截至 2022 年 11 月 8 日，每个 TSV 的大致大小和行数如下：

commits - 7.8M - 266,051 行
file_changes - 53M - 266,051 行
line_changes - 2.7G - 7,535,157 行

生成数据

此步骤为可选。我们免费提供这些数据——请参阅下载并插入数据。

git clone git@github.com:ClickHouse/ClickHouse.git
cd ClickHouse
clickhouse git-import --skip-paths 'generated\.cpp|^(contrib|docs?|website|libs/(libcityhash|liblz4|libdivide|libvectorclass|libdouble-conversion|libcpuid|libzstd|libfarmhash|libmetrohash|libpoco|libwidechar_width))/' --skip-commits-with-messages '^Merge branch '

在 2021 款 MacBook Pro 上，截至 2022 年 11 月 8 日，对 ClickHouse 仓库执行此操作大约需要 3 分钟完成。

可以通过该工具内置的帮助获取所有可用选项的完整列表。

clickhouse git-import -h

本帮助页面还为上述每个表提供 DDL，例如：

CREATE TABLE git.commits
(
    hash String,
    author LowCardinality(String),
    time DateTime,
    message String,
    files_added UInt32,
    files_deleted UInt32,
    files_renamed UInt32,
    files_modified UInt32,
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

这些查询在任意仓库上都应能正常运行。欢迎尽情探索并分享你的发现。 以下是关于执行时间的一些参考指南（截至 2022 年 11 月）：

Linux - ~/clickhouse git-import - 160 分钟

下载并插入数据

可以使用以下数据来复现一个可用的环境。或者，也可以在 play.clickhouse.com 上获取该数据集——有关更多详情，请参见 Queries。

以下代码仓库生成的文件如下：

ClickHouse（2022 年 11 月 8 日）
Linux（2022 年 11 月 8 日）

要插入这些数据，请通过执行以下查询语句来准备数据库：

DROP DATABASE IF EXISTS git;
CREATE DATABASE git;

CREATE TABLE git.commits
(
    hash String,
    author LowCardinality(String),
    time DateTime,
    message String,
    files_added UInt32,
    files_deleted UInt32,
    files_renamed UInt32,
    files_modified UInt32,
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.file_changes
(
    change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
    path LowCardinality(String),
    old_path LowCardinality(String),
    file_extension LowCardinality(String),
    lines_added UInt32,
    lines_deleted UInt32,
    hunks_added UInt32,
    hunks_removed UInt32,
    hunks_changed UInt32,

    commit_hash String,
    author LowCardinality(String),
    time DateTime,
    commit_message String,
    commit_files_added UInt32,
    commit_files_deleted UInt32,
    commit_files_renamed UInt32,
    commit_files_modified UInt32,
    commit_lines_added UInt32,
    commit_lines_deleted UInt32,
    commit_hunks_added UInt32,
    commit_hunks_removed UInt32,
    commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

CREATE TABLE git.line_changes
(
    sign Int8,
    line_number_old UInt32,
    line_number_new UInt32,
    hunk_num UInt32,
    hunk_start_line_number_old UInt32,
    hunk_start_line_number_new UInt32,
    hunk_lines_added UInt32,
    hunk_lines_deleted UInt32,
    hunk_context LowCardinality(String),
    line LowCardinality(String),
    indent UInt8,
    line_type Enum('Empty' = 0, 'Comment' = 1, 'Punct' = 2, 'Code' = 3),

    prev_commit_hash String,
    prev_author LowCardinality(String),
    prev_time DateTime,

    file_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3, 'Rename' = 4, 'Copy' = 5, 'Type' = 6),
    path LowCardinality(String),
    old_path LowCardinality(String),
    file_extension LowCardinality(String),
    file_lines_added UInt32,
    file_lines_deleted UInt32,
    file_hunks_added UInt32,
    file_hunks_removed UInt32,
    file_hunks_changed UInt32,

    commit_hash String,
    author LowCardinality(String),
    time DateTime,
    commit_message String,
    commit_files_added UInt32,
    commit_files_deleted UInt32,
    commit_files_renamed UInt32,
    commit_files_modified UInt32,
    commit_lines_added UInt32,
    commit_lines_deleted UInt32,
    commit_hunks_added UInt32,
    commit_hunks_removed UInt32,
    commit_hunks_changed UInt32
) ENGINE = MergeTree ORDER BY time;

使用 INSERT INTO SELECT 和 s3 函数来插入数据。例如，在下面的示例中，我们将 ClickHouse 文件插入到各自对应的表中：

commits

INSERT INTO git.commits SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/commits.tsv.xz', 'TSV', 'hash String,author LowCardinality(String), time DateTime, message String, files_added UInt32, files_deleted UInt32, files_renamed UInt32, files_modified UInt32, lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32')

0 rows in set. Elapsed: 1.826 sec. Processed 62.78 thousand rows, 8.50 MB (34.39 thousand rows/s., 4.66 MB/s.)

file_changes

INSERT INTO git.file_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/file_changes.tsv.xz', 'TSV', 'change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6), path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), lines_added UInt32, lines_deleted UInt32, hunks_added UInt32, hunks_removed UInt32, hunks_changed UInt32, commit_hash String, author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 2.688 sec. Processed 266.05 thousand rows, 48.30 MB (98.97 thousand rows/s., 17.97 MB/s.)

line_changes

INSERT INTO git.line_changes SELECT *
FROM s3('https://datasets-documentation.s3.amazonaws.com/github/commits/clickhouse/line_changes.tsv.xz', 'TSV', '    sign Int8, line_number_old UInt32, line_number_new UInt32, hunk_num UInt32, hunk_start_line_number_old UInt32, hunk_start_line_number_new UInt32, hunk_lines_added UInt32,\n    hunk_lines_deleted UInt32, hunk_context LowCardinality(String), line LowCardinality(String), indent UInt8, line_type Enum(\'Empty\' = 0, \'Comment\' = 1, \'Punct\' = 2, \'Code\' = 3), prev_commit_hash String, prev_author LowCardinality(String), prev_time DateTime, file_change_type Enum(\'Add\' = 1, \'Delete\' = 2, \'Modify\' = 3, \'Rename\' = 4, \'Copy\' = 5, \'Type\' = 6),\n    path LowCardinality(String), old_path LowCardinality(String), file_extension LowCardinality(String), file_lines_added UInt32, file_lines_deleted UInt32, file_hunks_added UInt32, file_hunks_removed UInt32, file_hunks_changed UInt32, commit_hash String,\n    author LowCardinality(String), time DateTime, commit_message String, commit_files_added UInt32, commit_files_deleted UInt32, commit_files_renamed UInt32, commit_files_modified UInt32, commit_lines_added UInt32, commit_lines_deleted UInt32, commit_hunks_added UInt32, commit_hunks_removed UInt32, commit_hunks_changed UInt32')

0 rows in set. Elapsed: 50.535 sec. Processed 7.54 million rows, 2.09 GB (149.11 thousand rows/s., 41.40 MB/s.)

查询

该工具会在其帮助信息中建议若干查询。除了这些查询之外，我们还解答了一些额外的补充性问题。这些查询大致按复杂度递增的顺序排列，而不是按工具给出的任意顺序。

该数据集可在 play.clickhouse.com 的 git_clickhouse 数据库中获取。我们为所有查询都提供了指向此环境的链接，并在需要时调整数据库名称。请注意，由于数据采集时间不同，play 环境中的结果可能与此处展示的结果有所差异。

单个文件的历史记录

这是最简单的查询。在这里，我们查看 StorageReplicatedMergeTree.cpp 的所有提交说明。由于这些记录通常更有参考价值，我们按时间倒序排序，让最新的记录排在最前面。

生成数据​

下载并插入数据​

查询​

单个文件的历史记录​

查找当前有效的文件​

列出修改最多的文件​

通常在一周的哪一天提交最频繁？​

子目录/文件的历史记录 - 随时间变化的行数、提交次数和贡献者数量​

列出作者最多的文件​

仓库中历史最久的代码行​

历史最久的文件​

本月在文档与代码方面的贡献者分布​

影响范围最广的作者​

某位作者最常参与的文件​

作者最少的最大文件​

提交次数和代码行数按时间分布；按星期几、按作者；针对特定子目录​

展示哪些作者倾向于重写其他作者代码的矩阵​

一周中，每天贡献占比最高的是谁？​

整个代码库的代码年龄分布​

某位作者编写的代码中，有多少百分比被其他作者移除？​

列出被修改次数最多的文件？​

代码在代码仓库中保留时间最长的是星期几？​

按平均代码年龄排序的文件​

谁更倾向于写更多测试 / CPP 代码 / 注释？​

按作者来看，提交中代码/注释比例随时间如何变化？​

代码在被重写前存活的平均时间是多少？中位数（代码衰减的“半衰期”）又是多少？​

在什么时间写代码最“糟糕”，也就是代码被重写的概率最高？​

哪位作者的代码“黏性”最高？​

作者最长连续提交天数​

文件的逐行提交历史​

未解决的问题​

Git blame​

生成数据

下载并插入数据

查询

单个文件的历史记录

查找当前有效的文件

列出修改最多的文件

通常在一周的哪一天提交最频繁？

子目录/文件的历史记录 - 随时间变化的行数、提交次数和贡献者数量

列出作者最多的文件

仓库中历史最久的代码行

历史最久的文件

本月在文档与代码方面的贡献者分布

影响范围最广的作者

某位作者最常参与的文件

作者最少的最大文件

提交次数和代码行数按时间分布；按星期几、按作者；针对特定子目录

展示哪些作者倾向于重写其他作者代码的矩阵

一周中，每天贡献占比最高的是谁？

整个代码库的代码年龄分布

某位作者编写的代码中，有多少百分比被其他作者移除？

列出被修改次数最多的文件？

代码在代码仓库中保留时间最长的是星期几？

按平均代码年龄排序的文件

谁更倾向于写更多测试 / CPP 代码 / 注释？

按作者来看，提交中代码/注释比例随时间如何变化？

代码在被重写前存活的平均时间是多少？中位数（代码衰减的“半衰期”）又是多少？

在什么时间写代码最“糟糕”，也就是代码被重写的概率最高？

哪位作者的代码“黏性”最高？

作者最长连续提交天数

文件的逐行提交历史

未解决的问题

Git blame