DataStore のデバッグ

DataStore は、データパイプラインを理解し、最適化するための包括的なデバッグツール群を提供します。

デバッグツールの概要

ツール	目的	使用タイミング
`explain()`	実行計画を確認	どの SQL が実行されるかを把握する
Profiler	パフォーマンスを測定	遅い処理を特定する
Logging	実行の詳細を記録・確認	予期しない動作をデバッグする

クイック決定マトリックス

目的	ツール	コマンド
実行計画を確認する	`explain()`	`ds.explain()`
パフォーマンスを測定する	Profiler	`config.enable_profiling()`
SQL クエリをデバッグする	Logging	`config.enable_debug()`
上記すべて	組み合わせ	以下を参照

クイックセットアップ

すべてのデバッグを有効にする

from chdb import datastore as pd
from chdb.datastore.config import config

# Enable all debugging
config.enable_debug()        # Verbose logging
config.enable_profiling()    # Performance tracking

ds = pd.read_csv("data.csv")
result = ds.filter(ds['age'] > 25).groupby('city').agg({'salary': 'mean'})

# View execution plan
result.explain()

# Get profiler report
from chdb.datastore.config import get_profiler
profiler = get_profiler()
profiler.report()

explain() メソッド

実行前にクエリの実行計画を確認します。

ds = pd.read_csv("data.csv")

query = (ds
    .filter(ds['amount'] > 1000)
    .groupby('region')
    .agg({'amount': ['sum', 'mean']})
)

# View plan
query.explain()

出力結果:

Pipeline:
  Source: file('data.csv', 'CSVWithNames')
  Filter: amount > 1000
  GroupBy: region
  Aggregate: sum(amount), avg(amount)

Generated SQL:
SELECT region, SUM(amount) AS sum, AVG(amount) AS mean
FROM file('data.csv', 'CSVWithNames')
WHERE amount > 1000
GROUP BY region

詳細は explain() のドキュメントを参照してください。

プロファイリング

各処理の実行時間を計測します。

from chdb.datastore.config import config, get_profiler

# Enable profiling
config.enable_profiling()

# Run operations
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# View report
profiler = get_profiler()
profiler.report(min_duration_ms=0.1)

出力:

Performance Report
==================
Step                          Duration    Calls
----                          --------    -----
read_csv                      1.234s      1
filter                        0.002s      1
groupby                       0.001s      1
agg                           0.089s      1
sort                          0.045s      1
head                          0.001s      1
to_df (SQL execution)         0.567s      1
----                          --------    -----
Total                         1.939s      7

詳細は Profiling Guide を参照してください。

ロギング

詳細な実行ログを確認します。

from chdb.datastore.config import config

# Enable debug logging
config.enable_debug()

# Run operations - logs will show:
# - SQL queries generated
# - Execution engine used
# - Cache hits/misses
# - Timing information

ログ出力例:

DEBUG - DataStore: Creating from file 'data.csv'
DEBUG - Query: SELECT region, SUM(amount) FROM ... WHERE amount > 1000 GROUP BY region
DEBUG - Engine: Using chdb for aggregation
DEBUG - Execution time: 0.089s
DEBUG - Cache: Storing result (key: abc123)

詳細については、ログ設定を参照してください。

よくあるデバッグシナリオ

1. クエリが想定どおりの結果を返さない

# Step 1: View the execution plan
query = ds.filter(ds['age'] > 25).groupby('city').sum()
query.explain(verbose=True)

# Step 2: Enable logging to see SQL
config.enable_debug()

# Step 3: Run and check logs
result = query.to_df()

2. クエリの実行が遅い

# Step 1: Enable profiling
config.enable_profiling()

# Step 2: Run your query
result = process_data()

# Step 3: Check profiler report
profiler = get_profiler()
profiler.report()

# Step 4: Identify slow operations and optimize

3. エンジン選択を理解する

# Enable verbose logging
config.enable_debug()

# Run operations
result = ds.filter(ds['x'] > 10).apply(custom_func)

# Logs will show which engine was used for each operation:
# DEBUG - filter: Using chdb engine
# DEBUG - apply: Using pandas engine (custom function)

4. キャッシュ関連の問題のデバッグ

# Enable debug to see cache operations
config.enable_debug()

# First run
result1 = ds.filter(ds['x'] > 10).to_df()
# LOG: Cache miss, executing query

# Second run (should use cache)
result2 = ds.filter(ds['x'] > 10).to_df()
# LOG: Cache hit, returning cached result

# If not caching when expected, check:
# - Are operations identical?
# - Is cache enabled? config.cache_enabled

ベストプラクティス

1. デバッグは本番環境ではなく開発環境で行う

# Development
config.enable_debug()
config.enable_profiling()

# Production
config.set_log_level(logging.WARNING)
config.set_profiling_enabled(False)

2. 大きなクエリを実行する前に explain() を使う

# Build query
query = ds.filter(...).groupby(...).agg(...)

# Check plan first
query.explain()

# If plan looks good, execute
result = query.to_df()

3. 最適化の前にプロファイリングする

# Don't guess what's slow - measure it
config.enable_profiling()
result = your_pipeline()
get_profiler().report()

4. 結果が期待どおりでない場合は SQL を確認する

# View generated SQL
print(query.to_sql())

# Compare with expected SQL
# Run SQL directly in ClickHouse to verify

デバッグツールの概要

ツール	コマンド	出力
Explain plan	`ds.explain()`	実行ステップ + SQL
Verbose explain	`ds.explain(verbose=True)`	+ メタデータ
SQL の表示	`ds.to_sql()`	SQL クエリ文字列
デバッグ有効化	`config.enable_debug()`	詳細なログ
プロファイリング有効化	`config.enable_profiling()`	タイミング情報
Profiler レポート	`get_profiler().report()`	パフォーマンス概要
Profiler クリア	`get_profiler().reset()`	タイミング情報をクリア

次のステップ

explain() メソッド - 実行計画の詳細なドキュメント
プロファイリングガイド - パフォーマンス計測
ロギング設定 - ログレベルとフォーマットの設定

デバッグツールの概要​

クイック決定マトリックス​

クイックセットアップ​

すべてのデバッグを有効にする​

explain() メソッド​

プロファイリング​

ロギング​

よくあるデバッグシナリオ​

1. クエリが想定どおりの結果を返さない​

2. クエリの実行が遅い​

3. エンジン選択を理解する​

4. キャッシュ関連の問題のデバッグ​

ベストプラクティス​

1. デバッグは本番環境ではなく開発環境で行う​

2. 大きなクエリを実行する前に explain() を使う​

3. 最適化の前にプロファイリングする​

4. 結果が期待どおりでない場合は SQL を確認する​

デバッグツールの概要​

次のステップ​

デバッグツールの概要

クイック決定マトリックス

クイックセットアップ

すべてのデバッグを有効にする

explain() メソッド

プロファイリング

ロギング

よくあるデバッグシナリオ

1. クエリが想定どおりの結果を返さない

2. クエリの実行が遅い

3. エンジン選択を理解する

4. キャッシュ関連の問題のデバッグ

ベストプラクティス

1. デバッグは本番環境ではなく開発環境で行う

2. 大きなクエリを実行する前に explain() を使う

3. 最適化の前にプロファイリングする

4. 結果が期待どおりでない場合は SQL を確認する

デバッグツールの概要

次のステップ