DataStoreのプロファイリング - ClickHouse Documentation

DataStoreプロファイラを使うと、実行時間を測定し、パフォーマンスのボトルネックを特定できます。

クイックスタート

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# プロファイリングを有効化
config.enable_profiling()

# 操作を実行
ds = pd.read_csv("large_data.csv")
result = (ds
    .filter(ds['amount'] > 100)
    .groupby('category')
    .agg({'amount': 'sum'})
    .sort('sum', ascending=False)
    .head(10)
    .to_df()
)

# レポートを表示
profiler = get_profiler()
print(profiler.report())

プロファイリングの有効化

from chdb.datastore.config import config

# プロファイリングを有効化する
config.enable_profiling()

# プロファイリングを無効化する
config.disable_profiling()

# プロファイリングが有効かどうかを確認する
print(config.profiling_enabled)  # True または False

Profiler API

Profilerの取得

from chdb.datastore.config import get_profiler

profiler = get_profiler()

report()

パフォーマンスレポートを表示します。

profiler.report(min_duration_ms=0.1)

パラメータ:

パラメータ	型	デフォルト	説明
`min_duration_ms`	float	`0.1`	この時間以上のステップのみを表示

出力例:

======================================================================
EXECUTION PROFILE
======================================================================
   45.79ms (100.0%) Total Execution
     23.25ms ( 50.8%) Query Planning [ops_count=2]
     22.29ms ( 48.7%) SQL Segment 1 [ops=2]
       20.48ms ( 91.9%) SQL Execution
        1.74ms (  7.8%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:    45.79ms
======================================================================

レポートには次が表示されます。

各ステップの所要時間 (ミリ秒)
親/合計時間に占める割合
操作の階層構造
各ステップのメタデータ (例: ops_count、ops)

step()

コードブロックの実行時間を手動で計測します。

with profiler.step("custom_operation"):
    # ここにコードを記述
    expensive_operation()

clear()

すべてのプロファイリングデータをクリアします。

profiler.clear()

summary()

ステップ名と所要時間 (ms) の対応を表す辞書を取得します。

summary = profiler.summary()
for name, duration in summary.items():
    print(f"{name}: {duration:.2f}ms")

出力例:

Total Execution: 45.79ms
Total Execution.Cache Check: 0.00ms
Total Execution.Query Planning: 23.25ms
Total Execution.SQL Segment 1: 22.29ms
Total Execution.SQL Segment 1.SQL Execution: 20.48ms
Total Execution.SQL Segment 1.Result to DataFrame: 1.74ms

レポートを理解する

ステップ名

ステップ名	説明
`Total Execution`	全体の実行時間
`Query Planning`	クエリのプランニングにかかった時間
`SQL Segment N`	SQL セグメント N の実行
`SQL Execution`	実際の SQL クエリの実行
`Result to DataFrame`	結果を pandas の DataFrame に変換
`Cache Check`	クエリキャッシュの確認
`Cache Write`	結果を cache に書き込み

実行時間

計画ステップ (Query Planning): 通常は高速
実行ステップ (SQL Execution): 実際の処理が行われる部分
転送ステップ (Result to DataFrame): データを pandas に変換する部分

ボトルネックの特定

======================================================================
EXECUTION PROFILE
======================================================================
  200.50ms (100.0%) Total Execution
    10.25ms (  5.1%) Query Planning [ops_count=4]
   190.00ms ( 94.8%) SQL Segment 1 [ops=4]
     185.00ms ( 97.4%) SQL Execution    <- Main bottleneck
       5.00ms (  2.6%) Result to DataFrame
----------------------------------------------------------------------
      TOTAL:   200.50ms
======================================================================

プロファイリングパターン

単一のクエリをプロファイリングする

config.enable_profiling()
profiler = get_profiler()
profiler.clear()  # 以前のデータをクリア

# クエリを実行
result = ds.filter(...).groupby(...).agg(...).to_df()

# このクエリのプロファイルを表示
print(profiler.report())

複数クエリのプロファイリング

config.enable_profiling()
profiler = get_profiler()
profiler.clear()

# クエリ 1
with profiler.step("Query 1"):
    result1 = query1.to_df()

# クエリ 2
with profiler.step("Query 2"):
    result2 = query2.to_df()

print(profiler.report())

アプローチの比較

profiler = get_profiler()

# アプローチ1: フィルタリング後にグループ化
profiler.clear()
with profiler.step("filter_then_groupby"):
    result1 = ds.filter(ds['x'] > 10).groupby('y').sum().to_df()
summary1 = profiler.summary()
time1 = summary1.get('filter_then_groupby', 0)

# アプローチ2: グループ化後にフィルタリング
profiler.clear()
with profiler.step("groupby_then_filter"):
    result2 = ds.groupby('y').sum().filter(ds['x'] > 10).to_df()
summary2 = profiler.summary()
time2 = summary2.get('groupby_then_filter', 0)

print(f"Approach 1: {time1:.2f}ms")
print(f"Approach 2: {time2:.2f}ms")
print(f"Winner: {'Approach 1' if time1 < time2 else 'Approach 2'}")

最適化のヒント

1. SQL実行時間を確認する

SQL execution がボトルネックになっている場合:

フィルターを追加してデータ量を減らす
CSV ではなく Parquet を使用する
適切な索引が設定されているか確認する (データベースソースの場合)

2. I/O時間を確認する

read_csv または read_parquet がボトルネックになっている場合:

Parquet を使用する (列指向かつ圧縮可能)
必要なカラムだけを読み取る
可能であればソース側でフィルタする

3. データ転送を確認する

to_df が遅い場合:

結果セットが大きすぎる可能性があります
フィルターを追加するか、件数を制限します
プレビューには head() を使用します

4. エンジンを比較する

from chdb.datastore.config import config

# chdb を使ってプロファイリング
config.use_chdb()
profiler.clear()
result_chdb = query.to_df()
time_chdb = profiler.total_duration_ms

# pandas を使ってプロファイリング
config.use_pandas()
profiler.clear()
result_pandas = query.to_df()
time_pandas = profiler.total_duration_ms

print(f"chdb: {time_chdb:.2f}ms")
print(f"pandas: {time_pandas:.2f}ms")

ベストプラクティス

1. 最適化する前にプロファイリングを行う

# 勘に頼らず、まずは測定！
config.enable_profiling()
result = your_query.to_df()
print(get_profiler().report())

2. テストごとにクリアする

profiler.clear()  # 以前のデータをクリア
# テストを実行
print(profiler.report())

3. Focus には min_duration_ms を使用する

# 100ms以上の操作のみ表示
profiler.report(min_duration_ms=100)

4. 代表的なデータをプロファイリングする

# 実際のデータサイズでプロファイリングする
# 小規模なテストデータでは実際のボトルネックが見えないことがある

5. 本番環境では無効化する

# 開発環境
config.enable_profiling()

# 本番環境
config.set_profiling_enabled(False)  # オーバーヘッドを避ける

例: 完全なプロファイリングセッション

from chdb import datastore as pd
from chdb.datastore.config import config, get_profiler

# Setup
config.enable_profiling()
config.enable_debug()  # 何が起きているかも確認する
profiler = get_profiler()

# データの読み込み
profiler.clear()
print("=== Loading Data ===")
ds = pd.read_csv("sales_2024.csv")  # 1000万行
print(profiler.report())

# クエリ1: 単純なフィルター
profiler.clear()
print("\n=== Query 1: Simple Filter ===")
result1 = ds.filter(ds['amount'] > 1000).to_df()
print(profiler.report())

# クエリ2: 複雑な集計
profiler.clear()
print("\n=== Query 2: Complex Aggregation ===")
result2 = (ds
    .filter(ds['amount'] > 100)
    .groupby('region', 'category')
    .agg({
        'amount': ['sum', 'mean', 'count'],
        'quantity': 'sum'
    })
    .sort('sum', ascending=False)
    .head(20)
    .to_df()
)
print(profiler.report())

# サマリー
print("\n=== Summary ===")
print(f"Query 1: {len(result1)} rows")
print(f"Query 2: {len(result2)} rows")

​クイックスタート

​プロファイリングの有効化

​Profiler API

​Profilerの取得

​report()

​step()

​clear()

​summary()

​レポートを理解する

​ステップ名

​実行時間

​ボトルネックの特定

​プロファイリングパターン

​単一のクエリをプロファイリングする

​複数クエリのプロファイリング

​アプローチの比較

​最適化のヒント

​1. SQL実行時間を確認する

​2. I/O時間を確認する

​3. データ転送を確認する

​4. エンジンを比較する

​ベストプラクティス

​1. 最適化する前にプロファイリングを行う

​2. テストごとにクリアする

​3. Focus には min_duration_ms を使用する

​4. 代表的なデータをプロファイリングする

​5. 本番環境では無効化する

​例: 完全なプロファイリングセッション