跳转到主要内容

简介

ClickHouse 没有 PIVOT 运算符,但我们可以借助聚合函数组合器,尤其是带有 -Map 后缀的组合器,实现类似的效果。 本文将介绍具体做法。 另外,还有一个涵盖相同内容的视频,你可以在下方观看:

理解聚合函数组合器

我们先从一个简单的示例开始。我们将使用 clickhouse-local,可通过运行以下命令启动它:
clickhouse -m --output_format_pretty_row_numbers=0
以下查询调用了 sumMap 函数,该函数接受一个 Map,并对其中每个键的值求和:
SELECT sumMap(map('ClickHouse', 1, 'ClickBench', 2));
┌─sumMap(map('ClickHouse', 1, 'ClickBench', 2))─┐
│ {'ClickBench':2,'ClickHouse':1}               │
└───────────────────────────────────────────────┘
这个示例没什么特别意思,因为它返回的就是我们传入的同一个 Map。 现在我们来对多行 Map 调用 sumMap
WITH values AS (
  SELECT map('ClickHouse', 3) AS value
  UNION ALL
  SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT sumMap(value)
FROM values;
┌─sumMap(value)───────────────────┐
│ {'ClickBench':2,'ClickHouse':7} │
└─────────────────────────────────┘
ClickHouse 在两行中都出现了,因此它对应的值被求和了。键 ClickBench 只在一行中出现,因此只对单个值求和,结果就是该值本身! 我们也可以使用 maxMap 来找出每个键的最大值:
WITH values AS (
  SELECT map('ClickHouse', 3) AS value
  UNION ALL
  SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT maxMap(value)
FROM values;
┌─maxMap(value)───────────────────┐
│ {'ClickBench':2,'ClickHouse':4} │
└─────────────────────────────────┘
或者,我们可以使用 avgMap 来计算每个键的平均值:
WITH values AS (
  SELECT map('ClickHouse', 3) AS value
  UNION ALL
  SELECT map('ClickBench', 2, 'ClickHouse', 4) AS value
)
SELECT avgMap(value)
FROM values;
┌─avgMap(value)─────────────────────┐
│ {'ClickBench':2,'ClickHouse':3.5} │
└───────────────────────────────────┘
希望这能让你大致了解这些函数组合器的工作方式。

实际应用:英国房价数据集

现在,我们要在更大的数据集上,在 ClickHouse SQL playground 中使用这些方法。 我们可以使用 clickhouse-client 连接到 playground:
clickhouse client -m \
  -h sql-clickhouse.clickhouse.com \
  -u demo \
  --secure
我们将查询 uk_price_paid 表,先来看看该表中的数据:
SELECT * FROM uk.uk_price_paid LIMIT 1 FORMAT Vertical;
Row 1:
──────
price:     145000
date:      2008-11-19
postcode1:
postcode2:
type:      semi-detached
is_new:    0
duration:  leasehold
addr1:
addr2:
street:    CURLEW DRIVE
locality:  SCARBOROUGH
town:      SCARBOROUGH
district:  SCARBOROUGH
county:    NORTH YORKSHIRE
category:  0
从上文可以看出,该表包含英国房产销售相关的各种字段。

按十年期分组并聚合

接下来,我们来计算数据集中每个十年期内按郡分组的房价中位数:
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    county,
    medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices───────────────────────────────────────┐
 1. │ GREATER LONDON     │ {1990:89972.5,2000:215000,2010:381500,2020:485000} │
 2. │ TYNE AND WEAR      │ {1990:46500,2000:93000,2010:130000,2020:139000}    │
 3. │ WEST MIDLANDS      │ {1990:50000,2000:110000,2010:149950,2020:185000}   │
 4. │ GREATER MANCHESTER │ {1990:47000,2000:97000,2010:141171,2020:178000}    │
 5. │ MERSEYSIDE         │ {1990:46750,2000:94972.5,2010:128000,2020:149000}  │
 6. │ HERTFORDSHIRE      │ {1990:86500,2000:193000,2010:315000,2020:415000}   │
 7. │ WEST YORKSHIRE     │ {1990:48995,2000:99950,2010:139000,2020:164950}    │
 8. │ BRIGHTON AND HOVE  │ {1990:70000,2000:173000,2010:288000,2020:387000}   │
 9. │ DORSET             │ {1990:76500,2000:182000,2010:250000,2020:315000}   │
10. │ HAMPSHIRE          │ {1990:79950,2000:177500,2010:260000,2020:335000}   │
    └────────────────────┴────────────────────────────────────────────────────┘

过滤结果

我们可以对结果进行过滤,只保留 2010 年及之后的数据:
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    county,
    medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices────────────────┐
 1. │ GREATER LONDON     │ {2010:384975,2020:485919.5} │
 2. │ TYNE AND WEAR      │ {2010:130000,2020:140000}   │
 3. │ WEST MIDLANDS      │ {2010:146500,2020:185000}   │
 4. │ GREATER MANCHESTER │ {2010:140000,2020:177500}   │
 5. │ MERSEYSIDE         │ {2010:130000,2020:150000}   │
 6. │ HERTFORDSHIRE      │ {2010:315000,2020:415000}   │
 7. │ WEST YORKSHIRE     │ {2010:140000,2020:162500}   │
 8. │ BRIGHTON AND HOVE  │ {2010:287500,2020:387000}   │
 9. │ DORSET             │ {2010:255750,2020:315000}   │
10. │ HAMPSHIRE          │ {2010:265000,2020:330000}   │
    └────────────────────┴─────────────────────────────┘

组合多个聚合操作

如果我们想找出每个十年的最高价格,可以使用前面提到的 maxMap 函数:
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    county,
    medianMap(map(year, price)) AS medianPrices,
    maxMap(map(year, price)) AS maxPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices──────────────┬─maxPrices───────────────────────┐
 1. │ GREATER LONDON     │ {2010:385000,2020:485250} │ {2010:594300000,2020:630000000} │
 2. │ TYNE AND WEAR      │ {2010:130000,2020:141000} │ {2010:448300979,2020:93395000}  │
 3. │ WEST MIDLANDS      │ {2010:149000,2020:184250} │ {2010:415000000,2020:104500000} │
 4. │ GREATER MANCHESTER │ {2010:140000,2020:175000} │ {2010:107086856,2020:319186000} │
 5. │ MERSEYSIDE         │ {2010:129950,2020:150000} │ {2010:300000000,2020:93395000}  │
 6. │ HERTFORDSHIRE      │ {2010:315000,2020:415000} │ {2010:254325163,2020:93395000}  │
 7. │ WEST YORKSHIRE     │ {2010:138500,2020:165000} │ {2010:246300000,2020:109686257} │
 8. │ BRIGHTON AND HOVE  │ {2010:285000,2020:387000} │ {2010:200000000,2020:71540000}  │
 9. │ DORSET             │ {2010:250000,2020:315000} │ {2010:150000000,2020:20230000}  │
10. │ HAMPSHIRE          │ {2010:264000,2020:330000} │ {2010:150000000,2020:48482500}  │
    └────────────────────┴───────────────────────────┴─────────────────────────────────┘

对 map 值应用函数

或者,我们也可以使用 avgMap 来计算平均价格。 这些值的小数位很多,我们可以使用 mapApply 函数,对 map 中的每个值调用 floor 函数来将其向下取整,从而让结果更简洁:
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    county,
    medianMap(map(year, price)) AS medianPrices,
    mapApply((k, v) -> (k, floor(v)), avgMap(map(year, price))) AS avgPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10;
    ┌─county─────────────┬─medianPrices──────────────┬─avgPrices─────────────────┐
 1. │ GREATER LONDON     │ {2010:382000,2020:490000} │ {2010:626091,2020:807240} │
 2. │ TYNE AND WEAR      │ {2010:127000,2020:140000} │ {2010:176955,2020:225770} │
 3. │ WEST MIDLANDS      │ {2010:148500,2020:183000} │ {2010:204128,2020:257226} │
 4. │ GREATER MANCHESTER │ {2010:140000,2020:177500} │ {2010:195592,2020:251165} │
 5. │ MERSEYSIDE         │ {2010:127995,2020:150000} │ {2010:182194,2020:206062} │
 6. │ HERTFORDSHIRE      │ {2010:317500,2020:415000} │ {2010:414134,2020:529409} │
 7. │ WEST YORKSHIRE     │ {2010:140000,2020:164500} │ {2010:185121,2020:234870} │
 8. │ BRIGHTON AND HOVE  │ {2010:285000,2020:387000} │ {2010:372285,2020:527184} │
 9. │ DORSET             │ {2010:250000,2020:315000} │ {2010:305581,2020:370739} │
10. │ HAMPSHIRE          │ {2010:265000,2020:330000} │ {2010:335945,2020:425196} │
    └────────────────────┴───────────────────────────┴───────────────────────────┘

灵活分组:郡、行政区和邮政编码

我们来尝试按一些不同的字段分组。 这次我们将按郡和行政区分组,计算每个十年的房价中位数:
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    county,
    district,
    medianMap(map(year, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE year >= 2010
GROUP BY ALL
ORDER BY max(price) DESC
LIMIT 10
    ┌─county─────────────┬─district───────────────┬─medianPrices────────────────┐
 1. │ GREATER LONDON     │ CROYDON                │ {2010:298475,2020:400000}   │
 2. │ GREATER LONDON     │ CITY OF WESTMINSTER    │ {2010:800000,2020:935000}   │
 3. │ GREATER LONDON     │ SOUTHWARK              │ {2010:437000,2020:540000}   │
 4. │ TYNE AND WEAR      │ NEWCASTLE UPON TYNE    │ {2010:144000,2020:162500}   │
 5. │ WEST MIDLANDS      │ WALSALL                │ {2010:137450,2020:162000}   │
 6. │ GREATER LONDON     │ CITY OF LONDON         │ {2010:725875,2020:840000}   │
 7. │ GREATER LONDON     │ HILLINGDON             │ {2010:329125,2020:439000}   │
 8. │ GREATER MANCHESTER │ MANCHESTER             │ {2010:144972.5,2020:190000} │
 9. │ GREATER LONDON     │ HAMMERSMITH AND FULHAM │ {2010:622250,2020:750000}   │
10. │ GREATER LONDON     │ ISLINGTON              │ {2010:500000,2020:640000}   │
    └────────────────────┴────────────────────────┴─────────────────────────────┘
我们也可以选择按年份分组,再在 map 中拼接 postcode1postcode2
WITH year(toStartOfInterval(date, toIntervalYear(10))) AS year
SELECT
    year,
    medianMap(map(postcode1 || ' ' || postcode2, price)) AS medianPrices
FROM uk.uk_price_paid
WHERE postcode1 LIKE 'NP1'
GROUP BY ALL;
   ┌─year─┬─medianPrices────────────────────────────────────────────────────────┐
1. │ 1990 │ {'NP1 4PB':9000}                                                    │
2. │ 2000 │ {'NP1 4SR':28475,'NP1 7HZ':200000}                                  │
3. │ 2010 │ {'NP1 4PB':5000,'NP1 4QJ':1075000,'NP1 4SR':58000,'NP1 8BR':200000} │
4. │ 2020 │ {'NP1 5DW':140000}                                                  │
   └──────┴─────────────────────────────────────────────────────────────────────┘
最后修改于 2026年6月10日