azureBlobStorage 表函数 - ClickHouse Documentation

提供类似表的接口，用于在 Azure Blob 存储中查询/插入文件。与 s3 函数类似。

语法

连接字符串
存储账户 URL
命名集合

凭据已嵌入连接字符串中，因此无需单独提供 account_name/account_key：

azureBlobStorage(connection_string, container_name, blobpath [, format, compression, structure])

需要将 account_name 和 account_key 作为单独的参数提供：

azureBlobStorage(storage_account_url, container_name, blobpath, account_name, account_key [, format, compression, structure])

有关支持的键完整列表，请参见下文的命名集合：

azureBlobStorage(named_collection[, option=value [,..]])

参数

参数	说明
`connection_string`	包含内嵌凭据 (账户名 + 账户密钥或 SAS token) 的连接字符串。使用这种形式时，不应再单独传递 `account_name` 和 `account_key`。参见配置连接字符串。
`storage_account_url`	存储账户的端点 URL，例如 `https://myaccount.blob.core.windows.net/`。使用这种形式时，必须同时传递 `account_name` 和 `account_key`。
`container_name`	容器名称。
`blobpath`	文件路径。在只读模式下支持以下通配符：``、`*`、`?`、`{abc,def}` 和 `{N..M}`，其中 `N`、`M` 为数字，`'abc'`、`'def'` 为字符串。
`account_name`	存储账户名称。使用不带 SAS 的 `storage_account_url` 时必需；使用 `connection_string` 时不得传递。
`account_key`	存储账户密钥。使用不带 SAS 的 `storage_account_url` 时必需；使用 `connection_string` 时不得传递。
`format`	文件的格式。
`compression`	支持的值：`none`、`gzip/gz`、`brotli/br`、`xz/LZMA`、`zstd/zst`。默认会根据文件扩展名自动检测压缩方式 (等同于设置为 `auto`) 。
`structure`	表的结构。格式为 `'column1_name column1_type, column2_name column2_type, ...'`。
`partition_strategy`	可选。支持的值：`WILDCARD` 或 `HIVE`。`WILDCARD` 要求路径中包含 `{_partition_id}`，该占位符会被替换为分区键。`HIVE` 不允许使用通配符，假定该路径是表根路径，并生成 Hive 风格的分区目录，以 Snowflake ID 作为文件名、以文件格式作为扩展名。默认为 `WILDCARD`。
`partition_columns_in_data_file`	可选。仅在 `HIVE` 分区策略下使用。用于告知 ClickHouse 是否应预期分区列会写入数据文件中。默认为 `false`。
`extra_credentials`	使用 `client_id` 和 `tenant_id` 进行身份验证。如果提供了 `extra_credentials`，其优先级高于 `account_name` 和 `account_key`。

命名集合

参数也可以通过命名集合传递。此时支持以下键：

键	必需	描述
`container`	是	Container 名称。对应位置参数 `container_name`。
`blob_path`	是	文件路径 (可选使用通配符) 。对应位置参数 `blobpath`。
`connection_string`	否*	包含内嵌凭据的连接字符串。*必须提供 `connection_string` 或 `storage_account_url` 其中之一。
`storage_account_url`	否*	存储账户端点 URL。*必须提供 `connection_string` 或 `storage_account_url` 其中之一。
`account_name`	否	使用 `storage_account_url` 时为必需
`account_key`	否	使用 `storage_account_url` 时为必需
`format`	否	文件格式。
`compression`	否	压缩类型。
`structure`	否	表结构。
`client_id`	否	用于身份验证的客户端 ID。
`tenant_id`	否	用于身份验证的租户 ID。

命名集合的键名与函数的位置参数名不同：container (不是 container_name) 和 blob_path (不是 blobpath) 。

示例：

CREATE NAMED COLLECTION azure_my_data AS
    storage_account_url = 'https://myaccount.blob.core.windows.net/',
    container = 'mycontainer',
    blob_path = 'data/*.parquet',
    account_name = 'myaccount',
    account_key = 'mykey...==',
    format = 'Parquet';

SELECT *
FROM azureBlobStorage(azure_my_data)
LIMIT 5;

你也可以在查询时覆盖 named collection 中的值：

SELECT *
FROM azureBlobStorage(azure_my_data, blob_path = 'other_data/*.csv', format = 'CSVWithNames')
LIMIT 5;

返回值

一个具有指定结构的表，用于从指定文件读取数据或向其写入数据。

示例

使用 `storage_account_url` 形式读取

SELECT *
FROM azureBlobStorage(
    'https://myaccount.blob.core.windows.net/',
    'mycontainer',
    'data/*.parquet',
    'myaccount',
    'mykey...==',
    'Parquet'
)
LIMIT 5;

使用 `connection_string` 格式读取

SELECT *
FROM azureBlobStorage(
    'DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey...==;EndPointSuffix=core.windows.net',
    'mycontainer',
    'data/*.csv',
    'CSVWithNames'
)
LIMIT 5;

按分区写入数据

INSERT INTO TABLE FUNCTION azureBlobStorage(
    'DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey...==;EndPointSuffix=core.windows.net',
    'mycontainer',
    'test_{_partition_id}.csv',
    'CSV',
    'auto',
    'column1 UInt32, column2 UInt32, column3 UInt32'
) PARTITION BY column3
VALUES (1, 2, 3), (3, 2, 1), (78, 43, 3);

然后读取特定分区：

SELECT *
FROM azureBlobStorage(
    'DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey...==;EndPointSuffix=core.windows.net',
    'mycontainer',
    'test_1.csv',
    'CSV',
    'auto',
    'column1 UInt32, column2 UInt32, column3 UInt32'
);

┌─column1─┬─column2─┬─column3─┐
│       3 │       2 │       1 │
└─────────┴─────────┴─────────┘

虚拟列

_path — 文件路径。类型：LowCardinality(String)。
_file — 文件名。类型：LowCardinality(String)。
_size — 文件大小 (字节) 。类型：Nullable(UInt64)。如果文件大小未知，则值为 NULL。
_time — 文件的最后修改时间。类型：Nullable(DateTime)。如果时间未知，则值为 NULL。

分区写入

分区策略

仅支持 INSERT 查询。 WILDCARD (默认) ：将文件路径中的 {_partition_id} 通配符替换为实际分区键。 HIVE 对读写均采用 Hive 风格分区。它按以下格式生成文件：<prefix>/<key1=val1/key2=val2...>/<snowflakeid>.<toLower(file_format)>。 HIVE 分区策略示例

INSERT INTO TABLE FUNCTION azureBlobStorage(
    azure_conf2,
    storage_account_url = 'https://myaccount.blob.core.windows.net/',
    container = 'cont',
    blob_path = 'azure_table_root',
    format = 'CSVWithNames',
    compression = 'auto',
    structure = 'year UInt16, country String, id Int32',
    partition_strategy = 'hive'
) PARTITION BY (year, country)
VALUES (2020, 'Russia', 1), (2021, 'Brazil', 2);

SELECT _path, * FROM azureBlobStorage(
    azure_conf2,
    storage_account_url = 'https://myaccount.blob.core.windows.net/',
    container = 'cont',
    blob_path = 'azure_table_root/**.csvwithnames'
)

   ┌─_path───────────────────────────────────────────────────────────────────────────┬─id─┬─year─┬─country─┐
1. │ cont/azure_table_root/year=2021/country=Brazil/7351307847391293440.csvwithnames │  2 │ 2021 │ Brazil  │
2. │ cont/azure_table_root/year=2020/country=Russia/7351307847378710528.csvwithnames │  1 │ 2020 │ Russia  │
   └─────────────────────────────────────────────────────────────────────────────────┴────┴──────┴─────────┘

use_hive_partitioning 设置

这是给 ClickHouse 的一个提示，用于在读取时解析采用 Hive 风格分区的文件。它对写入没有影响。若要实现读写对称，请使用 partition_strategy 参数。当 use_hive_partitioning 设置为 1 时，ClickHouse 会检测路径中的 Hive 风格分区 (/name=value/) ，并允许在查询中将分区列用作虚拟列。这些虚拟列的名称将与分区路径中的名称相同。示例使用通过 Hive 风格分区生成的虚拟列

SELECT * FROM azureBlobStorage(config, storage_account_url='...', container='...', blob_path='http://data/path/date=*/country=*/code=*/*.parquet') WHERE date > '2020-01-01' AND country = 'Netherlands' AND code = 42;

使用共享访问签名 (SAS)

共享访问签名 (SAS) 是一种 URI，可授予对 Azure Storage 容器或文件的受限访问权限。使用它可以在不共享存储账户密钥的情况下，为存储账户资源提供限时访问。更多详情请参见此处。 azureBlobStorage 函数支持共享访问签名 (SAS)。 Blob SAS token 包含对请求进行身份验证所需的全部信息，包括目标 blob、权限和有效期。要构造 blob URL，请将 SAS token 追加到 blob 服务端点之后。例如，如果端点是 https://clickhousedocstest.blob.core.windows.net/，则请求变为：

SELECT count()
FROM azureBlobStorage('BlobEndpoint=https://clickhousedocstest.blob.core.windows.net/;SharedAccessSignature=sp=r&st=2025-01-29T14:58:11Z&se=2025-01-29T22:58:11Z&spr=https&sv=2022-11-02&sr=c&sig=Ac2U0xl4tm%2Fp7m55IilWl1yHwk%2FJG0Uk6rMVuOiD0eE%3D', 'exampledatasets', 'example.csv')

┌─count()─┐
│      10 │
└─────────┘

1 row in set. Elapsed: 0.425 sec.

或者，用户也可以使用生成的 Blob SAS URL：

SELECT count()
FROM azureBlobStorage('https://clickhousedocstest.blob.core.windows.net/?sp=r&st=2025-01-29T14:58:11Z&se=2025-01-29T22:58:11Z&spr=https&sv=2022-11-02&sr=c&sig=Ac2U0xl4tm%2Fp7m55IilWl1yHwk%2FJG0Uk6rMVuOiD0eE%3D', 'exampledatasets', 'example.csv')

┌─count()─┐
│      10 │
└─────────┘

1 row in set. Elapsed: 0.153 sec.

AzureBlobStorage 表引擎

​语法

​参数

​命名集合

​返回值

​示例

​使用 storage_account_url 形式读取

​使用 connection_string 格式读取

​按分区写入数据

​虚拟列

​分区写入

​分区策略

​use_hive_partitioning 设置

​使用共享访问签名 (SAS)

​相关内容

语法

参数

命名集合

返回值

示例

使用 `storage_account_url` 形式读取

使用 `connection_string` 格式读取

按分区写入数据

虚拟列

分区写入

分区策略

use_hive_partitioning 设置

使用共享访问签名 (SAS)

相关内容