Hacker News 向量搜索数据集 - ClickHouse Documentation

引言

Hacker News 数据集包含 2874 万条帖子及其嵌入向量。这些嵌入向量由 SentenceTransformers 模型 all-MiniLM-L6-v2 生成。每个嵌入向量的维度为 384。该数据集可用于讲解基于用户生成文本数据构建的大规模真实世界向量搜索应用在设计、容量规划和性能等方面的考量。

数据集详情

ClickHouse 将包含嵌入向量的完整数据集作为单个 Parquet 文件提供，文件位于一个 S3 bucket 中。我们建议用户先参考文档进行容量规划，以估算该数据集所需的存储和内存资源。

步骤

创建表

创建 hackernews 表，用于存储帖子、其嵌入向量及相关属性：

CREATE TABLE hackernews
(
    `id` Int32,
    `doc_id` Int32,
    `text` String,
    `vector` Array(Float32),
    `node_info` Tuple(
        start Nullable(UInt64),
        end Nullable(UInt64)),
    `metadata` String,
    `type` Enum8('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
    `by` LowCardinality(String),
    `time` DateTime,
    `title` String,
    `post_score` Int32,
    `dead` UInt8,
    `deleted` UInt8,
    `length` UInt32
)
ENGINE = MergeTree
ORDER BY id;

id 只是一个递增的整数。附加属性可在谓词中使用，以帮助理解如文档中所述的、结合后过滤/前过滤的向量相似性搜索

加载数据

要从 Parquet 文件中加载数据集，请运行以下 SQL 语句：

INSERT INTO hackernews SELECT * FROM s3('https://clickhouse-datasets.s3.amazonaws.com/hackernews-miniLM/hackernews_part_1_of_1.parquet');

向表中插入 2874 万行数据需要几分钟时间。

构建向量相似度索引

运行以下 SQL，在 hackernews 表的 vector 列上定义并构建向量相似度索引：

ALTER TABLE hackernews ADD INDEX vector_index vector TYPE vector_similarity('hnsw', 'cosineDistance', 384, 'bf16', 64, 512);

ALTER TABLE hackernews MATERIALIZE INDEX vector_index SETTINGS mutations_sync = 2;

有关索引创建和搜索的参数及性能注意事项，请参见文档。上述语句中，HNSW 超参数 M 和 ef_construction 分别使用了 64 和 512。你需要通过评估索引构建时间和搜索结果质量，为这些参数谨慎选择最佳配置值。对于完整的 2874 万数据集，构建并保存索引甚至可能需要几分钟到几小时，具体取决于可用的 CPU 核心数量和存储带宽。

执行 ANN 搜索

向量相似度索引构建完成后，向量搜索查询会自动使用该索引：

Query

SELECT id, title, text
FROM hackernews
ORDER BY cosineDistance( vector, <search vector>)
LIMIT 10

首次将向量索引加载到内存中可能需要几秒到几分钟。

为搜索查询生成嵌入向量

Sentence Transformers 提供本地化、易于使用的嵌入模型，用于捕捉句子和段落的语义。此 HackerNews 数据集包含由 all-MiniLM-L6-v2 模型生成的向量嵌入。下方提供了一个示例 Python 脚本，演示如何使用 sentence_transformers Python 包以编程方式生成嵌入向量。搜索嵌入向量随后作为参数传入 SELECT 查询中的 cosineDistance() 函数。

from sentence_transformers import SentenceTransformer
import sys

import clickhouse_connect

print("Initializing...")

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

chclient = clickhouse_connect.get_client() # ClickHouse credentials here

while True:
    # 从用户获取搜索查询
    print("Enter a search query :")
    input_query = sys.stdin.readline();
    texts = [input_query]

    # 运行模型并获取搜索向量
    print("Generating the embedding for ", input_query);
    embeddings = model.encode(texts)

    print("Querying ClickHouse...")
    params = {'v1':list(embeddings[0]), 'v2':20}
    result = chclient.query("SELECT id, title, text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)
    print("Results :")
    for row in result.result_rows:
        print(row[0], row[2][:100])
        print("---------")

以下展示了运行上述 Python 脚本及相似度搜索结果的示例 (每条结果仅打印前 20 个帖子各自的前 100 个字符) ：

正在初始化...

请输入搜索查询：
Are OLAP cubes useful

正在为"Are OLAP cubes useful"生成嵌入向量

正在查询 ClickHouse...

结果：

27742647 smartmic:
slt2021: OLAP Cube is not dead, as long as you use some form of:<p>1. GROUP BY multiple fi
---------
27744260 georgewfraser:A data mart is a logical organization of data to help humans understand the schema. Wh
---------
27761434 mwexler:&quot;We model data according to rigorous frameworks like Kimball or Inmon because we must r
---------
28401230 chotmat:
erosenbe0: OLAP database is just a copy, replica, or archive of data with a schema designe
---------
22198879 Merick:+1 for Apache Kylin, it&#x27;s a great project and awesome open source community. If anyone i
---------
27741776 crazydoggers:I always felt the value of an OLAP cube was uncovering questions you may not know to as
---------
22189480 shadowsun7:
_Codemonkeyism: After maintaining an OLAP cube system for some years, I&#x27;m not that
---------
27742029 smartmic:
gengstrand: My first exposure to OLAP was on a team developing a front end to Essbase that
---------
22364133 irfansharif:
simo7: I&#x27;m wondering how this technology could work for OLAP cubes.<p>An OLAP cube
---------
23292746 scoresmoke:When I was developing my pet project for Web analytics (<a href="https:&#x2F;&#x2F;github
---------
22198891 js8:It seems that the article makes a categorical error, arguing that OLAP cubes were replaced by co
---------
28421602 chotmat:
7thaccount: Is there any advantage to OLAP cube over plain SQL (large historical database r
---------
22195444 shadowsun7:
lkcubing: Thanks for sharing. Interesting write up.<p>While this article accurately capt
---------
22198040 lkcubing:Thanks for sharing. Interesting write up.<p>While this article accurately captures the issu
---------
3973185 stefanu:
sgt: Interesting idea. Ofcourse, OLAP isn't just about the underlying cubes and dimensions,
---------
22190903 shadowsun7:
js8: It seems that the article makes a categorical error, arguing that OLAP cubes were r
---------
28422241 sradman:OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the history of
---------
28421480 chotmat:
sradman: OLAP Cubes have been disrupted by Column Stores. Unless you are interested in the
---------
27742515 BadInformatics:
quantified: OP posts with inverted condition: “OLAP != OLAP Cube” is the actual titl
---------
28422935 chotmat:
rstuart4133: I remember hearing about OLAP cubes donkey&#x27;s years ago (probably not far
---------

摘要 Demo 应用

上述示例演示了如何使用 ClickHouse 进行语义搜索和文档检索。接下来将介绍一个简单却极具潜力的生成式 AI 示例应用。该应用程序执行以下步骤：

接收用户输入的 topic
使用 SentenceTransformers 和模型 all-MiniLM-L6-v2 为主题生成 embedding 向量
使用 hackernews 表上的向量相似度搜索来检索高度相关的帖子/评论
使用 LangChain 和 OpenAI gpt-3.5-turbo Chat API 对第 3 步中检索到的内容进行总结。第 3 步中检索到的帖子/评论会作为 上下文 传递给 Chat API，是 Generative AI 的关键一环。

下面首先列出运行摘要应用程序的示例，随后是该应用程序的完整代码。运行此应用程序需要在环境变量 OPENAI_API_KEY 中设置 OpenAI API key。OpenAI API key 可在 https://platform.openai.com 注册后获取。该应用程序展示了一个 Generative AI 用例，适用于多个企业领域，例如：客户情感分析、技术支持自动化、用户对话挖掘、法律文件、医疗记录、会议记录、财务报表等。

$ python3 summarize.py

Enter a search topic :
ClickHouse performance experiences

Generating the embedding for ---->  ClickHouse performance experiences

Querying ClickHouse to retrieve relevant articles...

Initializing chatgpt-3.5-turbo model...

Summarizing search results retrieved from ClickHouse...

Summary from chatgpt-3.5:
The discussion focuses on comparing ClickHouse with various databases like TimescaleDB, Apache Spark,
AWS Redshift, and QuestDB, highlighting ClickHouse's cost-efficient high performance and suitability
for analytical applications. Users praise ClickHouse for its simplicity, speed, and resource efficiency
in handling large-scale analytics workloads, although some challenges like DMLs and difficulty in backups
are mentioned. ClickHouse is recognized for its real-time aggregate computation capabilities and solid
engineering, with comparisons made to other databases like Druid and MemSQL. Overall, ClickHouse is seen
as a powerful tool for real-time data processing, analytics, and handling large volumes of data
efficiently, gaining popularity for its impressive performance and cost-effectiveness.

上述应用程序的代码：

print("Initializing...")

import sys
import json
import time
from sentence_transformers import SentenceTransformer

import clickhouse_connect

from langchain.docstore.document import Document
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
import textwrap
import tiktoken

def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.encoding_for_model(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

chclient = clickhouse_connect.get_client(compress=False) # ClickHouse credentials here

while True:
    # 从用户获取搜索查询
    print("Enter a search topic :")
    input_query = sys.stdin.readline();
    texts = [input_query]

    # 运行模型并获取搜索向量或参考向量
    print("Generating the embedding for ----> ", input_query);
    embeddings = model.encode(texts)

    print("Querying ClickHouse...")
    params = {'v1':list(embeddings[0]), 'v2':100}
    result = chclient.query("SELECT id,title,text FROM hackernews ORDER BY cosineDistance(vector, %(v1)s) LIMIT %(v2)s", parameters=params)

    # 拼接所有搜索结果
    doc_results = ""
    for row in result.result_rows:
        doc_results = doc_results + "\n" + row[2]

    print("Initializing chatgpt-3.5-turbo model")
    model_name = "gpt-3.5-turbo"

    text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
        model_name=model_name
    )

    texts = text_splitter.split_text(doc_results)

    docs = [Document(page_content=t) for t in texts]

    llm = ChatOpenAI(temperature=0, model_name=model_name)

    prompt_template = """
Write a concise summary of the following in not more than 10 sentences:

{text}

CONSCISE SUMMARY :
"""

    prompt = PromptTemplate(template=prompt_template, input_variables=["text"])

    num_tokens = num_tokens_from_string(doc_results, model_name)

    gpt_35_turbo_max_tokens = 4096
    verbose = False

    print("Summarizing search results retrieved from ClickHouse...")

    if num_tokens <= gpt_35_turbo_max_tokens:
        chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt, verbose=verbose)
    else:
        chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=prompt, combine_prompt=prompt, verbose=verbose)

    summary = chain.run(docs)

    print(f"Summary from chatgpt-3.5: {summary}")

​引言

​数据集详情

​步骤

​创建表

​加载数据

​构建向量相似度索引

​执行 ANN 搜索

​为搜索查询生成嵌入向量

​摘要 Demo 应用

引言