测试 DuckDB 的全文搜索扩展

Laurens Kuiper

2021-01-25 · 4 分钟阅读

TL;DR: DuckDB 现在拥有全文搜索功能，类似于 SQLite 中的 FTS5 扩展。主要区别在于我们的 FTS 扩展完全用 SQL 编写。我们使用 TREC 磁盘 4 和 5 对其进行了测试。

在数据库中搜索存储的文本数据可能很麻烦，因为 SQL 无法很好地表达诸如“给我所有关于绿头鸭的文档”之类的问题：使用 LIKE 的字符串模式只能达到有限的效果。尽管 SQL 在这方面存在不足，但在数据库中存储文本数据是很常见的。考虑表 products (id INTEGER, name VARCHAR, description VARCHAR) – 对于销售这些产品的网站来说，搜索 name 和 description 列会很有用。

我们期望搜索引擎在几毫秒内返回结果。长期以来，数据库不适合这项任务，因为它们无法以这种速度搜索大型倒排索引：事务型数据库系统不是为此用例而设计的。然而，分析型数据库系统可以跟上最先进的信息检索系统。公司 Spinque 就是一个很好的例子。在 Spinque，MonetDB 被用作定制搜索引擎的计算引擎。

DuckDB 的 FTS 实现遵循论文“老狗也能玩出新花样”。一个敏锐的观察是，数据库系统的进步（例如并行化）将“免费”加速您的搜索引擎！

好了，“为什么”说得够多了，让我们来谈谈“如何”。

准备数据

TREC 2004 鲁棒检索赛道在 TREC 磁盘 4 和 5 上有 250 个“主题”（搜索查询）。数据由许多以 SGML 格式存储的文本文件以及相应的 DTD（文档类型定义）文件组成。这种格式现在很少使用，但它类似于 XML。我们将使用 OpenSP 的命令行工具 osx 将其转换为 XML。因为文件很多，我写了一个 Bash 脚本

mkdir -p latimes/xml
for i in $(seq -w 1 9); do
    cat dtds/la.dtd latimes-$i | osx > latimes/xml/latimes-$i.xml
done

这会对 latimes 文件进行排序。对 fbis、cr、fr94 和 ft 文件重复此操作。

为了解析 XML，我使用了 BeautifulSoup。每个文档都有一个 docno 标识符和一个 text 字段。由于文档来自不同的来源，它们在其他字段上有所不同。我选择获取所有字段。

import duckdb
import multiprocessing
import pandas as pd
import re
from bs4 import BeautifulSoup as bs
from tqdm import tqdm

# fill variable 'files' with the path to each .xml file that we created here

def process_file(fpath):
    dict_list = []
    with open(fpath, 'r') as f:
        content = f.read()
        bs_content = bs(content, "html.parser")
        # find all 'doc' nodes
        for doc in bs_content.findChildren('doc', recursive=True):
            row_dict = {}
            for c in doc.findChildren(recursive=True):
                row_dict[c.name] = ''.join(c.findAll(text=True, recursive=False)).trim()
            dict_list.append(row_dict)
    return dict_list

# process documents (in parallel to speed things up)
pool = multiprocessing.Pool(multiprocessing.cpu_count())
list_of_dict_lists = []
for x in tqdm(pool.imap_unordered(process_file, files), total=len(files)):
    list_of_dict_lists.append(x)
pool.close()
pool.join()

# create pandas dataframe from the parsed data
documents_df = pd.DataFrame([x for sublist in list_of_dict_lists for x in sublist])

现在我们有了数据框，我们可以在 DuckDB 中注册它。

# create database connection and register the dataframe
con = duckdb.connect(database='db/trec04_05.db', read_only=False)
con.register('documents_df', documents_df)

# create a table from the dataframe so that it persists
con.execute("CREATE TABLE documents AS (SELECT * FROM documents_df)")
con.close()

这是我准备脚本的结尾，所以我关闭了数据库连接。

构建搜索引擎

现在我们可以使用 PRAGMA 语句构建倒排索引和检索模型。该扩展在此处有文档。我们在使用脚本创建的表 documents 或 main.documents 上创建一个索引表。标识我们文档的列名为 docno，我们希望在提供的字段上创建倒排索引。我通过使用 '*' 快捷方式提供了所有字段。

con = duckdb.connect(database='db/trec04_05.db', read_only=False)
con.execute("PRAGMA create_fts_index('documents', 'docno', '*', stopwords='english')")

在内部，会调用一个参数化 SQL 脚本。创建了模式 fts_main_documents，以及构成倒排索引的表 docs、terms、dict 和 stats。如果您好奇这看起来如何，请查看 DuckDB 源代码中 extension 文件夹下的源代码！

运行基准测试

数据现已完全准备好。现在我们想逐一运行基准测试中的查询。我们按如下方式加载主题文件

# the 'topics' file is not structured nicely, therefore we need parse some of it using regex
def after_tag(s, tag):
    m = re.findall(r'<' + tag + r'>([\s\S]*?)<.*>', s)
    return m[0].replace('\n', '').strip()

topic_dict = {}
with open('../../trec/topics', 'r') as f:
    bs_content = bs(f.read(), "lxml")
    for top in bs_content.findChildren('top'):
        top_content = top.getText()
        # we need the number and title of each topic
        num = after_tag(str(top), 'num').split(' ')[1]
        title = after_tag(str(top), 'title')
        topic_dict[num] = title

这给了我们一个字典，其中查询编号作为键，查询字符串作为值，例如 301 -> 'International Organized Crime'。

我们希望以特定格式存储结果，以便它们可以通过 trec eval 进行评估

# create a prepared statement to make querying our document collection easier
con.execute("""
    PREPARE fts_query AS (
        WITH scored_docs AS (
            SELECT *, fts_main_documents.match_bm25(docno, ?) AS score FROM documents)
        SELECT docno, score
        FROM scored_docs
        WHERE score IS NOT NULL
        ORDER BY score DESC
        LIMIT 1000)
    """)

# enable parallelism
con.execute('PRAGMA threads=32')
results = []
for query in topic_dict:
    q_str = topic_dict[query].replace('\'', ' ')
    con.execute("EXECUTE fts_query('" + q_str + "')")
    for i, row in enumerate(con.fetchall()):
        results.append(query + " Q0 " + row[0].trim() + " " + str(i) + " " + str(row[1]) + " STANDARD")
con.close()

with open('results', 'w+') as f:
    for r in results:
        f.write(r + '\n')

结果

现在我们已经创建了“结果”文件，我们可以使用 trec_eval 将它们与相关性评估 qrels 进行比较。

./trec_eval -m P.30 -m map qrels results

map                     all 0.2324
P_30                    all 0.2948

还不错！虽然这些结果不如 Anserini 可复现的结果高，但它们绝对是可以接受的。性能差异可以通过以下因素来解释：

使用了哪个词干提取器（我们使用了“porter”）
使用了哪些停用词（我们使用了 SMART 系统中使用的 571 个英语停用词列表）
预处理（去除重音、标点、数字）
BM25 参数（我们使用了默认的 k=1.2 和 b=0.75，非连词）
索引了哪些字段（我们通过提供 '*' 使用了所有列）

在我们机器上，每个查询的检索时间在 0.5 到 1.3 秒之间，这将在 DuckDB 的进一步改进中得到改善。我希望您喜欢阅读这篇博客，并受到启发也来测试这个扩展！

测试 DuckDB 的全文搜索扩展

准备数据

构建搜索引擎

运行基准测试

结果

本文中

近期文章

DuckLake 0.2

通过 GitHub 发现 DuckDB 用例