Apache Arrow 上的 SQL – DuckDB 数据库

搜索快捷键 cmd + k | ctrl + k

文档 / 指南 / Python

Apache Arrow 上的 SQL

DuckDB 可以查询多种不同类型的 Apache Arrow 对象。

Apache Arrow 表

存储在局部变量中的Arrow 表可以像 DuckDB 中的常规表一样进行查询。

import duckdb
import pyarrow as pa

# connect to an in-memory database
con = duckdb.connect()

my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
                                       'j': ["one", "two", "three", "four"]})

# query the Apache Arrow Table "my_arrow_table" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_table WHERE i = 2").arrow()

Apache Arrow 数据集

存储为变量的Arrow 数据集也可以像常规表一样进行查询。数据集可用于指向 Parquet 文件目录以分析大型数据集。DuckDB 会将列选择和行过滤器下推到数据集扫描操作中，以便只有必要的数据被拉入内存。

import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# connect to an in-memory database
con = duckdb.connect()

my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
                                       'j': ["one", "two", "three", "four"]})

# create example Parquet files and save in a folder
base_path = pathlib.Path(tempfile.gettempdir())
(base_path / "parquet_folder").mkdir(exist_ok = True)
pq.write_to_dataset(my_arrow_table, str(base_path / "parquet_folder"))

# link to Parquet files using an Arrow Dataset
my_arrow_dataset = ds.dataset(str(base_path / 'parquet_folder/'))

# query the Apache Arrow Dataset "my_arrow_dataset" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_dataset WHERE i = 2").arrow()

Apache Arrow 扫描器

存储为变量的Arrow 扫描器也可以像常规表一样进行查询。扫描器会读取数据集并选择特定列或应用行级过滤。这类似于 DuckDB 将列选择和过滤器下推到 Arrow 数据集的方式，但使用的是 Arrow 计算操作。Arrow 可以使用异步 I/O 快速访问文件。

import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.compute as pc

# connect to an in-memory database
con = duckdb.connect()

my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
                                       'j': ["one", "two", "three", "four"]})

# create example Parquet files and save in a folder
base_path = pathlib.Path(tempfile.gettempdir())
(base_path / "parquet_folder").mkdir(exist_ok = True)
pq.write_to_dataset(my_arrow_table, str(base_path / "parquet_folder"))

# link to Parquet files using an Arrow Dataset
my_arrow_dataset = ds.dataset(str(base_path / 'parquet_folder/'))

# define the filter to be applied while scanning
# equivalent to "WHERE i = 2"
scanner_filter = (pc.field("i") == pc.scalar(2))

arrow_scanner = ds.Scanner.from_dataset(my_arrow_dataset, filter = scanner_filter)

# query the Apache Arrow scanner "arrow_scanner" and return as an Arrow Table
results = con.execute("SELECT * FROM arrow_scanner").arrow()

Apache Arrow RecordBatch 读取器

Arrow RecordBatch 读取器是 Arrow 流式二进制格式的读取器，也可以像表一样直接查询。这种流式格式在发送 Arrow 数据以进行进程间通信或语言运行时之间通信等任务时非常有用。

import duckdb
import pyarrow as pa

# connect to an in-memory database
con = duckdb.connect()

my_recordbatch = pa.RecordBatch.from_pydict({'i': [1, 2, 3, 4],
                                             'j': ["one", "two", "three", "four"]})

my_recordbatchreader = pa.ipc.RecordBatchReader.from_batches(my_recordbatch.schema, [my_recordbatch])

# query the Apache Arrow RecordBatchReader "my_recordbatchreader" and return as an Arrow Table
results = con.execute("SELECT * FROM my_recordbatchreader WHERE i = 2").arrow()

Apache Arrow 表

Apache Arrow 数据集

Apache Arrow 扫描器

Apache Arrow RecordBatch 读取器

关于此页面

本文中