DuckDB 与 Python 之间的转换 – DuckDB

搜索快捷键 cmd + k | ctrl + k

文档 / 客户端API / Python

DuckDB 与 Python 之间的转换

本页介绍了将 Python 对象转换为 DuckDB 以及将 DuckDB 结果转换为 Python 的规则。

对象转换：Python 对象到 DuckDB

这是 Python 对象类型到 DuckDB 逻辑类型的映射

None → NULL
bool → BOOLEAN
datetime.timedelta → INTERVAL
str → VARCHAR
bytearray → BLOB
memoryview → BLOB
decimal.Decimal → DECIMAL / DOUBLE
uuid.UUID → UUID

其余转换规则如下。

`int`

由于 Python 中的整数可以是任意大小的，因此无法进行一对一的整数转换。相反，我们将按顺序执行这些类型转换，直到其中一个成功

BIGINT
INTEGER
UBIGINT
UINTEGER
DOUBLE

使用 DuckDB Value 类时，可以设置一个目标类型，这将影响转换。

`float`

这些类型转换将按顺序尝试，直到其中一个成功

DOUBLE
FLOAT

`datetime.datetime`

对于 datetime，如果 pandas.isnull 可用，我们将检查它，如果返回 true 则返回 NULL。我们对照 datetime.datetime.min 和 datetime.datetime.max 进行检查，分别将其转换为 -inf 和 +inf。

如果 datetime 具有 tzinfo，我们将使用 TIMESTAMPTZ，否则它将转换为 TIMESTAMP。

`datetime.time`

如果 time 具有 tzinfo，我们将使用 TIMETZ，否则它将转换为 TIME。

`datetime.date`

date 转换为 DATE 类型。我们对照 datetime.date.min 和 datetime.date.max 进行检查，分别将其转换为 -inf 和 +inf。

`bytes`

默认情况下，bytes 转换为 BLOB。当它用于构造类型为 BITSTRING 的 Value 对象时，它会映射到 BITSTRING。

`list`

list 会成为其子项中“最宽松”类型的 LIST 类型，例如

my_list_value = [
    12345,
    "test"
]

将变为 VARCHAR[]，因为 12345 可以转换为 VARCHAR，但 test 无法转换为 INTEGER。

[12345, test]

`dict`

dict 对象可以根据其结构转换为 STRUCT(...) 或 MAP(..., ...)。如果字典的结构类似于

import duckdb

my_map_dict = {
    "key": [
        1, 2, 3
    ],
    "value": [
        "one", "two", "three"
    ]
}

duckdb.values(my_map_dict)

那么我们将其转换为由两个列表打包在一起的键值对的 MAP。上述示例将成为一个 MAP(INTEGER, VARCHAR)

┌─────────────────────────┐
│ {1=one, 2=two, 3=three} │
│  map(integer, varchar)  │
├─────────────────────────┤
│ {1=one, 2=two, 3=three} │
└─────────────────────────┘

如果字典由某个函数返回，该函数将返回一个 MAP，因此必须指定函数的 return_type。提供无法转换为 MAP 的返回类型将引发错误

import duckdb
duckdb_conn = duckdb.connect()

def get_map() -> dict[str,list[str]|list[int]]:
    return {
        "key": [
            1, 2, 3
        ],
        "value": [
            "one", "two", "three"
        ]
    }

duckdb_conn.create_function("get_map", get_map, return_type=dict[int, str])

duckdb_conn.sql("select get_map()").show()

duckdb_conn.create_function("get_map_error", get_map)

duckdb_conn.sql("select get_map_error()").show()

┌─────────────────────────┐
│        get_map()        │
│  map(bigint, varchar)   │
├─────────────────────────┤
│ {1=one, 2=two, 3=three} │
└─────────────────────────┘

ConversionException: Conversion Error: Type VARCHAR can't be cast as UNION(u1 VARCHAR[], u2 BIGINT[]). VARCHAR can't be implicitly cast to any of the union member types: VARCHAR[], BIGINT[]

字段的名称很重要，并且两个列表需要具有相同的大小。

否则，我们将尝试将其转换为 STRUCT。

import duckdb

my_struct_dict = {
    1: "one",
    "2": 2,
    "three": [1, 2, 3],
    False: True
}

duckdb.values(my_struct_dict)

变为

┌────────────────────────────────────────────────────────────────────┐
│      {'1': 'one', '2': 2, 'three': [1, 2, 3], 'False': true}       │
│ struct("1" varchar, "2" integer, three integer[], "false" boolean) │
├────────────────────────────────────────────────────────────────────┤
│ {'1': one, '2': 2, 'three': [1, 2, 3], 'False': true}              │
└────────────────────────────────────────────────────────────────────┘

如果字典由某个函数返回，由于自动转换，该函数将返回一个 MAP。要返回 STRUCT，必须提供 return_type

import duckdb
from duckdb.typing import BOOLEAN, INTEGER, VARCHAR
from duckdb import list_type, struct_type

duckdb_conn = duckdb.connect()

my_struct_dict = {
    1: "one",
    "2": 2,
    "three": [1, 2, 3],
    False: True
}

def get_struct() -> dict[str|int|bool,str|int|list[int]|bool]:
    return my_struct_dict

duckdb_conn.create_function("get_struct_as_map", get_struct)

duckdb_conn.sql("select get_struct_as_map()").show()

duckdb_conn.create_function("get_struct", get_struct, return_type=struct_type({
    1: VARCHAR,
    "2": INTEGER,
    "three": list_type(duckdb.typing.INTEGER),
    False: BOOLEAN
}))

duckdb_conn.sql("select get_struct()").show()

┌──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                         get_struct_as_map()                                          │
│ map(union(u1 varchar, u2 bigint, u3 boolean), union(u1 varchar, u2 bigint, u3 bigint[], u4 boolean)) │
├──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {1=one, 2=2, three=[1, 2, 3], false=true}                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────┐
│                            get_struct()                            │
│ struct("1" varchar, "2" integer, three integer[], "false" boolean) │
├────────────────────────────────────────────────────────────────────┤
│ {'1': one, '2': 2, 'three': [1, 2, 3], 'False': true}              │
└────────────────────────────────────────────────────────────────────┘

字典的每个 key 都将转换为字符串。

`tuple`

默认情况下，tuple 转换为 LIST。当它用于构造类型为 STRUCT 的 Value 对象时，它将转换为 STRUCT。

`numpy.ndarray` 和 `numpy.datetime64`

ndarray 和 datetime64 通过调用 tolist() 并转换其结果来完成转换。

结果转换：DuckDB 结果到 Python

DuckDB 的 Python 客户端提供了多种额外方法，可用于高效地检索数据。

NumPy

fetchnumpy() 将数据作为 NumPy 数组的字典获取

Pandas

df() 将数据作为 Pandas DataFrame 获取
fetchdf() 是 df() 的别名
fetch_df() 是 df() 的别名
fetch_df_chunk(vector_multiple) 将结果的一部分获取到 DataFrame 中。每个块中返回的行数是向量大小（默认为 2048）* vector_multiple（默认为 1）。

Apache Arrow

arrow() 将数据作为 Arrow 表获取
fetch_arrow_table() 是 arrow() 的别名
fetch_record_batch(chunk_size) 返回一个每批包含 chunk_size 行的 Arrow 记录批次读取器

Polars

pl() 将数据作为 Polars DataFrame 获取

示例

以下是使用此功能的一些示例。有关更多示例，请参阅 Python 指南。

作为 Pandas DataFrame 获取

df = con.execute("SELECT * FROM items").fetchdf()
print(df)

       item   value  count
   jeans    20.0      1
  hammer    42.2      2
  laptop  2000.0      1
chainsaw   500.0     10
  iphone   300.0      2

作为 NumPy 数组字典获取

arr = con.execute("SELECT * FROM items").fetchnumpy()
print(arr)

{'item': masked_array(data=['jeans', 'hammer', 'laptop', 'chainsaw', 'iphone'],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'value': masked_array(data=[20.0, 42.2, 2000.0, 500.0, 300.0],
             mask=[False, False, False, False, False],
       fill_value=1e+20), 'count': masked_array(data=[1, 2, 1, 10, 2],
             mask=[False, False, False, False, False],
       fill_value=999999,
            dtype=int32)}

作为 Arrow 表获取。之后转换为 Pandas 仅用于美观打印

tbl = con.execute("SELECT * FROM items").fetch_arrow_table()
print(tbl.to_pandas())

       item    value  count
   jeans    20.00      1
  hammer    42.20      2
  laptop  2000.00      1
chainsaw   500.00     10
  iphone   300.00      2