DuckDB

DuckLake 0.2

2025-07-04T00:00:00+00:00

We released DuckLake a little over a month ago, and we were overwhelmed by the positive response from the community. Naturally, we also received a bunch of feature requests. With DuckLake version 0.2, we are happy to present the improvements made since!

We would like to start with a quick reminder: the term DuckLake refers to both the DuckLake open standard and the ducklake DuckDB extension. In this blog post, we discuss the updates affecting both of these.

Updates in the DuckDB `ducklake` Extension

Secrets for Managing Ducklake Credentials

Secrets can now be used to manage DuckLake credentials. The secrets can be created using the standard secret infrastructure, and contain all information relevant to connect to a given DuckLake instance. When connecting to DuckLake, the secret name can be passed in as a parameter. The connection information is then loaded from the secret.

Named Secrets

You can pass a named secret to DuckLake as follows:

CREATE SECRET my_secret (
	TYPE ducklake,
	METADATA_PATH 'metadata.db',
	DATA_PATH 's3://my-s3-bucket/'
);

ATTACH 'ducklake:my_secret';

Unnamed Secrets

In addition to named secrets, there can also be one unnamed secret. When connecting using an empty connection string, this unnamed secret is used to connect:

CREATE SECRET (
	TYPE ducklake,
	METADATA_PATH 'metadata.db',
	DATA_PATH 's3://my-s3-bucket/'
);

Settings

This release adds support for many more settings for DuckLake, in particular around the writing of Parquet files. The below list of settings is now supported. Settings can be scoped either globally for the entire DuckLake, or set at a per-schema and per-table level. The settings are persisted to the DuckLake in the ducklake_metadata table.

parquet_compression: Compression algorithm for Parquet files (uncompressed, snappy, gzip, zstd, brotli, lz4).
parquet_version: Parquet format version (1 or 2).
parquet_compression_level: Compression level for Parquet files.
parquet_row_group_size: Number of rows per row group in Parquet files.
parquet_row_group_size_bytes: Number of bytes per row group in Parquet files.
target_file_size: The target data file size for insertion and compaction operations.

Settings can be set using the set_option method:

CALL ducklake.set_option('parquet_compression', 'zstd');

The settings for a given DuckLake can be read using the new options function.

FROM ducklake.options();

List Files

We added the ducklake_list_files function. This function returns a list of data and corresponding delete files that must be scanned to scan the data for a table for a given snapshot. This can be helpful to integrate DuckLake into other systems.

FROM ducklake_list_files('ducklake', 'file');

Directly Attaching DuckLake

This release also adds support for directly attaching DuckLake using the duckdb CLI and other clients:

duckdb ducklake:my_file.db

Updates in the DuckLake 0.2 Standard

Relative Schema/Table Paths

In the old DuckLake standard, paths were always only relative to the global data path. In the DuckLake v0.2 standard, the location to which data and delete files are written now have three layers:

Data paths are relative to the table path
Table paths are relative to the schema path
Schema paths are relative to the global data path

This allows data files to be written in a more structured manner. By default, the schema and table name are set as the path to which the files are written. For example:

main/
	my_table/
		ducklake-uuid1.parquet
		ducklake-uuid2.parquet

The paths are stored in the ducklake_table and ducklake_schema tables.

By writing all files for a given table in a given subdirectory, it is now possible to use prefix-based access control at the object store level to grant users access to only specific schemas or tables in the database.

Name Mapping, and Adding Existing Parquet Files

DuckLake by default uses field ids to perform column mapping. When writing Parquet files through DuckLake, each column contains a field id that indicates to which top-level column it belongs. This allows DuckLake to perform metadata-only alter operations such as renaming and dropping fields.

When registering existing Parquet files written through other means or by other writers, these files generally do not have field identifiers written to them. In order to facilitate using these Parquet files in DuckLake, v0.2 adds a new way of mapping columns in the form of name mapping. This allows registering Parquet files as follows:

ATTACH 'ducklake:my_ducklake.db' AS my_ducklake;
CREATE TABLE my_ducklake.people (id BIGINT, name VARCHAR);
COPY (SELECT 42 AS id, 'Mark' AS name) TO 'people.parquet';
CALL ducklake_add_data_files('my_ducklake', 'people', 'people.parquet');

FROM my_ducklake.people;

┌───────┬─────────┐
│  id   │  name   │
│ int64 │ varchar │
├───────┼─────────┤
│  42   │ Mark    │
└───────┴─────────┘

Every file that is added to DuckLake has an optional mapping_id, that tells the system the mapping of column name to field id. As this is done on a per-file basis, Parquet files can always be added to DuckLake without restriction. All DuckLake operations are supported on added files, including schema evolution and data change feeds.

Settings

DuckLake v0.2 adds support for scoped settings. The settings stored in the ducklake_metadata table now have an optional scope field, together with a scope_id. This allows settings to be scoped at either the schema or table level – instead of requiring all settings to be scoped globally.

Partition Transforms

This release adds support for the year/month/day/hour partition transforms, allowing these functions to be used to partition data directly instead of having to create separate columns containing these fields.

Migration Guide

When connecting DuckDB to a DuckLake that runs v0.1, a migration is automatically triggered that updates the DuckLake to v0.2. The following SQL queries are issued in order to perform the migration.

ALTER TABLE ducklake_schema ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_schema ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_table ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_table ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_metadata ADD COLUMN scope VARCHAR;
ALTER TABLE ducklake_metadata ADD COLUMN scope_id BIGINT;
ALTER TABLE ducklake_data_file ADD COLUMN mapping_id BIGINT;
CREATE TABLE ducklake_column_mapping (mapping_id BIGINT, table_id BIGINT, type VARCHAR);
CREATE TABLE ducklake_name_mapping (mapping_id BIGINT, column_id BIGINT, source_name VARCHAR, target_field_id BIGINT, parent_column BIGINT);
UPDATE ducklake_partition_column
SET column_id = (SELECT list(column_id ORDER BY column_order)
FROM ducklake_column
WHERE table_id = ducklake_partition_column.table_id AND parent_column IS NULL AND end_snapshot IS NULL)[ducklake_partition_column.column_id + 1];
UPDATE ducklake_metadata
SET value = '0.2'
WHERE key = 'version';

Discovering DuckDB Use Cases via GitHub

2025-06-27T00:00:00+00:00

Introduction

One of the persistent challenges of maintaining an open-source library is the lack of visibility into how, where, and by whom it's being used. At DuckDB we look at the awesome-duckdb repository, which contains a curated list of DuckDB libraries, tools and resources. To supplement this, we wanted to automate insights from GitHub in order to uncover new and emerging projects using DuckDB. In this post we write about how we use DuckDB to query the GitHub API, export the data into a Markdown file and visualize the historical data, generated from Git commits.

Data Retrieval from GitHub

GitHub provides a series of REST API which can be used to retrieve data from GitHub. Among them, the search repositories API returns a list of public repositories whose name, description or readme matches the search text.

GitHub also has a search code API in order to match on code dependency, but it does not return all the information the search repository provides. One could use the search code API and for each result to retrieve the repository information from the repository API.

In order to use the GitHub API, we create an access token with read access on public repositories. With the access token, saved as an environment variable, we create a secret in DuckDB to be used by the API request:

import os

import duckdb


def get_duckdb_conn():
    conn = duckdb.connect()
    conn.sql(f"""
        CREATE SECRET http_auth (
            TYPE http,
            BEARER_TOKEN '{os.getenv("READ_PUBLIC_REPO_TOKEN")}',
            SCOPE 'https://api.github.com/search'
        );
    """)
    return conn

With the DuckDB Python client's read_json function we are able to query the API response with SQL. The above secret is passed automatically by DuckDB to the API call:

duckdb_conn = get_duckdb_conn()

api_response = duckdb_conn.read_json(
    "https://api.github.com/search/repositories?q=duckdb"
)

The API call returns the following response:

┌─────────────┬────────────────────┬──────────────────────────────────────┐
│ total_count │ incomplete_results │    items                             │
│    int64    │      boolean       │ struct(id bigint, node_id varchar,...│
├─────────────┼────────────────────┼──────────────────────────────────────┤
│        3837 │ false              │ [{'id': 138754790, 'node_id': ...    │
└─────────────┴────────────────────┴──────────────────────────────────────┘

We can analyze the response by:

getting the list of columns with api_response.columns:
```
  ['total_count', 'incomplete_results', 'items']
```

getting the column types api_response.types:

  [
     BIGINT,
     BOOLEAN,
     STRUCT(id BIGINT, node_id VARCHAR, "name" VARCHAR, full_name VARCHAR, private BOOLEAN, "owner" STRUCT(login VARCHAR, id BIGINT, node_id VARCHAR, avatar_url VARCHAR, gravatar_id VARCHAR, url VARCHAR, html_url VARCHAR, followers_url VARCHAR, following_url VARCHAR, gists_url VARCHAR, starred_url VARCHAR, subscriptions_url VARCHAR, organizations_url VARCHAR, repos_url VARCHAR, events_url VARCHAR, received_events_url VARCHAR, "type" VARCHAR, user_view_type VARCHAR, site_admin BOOLEAN), html_url VARCHAR, description VARCHAR, fork BOOLEAN, url VARCHAR, forks_url VARCHAR, keys_url VARCHAR, collaborators_url VARCHAR, teams_url VARCHAR, hooks_url VARCHAR, issue_events_url VARCHAR, events_url VARCHAR, assignees_url VARCHAR, branches_url VARCHAR, tags_url VARCHAR, blobs_url VARCHAR, git_tags_url VARCHAR, git_refs_url VARCHAR, trees_url VARCHAR, statuses_url VARCHAR, languages_url VARCHAR, stargazers_url VARCHAR, contributors_url VARCHAR, subscribers_url VARCHAR, subscription_url VARCHAR, commits_url VARCHAR, git_commits_url VARCHAR, comments_url VARCHAR, issue_comment_url VARCHAR, contents_url VARCHAR, compare_url VARCHAR, merges_url VARCHAR, archive_url VARCHAR, downloads_url VARCHAR, issues_url VARCHAR, pulls_url VARCHAR, milestones_url VARCHAR, notifications_url VARCHAR, labels_url VARCHAR, releases_url VARCHAR, deployments_url VARCHAR, created_at TIMESTAMP, updated_at TIMESTAMP, pushed_at TIMESTAMP, git_url VARCHAR, ssh_url VARCHAR, clone_url VARCHAR, svn_url VARCHAR, homepage VARCHAR, size BIGINT, stargazers_count BIGINT, watchers_count BIGINT, "language" VARCHAR, has_issues BOOLEAN, has_projects BOOLEAN, has_downloads BOOLEAN, has_wiki BOOLEAN, has_pages BOOLEAN, has_discussions BOOLEAN, forks_count BIGINT, mirror_url JSON, archived BOOLEAN, disabled BOOLEAN, open_issues_count BIGINT, license STRUCT("key" VARCHAR, "name" VARCHAR, spdx_id VARCHAR, url VARCHAR, node_id VARCHAR), allow_forking BOOLEAN, is_template BOOLEAN, web_commit_signoff_required BOOLEAN, topics VARCHAR[], visibility VARCHAR, forks BIGINT, open_issues BIGINT, watchers BIGINT, default_branch VARCHAR, permissions STRUCT("admin" BOOLEAN, maintain BOOLEAN, push BOOLEAN, triage BOOLEAN, pull BOOLEAN), score DOUBLE)[]
  ]

getting the number of items with api_response.select("len(items)"):

  ┌────────────┐
  │ len(items) │
  │   int64    │
  ├────────────┤
  │         30 │
  └────────────┘

The total number of repositories mentioning DuckDB is 3837, but the items object contains only 30. This is due to pagination that is set to 30 records per page by default. We can retrieve a maximum of 100 records by using the per_page query parameter. Additionaly, we use the page query parameter in order to retrieve the results for each page:

(
    duckdb_conn
    .read_json("https://api.github.com/search/repositories?q=duckdb&per_page=100&page=2")
    .select("len(items)")
)

┌────────────┐
│ len(items) │
│   int64    │
├────────────┤
│        100 │
└────────────┘

Because GitHub has a limit of returning maximum 1000 results in the search API and we retrieve the data daily, we added a filter to retrieve only the repositories which had a push in the last 7 days:

api_url = "https://api.github.com/search/repositories?q=duckdb"

last_pushed_date = (datetime.now() - timedelta(days=7)).strftime("%Y-%m-%d")

api_url = f"{api_url}+pushed:>={last_pushed_date}"

duckdb_conn.read_json(f"{api_url}&per_page=100&page=1").to_table("github_raw_data")

We store the first page in a table called github_raw_data and calculate how many pages we need to retrieve based on the total count returned by the first call. Then, with the insert_into method, we append the data from each page into github_raw_data:

for page in range(2, number_pages + 1):
    logger.info(f"Fetching {page} out of {number_pages}")
    (
        duckdb_conn
        .read_json(f"{api_url}&per_page=100&page={page}")
        .insert_into("github_raw_data")
    )

Saving Data to a Markdown File

With the data available in a table, we can continue our data processing, by using the unnest function, which will flatten the items returned by the API:

(
    duckdb_conn.table("github_raw_data")
    .select("unnest(items, recursive := true)")
)

By using recursive unnesting, the STRUCT objects within items will be flattened too; below is a sample of the columns derived from items after unnesting:

['id',
 'node_id',
 'name',
 'full_name',
 'private',
 'login',
 'id',
 ...
 'pull',
 'score']

Our scope is to create a Markdown file containing a table, with the following format:

name, containing the name, link, description, license and owner of the repository;
topics, the list of topics of the repository;
stars, the number of stars;
open issues, the number of open issues;
forks, the number of forks;
created at, when the repository was created;
updated at, when the repository was last updated.

To retrieve the name field we use concat_ws and concat functions, in order to generate the hyperlink and text with the Markdown newline character():

selection_query = (
    duckdb_conn.table("github_raw_data")
    .select("unnest(items, recursive := true)")
    .select("""
        concat_ws(
            '
',
            concat('[', name, '](', concat('https://github.com/', full_name),')'),
            coalesce(description, ' '),
            concat('**License** ', coalesce(name_1, 'unknown')),
            concat('**Owner** ', login)
        ) as repo_details
    """)
)

The above returns:

repo_details = [duckdb-web](https://github.com/duckdb/duckdb-web)
DuckDB website and documentation
**License** MIT License
**Owner** duckdb 
repo_details = [duckdb](https://github.com/duckdb/duckdb)
DuckDB is an analytical in-process SQL database management system
**License** MIT License
**Owner** duckdb
...

We also calculate metrics such as stars, open issues count and forks, which we sum into a field activity_count. A repository will end up in the list only if it has an activity_count greater than 3 and if it is not a fork. From above we also see that repositories owned by DuckDB are returned, therefore we filter them out too:

selection_query.filter("""
    login != 'duckdb'
    and not fork
    and activity_count >= 3
""")

A Markdown table is similar to a CSV file, separated by pipe (|), but it must:

contain the separator at the beginning and end of each row;
between the table header and the first table row there needs to be a row with dashes (called delimiter row).

In order to export the data as a Markdown file we are applying a few tricks. The first one is to select the header by adding dummy columns, at the beginning and end, containing NULL:

duckdb_conn.sql("""
    select 
        NULL,
        'Name',
        'Topics',
        'Stars',
        'Open Issues',
        'Forks',
        'Created At',
        'Updated At',
        NULL
""")

We then union the above header with the delimiter row:

.union(
    duckdb_conn.sql("""
    select 
        NULL as '',
        '--' as "Name",
        '--' as "Topics",
        '--' as "Stars",
        '--' as "Open Issues",
        '--' as "Forks",
        '--' as "Created At",
        '--' as "Updated At",
        NULL as ''
""")
)

Returning:

┌───────┬─────────┬──────────┬─────────┬───────────────┬─────────┬──────────────┬──────────────┬───────┐
│ NULL  │ 'Name'  │ 'Topics' │ 'Stars' │ 'Open Issues' │ 'Forks' │ 'Created At' │ 'Updated At' │ NULL  │
│ int32 │ varchar │ varchar  │ varchar │    varchar    │ varchar │   varchar    │   varchar    │ int32 │
├───────┼─────────┼──────────┼─────────┼───────────────┼─────────┼──────────────┼──────────────┼───────┤
│  NULL │ Name    │ Topics   │ Stars   │ Open Issues   │ Forks   │ Created At   │ Updated At   │  NULL │
│  NULL │ --      │ --       │ --      │ --            │ --      │ --           │ --           │  NULL │
└───────┴─────────┴──────────┴─────────┴───────────────┴─────────┴──────────────┴──────────────┴───────┘

And finally we union with the initial GitHub selection query and export the data to CSV, by disabling the header export:

    .union(selection_query)
).to_csv("./exported_records.md", sep="|", header=False)

Below is a sample of the exported_records.md file, rendered as HTML:

Name	Topics	Stars	Open Issues	Forks	Created At	Updated At
tailpipe select * from logs; License GNU Affero General Public License v3.0 Owner turbot	[aws, azure, detections, devops, duckdb, forensics, gcp, incident-response, log-analysis, mitre-attack, open-source, parquet, siem, tailpipe, threat-detection]	438	41	9	2024-04-18 02:44:35	2025-06-17 11:57:42

Why a Markdown file you may wonder. Because it is rendered automatically by GitHub and there is no need to host our tiny data application somewhere else. In fact, we copy the above table to README, such that it is displayed automatically on the first page of the repository:

echo '# Repositories using `duckdb`' > README.md && \
cat exported_records.md >> README.md

Another way to create the README file is by using the string_agg function:

selected_data = (
    selection_query
    .select("""
        concat(
            '|',
            concat_ws(
                '|',
                repo_details,
                topics, stars,
                open_issues,
                forks,
                created_at,
                updated_at
            )
        ) as line
    """)
    .string_agg(
        'line',
        sep='|\n'
    )
).fetchone()[0]

with open('README.md', 'w') as readme_file:
    readme_file.write("# Repositories using `duckdb`\n")
    readme_file.write("|Name|Topics|Stars|Open Issues|Forks|Created At|Updated At|\n")
    readme_file.write(f"{duckdb_conn.sql("select concat(repeat('|--', 7),'|')").fetchone()[0]}\n")
    readme_file.write(f"{selected_data}|")

In the above code snippet we concatenate the columns with the pipe character and then we aggregate the records into a string, separated by |\n in order to add a pipe and newline at the end of each line. We then use Python to write to README the title of the page, the Markdown table header, the delimiter row (by using the repeat function) and the data itself.

The data size is very small, each commit having a README of approx. 35 KB.

Automating with GitHub Workflow

With GitHub workflows we automated the above data processing steps. We first define a Makefile in our project to configure the steps needed to be executed in the workflow:

search-repos:
	uv run using_duckdb/search_repositories.py

readme:
	echo '# Repositories using `duckdb`' > README.md && \
	cat exported_records.md >> README.md

In the settings of the GitHub repository we create a repository secret which contains the value of the API token generated above. We can then configure an environment variable in our workflow and define the processing steps:

name: Search

on:
  workflow_dispatch:
  schedule:
    - cron: '37 5 * * *'

env:
  READ_PUBLIC_REPO_TOKEN : $

jobs:
  make-readme-md:
    runs-on: ubuntu-latest

    permissions:
      contents: write

    steps:
      - uses: actions/checkout@v4
        with:
          ref: $
          fetch-depth: 0

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Install the project
        run: uv sync

      - name: Get search results
        run: make search-repos

      - name: Export current results
        run: make readme

      - name: Get datetime
        id: datetime
        run: echo "datetime=$(date -u +'%Y-%m-%dT%H:%M:%SZ')" >> $GITHUB_OUTPUT

      - uses: stefanzweifel/git-auto-commit-action@v5
        with:
          commit_message: 'Update as of $'
          file_pattern: 'README.md'

With git-auto-commit-action we delegate to GitHub the right to create a commit with the updated README.md file:

Visualizing Historical Data

Because we store the search results into the README file and we update it daily with Git, we can retrieve the history of the search by getting the Git difference for each commit:

git-log:
	echo '|Name|Topics|Stars|Open Issues|Forks|Created At|Updated At|' > git_log.md && \
	echo '|--|--|--|--|--|--|--|' >> git_log.md && \
	git log --follow -p --pretty=format:"" -- README.md | grep '^+|\[' | sed 's/^+//' >> git_log.md && \
	git diff -- README.md | grep '^+|\[' | sed 's/^+//' >> git_log.md

In the above Makefile command we:

create a file, git_log.md, which has on the first line the header of the Markdown table;
add to git_log.md the delimiter row;
for each commit, append to git_log.md the new lines of the README differences;
append to git_log.md the new lines from the current change of README.

The git_log.md contains now the entire history of appended records to the README file and it can be used to display a bubble chart with the repositories mentioning DuckDB:

The above plot is generated with Plotly by providing as data source a DuckDB Python relation:

px.scatter(
    duckdb_conn.read_csv(
        "./git_log.md"
        skiprows=2
    )
        .select("""
        #2 as repo,
        #3 as topics,
        #4 as stars,
        #5 as open_issues,
        #6 as forks,
        strftime(#7, '%B %d, %Y, %H:%m' ) as created_at,
        #8 as updated_at,
        if(stars + forks + open_issues = 0, 0, log(stars + forks + open_issues)) as log_activity_count,
        substr(repo, position('[' in repo) + 1 , position(']' in repo) - 2) as repo_name,
        count(distinct updated_at::date) over (partition by repo_name) as number_of_updates,
        row_number() over (partition by repo_name order by updated_at desc) as rn
    """)
    .filter("rn = 1")
    .order("updated_at"),
    x="updated_at",
    y="log_activity_count",
    labels={
        "updated_at": "Updated Date",
        "log_activity_count": "Activity count, based on stars, open issues and forks"
    },
    ...

Using Markdown as a file format for our historical data might not be the best solution, because – among others – it does not support schema evolution. While we cannot do changes in the middle of the file (except column renames), we can remove the last column or add new ones at the end, due to the way DuckDB reads malformed CSV files.

Conclusion

In this post we showed how DuckDB can be used to process API requests and historical data from git commits. Some of the repositories returned by the GitHub search API made it to awesome-duckdb, which is our go-to list for curated DuckDB related projects; for example tailpipe, an open source Security Information and Event Management for instant log insights, and preswald, a Wasm packager for Python-based interactive data apps.

Lightweight Text Analytics Workflows with DuckDB

2025-06-13T00:00:00+00:00

Introduction

Text analytics is a central component of many modern data workflows, covering tasks such as keyword matching, full-text search, and semantic comparison. Conventional tools frequently require complex pipelines and substantial infrastructure, which can pose significant challenges. DuckDB offers a high-performance SQL engine that simplifies and streamlines text analytics. In this post, we will demonstrate how to leverage DuckDB to efficiently perform advanced text analytics in Python.

The following implementation is executed in a marimo Python notebook, which is available on GitHub, in our examples repository.

Data Preparation

We will be working with a public dataset, available on Hugging Face, which contains English Twitter messages and their classification to one of the following emotions: anger, fear, joy, love, sadness, and surprise.

With DuckDB we are able to access Hugging Face datasets via the hf:// prefix:

from_hf_rel = conn.read_parquet(
        "hf://datasets/dair-ai/emotion/unsplit/train-00000-of-00001.parquet",
        file_row_number=True
    )
from_hf_rel = from_hf_rel.select("""
    text,
    label as emotion_id,
    file_row_number as text_id
""")
from_hf_rel.to_table("text_emotions")

How to access Hugging Face datasets with DuckDB is detailed in the post “Access 150k+ Datasets from Hugging Face with DuckDB”.

In the above data we have available only the identifier of an emotion (emotion_id), without its descriptive information. Therefore, from the list provided in the dataset description, we create a reference table by unnesting the Python list and retrieving the index for each value with the generate_subscripts function:

emotion_labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

from_labels_rel = conn.values([emotion_labels])
from_labels_rel = from_labels_rel.select("""
    unnest(col0) as emotion,
    generate_subscripts(col0, 1) - 1 as emotion_id
""")
from_labels_rel.to_table("emotion_ref")

Last we define a relation, by joining the two tables:

text_emotions_rel = conn.table("text_emotions").join(
    conn.table("emotion_ref"), condition="emotion_id"
)

By executing text_emotions_rel.to_view("text_emotions_v", replace=True) a view with the name text_emotions_v will be created, which can be used in SQL cells.

We plot on a bar chart the emotion distribution to have an initial understanding of our data:

Keyword Search

Keyword search is the most basic form of text retrieval, matching exact words or phrases in text fields using SQL conditions such as CONTAINS, ILIKE, or other DuckDB text functions. It is fast, requires no preprocessing, and works well for structured queries like filtering logs, matching tags, or finding product names.

For example, getting the texts and their emotion label containing the phrase excited to learn is a matter of applying filter on the relation defined above:

text_emotions_rel.filter("text ilike '%excited to learn%'").select("""
    emotion,
    substring(
        text,
        position('excited to learn' in text),
        len('excited to learn')
    ) as substring_text 
""")

┌─────────┬──────────────────┐
│ emotion │  substring_text  │
│ varchar │     varchar      │
├─────────┼──────────────────┤
│ sadness │ excited to learn │
│ joy     │ excited to learn │
│ joy     │ excited to learn │
│ fear    │ excited to learn │
│ fear    │ excited to learn │
│ joy     │ excited to learn │
│ joy     │ excited to learn │
│ sadness │ excited to learn │
└─────────┴──────────────────┘

One common step in text processing is to split text into tokens (keywords), where raw text is broken down into smaller units (typically words), that can be analyzed or indexed. This process, known as tokenization, helps convert unstructured text into a structured form suitable for keyword search. In DuckDB this process can be implemented with the regexp_split_to_table function, which will split the text based on the provided regex and return each keyword on a row.

This step is case sensitive, therefore it is important to convert all text to a consistent case (by applying lcase or ucase) before processing.

In the below code snippet we select all the keywords by splitting the text on one or more non-word characters (anything except [a-zA-Z0-9_]):

text_emotions_tokenized_rel = text_emotions_rel.select("""
    text_id,
    emotion,
    regexp_split_to_table(text, '\\W+') as token
""")

In the tokenization step, we usually exclude common words (such as and, the), called stopwords. In DuckDB we implement the exclusion by applying an ANTI JOIN on a curated CSV file hosted on GitHub:

english_stopwords_rel = duckdb_conn.read_csv(
    "https://raw.githubusercontent.com/stopwords-iso/stopwords-en/refs/heads/master/stopwords-en.txt",
    header=False,
).select("column0 as token")

text_emotions_tokenized_rel.join(
    english_stopwords_rel,
    condition="token",
    how="anti",
).to_table("text_emotion_tokens")

Now that we have tokenized and cleaned the text, we can implement keyword search by ranking the match with similarity functions, such as Jaccard:

text_token_rel = conn.table(
    "text_emotion_tokens"
).select("token, emotion, jaccard(token, 'learn') as jaccard_score")

text_token_rel = text_token_rel.max(
    "jaccard_score",
    groups="emotion, token",
    projected_columns="emotion, token"
)

text_token_rel.order("3 desc").limit(10)

┌──────────┬─────────┬────────────────────┐
│ emotion  │  token  │ max(jaccard_score) │
│ varchar  │ varchar │       double       │
├──────────┼─────────┼────────────────────┤
│ fear     │ learn   │                1.0 │
│ surprise │ learn   │                1.0 │
│ love     │ learn   │                1.0 │
│ joy      │ lerna   │                1.0 │
│ sadness  │ learn   │                1.0 │
│ fear     │ learner │                1.0 │
│ anger    │ learn   │                1.0 │
│ joy      │ leaner  │                1.0 │
│ fear     │ allaner │                1.0 │
│ anger    │ learner │                1.0 │
├──────────┴─────────┴────────────────────┤
│ 10 rows                       3 columns │
└─────────────────────────────────────────┘

We can also visualize the data to gain insights. One simple and effective approach is to plot the most frequently used words. By counting token occurrences across the dataset and displaying them in bubble plots, we can quickly identify dominant themes, repeated keywords, or unusual patterns in the text. For example, we plot the data by using a scatter facet plot per emotion:

From the above plot, we observe repeated keywords, such as feel - feeling, love - loved - loving. In order to de-duplicate such data, we need to look at the word stem rather than at the word itself. This brings us to full-text search.

Full-Text Search

The Full-Text Search (FTS) DuckDB extension is an experimental extension, which implements two main full-text search functionalities:

the stem function, to retrieve the word stem;
the match_bm25 function, to calculate the Best Match score.

By applying stem to the token column, we can now visualize the most frequently used word stem in our data:

We observe that feel and love appear only once and new word stems are plotted, such as support, surpris.

While the stem function can be used standalone, the match_bm25 one requires the build of a FTS index, a special index that allows fast and efficient searching of text by indexing the words (tokens) in a column:

conn.sql("""
    PRAGMA create_fts_index(
        "text_emotions", 
        text_id, 
        "text", 
        stemmer = 'english',
        stopwords = 'english_stopwords',
        ignore = '(\\.|[^a-z])+',
        strip_accents = 1, 
        lower = 1, 
        overwrite = 1
    )
""")

In the FTS index creation we are using the same list of English stopwords as in the tokenization process, by saving it into a table, named english_stopwords. The index is case insensitive due to the lower parameter, which will lowercase the text automatically.

Warning The index can be created only on tables and it requires a unique identifier of the text. It also needs to be rebuild when the underlying data has been modified.

Once the index has been created, we can rank the match between the text column and the phrase excited to learn:

text_emotions_rel
.select("""
    emotion,
    text,
    emotion_color,
    fts_main_text_emotions.match_bm25(
        text_id,
        'excited to learn'
    )::decimal(3, 2) as bm25_score
""")
.order("bm25_score desc")
.limit(10)

Out of the 10 returned texts, displayed above in a table plot, 2 are poor matches to our search input; likely due to BM25 scoring being skewed by common terms or differences in document length.

Semantic Search

Compared to keyword and full-text search, semantic search takes into account the meaning and context of the text. Instead of just looking for exact words, it uses techniques like vector embeddings to capture the underlying concepts. Semantic search, which is case insensitive, can be implemented in DuckDB, by making use of the (also experimental) Vector Similarity Search extension.

The vector embeddings of a (list of) text can be calculated with the sentence-transformers library and the pre-trained model all-MiniLM-L6-v2:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def get_text_embedding_list(list_text: list[str]):
    """
    Return the list of normalized vector embeddings for list_text.
    """
    return model.encode(list_text, normalize_embeddings=True)

For example, get_text_embedding_list(['excited to learn']) will return:

array([[ 3.14795598e-02, -6.66208193e-02,  1.05058309e-02,
         4.12571728e-02, -8.67664907e-03, -1.79746319e-02,
        ...
        -2.50727013e-02, -3.00881546e-03,  1.55055271e-02]], dtype=float32)

We register the model inference function as a Python User Defined Function and create a table with a column of type FLOAT[384] to load the embeddings into:

conn.create_function(
    "get_text_embedding_list",
    get_text_embedding_list,
    return_type='FLOAT[384][]'
)

conn.sql("""
    create table text_emotion_embeddings (
        text_id integer,
        text_embedding FLOAT[384]
    )
""")

With the Python UDF, we save, in batches, the model output in text_emotion_embeddings:

for i in range(num_batches):
    selection_query = (
        duckdb_conn.table("text_emotions")
        .order("text_id")
        .limit(batch_size, offset=batch_size*i)
        .select("*")
    )

    (
        selection_query.aggregate("""
            array_agg(text) as text_list,
            array_agg(text_id) as id_list,
            get_text_embedding_list(text_list) as text_emb_list
        """).select("""
            unnest(id_list) as text_id,
            unnest(text_emb_list) as text_embedding
        """)
    ).insert_into("text_emotion_embeddings")

We wrote about model inference in DuckDB in the post Machine Learning Prototyping with DuckDB and scikit-learn.

We can now perform semantic search by using the cosine distance between the vector embedding of our search text, excited to learn, and the embedding of the text field:

input_text_emb_rel = conn.sql("""
    select get_text_embedding_list(['excited to learn'])[1] as input_text_embedding
""")

text_emotions_rel
.join(conn.table("text_emotion_embeddings"), condition="text_id")
.join(input_text_emb_rel, condition="1=1")
.select("""
        text, 
        emotion,
        emotion_color,
        array_cosine_distance(
            text_embedding,
            input_text_embedding
        )::decimal(3, 2) as cosine_distance_score
    """)
.order("cosine_distance_score asc")
.limit(10)

Interesting to observe that the phrase i am excited to learn and feel privileged to be here didn't make it in the top 10 results in our semantic search!

Similarity Joins

Vector embeddings are most known for their usability in search engines, but they can be used in a variety of text analytics use cases, such as topic grouping, classification or semantic matching between documents. The VSS extension provides vector similarity joins, which can be used to conduct these types of analytics.

For example, we show in the below heatmap chart the number of texts for each combination of emotion labels, where the x-axis corresponds to the semantic matching between the text and the emotion, the y-axis to the classified emotion, and the color indicates the count of texts assigned to each pair:

It is particularly noticeable that from the 6 emotions, only sadness has a strong semantic match with the text classified with the same label. As with full-text search, semantic search is affected by differences in document length (in this case the emotion keyword versus a text).

Hybrid Search

While each type of search has its own applicability, we observed that some results are not as expected:

keyword search and full-text search do not take word meaning into account;
semantic search scored synonyms higher than the search text.

In practice, the three search methods are combined and used to perform a “hybrid search”, in order to improve the search relevance and accuracy. We start by calculating the score for each type of search, by implementing custom logic, such as a check on the emotion:

if(
    emotion = 'joy' and contains(text, 'excited to learn'),
    1,
    0
) exact_match_score,

fts_main_text_emotions.match_bm25(
        text_id,
        'excited to learn'
)::decimal(3, 2) as bm25_score,

array_cosine_similarity(
    text_embedding,
    input_text_embedding
)::decimal(3, 2) as cosine_similarity_score

The BM25 score is ranked in descending order and the cosine distance in ascending order. In hybrid search, we use the array_cosine_similarity score to ensure the same sort order (in this case descending).

cosine similarity = 1 - cosine distance

Because the BM25 score can be, in theory, unbounded, we need to scale the score to the interval [0, 1] by implementing the min-max normalization:

max(bm25_score) over () as max_bm25_score,
min(bm25_score) over () as min_bm25_score,
(bm25_score - min_bm25_score) / nullif((max_bm25_score - min_bm25_score), 0) as norm_bm25_score

The hybrid search score is calculated by applying a weight to the BM25 and cosine similarity scores:

if(
    exact_match_score = 1,
    exact_match_score,
    cast(
        0.3 * coalesce(norm_bm25_score, 0) +
        0.7 * coalesce(cosine_similarity_score, 0)
        as
        decimal(3, 2)
    )
) as hybrid_score

And here are the results! Much better, don't you think?

Conclusion

In this post, we showed how DuckDB can be used for text analytics by combining keyword, full-text, and semantic search techniques. Using the experimental fts and vss extensions with the sentence-transformers library, we demonstrated how DuckDB can support both traditional and modern text analytics workflows.

Faster Dashboards with Multi-Column Approximate Sorting

2025-06-06T00:00:00+00:00

An animated Hilbert space filling curve by TimSauder – Own work, CC BY-SA 4.0

It is rare to have a dashboard with a single chart. It is even more rare to only query your data one way. It is rarer still to query your data with no filters at all!

When queries read a subset of the total rows, sorting data while loading provides significant benefits. Please have a look at the prior post in this series to understand how this approach works and for practical tips!

In order to see these benefits in a wider variety of real world cases, we have to get a bit more creative. These advanced techniques help when one of these situations occur:

Queries filter on different columns
Query patterns are not perfectly predictable
Queries filter by time and at least one other column

In this post, we describe several advanced sorting strategies, compare them with some experiments (microbenchmarks), and then calculate a metric to measure their effectiveness.

The Strategy

Instead of sorting precisely by one or a small number of columns, we want to sort approximately by a larger number of columns. That will allow queries with different WHERE clauses to all benefit from DuckDB's min-max indexes (zone maps). This post introduces two high-level approaches with several examples of each: sorting by space filling curves, and sorting by truncated timestamps.

Space Filling Curves

Both Morton and Hilbert are space filling curve algorithms that are designed to combine multiple columns into an order that preserves some approximate ordering for both columns. One application of space filling curves is in geospatial analytics and it is a helpful illustration.

If a dataset contained the latitude and longitude coordinates of every café on earth (one row per café), but we wanted to sort so that cafés that are physically close to one another are near each other in the list, we could use a space filling curve. Cafés that are somewhat close in both latitude and longitude will receive a similar Morton or Hilbert encoding value. This will allow us to quickly execute queries like “Find all cafés within this rectangular region on a map”. (A rectangular region like that is called a bounding box in geospatial-land!) The GIF at the top of this post shows various levels of granularity of a Hilbert encoding – imagine if the x axis were longitude and the y axis were latitude. The "zig-zag" line of the Hilbert algorithm is the list of cafés, sorted approximately.

Both Morton and Hilbert operate on integers or floats, but this post outlines a way to use them to sort VARCHAR columns as well. A SQL macro is used to convert the first few characters of a VARCHAR column into an integer as a pre-processing step.

Truncated Timestamps

More recent data tends to be more useful data, so frequently queries filter on a time column. However, often queries filter on time and on other columns as well.

Don't just sort on the timestamp column! You will miss out on performance benefits. This is because timestamps tend to be so granular, that in practical terms the data is only sorted by timestamp. How many rows of your data were inserted at exactly 2025-01-01 01:02:03.456789? Probably just one!

To sort on multiple columns as well as a time column, first sort by a truncated timestamp (truncated to the start of the day, week, year, etc.) and then on the other columns.

The “On-Time Flights” Dataset

Ever wondered how likely you will be delayed when you are getting ready to fly? If you are traveling in the US, you can use a government dataset to see how on-time each route is! You can get more details on the dataset at its official site, but I grabbed a Parquet version of the data from Kaggle. It includes almost 5 years of data on US flights. Combined, the Parquet files are about 1.1 GB. This is fairly small when accessed locally but can be more of a challenge over a slow remote connection. In the rest of the post, we'll run different benchmarks to demonstrate how pruning can help reducing the amount of network traffic required to answer certain queries.

Our first experiment is a hypothetical use case to serve a dashboard where people can explore a single origin, a single dest (destination), or an origin–dest pair. The origin and dest columns are three-letter airport codes (like LAX or JFK). We assume once you have picked one of those filters, you want all the rows that match and a few arbitrary columns.

Our second experiment will add in a time component where our hypothetical users could additionally filter to the latest 1, 13, or 52 weeks of data in the dataset.

Experimental Design

We will show a few approaches to sort data approximately to speed up retrieval.

The control groups are:

random: Sort by a hash – the worst case scenario!
origin: Single-column sort on origin
origin-dest: Sort on origin, then on destination

Our alternative sorting approaches are:

zipped_varchar: Sort by one letter at a time, alternating between origin and destination. – For example, an origin of ABC and a destination of XYZ would become AXBYCZ, which would then be sorted.
morton: Convert origin and destination into integers, then order by Morton encoding (Z-order)
hilbert: Convert origin and destination into integers, then order by Hilbert encoding

The zipped_varchar algorithm is implemented using a SQL macro so no extensions are needed. It also handles arbitrary length strings.

The Morton and Hilbert encoding functions come from the lindel DuckDB community extension, contributed by Rusty Conover. Thank you to Rusty and the folks who have built the lindel Rust crate upon which the DuckDB extension is based! The spatial extension also contains an ST_Hilbert function that works similarly. Thanks to Max Gabrielsson and the GDAL community!

These plots display query runtime when pulling from a DuckDB file hosted on S3. These same techniques can also be successfully applied to the DuckLake integrated data lake and catalog format! DuckLake is the modern evolution of the cloud data lakehouse – take a minute to check out the launch post if you haven't yet! DuckLake has an additional concept of a partition, which enables entire files to be skipped. To take full advantage of DuckLake, first partition your data (by time or otherwise) and then apply the techniques in this post when loading your data.

All experiments were run on an M1 MacBook Pro with DuckDB v1.2.2. The tests were conducted with the DuckDB Python client with results returned as Apache Arrow tables. In between each query, the DuckDB connection is closed and recreated (this time is not measured as a part of the results). This is to better simulate a single user's experience accessing the dashboard in our hypothetical use case.

Expand to see the basic sorting queries

30 000 Stars on GitHub

2025-06-06T00:00:00+00:00

The duckdb/duckdb GitHub repository has just passed 30 000 stars!

We would like to use this event to stop for a moment and revisit recent developments in the DuckDB ecosystem since last December, not even six months ago, when we reached 25 000 stars.

Recent Updates

We released DuckDB 1.2.0 and DuckDB 1.3.0, both packing many new features, performance optimizations and bugfixes.
We published the DuckLake specification, an open format for SQL-as-a-lakehouse. For details, read the announcement blog post and its coverage in the Register.
DuckDB now has a built-in UI.
DuckDB now supports Iceberg REST Catalog, which also allows it to connect to Amazon S3 Tables
We published the much-requested roadmap.
Coincidentally, this is the 100th post on our blog.

Metrics

Besides the GitHub stars we have also observed a lot of growth in other metrics:

Each month, our website now welcomes over 3 million unique visitors – more than double than in December. We also see over 700 TB in traffic from millions of extension downloads. Thanks again to Cloudflare for sponsoring the project.
Just in this calendar year, we rose 10 positions in the DB Engines ranking from position 55 to 45.
We now count 20M+ monthly downloads in PyPI.

We would like to reiterate that while we’re happy to see these numbers grow, we are not glorifying them and they are not a target per se in accordance with Goodhart’s law.

Events

We hosted the first livestreamed DuckCon in Amsterdam and organized another DuckDB Meetup.

Closing Thoughts

That's it for today. We would like to thank our community of users and contributors! We are looking forward to keep growing our community and hope the GitHub stars follow. We'll report back when we reach 40,000 stars.

DuckLake: SQL as a Lakehouse Format

2025-05-27T00:00:00+00:00

The first part of the blog post is shared with the DuckLake manifesto. Jump to the DuckLake extension section to read the rest.

Background

Innovative data systems like BigQuery and Snowflake have shown that disconnecting storage and compute is a great idea in a time where storage is a virtualized commodity. That way, both storage and compute can scale independently and we don't have to buy expensive database machines just to store tables we will never read.

At the same time, market forces have pushed people to insist that data systems use open formats like Parquet to avoid the all-too-common hostage taking of data by a single vendor. In this new world, lots of data systems happily frolic around a pristine “data lake” built on Parquet and S3 and all was well. Who needs those old school databases anyway!

But quickly it emerged that – shockingly – people would like to make changes to their dataset. Simple appends worked pretty well by just dropping more files into a folder, but anything beyond that required complex and error-prone custom scripts without any notion of correctness or – Codd beware – transactional guarantees.

An actual lakehouse. Maybe more like a cabin on a lake.

Iceberg and Delta

To address the basic task of changing data in the lake, various new open standards emerged, most prominently Apache Iceberg and Linux Foundation Delta Lake. Both formats were designed to essentially recover some sanity of making changes to tables without giving up the basic premise: use open formats on blob storage. For example, Iceberg uses a maze of JSON and Avro files to define schemas, snapshots and which Parquet files are part of the table at a point in time. The result was christened the “Lakehouse”, effectively an addition of database features to data lakes that enabled a lot of new exciting use cases for data management, e.g., cross-engine data sharing.

Iceberg table architecture

But both formats hit a snag in the road: finding the latest version of a table is tricky in blob stores with their mercurial consistency guarantees. It’s tricky to atomically (the “A” in ACID) swap a pointer to make sure everyone sees the latest version. Iceberg and Delta Lake also only really know about a single table, but people – again, shockingly – wanted to manage multiple tables.

Catalogs

The solution was another layer of technology: we added a catalog service on top of the various files. That catalog service in turn talks to a database that manages all the table folder names. It also manages the saddest table of all time that only contains a single row for each table with the current version number. We can now borrow the database’s transactional guarantees around updating that number and everyone’s happy.

Iceberg catalog architecture

A Database You Say?

But here’s the problem: Iceberg and Delta Lake were specifically designed to not require a database. Their designers went to great lengths to encode all information needed to efficiently read and update tables into files on the blob store. They make many compromises to achieve this. For example, every single root file in Iceberg contains all existing snapshots complete with schema information, etc. For every single change, a new file is written that contains the complete history. A lot of other metadata had to be batched together, e.g., in the two-layer manifest files to avoid writing or reading too many small files, something that would not be efficient on blob stores. Making small changes to data is also a largely unsolved problem that requires complex cleanup procedures that are still not very well understood nor supported by open-source implementations. Entire companies exist and are still being started to solve this problem of managing fast-changing data. Almost as if a specialized data management system of sorts would be a good idea.

But as pointed out above, the Iceberg and Delta Lake designs already had to compromise and add a database as part of the catalog for consistency. However, they never revisited the rest of their design constraints and tech stack to adjust for this fundamental design change.

DuckLake

Here at DuckDB, we actually like databases. They are amazing tools to safely and efficiently manage fairly large datasets. Once a database has entered the Lakehouse stack anyway, it makes an insane amount of sense to also use it for managing the rest of the table metadata! We can still take advantage of the “endless” capacity and “infinite” scalability of blob stores for storing the actual table data in open formats like Parquet, but we can much more efficiently and effectively manage the metadata needed to support changes in a database! Coincidentally, this is also what Google BigQuery (with Spanner) and Snowflake (with FoundationDB) have chosen, just without the open formats at the bottom.

DuckLake's architecture: Just a database and some Parquet files

To resolve the fundamental problems of the existing Lakehouse architecture, we have created a new open table format called DuckLake. DuckLake re-imagines what a “Lakehouse” format should look like by acknowledging two simple truths:

Storing data files in open formats on blob storage is a great idea for scalability and to prevent lock-in.
Managing metadata is a complex and interconnected data management task best left to a database management system.

The basic design of DuckLake is to move all metadata structures into a SQL database, both for catalog and table data. The format is defined as a set of relational tables and pure-SQL transactions on them that describe data operations like schema creation, modification, and addition, deletion and updating of data. The DuckLake format can manage an arbitrary number of tables with cross-table transactions. It also supports “advanced” database concepts like views, nested types, transactional schema changes etc.; see below for a list. One major advantage of this design is by leveraging referential consistency (the “C” in ACID), the schema makes sure there are e.g. no duplicate snapshot ids.

DuckLake schema

Which exact SQL database to use is up to the user, the only requirements are that the system supports ACID operations and primary keys along with standard SQL support. The DuckLake-internal table schema is intentionally kept simple in order to maximize compatibility with different SQL databases. Here is the core schema through an example.

Let's follow the sequence of queries that occur in DuckLake when running the following query on a new, empty table:

INSERT INTO demo VALUES (42), (43);

BEGIN TRANSACTION;
  -- some metadata reads skipped here
  INSERT INTO ducklake_data_file VALUES (0, 1, 2, NULL, NULL, 'data_files/ducklake-8196...13a.parquet', 'parquet', 2, 279, 164, 0, NULL, NULL);
  INSERT INTO ducklake_table_stats VALUES (1, 2, 2, 279);
  INSERT INTO ducklake_table_column_stats VALUES (1, 1, false, NULL, '42', '43');
  INSERT INTO ducklake_file_column_statistics VALUES (0, 1, 1, NULL, 2, 0, 56, '42', '43', NULL)
  INSERT INTO ducklake_snapshot VALUES (2, now(), 1, 2, 1);
  INSERT INTO ducklake_snapshot_changes VALUES (2, 'inserted_into_table:1');
COMMIT;

We see a single coherent SQL transaction that:

Inserts the new Parquet file path
Updates the global table statistics (now has more rows)
Updates the global column statistics (now has a different minimum and maximum value)
Updates the file column statistics (also record min/max among other things)
Creates a new schema snapshot (#2)
Logs the changes that happened in the snapshot

Note that the actual write to Parquet is not part of this sequence, it happens beforehand. But no matter how many values are added, this sequence has the same (low) cost.

Let's discuss the three principles of DuckLake: Simplicity, Scalability and Speed.

Simplicity

DuckLake follows the DuckDB design principles of keeping things simple and incremental. In order to run DuckLake on a laptop, it is enough to just install DuckDB with the ducklake extension. This is great for testing purposes, development and prototyping. In this case, the catalog store is just a local DuckDB file.

The next step is making use of external storage systems. DuckLake data files are immutable, it never requires modifying files in place or re-using file names. This allows use with almost any storage system. DuckLake supports integration with any storage system like local disk, local NAS, S3, Azure Blob Store, GCS, etc. The storage prefix for data files (e.g., s3://mybucket/mylake/) is specified when the metadata tables are created.

Finally, the SQL database that hosts the catalog server can be any halfway competent SQL database that supports ACID and primary key constraints. Most organizations will already have a lot of experience operating a system like that. This greatly simplifies deployment as no additional software stack is needed beyond the SQL database. Also, SQL databases have been heavily commoditized in recent years, there are innumerable hosted PostgreSQL services or even hosted DuckDB that can be used as the catalog store! Again, the lock-in here is very limited because transitioning does not require any table data movement, and the schema is simple and standardized.

There are no Avro or JSON files. There is no additional catalog server or additional API to integrate with. It’s all just SQL. We all know SQL.

Scalability

DuckLake actually increases separation of concerns within a data architecture into three parts. Storage, compute and metadata management. Storage remains on purpose-built file storage (e.g., blob storage), DuckLake can scale infinitely in storage.

An arbitrary number of compute nodes are querying and updating the catalog database and then independently reading and writing from storage. DuckLake can scale infinitely regarding compute.

Finally, the catalog database needs to be able to run only the metadata transactions requested by the compute nodes. Their volume is several orders of magnitude smaller than the actual data changes. But DuckLake is not bound to a single catalog database, making it possible to migrate e.g. from PostgreSQL to something else as demand grows. In the end, DuckLake uses simple tables and basic, portable SQL. But don’t worry, a PostgreSQL-backed DuckLake will already be able to scale to hundreds of terabytes and thousands of compute nodes.

Again, this is the exact design used by BigQuery and Snowflake that successfully manage immense datasets already. And hey, nothing keeps you from using Spanner as the DuckLake catalog database if required.

Speed

Just like DuckDB itself, DuckLake is very much about speed. One of the biggest pain points of Iceberg and Delta Lake is the involved sequence of file IO that is required to run the smallest query. Following the catalog and file metadata path requires many separate sequential HTTP requests. As a result, there is a lower bound to how fast reads or transactions can run. There is a lot of time spent in the critical path of transaction commits, leading to frequent conflicts and expensive conflict resolution. While caching can be used to alleviate some of these problems, this adds additional complexity and is only effective for “hot” data.

The unified metadata within a SQL database also allows for low-latency query planning. In order to read from a DuckLake table, a single query is sent to the catalog database, which performs the schema-based, partition-based and statistics-based pruning to essentially retrieve a list of files to be read from blob storage. There are no multiple round trips to storage to retrieve and reconstruct metadata state. There is also less that can go wrong, no S3 throttling, no failing requests, no retries, no not-yet consistent views on storage that lead to files being invisible, etc.

DuckLake is also able to improve the two biggest performance problems of data lakes: small changes and many concurrent changes.

For small changes, DuckLake will dramatically reduce the number of small files written to storage. There is no new snapshot file with a tiny change compared to the previous one, there is no new manifest file or manifest list. DuckLake even optionally allows transparent inlining of small changes to tables into actual tables directly in the metadata store! Turns out, a database system can be used to manage data, too. This allows for sub-millisecond writes and for improved overall query performance by reducing the number of files that have to be read. By writing many fewer files, DuckLake also greatly simplifies cleanup and compaction operations.

In DuckLake, table changes consist of two steps: staging the data files (if any) to storage, and then running a single SQL transaction in the catalog database. This greatly reduces the time spent in the critical path of transaction commits, there is only a single transaction to run. SQL databases are pretty good at de-conflicting transactions. This means that the compute nodes spend a much smaller amount of time in the critical path where conflicts can occur. This allows for much faster conflict resolution and for many more concurrent transactions. Essentially, DuckLake supports as many table changes as the catalog database can commit. Even the venerable Postgres can run thousands of transactions per second. One could run a thousand compute nodes running appends to a table at a one-second interval and it would work fine.

In addition, DuckLake snapshots are just a few rows added to the metadata store, allowing for many snapshots to exist at the same time. There is no need to proactively prune snapshots. Snapshots can also refer to parts of a Parquet file, allowing many more snapshots to exist than there are files on disk. Combined, this allows DuckLake to manage millions of snapshots!

Features

DuckLake has all of your favorite Lakehouse features:

Arbitrary SQL: DuckLake supports all the same vastness of SQL features that e.g. DuckDB supports.
Data Changes: DuckLake supports efficient appends, updates and deletes to data.
Multi-Schema, Multi-Table: DuckLake can manage an arbitrary number of schemas that each contain an arbitrary number of tables in the same metadata table structure.
Multi-Table Transactions: DuckLake supports fully ACID-compliant transactions over all of the managed schemas, tables and their content.
Complex Types: DuckLake supports all your favorite complex types like lists, arbitrarily nested.
Full Schema Evolution: Table schemas can be changed arbitrarily, e.g., columns can be added, removed, or have their data types changed.
Schema-Level Time Travel and Rollback: DuckLake supports full snapshot isolation and time travel, allowing to query tables as of a specific point in time.
Incremental Scans: DuckLake supports retrieval of only the changes that occurred between specified snapshots.
SQL Views: DuckLake supports the definition of lazily evaluated SQL-level views.
Hidden Partitioning and Pruning: DuckLake is aware of data partitioning and table- and file-level statistics, allowing for early pruning of scans for maximum efficiency.
Transactional DDL: Schema and table and view creation, evolution and removal are fully transactional.
Data Compaction Avoidance: DuckLake requires far fewer compaction operations than comparable formats. DuckLake supports efficient compaction of snapshots.
Inlining: When making small changes to the data, DuckLake can optionally use the catalog database to store those small changes directly to avoid writing many small files.
Encryption: DuckLake can optionally encrypt all data files written to data store, allowing for zero-trust data hosting. Keys are managed by the catalog database.
Compatibility: The data and (positional) deletion files that DuckLake writes to storage are fully compatible with Apache Iceberg allowing for metadata-only migrations.

The `ducklake` DuckDB Extension

Specifying data formats into thin air is easy, making things work is hard. This is why today we’re also releasing a compute node implementation for DuckLake, in the form of the ducklake DuckDB extension. The extension implements the DuckLake format as described above and supports all the features described. The extension is free and open-source software under the MIT license with all IP resting in the non-profit DuckDB Foundation.

Conceptually, the ducklake extension massively elevates DuckDB from its single-node beginnings to where it can support centralized client-server data warehouse use cases without requiring extra supporting infrastructure. With DuckLake, an organization can set up a centralized catalog database and file storage (e.g., AWS RDS and S3 or self-hosted), and then run a DuckDB instance with the ducklake extension on a large number of participating devices, e.g., employee workstations, phones, application servers, or serverless compute code.

The extension is able to run DuckLake independently using a local DuckDB file as its catalog database. It is also able to use any third-party database that DuckDB can talk to. Currently this includes PostgreSQL, SQLite, MySQL and MotherDuck as external centralized catalog databases. The extension can make use of any file systems that DuckDB supports, currently local files, S3, Azure Blob Store, GCS, etc.

Of course, the availability of the DuckLake extension does augment, not replace DuckDB’s existing and continuing support for Iceberg and Delta and the associated catalogs. DuckLake is also well-positioned to serve as a local cache or acceleration feature for those formats.

DuckLake is available starting from DuckDB release v1.3.0 (codename “Ossivalis”).

Installation

Step 1: Install DuckDB
Step 2: Start DuckDB
Step 3: Type INSTALL ducklake;

Checklist complete.

Usage

A DuckLake can be initialized through the ATTACH command in DuckDB. For example:

ATTACH 'ducklake:metadata.ducklake' AS my_ducklake;

This will create a new attached database metadata.ducklake in DuckDB and aliased as my_ducklake. In this fully local case, the metadata tables are stored in the metadata.ducklake file, and the Parquet files with the data are stored in the metadata.ducklake.files folder in the current working directory. Of course, absolute paths will work as well.

Let’s next create a table and insert some data:

CREATE TABLE my_ducklake.demo (i INTEGER);
INSERT INTO my_ducklake.demo VALUES (42), (43);

You can switch the default database with the USE command, e.g., USE my_ducklake.

Let’s query the table again:

FROM my_ducklake.demo;

┌───────┐
│   i   │
│ int32 │
├───────┤
│    42 │
│    43 │
└───────┘

So far, so good.

FROM glob('metadata.ducklake.files/*');

┌───────────────────────────────────────────────────────────────────────────────┐
│                                     file                                      │
│                                    varchar                                    │
├───────────────────────────────────────────────────────────────────────────────┤
│ metadata.ducklake.files/ducklake-019711dd-6f55-7f41-ab99-6ac7d9de6ef3.parquet │
└───────────────────────────────────────────────────────────────────────────────┘

We can see that a single Parquet file has been created that contains the two rows. Now, let’s delete a row again:

DELETE FROM my_ducklake.demo WHERE i = 43;
FROM my_ducklake.demo;

┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘

We can see the row being gone. If we inspect the folder again, we see a new file appearing

FROM glob('metadata.ducklake.files/*');

┌──────────────────────────────────────────────────────────────────────────────────────┐
│                                         file                                         │
│                                       varchar                                        │
├──────────────────────────────────────────────────────────────────────────────────────┤
│ metadata.ducklake.files/ducklake-019711dd-6f55-7f41-ab99-6ac7d9de6ef3.parquet        │
│ metadata.ducklake.files/ducklake-019711e0-16f7-7261-9d08-563a48529955-delete.parquet │
└──────────────────────────────────────────────────────────────────────────────────────┘

A second file with -delete in the name has appeared, this is also a Parquet file that contains the identifiers of the deleted rows.

Of course, DuckLake supports time travel, and we can query the available snapshots using the ducklake_snapshots() function.

FROM ducklake_snapshots('my_ducklake');

┌─────────────┬────────────────────────────┬────────────────┬──────────────────────────────┐
│ snapshot_id │       snapshot_time        │ schema_version │           changes            │
│    int64    │  timestamp with time zone  │     int64      │   map(varchar, varchar[])    │
├─────────────┼────────────────────────────┼────────────────┼──────────────────────────────┤
│           0 │ 2025-05-27 15:10:04.953+02 │              0 │ {schemas_created=[main]}     │
│           1 │ 2025-05-27 15:10:14.079+02 │              1 │ {tables_created=[main.demo]} │
│           2 │ 2025-05-27 15:10:14.092+02 │              1 │ {tables_inserted_into=[1]}   │
│           3 │ 2025-05-27 15:13:08.08+02  │              1 │ {tables_deleted_from=[1]}    │
└─────────────┴────────────────────────────┴────────────────┴──────────────────────────────┘

Let's say we want to read the table before the row with 43 in it was deleted, we can use the new AT syntax in DuckDB:

FROM my_ducklake.demo AT (VERSION => 2);

┌───────┐
│   i   │
│ int32 │
├───────┤
│    42 │
│    43 │
└───────┘

Version 2 still had this row, so there it is. This also works with snapshot timestamps instead of version numbers: just use TIMESTAMP instead of VERSION.

We can also look at what has changed between versions using the ducklake_table_changes() function, for example

FROM ducklake_table_changes('my_ducklake', 'main', 'demo', 2, 3);

┌─────────────┬───────┬─────────────┬───────┐
│ snapshot_id │ rowid │ change_type │   i   │
│    int64    │ int64 │   varchar   │ int32 │
├─────────────┼───────┼─────────────┼───────┤
│           2 │     0 │ insert      │    42 │
│           2 │     1 │ insert      │    43 │
│           3 │     1 │ delete      │    43 │
└─────────────┴───────┴─────────────┴───────┘

We can see how in version two we added the two values 42 and 43 in snapshot 2 and then deleted 43 in snapshot 3 again.

Changes in DuckLake are of course transactional, previously we were running in “autocommit” mode, where every command is its own transaction. But we can change that using BEGIN TRANSACTION and COMMIT or ROLLBACK.

BEGIN TRANSACTION;
DELETE FROM my_ducklake.demo;
FROM my_ducklake.demo;

┌────────┐
│   i    │
│ int32  │
├────────┤
│ 0 rows │
└────────┘

ROLLBACK;
FROM my_ducklake.demo;

┌───────┐
│   i   │
│ int32 │
├───────┤
│  42   │
└───────┘

Here, we start a transaction, and then delete all the rows from my_table, which then indeed shows as empty. However, because we then ROLLBACK, the non-committed delete is reverted.

Summary In this post, we presented the rationale and the design of a new lakehouse format built on pure SQL. If you would like to use data in a simple, scalable and fast lakehouse format, give it a try! We are looking forward to seeing your use cases!

The ducklake extension is currently experimental. If you encounter any bugs, please submit an issue in the ducklake extension repository.

Press Inquiries

For press inquiries, please reach out to Gabor Szarnyas.

Arrow IPC Support in DuckDB

2025-05-23T00:00:00+00:00

DuckDB and Arrow

In recent years, the Apache Arrow project has gained a lot of traction in the data world, thanks to its columnar format that allows for easy interchange of data between different systems – mostly with a zero-copy approach. Apache Arrow powers, among others, the integration between DuckDB and Polars. In practice, when DuckDB produces or consumes a Polars DataFrame, it is actually using the Arrow columnar format underneath.

The importance of having such a format is also one of the main reasons DuckDB was among the pioneers in integrating with Arrow and implementing an Arrow Database Connectivity (ADBC) interface – particularly because Arrow makes this possible with no additional dependencies, thanks to its C data interface.

But one limitation of Arrow's C data interface is that it exchanges data using pointers (memory addresses). This limits the possibilities if you want to exchange Arrow data between different processes or systems. To overcome this limitation, the Arrow project also specifies the Arrow IPC format, which allows users to efficiently serialize Arrow columnar data and pass it between processes or over a network. This data can be consumed as a stream, either directly from a memory buffer or from a file.

We're thrilled to announce that DuckDB is now able to consume and produce these Arrow streams via the new arrow community extension. In this post, we will describe the Arrow IPC serialization format in more detail, show how to install the new arrow community extension for DuckDB, and give a demo showing how to use it.

Arrow Interprocess Communication (Arrow IPC)

The Arrow IPC format provides a way of serializing (and optionally compressing) Arrow-formatted data, enabling you to transfer data over a network or store it on disk while keeping it in Arrow format, avoiding the overhead of converting it to a different format. Arrow IPC supports LZ4 and ZSTD compression, and when stored as a file, it also supports a file footer that can be used to speed up retrieval and processing by allowing parts of the data to be skipped (similar to the approach used by the Parquet format). When compared to Parquet, the Arrow IPC format has two main benefits:

Ease of implementation: Writing a low-level Arrow IPC consumer/producer is less complex than writing a Parquet one, especially if the system already integrates with the Arrow format.
Faster encoding and decoding: The process of encoding and decoding (serializing and deserializing) Arrow data is much simpler and faster than with Parquet. This can yield faster processing times—especially if you are streaming data that does not need to be stored on disk afterwards.

Arrow and Parquet are complementary formats. Parquet's sophisticated system of encoding and compression options typically yields much smaller files, making Parquet a better choice for archival storage. Arrow's ability to eliminate encoding and decoding overheads typically yields faster and more efficient data interchange, making Arrow a better choice for query result transfer and ephemeral caching.

To give you an illustration of just how simple the Arrow IPC format is, consider the following illustration. In the Arrow IPC format, a table is serialized as a sequence of record batches (a collection of records organized in a columnar layout), preceded by their shared schema:

Figure from Apache Arrow Blog: How the Apache Arrow Format Accelerates Query Result Transfer.

Note that in realistic scenarios, record batches are much larger and the figure above is simplified for illustrative purposes.

The Arrow Community Extension

DuckDB has included an integration with the Arrow IPC format for many years, via the (now-deprecated) Arrow core extension. However, the main purpose of this support was to allow DuckDB interoperability with JavaScript, hence it was designed only to read in-memory serialized buffers, and not Arrow IPC files. The extension's code complexity and maintainability were very high, because working with Arrow IPC required having the entire Arrow C++ library as a dependency, as we did not want to write our own serialization and deserialization code for the Arrow IPC format.

More recently, a much smaller Arrow C++ implementation started to gain popularity as a way to interact with Arrow IPC data: the nanoarrow library. Using nanoarrow, we completely redesigned the old DuckDB Arrow extension to have a much smaller dependency, a cleaner codebase, and the ability to scan Arrow IPC files. We also took the opportunity to shift the Arrow DuckDB extension from a core extension to a community extension. This change was made for two main reasons. The first is to enable the Arrow developer and user community to be more involved in building and supporting the extension. The second is to have a release schedule that is not tied to the DuckDB release schedule. In practice, this means that members of the core Arrow developer community can decide when a new version of the extension will be released.

Installing and loading the new Arrow extension is very simple:

INSTALL arrow FROM community;
LOAD arrow;

Demo

In this demo, we will use the new Arrow DuckDB extension to generate the lineitem TPC-H table with scale factor 10 as an Arrow IPC file. While our demo will focus on Arrow IPC data stored in a file, the extension itself also allows you to consume and produce the Arrow IPC format directly as buffers. You can find detailed examples of usage and accepted parameters in the extension's README.

We start off by loading the arrow extension and generating our TPC-H tables.

LOAD arrow;

CALL dbgen(sf = 10);

To generate the Arrow IPC files we can simply use the COPY ... TO ... clause, as follows. We use the recommended file extension .arrows since this file is in the Arrow IPC stream format.

COPY lineitem TO 'lineitem.arrows';

In this demo, for simplicity, we wrote our table in a single file. However, our Arrow COPY ... TO ... clause allows us to set the chunk_size and the number of row_groups per file. These options allow us to produce data optimized for the best possible performance for your use case. For example, a smaller chunk_size may reduce overall performance but benefit streaming scenarios.

We can now run TPC-H query 6 directly on our file using the read_arrow function:

SELECT
    sum(l_extendedprice * l_discount) AS revenue
FROM
    read_arrow('lineitem.arrows')
WHERE
    l_shipdate >= CAST('1994-01-01' AS date)
    AND l_shipdate < CAST('1995-01-01' AS date)
    AND l_discount BETWEEN 0.05
    AND 0.07
    AND l_quantity < 24;

which prints:

┌─────────────────┐
│     revenue     │
│  decimal(38,4)  │
├─────────────────┤
│ 1230113636.0101 │
│ (1.23 billion)  │
└─────────────────┘

Thanks to replacement scans, you can omit the function read_arrow if the filename ends with .arrow or .arrows. For example:

SELECT count(*) FROM 'lineitem.arrows';

which prints:

┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    59986052     │
│ (59.99 million) │
└─────────────────┘

For simplicity, we focus on the scenario of reading a single file, but our reader supports multi-file reading, with functionality on par with the DuckDB Parquet reader.

What if you want to fetch an Arrow IPC stream directly from a server into DuckDB? To demonstrate this, we can start an HTTP file server in the same directory where we saved lineitem.arrows. We use the Node.js package serve to do this (instead of Python's built-in http.server) because it supports HTTP range requests:

npx serve -l 8008

Then you can use DuckDB's httpfs extension to query the Arrow data over the HTTP(S) protocol:

INSTALL httpfs;
LOAD httpfs;
LOAD arrow;

SELECT count(*) FROM read_arrow('http://localhost:8008/lineitem.arrows');

which prints the same result:

┌─────────────────┐
│  count_star()   │
│      int64      │
├─────────────────┤
│    59986052     │
│ (59.99 million) │
└─────────────────┘

Alternatively, you can use a tool like curl to fetch Arrow IPC data from a server and pipe it to DuckDB in the terminal:

URL="http://localhost:8008/lineitem.arrows"
SQL="LOAD arrow; FROM read_arrow('/dev/stdin') SELECT count(*);"

curl -s "$URL" | duckdb -c "$SQL"

which prints the same result. For other demos of the arrow extension, see our arrow-ipc demo repository.

Bonus: Cool Use Cases for Arrow IPC

Running a DuckDB query against data in Arrow IPC format like in the demo above is a pretty neat trick, and it works so well because DuckDB and Arrow are a natural pair due to both using a columnar data layout. However, you may be wondering what else you can do with Arrow IPC data. One of Arrow's main goals is interoperability, and by saving our data in Arrow IPC format, we've opened up many options for connecting with other tools.

For example, we can now work with our data with PyArrow:

import pyarrow as pa

with open('lineitem.arrows', 'rb') as source:
   stream = pa.ipc.open_stream(source)
   tbl = stream.read_all()

or Polars:

import polars as pl

tbl = pl.read_ipc_stream("lineitem.arrows")

or ClickHouse:

CREATE TABLE
    lineitem
ENGINE MergeTree()
ORDER BY tuple()
AS
    SELECT * FROM file('lineitem.arrows', 'ArrowStream');

or any of the numerous other Arrow libraries (available in a dozen different languages) or Arrow-compatible systems.

The benefits of Arrow IPC don't stop there: Arrow IPC is also ideal for larger-than-memory use cases. Using PyArrow, we can memory-map our lineitem.arrows file and work with it without reading the entire thing into memory:

import pyarrow as pa

with pa.memory_map('lineitem.arrows', 'rb') as source:
    stream = pa.ipc.open_stream(source)
    tbl = stream.read_all()

tbl.num_rows
# => 59986052

Then, we can check that PyArrow didn't have to allocate any buffers to hold the data because it all lives on disk:

pa.total_allocated_bytes()
# => 0

Now we can perform the same query we did in the demo above and show we get the same result:

import datetime
import pyarrow.compute as pc

subset = tbl.filter(
    (pc.field("l_shipdate") >= datetime.datetime(1994, 1, 1)) &
    (pc.field("l_shipdate") < datetime.datetime(1995, 1, 1)) &
    (pc.field("l_discount") >= 0.05) &
    (pc.field("l_discount") <= 0.07) &
    (pc.field("l_quantity") < 24.)
)
pc.sum(pc.multiply(subset.column("l_extendedprice"), subset.column("l_discount")))
# => 

And, despite lineitem.arrows being over 10 GB, PyArrow only had to allocate a fraction of the memory:

pa.total_allocated_bytes()
# => 201594240 (192MB)

Conclusion & What's Next

In this blog post, we presented the new Arrow community extension, which enables DuckDB users to interact with Arrow IPC streaming buffers and files. Special thanks to Voltron Data for enabling this extension by working with DuckDB Labs. Below we list our future plans for this extension:

Support for both ZSTD and LZ4 compression when writing Arrow IPC. DuckDB currently only supports writing uncompressed buffers.
Support for LZ4 compression when reading Arrow buffers. The reader currently only supports ZSTD or uncompressed buffers.
Support for writing the Arrow IPC file format containing the file footer, and using the footer to speed up reads.
Implementation of C API DuckDB functions to produce and consume Arrow IPC data.

If you'd like to work on any of these planned features or suggest other features, or if you find any bugs, feel free to log them in our issue tracker. Happy hacking!

USING KEY in Recursive CTEs

2025-05-23T00:00:00+00:00

Assembling SQL Queries from Pieces: CTEs

As SQL queries become more complex, managing their readability, modularity and reusability increasingly becomes a challenge. Common Table Expressions (CTEs) were introduced to address these issues by allowing developers to define temporary, named result sets within a query. Similar to functions in programming, CTEs allow a large query to be broken down into logical building blocks, making it easier to understand, maintain and debug.

CTEs are particularly useful for structuring multi-step transformations that might otherwise require deeply nested subqueries or complex joins. By improving both the clarity and structure of SQL code, CTEs have become an essential tool in modern query writing – enabling the clear, declarative expression of even the most sophisticated logic.

Iterate Like It's 1999: Recursive CTEs

To enhance the expressive power of SQL, recursive CTEs were introduced in the SQL:1999 standard. These allow a query to reference the results from previous iterations within the same expression, enabling SQL to solve more complex problems such as graph traversal and other iterative computations.

This capability pushes SQL beyond basic data retrieval, allowing for the formulation of complex, iterative logic directly in SQL. In fact, recursive CTEs make SQL Turing-complete, meaning it can theoretically express any computation (given sufficient time and memory).

But how do recursive CTEs work in DuckDB?

Let’s look at a simple example to break down the mechanism. Suppose we want to calculate the largest power of 2 that is smaller than 100. We can use a recursive CTE power to generate powers of 2 iteratively until we reach that limit. For any row (a, b, c) in power we will have that a^b = c:

WITH RECURSIVE power(a, b, c) AS (
    SELECT 2, 0, 1       -- 2^0 = 1
        UNION
    SELECT a, b+1, a * c -- a^(b+1) = a * a^b
    FROM power           -- reads the working table (contains a single row)
    WHERE a * c < 100
)
FROM power;         -- reads the union table (contains all intermediate results)

We can divide a recursive CTE into two parts, separated by the UNION keyword. The part above the UNION is the non-recursive part (SELECT 2, 0, 1 in our example), and the part below is the recursive part.

In the recursive part, CTE power references itself. This self-reference points to what we call the working table. The working table always holds the rows produced in the immediately preceding iteration. And only those.

Here's how it works step by step:

First, the non-recursive part is executed, producing initial rows – in our example, just the row (2, 0, 1). These rows are stored in the working table.
Then, the recursive part is executed using the rows from the working table. Every new row the recursive part produces is stored in the intermediate table, which holds results from the current iteration.
If the intermediate table is empty, the iteration ends.
Otherwise, we clear the working table and replace it with the contents of the intermediate table – preparing for the next iteration.
We additionally add the contents of the intermediate table to the union table, which accumulates all intermediate results across iterations.

Here you can see the entries of the three involved tables in each iteration of the CTE:

Iteration	Output of recursive step	Working table	Intermediate table	Union table
0	SELECT 2, 0, 1	∅ (no rows)	(2, 0, 1)	(2, 0, 1)
1	2 * 1 = 2	(2, 0, 1)	(2, 1, 2)	(2, 0, 1) (2, 1, 2)
2	2 * 2 = 4	(2, 1, 2)	(2, 2, 4)	(2, 0, 1) (2, 1, 2) (2, 2, 4)
3	2 * 4 = 8	(2, 2, 4)	(2, 3, 8)	(2, 0, 1) … (2, 3, 8)
4	2 * 8 = 16	(2, 3, 8)	(2, 4, 16)	(2, 0, 1) … (2, 4, 16)
5	2 * 16 = 32	(2, 4, 16)	(2, 5, 32)	(2, 0, 1) … (2, 5, 32)
6	2 * 32 = 64	(2, 5, 32)	(2, 6, 64)	(2, 0, 1) … (2, 6, 64)
7	2 * 64 = 128	(2, 6, 64)	∅ stop!	(2, 0, 1) … (2, 6, 64)

When the recursive CTE completes, the union table holds the entire result set, including all intermediate rows from every iteration:

┌───────┬───────┬───────┐
│   a   │   b   │   c   │ -- a^b = c
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     2 │     0 │     1 │
│     2 │     1 │     2 │
│     2 │     2 │     4 │
│     2 │     3 │     8 │
│     2 │     4 │    16 │
│     2 │     5 │    32 │
│     2 │     6 │    64 │
└───────┴───────┴───────┘

The union table provides a complete record of the history of the power computation. This can lead to unnecessary overhead, especially if all we need is the last row – the final result of the recursion. Storing every intermediate value can be inefficient, in particular when the intermediate results aren't required or when working with large datasets (that involve many rows or wide rows, e.g., in the presence of array-typed columns).

Recursive CTEs Suffer from Amnesia

While the union table holds the just mentioned history once the computation is complete, recursive CTEs suffer from a case of "amnesia" while the iteration is going on: the recursive part only ever sees the intermediate results of the immediately preceding iteration. This can be limiting and many of us will have seen how query authors work around this limitation in terms of manually maintained arrays (or similar container structures) that hold information about previous iterations. This may be costly. Then again, enabling the recursive part to access the union table with all its accumulated – potentially sizable – intermediate results may easily incur performance problems while we iterate. A real conundrum.

`USING` a `KEY` to Cure Amnesia

Can we afford to allow the recursive part to access the union table? Yes, but we need a means to control its size. Operating the union table in append-only mode thus is a no-go. Instead, let the recursive part optionally overwrite existing rows in the union table with fresh information that has been computed in the current iteration. This can reduce the union table's size significantly (see our experiments below).

Starting with version 1.3, DuckDB features a USING KEY variant of recursive CTEs that incorporates just this idea.

If you would like to read more about its origins, check out CIDR 2023. For further information about implementation, see SIGMOD 2025.

The variant introduces two major differences compared to traditional recursive CTEs:

It provides access to the working table (as always) as well as access to the union table – which we now call the recurring table.
Instead of simply appending new rows to the recurring table, the table now functions more like a dictionary (similar to a Python dict), allowing key-based updates.

To use this new feature, add the USING KEY (...) clause to your recursive CTE:

WITH RECURSIVE power(a, b, c) USING KEY (a)
    -- key: a, payload: b, c
AS (
    ...
);

With USING KEY, the schema of the recursive CTE is divided into key columns and payload columns. The key columns are specified using the USING KEY (column names) clause, while the remaining columns are treated as payload.

This division affects how the recurring table behaves. Instead of stubbornly appending new rows on each iteration, it acts more like a dictionary: if the recursive part returns a row that hasn't been seen before, it is added to the recurring table as usual. But if a row shares a key with an existing entry in the recurring table, the payload is updated – replacing the previous values for that key in the recurring table.

If multiple rows with the same key are produced in a single iteration, only the last one is retained. Therefore, you may wish to use the ORDER BY clause in the recursive part to control which row is kept.

This approach allows recursive queries to maintain and update state more efficiently, especially for algorithms in which keeping the latest (or “best”) value for a given key is crucial.

Overwriting Old Intermediate Results

Let us pick up our recursive example query from above. Now we compute the powers of bases 2 and 3, as long as these are less than 100:

WITH RECURSIVE power(a, b, c) USING KEY (a) AS (
    FROM (VALUES (2, 0, 1), (3, 0, 1))   -- 2^0 = 1, 3^0 = 1
        UNION
    SELECT a, b+1, a * c                 -- a^(b+1) = a * a^b
    FROM power                           -- reads the working table (contains two rows)
    WHERE a * c < 100
)
FROM power;                              -- reads the recurring table (contains two rows)

We start with two rows in the non-recursive part: one with a key (base) of 2 and another with a key of 3. In the recursive part, we multiply the intermediate power by the base. This produces two rows, again with the keys 2 and 3.

Unlike traditional recursive CTEs, we do not append these two new rows to the recurring table. Instead, we update the existing rows with keys 2 and 3 in the recurring table, overwriting their payload values. We thus keep only two rows throughout the entire computation, each holding the current power value for its key (or base) in column a.

Iteration	Output of recursive step	Working table	Intermediate table	Recurring table
0	SELECT 2, 0, 1 SELECT 3, 0, 1	∅ (no rows)	(2, 0, 1) (3, 0, 1)	(2, 0, 1) (3, 0, 1)
1	2 * 1 = 2 3 * 1 = 3	(2, 0, 1) (3, 0, 1)	(2, 1, 2) (3, 1, 3)	(2, 1, 2) (3, 1, 3)
2	2 * 2 = 4 3 * 3 = 9	(2, 1, 2) (3, 1, 3)	(2, 2, 4) (3, 2, 9)	(2, 2, 4) (3, 2, 9)
3	2 * 4 = 8 3 * 9 = 27	(2, 2, 4) (3, 2, 9)	(2, 3, 8) (3, 3, 27)	(2, 3, 8) (3, 3, 27)
4	2 * 8 = 16 3 * 27 = 81	(2, 3, 8) (3, 3, 27)	(2, 4, 16) (3, 4, 81)	(2, 4, 16) (3, 4, 81)
5	2 * 16 = 32 3 * 81 = 243	(2, 4, 16) (3, 4, 81)	(2, 5, 32)	(2, 5, 32) (3, 4, 81)
6	2 * 32 = 64	(2, 5, 32)	(2, 6, 64)	(2, 6, 64) (3, 4, 81)
7	2 * 64 = 128	(2, 6, 64)	∅ stop!	(2, 6, 64) (3, 4, 81)

As we can see, the size of the recurring table remains constant. Irrelevant computation history is overwritten which leads to reduced memory usage. The final recurring table reads:

┌───────┬───────┬───────┐
│   a   │   b   │   c   │ -- a^b = c
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     2 │     6 │    64 │
│     3 │     4 │    81 │
└───────┴───────┴───────┘

This behavior is especially useful with algorithms in which we are interested in the latest, best, or smallest value for a given key. The maximum number of rows in the recurring table is now bounded by the number of unique keys. Since the number of distinct keys in use is under control of the recursive part, this can be a powerful advantage when working with sizable datasets.

A Change of Key

Now, should you happen to be interested in the history of the computation and are willing to invest the memory space, all that is required is a change of key. With

WITH RECURSIVE power(a, b, c)
    USING KEY (a, b)
AS (  -- formerly: USING KEY (a)
    ...
)
FROM power
ORDER BY a, b;

the iteration counter (or exponent) in column b is considered to be part of the key, too. The recurring table will thus track unique (a, b) combinations (i.e., base, exponent) and we are able to trace what was going on during iteration:

┌───────┬───────┬───────┐
│   a   │   b   │   c   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     2 │     0 │     1 │
│     2 │     1 │     2 │
│     2 │     2 │     4 │
│     2 │     3 │     8 │
│     2 │     4 │    16 │
│     2 │     5 │    32 │
│     2 │     6 │    64 │
│     3 │     0 │     1 │
│     3 │     1 │     3 │
│     3 │     2 │     9 │
│     3 │     3 │    27 │
│     3 │     4 │    81 │
└───────┴───────┴───────┘

Accessing Relevant History

Another major difference to vanilla recursive CTEs: now that the size of the recurring table is under control, we can afford to reference it directly in the recursive part of the CTE. This allows us to access any intermediate result that has not been overwritten yet – regardless of which iteration computed these results. No amnesia anymore! To access the recurring table, simply prefix the CTE name with the pseudo-schema name recurring:

WITH RECURSIVE t(...) USING KEY (...) AS (
    ...
    FROM recurring.t  -- reads the recurring table while we iterate
)
...;

`USING KEY` Can Unlock Performance Advantages

To further highlight the differences between vanilla and key-based CTEs, let's consider a more complex example using a graph dataset – a social network graph.

The graphs were derived from the LDBC Social Network Benchmark (SNB), which provides a generator for synthetic social network data. To make the graphs more manageable for use with the WITH RECURSIVE query, we further filtered them by person name, reducing their density.

In this dataset, nodes represent people while edges correspond to relationships between them. The tables we’re working with are Person(id), which contains all existing ids in the network and knows(person1id, person2id) in which each row holds a pair of people knowing each other.

If you would like to try this out for yourself, begin by attaching the database to any DuckDB session.

ATTACH 'https://blobs.duckdb.org/data/using-key-graph.duckdb';
USE 'using-key-graph';

Our goal is to compute the shortest path between all pairs of people in the social network. Since this is a problem with inherent quadratic complexity, we need to closely eye runtime and memory requirements. For each pair, we start by adding a row that has one person as the start node and the other as the target node. We then iteratively explore all individuals known by the start node, continuing the traversal until the target person is reached. Paths through the network are encoded by via nodes: to reach the target from the start node, first proceed to the via node (an immediate neighbor of the start node) – once there, use that node's via entry to continue your traversal.

A recursive CTE that implements this approach reads as follows:

WITH RECURSIVE paths(here, current, via, len, there, completed, found) AS (
  SELECT n1.id AS here, n1.id AS current, NULL::BIGINT AS via,
         0 AS len, n2.id AS there, false AS completed, false AS found
    FROM Person AS n1 JOIN Person AS n2 ON (n2.id <> n1.id)
    UNION ALL
  SELECT paths.here,
         person2id AS current,
         coalesce(paths.via, knows.person2id) AS via,
         paths.len+1 AS len,
         paths.there,
         bool_or(knows.person2id = paths.there)
                 OVER (PARTITION BY paths.here, paths.there
                       ROWS BETWEEN UNBOUNDED PRECEDING
                                AND UNBOUNDED FOLLOWING) AS completed,
         knows.person2id = paths.there AS found
      FROM paths
      JOIN knows ON (paths.current = knows.person1id AND NOT paths.completed)
 )
 SELECT here, there, via, len
 FROM paths WHERE found
 ORDER BY len, here, there;

The working table of recursive CTE paths has seven columns:

here refers to the person where the traversal starts,

there refers to the target person we aim to reach,

via indicates the immediate neighbour of the start node of the current path,

len is the length of the current path, which is incremented by one for each step taken in the traversal.

The remaining three columns control the traversal logic and optimize the computation.

current refers to the node that is currently being explored during the traversal,

found indicates whether the current path has successfully reached the target person,

completed tracks whether any path with the same here and there has already reached the target. This prevents further traversal for that pair once a shortest path has been found, thus avoiding the exploration of longer paths.

If we search in a large graph with many edges, the union table in a vanilla recursive CTE can grow large, potentially exceeding memory limits. This not only causes significant performance issues but can even lead to query crashes in extreme cases.

The new key-based CTEs can avoid this problem by changing how the state of the search is maintained. This enables a range of new algorithms to be expressed efficiently in SQL, including algorithms for finding the shortest paths in large graphs.

One such algorithm is Distance Vector Routing (DVR), a method for computing the shortest paths in a network based on node-local routing table that indicate where to “hop next”.

In DVR, each node maintains a routing table that records the length of the paths that reach other nodes.
We use the recurring table to store these routing tables for all nodes in the network: a row (here, there, via, len) indicates that the first hop on the path from node here to node there is node via. The length of the overall path is len.
When a shorter path to a target node is found, the corresponding routing table entry is updated.
Routing updates found in the last iteration are distributed to neighboring nodes through the working table.

To check if a newly incoming routing update improves a currently known path, we perform a lookup in the recurring table. If the new length is smaller than that of the known path, we update the entry and propagate the update to our immediate neighbors. This mechanism allows for efficient path finding even in very large graphs – see below how DVR outperforms the above approach based on a vanilla recursive CTEs.

WITH RECURSIVE dvr(here, there, via, len) USING KEY (here, there) AS (
  -- initialize routing tables for all nodes, only the routes to
  -- immediate neighbors are known at this time
  SELECT n.person1id AS here, n.person2id AS there, n.person2id AS via, 1::DOUBLE AS len
  FROM   knows AS n
    UNION
  (SELECT n.person1id AS here, dvr.there, dvr.here AS via, 1 + dvr.len AS len
    FROM dvr -- working table - routing updates shared by neighbors
    JOIN knows AS n ON
        (n.person2id =  dvr.here AND    -- send update only to immediate neighbors
        n.person1id <> dvr.there)       -- no need to store a route to myself
    LEFT JOIN recurring.dvr AS rec ON   -- recurring table (current routing tables)
        (rec.here = n.person1id AND rec.there = dvr.there) -- identify affected routing table entry
    WHERE 1 + dvr.len < coalesce(rec.len, 'Infinity'::DOUBLE) --  does the routing update improve the entry in the routing table?
    ORDER BY len -- shortest path first
  )
)
FROM dvr
ORDER BY len, here, there;

The larger the social network graphs (A to G), the more pronounced is the performance gap between vanilla recursive CTEs (REC) and the new USING KEY variant (KEY). The table below reports the number of rows processed during each iteration:

graph	nodes	edges	KEY	REC
A	184	233	744	352,906
B	322	903	8,232	40,732,577
C	424	1,446	19,213	605,859,791
D	484	2,049	30,871	❌
E	1,119	8,809	255,425	❌
F	1,481	14,256	491,880	❌
G	1,618	16,619	607,926	❌

Even for the smallest graph, Graph A, the difference is significant: the REC CTE produces around 350,000 rows, while KEY generates only 744 rows. As the graph size increases, the gap becomes even more striking. In Graph C, with 424 nodes and 1,446 edges, the REC approach processes nearly 1 billion rows, while the KEY method handles less than 20,000 rows. Although this isn't the largest graph in our benchmark, the REC approach is already approaching out-of-memory conditions (OOM, ❌).

This substantial difference in memory usage is only part of the story. The performance of the REC approach also degrades quickly. While both CTEs perform similarly on small graphs, REC becomes significantly slower as the graph grows – eventually crashing – whereas KEY continues to scale smoothly:

And that’s the beauty of the new USING KEY CTEs. It enables more efficient expression of complex iterative algorithms by providing key-based control over the size of intermediate results that are passed from iteration to iteration – memory pressure lessens, run time performance can go up significantly. If you're working with recursive CTEs in DuckDB, be sure to take advantage of this powerful addition – it may make a significant difference in your queries.

Announcing DuckDB 1.3.0

2025-05-21T00:00:00+00:00

To install the new version, please visit the installation guide. Note that it can take a few hours to days to release some client libraries (e.g., Go, R, Java) and extensions (e.g., the UI) due to the extra changes and review rounds required.

We are proud to release DuckDB 1.3.0. This release of DuckDB is named “Ossivalis” after Bucephala Ossivalis, an ancestor of the Goldeneye duck that lived millions of years ago.

In this blog post, we cover the most important features of the new release. DuckDB is moving rather quickly, and we could cover only a small fraction of the changes in this release. For the complete release notes, see the release page on GitHub.

Breaking Changes and Deprecations

Old Linux glibc Deprecation

Now that all mainstream Linux distributions use glibc 2.28 or newer, DuckDB's official Linux binaries require at least glibc 2.28 or newer. The release is built using the manylinux_2_28 image from Python, which combines an older glibc with a newer compiler. This change also implies that extensions are no longer distributed for the linux_amd64_gcc4 platform.

We highly value portability, so it is of course still possible to build DuckDB from source for older versions of glibc.

Lambda Function Syntax

Previously, lambda functions in DuckDB could be specified using the single arrow syntax: x -> x + 1. The single arrow operator is also used by the JSON extension to express JSON extraction using the syntax ->'field'. The two meanings of the single arrow operator are treated the same by the binder, thus they share the same (low) precedence, necessitating extra parentheses in JSON expression with equality checks:

SELECT (JSON '{"field": 42}')->'field' = 42;
-- throws a Binder Error:
-- No function matches the given name and argument types 'json_extract(JSON, BOOLEAN)

SELECT ((JSON '{"field": 42}')->'field') = 42;
-- return true

This often caused confusion among users, therefore, the new release deprecates the old arrow lambda syntax and replaces it with Python-style lambda syntax:

SELECT list_transform([1, 2, 3], lambda x: x + 1);

To make the transition smoother, the deprecation will happen in several steps over the next year. First, DuckDB 1.3.0 introduces a new setting to configure the lambda syntax:

SET lambda_syntax = 'DEFAULT';
SET lambda_syntax = 'ENABLE_SINGLE_ARROW';
SET lambda_syntax = 'DISABLE_SINGLE_ARROW';

Currently, DEFAULT enables both syntax styles, i.e., the old single arrow syntax and the Python-style syntax. DuckDB 1.4.0 will be the last release supporting the single arrow syntax without explicitly enabling it. DuckDB 1.5.0 will disable the single arrow syntax by default. DuckDB 1.6.0 will remove the lambda_syntax flag and fully deprecate the single arrow syntax, so the old behavior will no longer be accessible.

Serializing Strings in List with Escapes

Starting with the new version, DuckDB escapes characters such as ' in strings serialized in nested data structures to allow round-tripping between the serialized string and the nested representation. For example:

SELECT ['hello ''my'' world'] AS s;

DuckDB version 1.2.2 returns [hello 'my' world] while DuckDB 1.3.0 returns ['hello \'my\' world'].

To serialize a list of strings with the old behavior, use the array_to_string function:

SELECT printf('[%s]', array_to_string(
        ['hello ''my'' world', 'hello ''cruel'' world'], ', '
    )) AS s;

┌─────────────────────────────────────────┐
│                    s                    │
│                 varchar                 │
├─────────────────────────────────────────┤
│ [hello 'my' world, hello 'cruel' world] │
└─────────────────────────────────────────┘

Minor SQL Parser Changes

The term AT now needs quotes to be used as an identifier as it is used for time travel in Iceberg.
LAMBDA is now a reserved keyword due to the change in lambda syntax.
GRANT is no longer a reserved keyword.

New Features

The new DuckDB release again contains a lot of exciting new features:

External File Cache

DuckDB is used a lot to read from remote files, e.g., Parquet files stored on HTTP servers or blob storage. Previous versions would always fully re-read file data. With this release, we added a cache for data from external files. This cache is subject to the overall DuckDB memory limit. If space is available, it will be used to dynamically cache data from external files. This should greatly improve re-running queries on remote data. For example:

.timer on
.mode trash -- do not show query result
FROM 's3://duckdb-blobs/data/shakespeare.parquet';
Run Time (s): real 1.456 user 0.037920 sys 0.028510
FROM 's3://duckdb-blobs/data/shakespeare.parquet';
Run Time (s): real 0.360 user 0.029188 sys 0.007620

We can see that the query is much faster the second time around due to the cache. In previous versions, the runtime would have been the same.

The cache contents can be queried using the duckdb_external_file_cache() table function like so:

.mode duckbox -- re-enable output
FROM duckdb_external_file_cache();

┌────────────────────────────────────────────┬──────────┬──────────┬─────────┐
│                    path                    │ nr_bytes │ location │ loaded  │
│                  varchar                   │  int64   │  int64   │ boolean │
├────────────────────────────────────────────┼──────────┼──────────┼─────────┤
│ s3://duckdb-blobs/data/shakespeare.parquet │  1697483 │        4 │ true    │
│ s3://duckdb-blobs/data/shakespeare.parquet │    16384 │  1681808 │ true    │
└────────────────────────────────────────────┴──────────┴──────────┴─────────┘

The cache is enabled by default but can be disabled with:

SET enable_external_file_cache = false;

Directly Query Data Files with the CLI

DuckDB's command line interface (CLI) gained the capability to directly query Parquet, CSV or JSON files. This works by just using e.g. a Parquet file instead of the database file. This will expose a view that can be queried. For example, say we have a Parquet file called region.parquet, this will work:

duckdb region.parquet -c 'FROM region;'

┌─────────────┐
│   r_name    │
│   varchar   │
├─────────────┤
│ AFRICA      │
│ AMERICA     │
│ ASIA        │
│ EUROPE      │
│ MIDDLE EAST │
└─────────────┘

When using the CLI like this, what actually happens is that we launch a temporary in-memory DuckDB database, and create two views over the given file:

file – this view is always named the same, regardless of the name of the file.
[base_file_name] – this view depends on the name of the file, e.g., for region.parquet this is region.

Both views can be queried and will give the same result.

The main advantage of this feature is usability: we can use the regular shell to navigate to a file, and then use DuckDB to open that file without having to refer to the path of the file at the SQL level.

`TRY` Expression

DuckDB already supported TRY_CAST, which was trying to cast a value but did not fail the query if this was not possible. For example:

SELECT TRY_CAST('asdf' AS INTEGER);

returns NULL. This release generalizes this functionality beyond casting to arbitrary expressions that can error using TRY. For example, the logarithm of 0 is undefined, and log(0) will throw an exception and tell you that it “cannot take logarithm of zero”. With the new TRY, this will return NULL instead, e.g.:

SELECT TRY(log(0));

NULL

Again, this will work for arbitrary expressions. We recommend to use TRY sparingly however if an error is expected often because there is going to be a performance impact. If any batch of rows causes an error, we switch to row-by-row execution of the expression to figure out exactly which row had an error and which did not. This is slower.

Updatings Structs

Starting with the new release, it's possible to update the sub-schema of structs using the ALTER TABLE clause. You can add, drop, and rename fields:

CREATE TABLE test(s STRUCT(i INTEGER, j INTEGER));
INSERT INTO test VALUES (ROW(1, 1)), (ROW(2, 2));
ALTER TABLE test DROP COLUMN s.i;
ALTER TABLE test ADD COLUMN s.k INTEGER;
ALTER TABLE test RENAME COLUMN s.j TO l;

┌──────────────────────────────┐
│              s               │
│ struct(l integer, k integer) │
├──────────────────────────────┤
│ {'l': 1, 'k': NULL}          │
│ {'l': 2, 'k': NULL}          │
└──────────────────────────────┘

Altering structs is also supported inside LIST and MAP columns.

Swapping in New Databases

The ATTACH OR REPLACE clause allows you to replace a database, so you can swap a database on the fly. For example:

ATTACH 'taxi_v1.duckdb' AS taxi;
USE taxi;
ATTACH OR REPLACE 'taxi_v2.duckdb' AS taxi;

This feature was implemented by external contributor xevix.

UUID v7 Support

Warning Update (2025-05-23). The UUID v7 implementation in DuckDB v1.3.0 is not consistent with the UUID standard, causing the timestamp values to be off. This means that the timestamps in the generated UUID v7 values will only be correct if they are used exclusively within DuckDB. Importing or exporting UUID v7 values from/to systems will yield incorrect timestamps. We patched this bug and the fix will soon be available in the preview builds and the upcoming 1.3.1 patch release.

DuckDB now supports UUID v7, which is a newer version of UUIDs. UUIDv7 combines a Unix timestamp in milliseconds and random bits, offering both uniqueness and sortability. This is useful to e.g. order UUIDs by age or to combine the ubiquitous ID and TIMESTAMP columns in many tables into a single UUIDv7 column.

New UUIDs can be created using the uuidv7() scalar function. For example:

SELECT uuidv7();

┌──────────────────────────────────────┐
│               uuidv7()               │
│                 uuid                 │
├──────────────────────────────────────┤
│ 8196f1f6-e3cf-7a74-bc0e-c89ac1ea1e19 │
└──────────────────────────────────────┘

There are also additional functions to determine the UUID version (uuid_extract_version()) and to extract the internal timestamp (uuid_extract_timestamp()), for example:

SELECT uuid_extract_version(uuidv7());

┌────────────────────────────────┐
│ uuid_extract_version(uuidv7()) │
│             uint32             │
├────────────────────────────────┤
│               7                │
└────────────────────────────────┘

SELECT uuid_extract_timestamp(uuidv7());

┌──────────────────────────────────┐
│ uuid_extract_timestamp(uuidv7()) │
│     timestamp with time zone     │
├──────────────────────────────────┤
│ 2025-05-21 08:32:14.61+00        │
└──────────────────────────────────┘

This feature was implemented by external contributor dentiny.

Expression Support in `CREATE SECRET`

DuckDB has an internal “secret” management facility for things like S3 credentials. With this release, it is possible to use scalar expressions in the creation of the secret. This allows for secret contents to not be specified in query text, which makes them easier to keep out of log files, etc. For example:

SET VARIABLE my_bearer_token = 'hocus pocus this token is bogus';
CREATE SECRET http (
    TYPE http,
    BEARER_TOKEN getvariable('my_bearer_token')
);

You can see that the BEARER_TOKEN field in the secret is set from the getvariable function in CREATE SECRET. In the CLI, this is also possible through environment variables using getenv(). For example, this is now possible:

MY_SECRET=asdf duckdb -c \
    "CREATE SECRET http (TYPE http, BEARER_TOKEN getenv('MY_SECRET'))"

Unpacking Columns

DuckDB v1.3.0 brings a further boost to the popular COLUMNS(*) expression. Previously, unpacking the entities into a list was possible by adding a leading * character:

CREATE TABLE tbl AS SELECT 21 AS a, 1.234 AS b;
SELECT [*COLUMNS(*)] AS col_exp FROM tbl;

┌─────────────────┐
│     col_exp     │
│ decimal(13,3)[] │
├─────────────────┤
│ [21.000, 1.234] │
└─────────────────┘

However, this syntax could not be used in tandem with other expressions such as casting:

SELECT [*COLUMNS(*)::VARCHAR] AS col_exp FROM tbl;

Binder Error:
*COLUMNS() can not be used in this place

The new UNPACK keyword removes this limitation. The following expression

SELECT [UNPACK(COLUMNS(*)::VARCHAR)] AS col_exp FROM tbl;

is equivalent to:

SELECT [a::VARCHAR, b::VARCHAR] AS col_exp FROM tbl;

┌─────────────┐
│   col_exp   │
│  varchar[]  │
├─────────────┤
│ [21, 1.234] │
└─────────────┘

Spatial `JOIN` Operator

We added a new specialized join operator as part of the spatial extension, which greatly improves the efficiency of spatial joins, that is, queries that JOIN two geometry columns using specific spatial predicate functions, such as ST_Intersects and ST_Contains.

Similarly to a HASH_JOIN, the SPATIAL_JOIN builds a temporary lookup data-structure for the smaller side of the join, except it's an R-Tree, instead of a hash table. What this means for you is that you don't need to create an index first, or do any other pre-processing to optimize your spatial joins. It's all handled by the join operator internally.

While the query optimizer will try to instantiate this new operator for LEFT, OUTER, INNER and RIGHT spatial joins, one limitation currently is that the join can only contain a single join condition, or the optimizer will fall back to use a less efficient join strategy.

The following example illustrates how the SPATIAL_JOIN operator becomes part of the query plan. It's a relatively small query, but on my machine it executes almost 100× faster than it used to do in DuckDB v1.2.2!

LOAD spatial;

-- generate random points
CREATE TABLE points AS
  SELECT
      ST_Point(x, y) AS geom,
      (y * 50) + x // 10 AS id
  FROM
      generate_series(0, 1000, 5) r1(x),
      generate_series(0, 1000, 5) r2(y);

-- generate random polygons
CREATE TABLE polygons AS
  SELECT
      ST_Buffer(ST_Point(x, y), 5) AS geom,
      (y * 50) + x // 10 AS id
  FROM
      generate_series(0, 500, 10) r1(x),
      generate_series(0, 500, 10) r2(y);

-- inspect the join plan
EXPLAIN
    SELECT *
    FROM polygons
    JOIN points ON ST_Intersects(points.geom, polygons.geom);

             ...
┌─────────────┴─────────────┐
│        SPATIAL_JOIN       │
│    ────────────────────   │
│      Join Type: INNER     │
│        Conditions:        ├──────────────┐
│ ST_Intersects(geom, geom) │              │
│        ~40401 Rows        │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│       Table: points       ││      Table: polygons      │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│        ~40401 Rows        ││         ~2601 Rows        │
└───────────────────────────┘└───────────────────────────┘

For the curious, there are more details in the PR.

Internal Changes

For this release there have also been a large number of internal changes.

We have completed an almost complete re-implementation of DuckDB's Parquet reader and writer. This should greatly improve Parquet performance and reliability, and also expanded Parquet feature support for obscure logical types like UNKNOWN and FLOAT16.

We have also done a large amount of internal changes around the reading of multiple files (e.g., a folder of Parquet files) in an API called the MultiFileReader. We have unified the handling of multiple files across many of our file readers, e.g., Parquet, CSV, JSON, Avro, etc. This allows DuckDB to handle e.g. schema differences between multiple files in a unified way.

We have also added a new string compression method, DICT_FSST. Before, DuckDB supported either dictionary encoding or FSST compression (“Fast Static Symbol Table”) for strings. Those compression methods could not be mixed within a storage block (265 kB by default). However, we observed a lot of real-world data where part of the block would benefit from dictionary encoding, and another part would benefit from FSST. FSST does not by default duplicate-eliminate strings. This release combines both methods into a new compression method, DICT_FFST. This first runs dictionary encoding and then compresses the dictionary using FSST. Dictionary encoding and FSST-only encoding are also still available. We have also optimized storing validity masks (“which rows are NULL?”) in this release, some compression methods (like the new DICT_FSST) can handle NULLs internally and this obviates the need for a separate validity mask. Combined, those new features should greatly reduce the storage space required especially for strings. Note that the compression method is automatically picked by DuckDB based on actually observed compression ratios so users will not have to explicitly set this.

Final Thoughts

These were a few highlights – but there are many more features and improvements in this release. There have been over 3,000 commits by over 75 contributors since we released v1.2.2. The full release notes can be found on GitHub. We would like to thank our community for providing detailed issue reports and feedback. And our special thanks goes to external contributors, who directly landed features in this release!

The Lost Decade of Small Data?

2025-05-19T00:00:00+00:00

Much has been said, not in the very least by ourselves, about how data is actually not that “Big” and how the speed of hardware innovation is outpacing the growth of useful datasets. We may have gone so far to predict a data singularity in the near future, where 99% of useful datasets can be comfortably queried on a single node. As recently shown, the median scan in Amazon Redshift and Snowflake reads a doable 100 MB of data, and the 99.9-percentile reads less than 300 GB. So the singularity might be closer than we think.

But we started wondering, when did this development really start? When did personal computers like the ubiquitous MacBook Pro, usually condemned to running Chrome, become the data processing powerhouses that they really are today?

Let's turn our attention to the 2012 Retina MacBook Pro, a computer many people (myself included) bought at the time because of its gorgeous “Retina” display. Millions were sold. Despite being unemployed at the time, I had even splurged for the 16 GB RAM upgrade. But there was another often-forgotten revolutionary change in this machine: it was the first MacBook with a built-in Solid-State Disk (SSD) and a competitive 4-core 2.6 GHz “Core i7” CPU. It's funny to watch the announcement again, where they do stress the performance aspect of the “all-flash architecture” as well.

Side note: the MacBook Air was actually the first MacBook with an (optional) built-in SSD already back in 2008. But it did not have the CPU firepower of the Pro, sadly.

Coincidentally, I still have this laptop in the DuckDB Labs office, currently used by my kids to type their names in a massive font size or watch Bluey on YouTube when they're around. But can this relic still run modern-day DuckDB? How will its performance compare to modern MacBooks? And could we have had the data revolution that we are seeing now already back in 2012? Let's find out!

Software

First, what about the operating system? In order to make the comparison fair(er) to the decades, we actually downgraded the operating system on the Retina to OS X 10.8.5 “Mountain Lion”, the operating system version that shipped just a few weeks after the laptop itself in July 2012. Even though the Retina can actually run 10.15 (Catalina), we felt a true 2012 comparison should also use an operating system from the era. Below is a screenshot of the user interface for those of us who sometimes feel a little old.

Moving on to DuckDB itself: here at DuckDB we are more than a little religious about portability and dependencies – or rather the lack thereof. This means that very little had to happen to make DuckDB run on the ancient Mountain Lion: the stock DuckDB binary is built with by default with backwards-compatibility to OS X 11.0 (Big Sur), but simply changing the flag and recompiling turned out to be enough to make DuckDB 1.2.2 run on Mountain Lion. We would have loved to also use a 2012 compiler to build DuckDB, but, alas, C++ 11 was unsurprisingly simply too new in 2012 to be fully supported by compilers. Either way, the binary runs fine and could have been also produced by working around the compiler bugs. Or we could have just hand-coded Assembly like others have done.

Benchmarks

But we're not interested in synthetic CPU scores, we're interested in synthetic SQL scores instead! To see how the old machine is holding up when performing serious data crunching, we used the at this point rather tired but well-known TPC-H benchmark at scale factor 1000. This means that the two main tables, lineitem and orders contain 6 and 1.5 Billion rows, respectively. When stored as a DuckDB database, the database has a size of ca. 265 GB.

From the audited results on the TPC website, we can see that running the benchmark on this scale factor on a single node seems to require hardware costing hundreds of thousands of Dollars.

We ran each of the 22 benchmark queries five times, and took the median runtime to remove noise. However, because the amount of RAM (16 GB) is very much less than the database size (256 GB), no significant amount of the input data can be cached in the buffer manager, so those are not really what people sometimes call “hot” runs.

Below are the per-query results in seconds:

query	latency
1	142.2
2	23.2
3	262.7
4	167.5
5	185.9
6	127.7
7	278.3
8	248.4
9	675.0
10	1266.1
11	33.4
12	161.7
13	384.7
14	215.9
15	197.6
16	100.7
17	243.7
18	2076.1
19	283.9
20	200.1
21	1011.9
22	57.7

But what do those cold numbers actually mean? The hidden sensation is that we actually have numbers, this old computer could actually complete all benchmark queries using DuckDB! If we look at the time a bit closer, we see the queries take anywhere between a minute and half an hour. Those are not unreasonable waiting times for analytical queries on that sort of data in any way. Heck, you would have been waiting way longer back in 2012 for Hadoop YARN to pick up your job in the first place only to spew stack traces at you at some point.

2023 Improvements

But how do those results stack up against a modern MacBook? As a comparison point, we used a modern ARM-based M3 Max MacBook Pro, which happened to be sitting on the same desk. But between them, the two MacBooks represent more than a decade of hardware development.

Looking at GeekBench 5 benchmark scores alone, we see a ca. 7× difference in raw CPU speed when using all cores, and ca. factor 3 difference in single-core speed. Of course there are also big differences in RAM and SSD speeds. Funnily, the display size and resolution are almost unchanged.

Here are the results side-by-side:

query	latency_old	latency_new	speedup
1	142.2	19.6	7.26
2	23.2	2.0	11.60
3	262.7	21.8	12.05
4	167.5	11.1	15.09
5	185.9	15.5	11.99
6	127.7	6.6	19.35
7	278.3	14.9	18.68
8	248.4	14.5	17.13
9	675.0	33.3	20.27
10	1266.1	23.6	53.65
11	33.4	2.2	15.18
12	161.7	10.1	16.01
13	384.7	24.4	15.77
14	215.9	9.2	23.47
15	197.6	8.2	24.10
16	100.7	4.1	24.56
17	243.7	15.3	15.93
18	2076.1	47.6	43.62
19	283.9	23.1	12.29
20	200.1	10.9	18.36
21	1011.9	47.8	21.17
22	57.7	4.3	13.42

We do see significant speedups, from 7 up to as much as 53. The geometric mean of the timings improved from 218 to 12, a ca. 20× improvement.

Reproducibility

The binary, scripts, queries, and results are available on GitHub for inspection. We also made the TPC-H SF1000 database file available for download so you don't have to generate it. But be warned, it's a large file.

Discussion

We have seen how the decade-old MacBook Pro Retina has been able to complete a complex analytical benchmark. A newer laptop was able to significantly improve on those times. But absolute speedup numbers are a bit pointless here. The difference is purely quantitative, not qualitative.

From a user perspective, it matters much more that those queries complete in somewhat reasonable time, not if it took 10 or 100 seconds to do so. We can tackle almost the same kind of data problems with both laptops, we just have to be willing to wait a little longer. This is especially true given DuckDB's out-of-core capability, which allows it to spill query intermediates to disks if required.

What is perhaps more interesting is that back in 2012, it would have been completely feasible to have a single-node SQL engine like DuckDB that could run complex analytical SQL queries against a database of 6 billion rows in manageable time – and we did not even have to immerse it in dry ice this time.

History is full of “what if”s, what if something like DuckDB had existed in 2012? The main ingredients were there, vectorized query processing had already been invented in 2005. Would the now somewhat-silly-looking move to distributed systems for data analysis have ever happened? The dataset size of our benchmark database was awfully close to that 99.9% percentile of input data volume for analytical queries in 2024. And while the retina MacBook Pro was a high-end machine in 2012, by 2014 many other vendors shifted to offering laptops with built-in SSD storage and larger amounts of memory became more widespread.

So, yes, we really did lose a full decade.

DuckDB

DuckLake 0.2

Updates in the DuckDB ducklake Extension

Secrets for Managing Ducklake Credentials

Named Secrets

Unnamed Secrets

Settings

List Files

Directly Attaching DuckLake

Updates in the DuckLake 0.2 Standard

Relative Schema/Table Paths

Name Mapping, and Adding Existing Parquet Files

Settings

Partition Transforms

Migration Guide

Discovering DuckDB Use Cases via GitHub

Introduction

Data Retrieval from GitHub

Saving Data to a Markdown File

Automating with GitHub Workflow

Visualizing Historical Data

Conclusion

Lightweight Text Analytics Workflows with DuckDB

Introduction

Data Preparation

Keyword Search

Full-Text Search

Semantic Search

Similarity Joins

Hybrid Search

Conclusion

Faster Dashboards with Multi-Column Approximate Sorting

The Strategy

Space Filling Curves

Truncated Timestamps

The “On-Time Flights” Dataset

Experimental Design

30 000 Stars on GitHub

Recent Updates

Metrics

Events

Closing Thoughts

DuckLake: SQL as a Lakehouse Format

Background

Iceberg and Delta

Catalogs

A Database You Say?

DuckLake

Simplicity

Scalability

Speed

Features

The ducklake DuckDB Extension

Installation

Usage

Press Inquiries

Arrow IPC Support in DuckDB

DuckDB and Arrow

Arrow Interprocess Communication (Arrow IPC)

The Arrow Community Extension

Demo

Bonus: Cool Use Cases for Arrow IPC

Conclusion & What's Next

USING KEY in Recursive CTEs

Assembling SQL Queries from Pieces: CTEs

Iterate Like It's 1999: Recursive CTEs

Recursive CTEs Suffer from Amnesia

USING a KEY to Cure Amnesia

Overwriting Old Intermediate Results

A Change of Key

Accessing Relevant History

USING KEY Can Unlock Performance Advantages

Announcing DuckDB 1.3.0

Breaking Changes and Deprecations

Old Linux glibc Deprecation

Lambda Function Syntax

Serializing Strings in List with Escapes

Minor SQL Parser Changes

New Features

External File Cache

Updates in the DuckDB `ducklake` Extension

The `ducklake` DuckDB Extension

`USING` a `KEY` to Cure Amnesia

`USING KEY` Can Unlock Performance Advantages

`TRY` Expression

Expression Support in `CREATE SECRET`

Spatial `JOIN` Operator