WITH
子句允许您指定公共表表达式 (CTE)。常规(非递归)公共表表达式本质上是视图,其作用域仅限于特定查询。CTE 可以相互引用,并且可以嵌套。递归 CTE 可以引用自身。
基本 CTE 示例
创建一个名为 cte
的 CTE 并在主查询中使用它
WITH cte AS (SELECT 42 AS x)
SELECT * FROM cte;
x |
---|
42 |
创建两个 CTE cte1
和 cte2
,其中第二个 CTE 引用第一个 CTE
WITH
cte1 AS (SELECT 42 AS i),
cte2 AS (SELECT i * 100 AS x FROM cte1)
SELECT * FROM cte2;
x |
---|
4200 |
您可以为 CTE 指定列名
WITH cte(j) AS (SELECT 42 AS i)
FROM cte;
CTE 实体化
DuckDB 可以采用 CTE 实体化,而不是将 CTE 内联到主查询中。这是通过启发式方法执行的:如果 CTE 执行分组聚合并被多次查询,则会进行实体化。可以通过使用 AS MATERIALIZED
定义 CTE 来显式激活实体化,并使用 AS NOT MATERIALIZED
禁用实体化。
以下面这个多次调用同一个 CTE 的查询为例
WITH t(x) AS (complex_query)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;
内联会为每个引用重复 t
的定义,从而产生以下查询
SELECT *
FROM
(complex_query) AS t1(x),
(complex_query) AS t2(x),
(complex_query) AS t3(x);
如果 complex_query
的开销很大,使用 MATERIALIZED
关键字对其进行实体化可以提高性能。在这种情况下,complex_query
只会被评估一次。
WITH t(x) AS MATERIALIZED (complex_query)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;
如果想要禁用实体化,请使用 NOT MATERIALIZED
WITH t(x) AS NOT MATERIALIZED (complex_query)
SELECT *
FROM
t AS t1,
t AS t2,
t AS t3;
递归 CTE
WITH RECURSIVE
允许定义可以引用自身的 CTE。请注意,查询必须以确保终止的方式编写,否则可能会陷入无限循环。
示例:斐波那契数列
WITH RECURSIVE
可以用于进行递归计算。例如,以下是 WITH RECURSIVE
如何用于计算前十个斐波那契数的方法
WITH RECURSIVE FibonacciNumbers (
RecursionDepth, FibonacciNumber, NextNumber
) AS (
-- Base case
SELECT
0 AS RecursionDepth,
0 AS FibonacciNumber,
1 AS NextNumber
UNION ALL
-- Recursive step
SELECT
fib.RecursionDepth + 1 AS RecursionDepth,
fib.NextNumber AS FibonacciNumber,
fib.FibonacciNumber + fib.NextNumber AS NextNumber
FROM
FibonacciNumbers fib
WHERE
fib.RecursionDepth + 1 < 10
)
SELECT
fn.RecursionDepth AS FibonacciNumberIndex,
fn.FibonacciNumber
FROM
FibonacciNumbers fn;
斐波那契数索引 | 斐波那契数 |
---|---|
0 | 0 |
1 | 1 |
2 | 1 |
3 | 2 |
4 | 3 |
5 | 5 |
6 | 8 |
7 | 13 |
8 | 21 |
9 | 34 |
示例:树遍历
WITH RECURSIVE
可以用于遍历树。例如,以标签层次结构为例
CREATE TABLE tag (id INTEGER, name VARCHAR, subclassof INTEGER);
INSERT INTO tag VALUES
(1, 'U2', 5),
(2, 'Blur', 5),
(3, 'Oasis', 5),
(4, '2Pac', 6),
(5, 'Rock', 7),
(6, 'Rap', 7),
(7, 'Music', 9),
(8, 'Movies', 9),
(9, 'Art', NULL);
以下查询返回从节点 Oasis
到树根 (Art
) 的路径。
WITH RECURSIVE tag_hierarchy(id, source, path) AS (
SELECT id, name, [name] AS path
FROM tag
WHERE subclassof IS NULL
UNION ALL
SELECT tag.id, tag.name, list_prepend(tag.name, tag_hierarchy.path)
FROM tag, tag_hierarchy
WHERE tag.subclassof = tag_hierarchy.id
)
SELECT path
FROM tag_hierarchy
WHERE source = 'Oasis';
路径 |
---|
[Oasis, Rock, Music, Art] |
图遍历
WITH RECURSIVE
子句可用于在任意图上表达图遍历。但是,如果图包含循环,查询必须执行循环检测以防止无限循环。实现这一点的一种方法是将遍历路径存储在列表中,并且在用新边扩展路径之前,检查其端点是否已被访问过(请参阅后面的示例)。
以下是LDBC Graphalytics 基准测试中的有向图
CREATE TABLE edge (node1id INTEGER, node2id INTEGER);
INSERT INTO edge VALUES
(1, 3), (1, 5), (2, 4), (2, 5), (2, 10), (3, 1),
(3, 5), (3, 8), (3, 10), (5, 3), (5, 4), (5, 8),
(6, 3), (6, 4), (7, 4), (8, 1), (9, 4);
请注意,该图包含有向循环,例如节点 1、5 和 8 之间。
枚举从一个节点开始的所有路径
以下查询返回从节点 1 开始的所有路径
WITH RECURSIVE paths(startNode, endNode, path) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path
FROM paths
JOIN edge ON paths.endNode = node1id
-- Prevent adding a repeated node to the path.
-- This ensures that no cycles occur.
WHERE list_position(paths.path, node2id) IS NULL
)
SELECT startNode, endNode, path
FROM paths
ORDER BY length(path), path;
起始节点 | 结束节点 | 路径 |
---|---|---|
1 | 3 | [1, 3] |
1 | 5 | [1, 5] |
1 | 5 | [1, 3, 5] |
1 | 8 | [1, 3, 8] |
1 | 10 | [1, 3, 10] |
1 | 3 | [1, 5, 3] |
1 | 4 | [1, 5, 4] |
1 | 8 | [1, 5, 8] |
1 | 4 | [1, 3, 5, 4] |
1 | 8 | [1, 3, 5, 8] |
1 | 8 | [1, 5, 3, 8] |
1 | 10 | [1, 5, 3, 10] |
请注意,此查询的结果不限于最短路径,例如,对于节点 5,结果包括路径 [1, 5]
和 [1, 3, 5]
。
枚举从一个节点开始的无权最短路径
在大多数情况下,枚举所有路径是不切实际或不可行的。相反,只有(无权)最短路径才值得关注。要找到这些路径,WITH RECURSIVE
查询的第二部分应进行调整,使其仅包含尚未访问过的节点。这是通过使用子查询实现的,该子查询检查之前的任何路径是否包含该节点
WITH RECURSIVE paths(startNode, endNode, path) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path
FROM paths
JOIN edge ON paths.endNode = node1id
-- Prevent adding a node that was visited previously by any path.
-- This ensures that (1) no cycles occur and (2) only nodes that
-- were not visited by previous (shorter) paths are added to a path.
WHERE NOT EXISTS (
FROM paths previous_paths
WHERE list_contains(previous_paths.path, node2id)
)
)
SELECT startNode, endNode, path
FROM paths
ORDER BY length(path), path;
起始节点 | 结束节点 | 路径 |
---|---|---|
1 | 3 | [1, 3] |
1 | 5 | [1, 5] |
1 | 8 | [1, 3, 8] |
1 | 10 | [1, 3, 10] |
1 | 4 | [1, 5, 4] |
1 | 8 | [1, 5, 8] |
枚举两个节点之间的无权最短路径
WITH RECURSIVE
也可以用于查找两个节点之间的所有(无权)最短路径。为了确保递归查询在到达结束节点时立即停止,我们使用一个窗口函数,该函数检查结束节点是否在新增节点中。
以下查询返回节点 1(起始节点)和 8(结束节点)之间的所有无权最短路径
WITH RECURSIVE paths(startNode, endNode, path, endReached) AS (
SELECT -- Define the path as the first edge of the traversal
node1id AS startNode,
node2id AS endNode,
[node1id, node2id] AS path,
(node2id = 8) AS endReached
FROM edge
WHERE startNode = 1
UNION ALL
SELECT -- Concatenate new edge to the path
paths.startNode AS startNode,
node2id AS endNode,
array_append(path, node2id) AS path,
max(CASE WHEN node2id = 8 THEN 1 ELSE 0 END)
OVER (ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS endReached
FROM paths
JOIN edge ON paths.endNode = node1id
WHERE NOT EXISTS (
FROM paths previous_paths
WHERE list_contains(previous_paths.path, node2id)
)
AND paths.endReached = 0
)
SELECT startNode, endNode, path
FROM paths
WHERE endNode = 8
ORDER BY length(path), path;
起始节点 | 结束节点 | 路径 |
---|---|---|
1 | 8 | [1, 3, 8] |
1 | 8 | [1, 5, 8] |
带有 USING KEY
的递归 CTE
USING KEY
改变了常规递归 CTE 的行为。
在每次迭代中,常规递归 CTE 将结果行追加到联合表(union table)中,联合表最终定义了 CTE 的整体结果。相比之下,带有 USING KEY
的 CTE 能够更新先前迭代中已放入联合表中的行:如果当前迭代生成一个键为 k
的行,它将替换联合表中具有相同键 k
的行(类似于字典)。如果联合表中尚不存在这样的行,则新行照常追加到联合表中。
这允许 CTE 对联合表的内容进行细粒度控制。避免仅追加行为可以显著减小联合表的大小。这有助于查询运行时、内存消耗,并使得在迭代进行中访问联合表成为可能(这对于常规递归 CTE 是不可能的):在 WITH RECURSIVE T(...) USING KEY ...
形式的 CTE 中,表 T
表示上次迭代添加的行(这对于递归 CTE 是常见的),而表 recurring.T
表示到目前为止构建的联合表。对 recurring.T
的引用使得将相当复杂的算法优雅地、地道地转换为可读的 SQL 代码成为可能。
示例:USING KEY
这是一个递归 CTE,其中 USING KEY
有一个键列 (a
) 和一个有效载荷列 (b
)。有效载荷列对应于将被覆盖的列。在第一次迭代中,我们有两个不同的键,1
和 2
。这两个键将生成两行新数据,(1, 3)
和 (2, 4)
。在下一次迭代中,我们生成一个新键 3
,它会生成一行新数据。我们还生成行 (2, 3)
,其中 2
是上一次迭代中已经存在的键。这将用新的有效载荷 3
覆盖旧的有效载荷 4
。
WITH RECURSIVE tbl(a, b) USING KEY (a) AS (
SELECT a, b
FROM (VALUES (1, 3), (2, 4)) t(a, b)
UNION
SELECT a + 1, b
FROM tbl
WHERE a < 3
)
SELECT *
FROM tbl;
a | b |
---|---|
1 | 3 |
2 | 3 |
3 | 3 |
使用 VALUES
您可以使用 VALUES
子句作为 CTE 的初始(锚定)部分
WITH RECURSIVE tbl(a, b) USING KEY (a) AS (
VALUES (1, 3), (2, 4)
UNION
SELECT a + 1, b
FROM tbl
WHERE a < 3
)
SELECT *
FROM tbl;
示例:USING KEY
引用联合表
除了将联合表用作字典之外,我们现在还可以在查询中引用它。这使您不仅可以使用上次迭代的结果,还可以使用更早迭代的结果。这项新功能使某些算法更容易实现。
一个例子是连通分量算法。对于每个节点,该算法确定与其连接的具有最低 ID 的节点。为了实现这一点,我们使用联合表中的条目来跟踪为节点找到的最低 ID。如果新传入的行包含较低的 ID,我们会更新此值。
CREATE TABLE nodes (id INTEGER);
INSERT INTO nodes VALUES (1), (2), (3), (4), (5), (6), (7), (8);
CREATE TABLE edges (node1id INTEGER, node2id INTEGER);
INSERT INTO edges VALUES
(1, 3), (2, 3), (3, 7), (7, 8), (5, 4),
(6, 4);
WITH RECURSIVE cc(id, comp) USING KEY (id) AS (
SELECT n.id, n.id AS comp
FROM nodes AS n
UNION
(SELECT DISTINCT ON (u.id) u.id, v.comp
FROM recurring.cc AS u, cc AS v, edges AS e
WHERE ((e.node1id, e.node2id) = (u.id, v.id)
OR (e.node2id, e.node1id) = (u.id, v.id))
AND v.comp < u.comp
ORDER BY u.id ASC, v.comp ASC)
)
TABLE cc
ORDER BY id;
id | comp |
---|---|
1 | 1 |
2 | 1 |
3 | 1 |
4 | 4 |
5 | 4 |
6 | 4 |
7 | 1 |
8 | 1 |
限制
DuckDB 不支持相互递归的 CTE。请参阅 DuckDB 仓库中相关的问题和讨论。