R dplyr pipeline syntax support for DuckDB - transpiles dplyr verbs to SQL
Installing and Loading
INSTALL dplyr FROM community;
LOAD dplyr;
Example
-- Transform R dplyr pipeline syntax directly in DuckDB
-- First, create a sample table
CREATE TABLE iris AS SELECT * FROM (VALUES
(5.1, 3.5, 1.4, 0.2, 'setosa'),
(4.9, 3.0, 1.4, 0.2, 'setosa'),
(7.0, 3.2, 4.7, 1.4, 'versicolor'),
(6.4, 3.2, 4.5, 1.5, 'versicolor'),
(6.3, 3.3, 6.0, 2.5, 'virginica'),
(5.8, 2.7, 5.1, 1.9, 'virginica')
) AS t(sepal_length, sepal_width, petal_length, petal_width, species);
-- Run a pipeline directly as a statement (parser extension)
iris %>%
filter(sepal_length > 5) %>%
select(species, sepal_length, petal_length) %>%
arrange(desc(sepal_length));
┌────────────┬──────────────┬──────────────┐
│ species │ sepal_length │ petal_length │
├────────────┼──────────────┼──────────────┤
│ versicolor │ 7.0 │ 4.7 │
│ versicolor │ 6.4 │ 4.5 │
│ virginica │ 6.3 │ 6.0 │
│ virginica │ 5.8 │ 5.1 │
│ setosa │ 5.1 │ 1.4 │
└────────────┴──────────────┴──────────────┘
-- Group and summarize with dplyr syntax
iris %>%
group_by(species) %>%
summarise(avg_sepal = mean(sepal_length), count = n());
┌───────────┬───────┐
│ avg_sepal │ count │
│ double │ int64 │
├───────────┼───────┤
│ 6.05 │ 2 │
│ 5.0 │ 2 │
│ 6.7 │ 2 │
└───────────┴───────┘
-- Create derived columns with mutate
iris %>%
mutate(sepal_ratio = sepal_length / sepal_width) %>%
select(species, sepal_ratio) %>%
arrange(sepal_ratio);
-- Embed a pipeline inside normal SQL using (| ... |)
SELECT species, COUNT(*) AS n
FROM (| iris %>% filter(sepal_length > 5) %>% select(species) |)
GROUP BY species
ORDER BY n DESC;
About dplyr
dplyr is a DuckDB extension that brings R's popular dplyr package syntax to DuckDB.
It transpiles dplyr pipeline syntax (using the %>% pipe operator) directly to SQL,
allowing R users to write familiar data manipulation code in DuckDB.
Supported Functions
Core Verbs
| Function | Description | Example |
| :— | :— | :— |
| select() | Select/rename columns | select(id, name) |
| filter() | Filter rows | filter(age > 18) |
| mutate() | Create/modify columns | mutate(total = price * qty) |
| rename() | Rename columns | rename(new = old) |
| arrange() | Sort rows | arrange(desc(date)) |
| group_by() | Group rows | group_by(dept) |
| summarise() | Aggregate data | summarise(avg = mean(val)) |
| *_join() | Joins (inner, left, etc.) | left_join(other, by="id") |
| Set Ops | union, intersect, setdiff | union(other) |
Helper Functions
- Aggregation:
mean,sum,min,max,n,count,median,mode - Window:
row_number,rank,lead,lag,ntile - Math:
abs,sqrt,round,floor,log,exp - String:
tolower,toupper,substr,trimws - Logic:
ifelse,is.na,coalesce
Key Features:
- Native R dplyr syntax support with
%>%pipe operator - Transpiles to optimized SQL at runtime
- Works with any DuckDB table (pipelines must start with a table name)
- Error messages include helpful diagnostics
- Embed pipelines in normal SQL using
(| ... |)(e.g.,SELECT * FROM (| iris %>% filter(x > 0) |);)
Usage:
-- Using the dplyr() table function
SELECT * FROM dplyr('table_name %>% filter(x > 5) %>% select(a, b)');
-- Or write a pipeline directly as a statement (parser extension)
table_name %>% filter(x > 5) %>% select(a, b);
For more information, visit the GitHub repository.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| dplyr | table | NULL | NULL |