A Julia library that provides a clean, flexible interface for executing SQL queries against various data sources using DuckDB as the backend engine.
DuckQuery.jl combines the power of DuckDB's SQL engine with Julia's DataFrames, providing a unified interface for working with tabular data. It enables you to query data from various sources using familiar SQL syntax while maintaining a clean, intuitive API.
- Multiple Source Types: Query database files, in-memory DataFrames, or combinations of both
- Connection Management: Automatic management or control via do-blocks
- Data Processing: Pre- and post-processing hooks for query and result transformation
- Performance Tuning: Configuration options for memory limits, threading, and more
- Extension Support: Easily use DuckDB extensions for specialized functionality
- Error Handling: Flexible error handling strategies
- Performance Profiling: Built-in timing for query optimization
using Pkg
Pkg.add(url="https://github.com/yourusername/DuckQuery.jl")To use with the latest Julia version:
]add DuckQueryHere's a quick overview of how to use DuckQuery.jl:
using DuckQuery
using DataFrames
using DuckDB
# Create a sample database with a table 'products'
# Create a sample database file with some data
conn = DuckDB.open("data.db")
DuckDB.execute(conn, "CREATE TABLE products (id INTEGER, name TEXT, price FLOAT, inventory INTEGER)")
DuckDB.execute(conn, "INSERT INTO products VALUES (1, 'Laptop', 999.99, 25), (2, 'Phone', 599.99, 50), (3, 'Headphones', 129.99, 100)")
DuckDB.close(conn)
# Query a database file
results = querydf("data.db", "SELECT * FROM products LIMIT 10")
# Query an in-memory DataFrame
df = DataFrame(id = 1:3, name = ["Alice", "Bob", "Charlie"])
results = querydf(df, "SELECT * FROM df WHERE id > 1")
# Execute a query with verbosity and profiling
results = querydf(
"data.db",
"SELECT * FROM products WHERE price > 100",
verbose = true,
profile = true
)using DuckQuery
using DataFrames
# Query an existing database file
results = querydf(
"customers.db",
"SELECT * FROM customers WHERE region = 'East' LIMIT 20"
)
# Create a new database and tables
querydf(
"new_database.db",
[
"CREATE TABLE customers (id INTEGER, name TEXT, region TEXT)",
"INSERT INTO customers VALUES (1, 'Alice', 'East'), (2, 'Bob', 'West')",
"CREATE INDEX idx_region ON customers(region)"
],
verbose = true
)
# Query with configuration settings
results = querydf(
"large_data.db",
"SELECT * FROM transactions WHERE amount > 1000",
init_config = Dict{Symbol, Any}(
:memory_limit => "8GB",
:threads => 8
)
)# Load data from external files
results = querydf(
":memory:",
"SELECT * FROM read_csv_auto('data.csv') WHERE value > 100"
)
# Export results to files
export_df = querydf(
"database.db",
[
"COPY (SELECT * FROM customers WHERE signup_date > '2023-01-01') TO 'new_customers.csv'",
"SELECT * FROM customers WHERE signup_date > '2023-01-01'"
]
)
# Work with Parquet files
results = querydf(
":memory:",
"SELECT * FROM parquet_scan('large_dataset.parquet') LIMIT 1000",
init_queries = ["LOAD parquet"]
)using DuckQuery
using DataFrames
# Create a sample DataFrame
df = DataFrame(
id = 1:5,
name = ["Alice", "Bob", "Charlie", "David", "Eve"],
age = [25, 32, 45, 28, 36],
department = ["Sales", "IT", "HR", "Sales", "IT"]
)
# Simple SELECT queries
results = querydf(df, "SELECT * FROM df WHERE age > 30")
# Aggregation and grouping
results = querydf(
df,
"""
SELECT
department,
COUNT(*) as employee_count,
AVG(age) as avg_age
FROM df
GROUP BY department
ORDER BY employee_count DESC
"""
)
# Using SQL functions with DataFrame data
results = querydf(
df,
"""
SELECT
id,
name,
age,
CASE
WHEN age < 30 THEN 'Junior'
WHEN age < 40 THEN 'Mid-level'
ELSE 'Senior'
END as seniority
FROM df
ORDER BY age
"""
)# Use SQL to reshape/pivot data
sales_data = DataFrame(
date = repeat(["2023-01-01", "2023-01-02", "2023-01-03"], inner=3),
region = repeat(["East", "West", "Central"], 3),
amount = [100, 150, 200, 120, 180, 210, 130, 190, 220]
)
# Pivot table using SQL
pivot_results = querydf(
sales_data,
"""
SELECT
date,
SUM(CASE WHEN region = 'East' THEN amount ELSE 0 END) as East,
SUM(CASE WHEN region = 'West' THEN amount ELSE 0 END) as West,
SUM(CASE WHEN region = 'Central' THEN amount ELSE 0 END) as Central
FROM df
GROUP BY date
ORDER BY date
"""
)using DuckQuery
using DataFrames
# Create sample DataFrames
customers = DataFrame(
id = 1:5,
name = ["Alice", "Bob", "Charlie", "David", "Eve"],
city = ["New York", "Los Angeles", "Chicago", "Houston", "Phoenix"]
)
orders = DataFrame(
id = 101:107,
customer_id = [1, 2, 1, 3, 2, 4, 5],
product = ["Widget A", "Widget B", "Widget C", "Widget B", "Widget A", "Widget C", "Widget A"],
amount = [150.0, 200.0, 120.0, 300.0, 250.0, 175.0, 220.0]
)
# Join DataFrames and perform analysis
results = querydf(
Dict("customers" => customers, "orders" => orders),
"""
SELECT
c.name,
c.city,
COUNT(o.id) as order_count,
SUM(o.amount) as total_spent,
AVG(o.amount) as avg_order_value
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name, c.city
ORDER BY total_spent DESC
"""
)
# Multi-table analysis
products = DataFrame(
name = ["Widget A", "Widget B", "Widget C"],
category = ["Standard", "Premium", "Standard"],
cost = [80.0, 120.0, 90.0]
)
# Profitability analysis across multiple tables
profitability = querydf(
Dict("customers" => customers, "orders" => orders, "products" => products),
"""
SELECT
p.category,
COUNT(o.id) as sales_count,
SUM(o.amount) as revenue,
SUM(o.amount - p.cost) as profit,
SUM(o.amount - p.cost) / SUM(o.amount) * 100 as profit_margin
FROM orders o
JOIN products p ON o.product = p.name
GROUP BY p.category
ORDER BY profit_margin DESC
"""
)using DuckQuery
using DataFrames
using Dates
using Random
# 1. First, create a DataFrame with existing customer data
# These will be stored in the database file
existing_customers = DataFrame(
id = 1:5,
name = ["Alice", "Bob", "Charlie", "David", "Eva"],
signup_date = [
Date(2023, 2, 15), # 2023-02-15 (after 2023-01-01)
Date(2023, 3, 20), # 2023-03-20 (after 2023-01-01)
Date(2022, 12, 10), # 2022-12-10 (before 2023-01-01)
Date(2023, 4, 5), # 2023-04-05 (after 2023-01-01)
Date(2022, 11, 25) # 2022-11-25 (before 2023-01-01)
]
)
# Remove the database file if it exists (for a clean start)
db_path = "customers.db"
isfile(db_path) && rm(db_path)
# 2. Write the existing customers DataFrame to the database file
querydf(
Dict(
"customers_data" => existing_customers,
"db" => db_path
),
"""
-- Create a customers table in the database file
CREATE TABLE db.customers AS
SELECT * FROM customers_data
"""
)
println("Existing customers have been written to the database file")
# 3. Create a sample DataFrame with new customer data
new_customers = DataFrame(
id = 101:105,
name = ["Frank", "Grace", "Heidi", "Ivan", "Julia"],
signup_date = fill(today(), 5)
)
println("\nNew customers data:")
println(new_customers)
# 4. Query combining database file and DataFrames
results = querydf(
Dict(
"existing" => db_path, # Database file
"new_data" => new_customers # DataFrame
),
"""
SELECT
'Existing' as source,
name,
signup_date
FROM existing.customers
WHERE signup_date > '2023-01-01'
UNION ALL
SELECT
'New' as source,
name,
signup_date
FROM new_data
ORDER BY signup_date DESC, name
""",
verbose=true # Turn on verbose logging
)
# Display the results
println("\nCombined query results:")
println(results)
# Using database file for lookup tables with in-memory data
using DuckQuery
using DataFrames
using Random
using Dates
# 1. First, create a product catalog DataFrame
product_catalog = DataFrame(
code = ["A001", "B002", "C003", "D004", "E005"],
name = ["Laptop", "Desk Chair", "Coffee Maker", "Headphones", "Desk Lamp"],
category = ["Electronics", "Furniture", "Appliances", "Electronics", "Lighting"],
price = [1200.00, 150.00, 75.50, 89.99, 45.00],
description = [
"High-performance laptop",
"Ergonomic office chair",
"Programmable coffee maker",
"Noise-cancelling headphones",
"LED desk lamp with adjustable brightness"
]
)
# Remove the database file if it exists (for a clean start)
db_path = "product_catalog.db"
isfile(db_path) && rm(db_path)
# 2. Write the product catalog DataFrame to the database file
# We'll use DuckQuery to create the catalog table in the database
querydf(
Dict(
"products" => product_catalog,
"db" => db_path
),
"""
-- Create a catalog table in the database file
CREATE TABLE db.catalog AS
SELECT * FROM products
"""
)
println("Product catalog has been written to the database file")
# 3. Create a transactions DataFrame
Random.seed!(123) # For reproducible results
transactions = DataFrame(
id = 1:100,
product_code = rand(["A001", "B002", "C003", "D004", "E005"], 100),
amount = rand(10.0:500.0, 100)
)
println("\nExample transactions data:")
println(first(transactions, 5))
# 4. Now query the database together with the transactions DataFrame
enriched_data = querydf(
Dict(
"transactions" => transactions,
"products" => db_path
),
"""
SELECT
t.id as transaction_id,
t.product_code,
p.name as product_name,
p.category,
t.amount
FROM transactions t
JOIN products.catalog p ON t.product_code = p.code
ORDER BY t.id
""",
verbose=true # Turn on verbose logging
)
# Display the first few rows of the result
println("\nEnriched data result:")
println(first(enriched_data, 10))
using DuckQuery
using DataFrames
# Create a sample DataFrame
data = DataFrame(
id = 1:1000,
value = rand(1000),
category = rand(["A", "B", "C"], 1000)
)
# Using do-block to control the connection
results = querydf(
":memory:",
"SELECT category, COUNT(*) as count, AVG(value) as avg_value FROM data_table GROUP BY category"
) do conn
# Register the DataFrame
DuckQuery.register_dataframe(conn, "data_temp", data)
# Create a filtered view for analysis
DuckDB.execute(conn, """
CREATE TEMP VIEW data_table AS
SELECT * FROM data_temp
WHERE value > 0.3
""")
# Could perform more operations on the connection here
# The query in the outer function will be executed after this block
end
# Performing multiple operations with database file connection
querydf(
"analytics.db",
[
"CREATE TABLE IF NOT EXISTS summary AS SELECT * FROM aggregated_data",
"SELECT * FROM summary ORDER BY category"
]
) do conn
# Register input data
DuckQuery.register_dataframe(conn, "input_data", data)
# Calculate aggregations and store in temporary table
DuckDB.execute(conn, """
CREATE TEMP TABLE aggregated_data AS
SELECT
category,
COUNT(*) as count,
MIN(value) as min_value,
MAX(value) as max_value,
AVG(value) as avg_value,
SUM(value) as total_value
FROM input_data
GROUP BY category
""")
end# Example of using with_connection for multiple operations
# (Requires implementation of with_connection function)
DuckQuery.with_connection("analysis.db", verbose=true) do conn, config
# First query: create and populate a table
DuckDB.execute(conn, """
CREATE TABLE IF NOT EXISTS metrics (
id INTEGER PRIMARY KEY,
name TEXT,
value DOUBLE
)
""")
# Insert data
for i in 1:10
DuckDB.execute(conn, """
INSERT INTO metrics VALUES ($i, 'Metric-$i', $(rand() * 100))
""")
end
# Create a view for filtered data
DuckDB.execute(conn, """
CREATE VIEW IF NOT EXISTS high_metrics AS
SELECT * FROM metrics WHERE value > 50
""")
# Query the view
high_values = DuckQuery.execute_query(conn, "SELECT * FROM high_metrics", config)
println("High metrics: ", high_values)
# Create an index for better performance
DuckDB.execute(conn, "CREATE INDEX IF NOT EXISTS idx_metric_value ON metrics(value)")
# Run a final query
result = DuckQuery.execute_query(conn, """
SELECT
AVG(value) as avg_value,
COUNT(*) as total_count,
COUNT(CASE WHEN value > 50 THEN 1 END) as high_count
FROM metrics
""", config)
println("Summary: ", result)
# Return the final result
return result
endusing DuckQuery
using DataFrames
using Dates
# Create a sample DataFrame
df = DataFrame(
id = 1:10,
value = rand(10),
timestamp = now() .- Day.(1:10),
status = rand(["active", "inactive", "pending"], 10)
)
# Define preprocessing function to modify query
function add_timestamp_filtering(query)
# Add a timestamp condition if not present
if !occursin("WHERE", uppercase(query))
return query * " WHERE timestamp > '$(now() - Day(7))'"
else
return query * " AND timestamp > '$(now() - Day(7))'"
end
end
# Define postprocessing functions
function add_derived_columns(result_df)
if :value in propertynames(result_df)
result_df[!, :value_squared] = result_df.value .^ 2
end
return result_df
end
function filter_low_values(result_df)
return filter(row -> row.value > 0.3, result_df)
end
# Use both preprocessors and postprocessors
results = querydf(
df,
"SELECT id, value, status FROM df",
preprocessors = [add_timestamp_filtering],
postprocessors = [add_derived_columns, filter_low_values],
verbose = true
)
# Example with complex query transformation
sales_data = DataFrame(
date = Date(2023,1,1):Day(1):Date(2023,3,31),
sales = rand(50:500, 90),
region = rand(["North", "South", "East", "West"], 90)
)
# Preprocessor that adds date filtering based on the current month
function filter_current_month(query)
current_month = month(today())
current_year = year(today())
return replace(
query,
"FROM sales_data" =>
"FROM sales_data WHERE EXTRACT(MONTH FROM date) = $current_month AND EXTRACT(YEAR FROM date) = $current_year"
)
end
# Postprocessor that adds calculated columns
function add_performance_metrics(df)
if !isempty(df) && :sales in propertynames(df)
df[!, :performance] = map(df.sales) do sales
if sales > 400
return "Excellent"
elseif sales > 300
return "Good"
elseif sales > 200
return "Average"
else
return "Poor"
end
end
end
return df
end
# Use the processing pipeline
monthly_analysis = querydf(
sales_data,
"""
SELECT
region,
AVG(sales) as avg_sales,
MAX(sales) as max_sales
FROM sales_data
GROUP BY region
ORDER BY avg_sales DESC
""",
preprocessors = [filter_current_month],
postprocessors = [add_performance_metrics]
)using DuckQuery
using DataFrames
using Dates
# Create sample data
orders = DataFrame(
id = 1:100,
customer_id = rand(1:20, 100),
order_date = [Date(2023, 1, 1) + Day(rand(0:90)) for _ in 1:100],
total_amount = rand(50:500, 100)
)
# Use CTEs for complex multi-step analysis
results = querydf(
Dict("orders" => orders),
"""
WITH
-- Calculate monthly sales
monthly_sales AS (
SELECT
EXTRACT(MONTH FROM order_date) AS month,
EXTRACT(YEAR FROM order_date) AS year,
SUM(total_amount) AS monthly_revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY month, year
),
-- Calculate customer spending segments
customer_segments AS (
SELECT
customer_id,
SUM(total_amount) AS total_spent,
CASE
WHEN SUM(total_amount) > 2000 THEN 'High'
WHEN SUM(total_amount) > 1000 THEN 'Medium'
ELSE 'Low'
END AS segment
FROM orders
GROUP BY customer_id
)
-- Main query using both CTEs
SELECT
cs.segment,
COUNT(DISTINCT cs.customer_id) AS customer_count,
SUM(ms.monthly_revenue) AS total_revenue,
AVG(ms.monthly_revenue) AS avg_monthly_revenue
FROM customer_segments cs
JOIN orders o ON cs.customer_id = o.customer_id
JOIN monthly_sales ms ON
EXTRACT(MONTH FROM o.order_date) = ms.month AND
EXTRACT(YEAR FROM o.order_date) = ms.year
GROUP BY cs.segment
ORDER BY total_revenue DESC
"""
)using DuckQuery
using DataFrames
using Dates
# Create sample data
stock_prices = DataFrame(
date = Date(2023,1,1):Day(1):Date(2023,3,31),
symbol = repeat(["AAPL", "MSFT", "GOOGL", "AMZN"], inner=23),
price = [100 + 5*sin(i) + rand(-3:0.1:3) for i in 1:90]
)
# Use window functions for time series analysis
analysis = querydf(
stock_prices,
"""
SELECT
symbol,
date,
price,
-- Calculate moving averages
AVG(price) OVER (
PARTITION BY symbol
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7day,
-- Calculate price changes
price - LAG(price, 1) OVER (
PARTITION BY symbol
ORDER BY date
) AS daily_change,
-- Calculate percent changes
(price - LAG(price, 1) OVER (
PARTITION BY symbol
ORDER BY date
)) / LAG(price, 1) OVER (
PARTITION BY symbol
ORDER BY date
) * 100 AS daily_percent_change,
-- Calculate ranks
RANK() OVER (
PARTITION BY symbol
ORDER BY price DESC
) AS price_rank,
-- Calculate percentiles
NTILE(4) OVER (
PARTITION BY symbol
ORDER BY price
) AS price_quartile
FROM df
ORDER BY symbol, date
"""
)using DuckQuery
using DataFrames
# Create a sample DataFrame with potential issues
df = DataFrame(
id = 1:5,
value = [10.5, missing, 20.3, 15.7, missing],
name = ["Item A", "Item B", "Item C", "Item D", "Item E"]
)
# Default error handling (throw exception)
try
result = querydf(
df,
"SELECT id, value / 0 AS impossible_value FROM df"
)
catch e
println("Error caught: ", e)
end
# Return empty DataFrame on error
empty_result = querydf(
df,
"SELECT id, SQRT(value) FROM df WHERE value < 0",
on_error = :return_empty
)
println("Result when error occurs with on_error=:return_empty:")
println(empty_result)
# Log error and return empty DataFrame
log_result = querydf(
df,
"SELECT * FROM nonexistent_table",
on_error = :log,
verbose = true
)
# Combine with preprocessors for safer queries
function make_query_safe(query)
# Add handling for missing values
return replace(
query,
"SELECT" => "SELECT COALESCE(value, 0) AS value_safe,"
)
end
safe_result = querydf(
df,
"SELECT id, value FROM df",
preprocessors = [make_query_safe],
on_error = :log
)using DuckQuery
using DataFrames
using BenchmarkTools
# Generate a larger dataset for testing
large_df = DataFrame(
id = 1:100_000,
value_a = rand(100_000),
value_b = rand(100_000),
category = rand(["A", "B", "C", "D", "E"], 100_000)
)
# Basic query without optimization
@time basic_result = querydf(
large_df,
"""
SELECT
category,
COUNT(*) as count,
AVG(value_a) as avg_a,
AVG(value_b) as avg_b
FROM df
GROUP BY category
"""
)
# Example handling complex types with missing values
complex_df = DataFrame(
int_with_missing = Union{Int64, Missing}[1, 2, missing, 4],
float_with_missing = Union{Float64, Missing}[1.1, missing, 3.3, 4.4],
string_with_missing = Union{String, Missing}["a", "b", missing, "d"]
)
# Query filtering out missing values
@time missing_filtered = querydf(
complex_df,
"""
SELECT * FROM df
WHERE int_with_missing IS NOT NULL
AND float_with_missing IS NOT NULL
AND string_with_missing IS NOT NULL
"""
)
# Same query with performance configuration
@time optimized_result = querydf(
large_df,
"""
SELECT
category,
COUNT(*) as count,
AVG(value_a) as avg_a,
AVG(value_b) as avg_b
FROM df
GROUP BY category
""",
init_config = Dict{Symbol, Any}(
:memory_limit => "4GB",
:threads => 8
),
profile = true
)
# Using profiling to optimize queries
complex_query_result = querydf(
Dict("data" => large_df),
"""
SELECT
category,
percentile_cont(0.5) WITHIN GROUP (ORDER BY value_a) AS median_a,
percentile_cont(0.5) WITHIN GROUP (ORDER BY value_b) AS median_b,
CORR(value_a, value_b) AS correlation
FROM data
GROUP BY category
""",
profile = true
)# Database file with a single query
querydf(dbfile::String, query::String; kwargs...)::DataFrame
# Database file with multiple queries
querydf(dbfile::String, queries::Vector{String}; kwargs...)::DataFrame
# DataFrame as source
querydf(df::DataFrame, query::String; kwargs...)::DataFrame
# Multiple DataFrames as sources
querydf(dfs::Dict{String, DataFrame}, query::String; kwargs...)::DataFrame
# Mixed sources (database files and DataFrames)
querydf(sources::Dict{String, <:Any}, query::String; kwargs...)::DataFrame
# Do-block variants for connection control
querydf(f::Function, dbfile::String, query::String; kwargs...)::DataFrame
querydf(f::Function, dbfile::String, queries::Vector{String}; kwargs...)::DataFrameinit_queries::Union{String, Vector{String}}: Initialization queries to run before the main queryinit_config::Dict{Symbol, Any}: Configuration settings for DuckDBverbose::Bool: Enable verbose loggingprofile::Bool: Enable query profilingpreprocessors::Vector{<:Function}: Functions to transform the query before executionpostprocessors::Vector{<:Function}: Functions to transform the result DataFrameon_error::Symbol: Error handling strategy (:throw,:return_empty,:log)
The init_config dictionary can include:
:memory_limit: Maximum memory usage (e.g., "4GB"):threads: Number of threads to use:extensions: Extensions to load (e.g., ["parquet", "json", "httpfs"]):batch_size: Number of rows to process in each batch when manually registering DataFrames (default: 1000):force_manual_registration: Force manual DataFrame registration instead of using DuckDB's native registration (default: false)
Contributions to DuckQuery.jl are welcome! Please feel free to:
- Open issues for bugs or feature requests
- Submit pull requests with improvements
- Add more examples or improve documentation
DuckQuery.jl is available under the MIT License. See the LICENSE file for more details.