Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format

By
<h2 id="overview">Overview</h2> <p>DuckLake 1.0 introduces a fresh approach to managing data lake metadata. Instead of scattering metadata across numerous files in object storage, it centralizes table metadata in a SQL database—making updates, sorting, and partitioning more efficient. Built as a DuckDB extension, DuckLake integrates seamlessly with existing workflows and offers compatibility with Iceberg-style features. This guide walks you through its setup, core operations, and common pitfalls.</p><figure style="margin:20px 0"><img src="https://res.infoq.com/news/2026/05/ducklake-sql-catalog/en/headerimage/generatedHeaderImage-1776423164012.jpg" alt="Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.infoq.com</figcaption></figure> <h2 id="prerequisites">Prerequisites</h2> <ul> <li><strong>DuckDB</strong>: Version 0.10.0 or higher (command-line interface or Python binding).</li> <li><strong>Object Storage</strong>: A bucket or directory (e.g., S3, MinIO, local filesystem) for storing parquet files.</li> <li><strong>SQL Database</strong>: For the catalog—DuckDB itself works for local testing; production uses PostgreSQL or MySQL.</li> <li><strong>DuckLake Extension</strong>: Install via <code>INSTALL ducklake; LOAD ducklake;</code>.</li> </ul> <h2 id="step-by-step">Step-by-Step Instructions</h2> <h3 id="install-extension">1. Install and Load the DuckLake Extension</h3> <p>Open DuckDB and run:</p> <pre><code>INSTALL ducklake FROM community; LOAD ducklake;</code></pre> <p>This registers DuckLake’s functions and types. Verify with <code>SELECT * FROM ducklake_version();</code></p> <h3 id="create-catalog">2. Create a DuckLake Catalog</h3> <p>A catalog holds all table metadata. Use <code>CREATE DUCKLAKE CATALOG</code>:</p> <pre><code>CREATE DUCKLAKE CATALOG my_catalog DATABASE 'duckdb' -- can be 'postgresql' or 'mysql' CONNECTION_STRING 'file:///path/to/catalog.db'; -- Switch to the catalog USE my_catalog;</code></pre> <p><em>Tip</em>: For remote databases, use a connection string like <code>postgresql://user:pass@host/db</code>.</p> <h3 id="create-table">3. Create a DuckLake Table</h3> <p>Define a table with partitioning and sorting:</p> <pre><code>CREATE DUCKLAKE TABLE sales ( order_id INTEGER, amount DECIMAL(10,2), order_date DATE, region VARCHAR ) PARTITIONED BY (region) SORTED BY (order_date);</code></pre> <p>This creates a logical table. Data is stored as Parquet files in your object storage.</p> <h3 id="write-data">4. Insert Data</h3> <p>Insert directly or from a SELECT:</p> <pre><code>INSERT INTO sales VALUES (1, 150.00, '2025-01-15', 'East'), (2, 200.50, '2025-01-16', 'West');</code></pre> <p>DuckLake automatically writes new Parquet files per partition and updates the catalog.</p> <h3 id="read-data">5. Query the Table</h3> <p>Standard SQL works—DuckLake reads the catalog to locate files:</p> <pre><code>SELECT region, SUM(amount) AS total_sales FROM sales WHERE order_date >= '2025-01-01' GROUP BY region;</code></pre> <p>Partition pruning and sorting are applied automatically.</p><figure style="margin:20px 0"><img src="https://imgopt.infoq.com/fit-in/100x100/filters:quality(80)/presentations/game-vr-flat-screens/en/smallimage/thumbnail-1775637585504.jpg" alt="Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: www.infoq.com</figcaption></figure> <h3 id="manage-partitions">6. Manage Partitions and Small Updates</h3> <p>DuckLake supports incremental updates without rewriting whole partitions. Use <code>MERGE</code> or <code>DELETE</code>:</p> <pre><code>DELETE FROM sales WHERE order_id = 1; MERGE INTO sales AS target USING (VALUES (3, 300.00, '2025-01-20', 'East')) AS src ON target.order_id = src.column1 WHEN MATCHED THEN UPDATE SET amount = src.column2 WHEN NOT MATCHED THEN INSERT (order_id, amount, order_date, region) VALUES (src.column1, src.column2, src.column3, src.column4);</code></pre> <p>The catalog tracks these small changes efficiently.</p> <h3 id="iceberg-compat">7. Iceberg Compatibility</h3> <p>DuckLake can read Iceberg tables if you enable compatibility mode:</p> <pre><code>SET ducklake_iceberg_compat = true; SELECT * FROM iceberg_scan('s3://bucket/iceberg_table');</code></pre> <p>Write support is limited to DuckLake-native tables.</p> <h2 id="common-mistakes">Common Mistakes</h2> <ul> <li><strong>Forgetting to load the extension</strong>: Always run <code>LOAD ducklake;</code> after installation.</li> <li><strong>Wrong catalog connection string</strong>: Ensure the path or database URL is correct and accessible.</li> <li><strong>Partition key mismatch</strong>: When inserting, include the partition column; missing it causes errors.</li> <li><strong>Overwriting small files</strong>: DuckLake handles small updates, but avoid frequent tiny inserts—compact periodically with <code>OPTIMIZE TABLE sales;</code>.</li> <li><strong>Ignoring sorting</strong>: Define a sort column to speed up range queries; otherwise full scans occur.</li> </ul> <h2 id="summary">Summary</h2> <p>DuckLake 1.0 simplifies data lake management by storing metadata in SQL, enabling faster updates and smarter partitioning. With its DuckDB extension, you get a lightweight yet powerful alternative to Hive or Iceberg for analytical workloads. Start small, tune your partitions, and enjoy seamless SQL-driven data lakes.</p>

Related Articles