S3 Data Lake

Availability:: Airbyte Cloud Airbyte OSS PyAirbyte
Support Level:: Marketplace
Connector Version:: 0.3.12(Last updated 13 hours ago)
Sync Success Rate:
Usage Rate:

caution

This connector is in early access and still evolving. Future updates may introduce breaking changes.

We're interested in hearing about your experience! See Github for more information on joining the beta.

This page guides you through the process of setting up the S3 Data Lake destination connector.

This connector writes the Iceberg table format to S3, or an S3-compatible storage backend. Currently it supports the REST, AWS Glue, and Nessie catalogs.

Setup Guide

S3 Data Lake requires configuring two components: S3 storage, and your Iceberg catalog.

S3 Setup

The connector needs certain permissions to be able to write Iceberg-format files to S3:

s3:ListAllMyBuckets
s3:GetObject*
s3:PutObject
s3:PutObjectAcl
s3:DeleteObject
s3:ListBucket*

Iceberg Catalog Setup

Different catalogs have different setup requirements.

AWS Glue

In addition to the S3 permissions, you should also grant these Glue permissions:

glue:TagResource
glue:UnTagResource
glue:BatchCreatePartition
glue:BatchDeletePartition
glue:BatchDeleteTable
glue:BatchGetPartition
glue:CreateDatabase
glue:CreateTable
glue:CreatePartition
glue:DeletePartition
glue:DeleteTable
glue:GetDatabase
glue:GetDatabases
glue:GetPartition
glue:GetPartitions
glue:GetTable
glue:GetTables
glue:UpdateDatabase
glue:UpdatePartition
glue:UpdateTable

Set the "warehouse location" option to s3://<bucket name>/path/within/bucket.

The "Role ARN" option is only usable in cloud.

REST catalog

You will need the URI of your REST catalog.

Nessie

You will need the URI of your Nessie catalog, and an access token to authenticate to that catalog.

Set the "warehouse location" option to s3://<bucket name>/path/within/bucket.

Iceberg schema generation

The top-level fields of the stream will be mapped to Iceberg fields. Nested fields (objects, arrays, and unions) will be mapped to STRING columns, and written as serialized JSON. This is the full mapping between Airbyte types and Iceberg types:

Airbyte type	Iceberg type
Boolean	Boolean
Date	Date
Integer	Long
Number	Double
String	String
Time with timezone	Time
Time without timezone	Time
Timestamp with timezone	Timestamp with timezone
Timestamp without timezone	Timestamp without timezone
Object	String (JSON-serialized value)
Array	String (JSON-serialized value)
Union	String (JSON-serialized value)

Note that for the time/timestamp with timezone types, the value is first adjusted to UTC, and then written into the Iceberg file.

Schema evolution

This connector supports limited schema evolution. Outside of refreshes/clears, the connector will never rewrite existing data files. This means that we can only handle specific schema changes:

Adding/removing a column
Widening columns
Changing the primary key

If your source goes through an unsupported schema change, the connector will fail at sync time. To resolve this, you can either:

Manually edit your table schema via Iceberg directly
Refresh your connection (removing existing records) / clear your connection

Full refresh overwrite syncs can also handle these schema changes transparently.

Deduplication

This connector uses a merge-on-read strategy to support deduplication:

The stream's primary keys are translated to Iceberg's identifier columns.
An "upsert" is an equality-based delete on that row's primary key, followed by an insertion of the new data.

Assumptions

The S3 Data Lake connector assumes that one of two things is true:

The source will never emit the same primary key twice in a single sync attempt
If the source emits the same PK multiple times in a single attempt, it will always emit those records in cursor order (oldest to newest)

If these conditions are not met, you may see inaccurate data in the destination (i.e. older records taking precendence over newer records). If this happens, you should use the append or overwrite sync mode.

Branching

Iceberg supports Git-like semantics over your data. Most query engines target the main branch.

This connector leverages those semantics to provide resilient syncs:

Within each sync, each microbatch creates a new snapshot
During truncate syncs, the connector writes the refreshed data to the airbyte_staging branch, and fast-forwards the main branch at the end of the sync.
- This means that your data remains queryable right up to the end of a truncate sync, at which point it is atomically swapped to the updated version.

Reference

Config fields reference

Field

Type

Property name

string

s3_bucket_name

string

s3_bucket_region

string

warehouse_location

string

main_branch_name

object

catalog_type

string

access_key_id

string

secret_access_key

string

s3_endpoint

Changelog

Expand to review

Version	Date	Pull Request	Subject
0.3.12	2025-02-12	#53170	Improve documentation, tweak error handling of invalid schema evolution
0.3.11	2025-02-12	#53216	Support arbitrary schema change in overwrite / truncate refresh / clear sync
0.3.10	2025-02-11	#53622	Enable the Nessie integration tests
0.3.9	2025-02-10	#53165	Very basic usability improvements and documentation
0.3.8	2025-02-10	#52666	Change the chunk size to 1.5Gb
0.3.7	2025-02-07	#53141	Adding integration tests around the Rest catalog
0.3.6	2025-02-06	#53172	Internal refactor
0.3.5	2025-02-06	#53164	Improve error message on null primary key in dedup mode
0.3.4	2025-02-05	#53173	Tweak spec wording
0.3.3	2025-02-05	#53176	Fix time_with_timezone handling (values are now adjusted to UTC)
0.3.2	2025-02-04	#52690	Handle special characters in stream name/namespace when using AWS Glue
0.3.1	2025-02-03	#52633	Fix dedup
0.3.0	2025-01-31	#52639	Make the database/namespace a required field
0.2.23	2025-01-27	#51600	Internal refactor
0.2.22	2025-01-22	#52081	Implement support for REST catalog
0.2.21	2025-01-27	#52564	Fix crash on stream with 0 records
0.2.20	2025-01-23	#52068	Add support for default namespace (/database name)
0.2.19	2025-01-16	#51595	Clarifications in connector config options
0.2.18	2025-01-15	#51042	Write structs as JSON strings instead of Iceberg structs.
0.2.17	2025-01-14	#51542	New identifier fields should be marked as required.
0.2.16	2025-01-14	#51538	Update identifier fields if incoming fields are different than existing ones
0.2.15	2025-01-14	#51530	Set AWS region for S3 bucket for nessie catalog
0.2.14	2025-01-14	#50413	Update existing table schema based on the incoming schema
0.2.13	2025-01-14	#50412	Implement logic to determine super types between iceberg types
0.2.12	2025-01-10	#50876	Add support for AWS instance profile auth
0.2.11	2025-01-10	#50971	Internal refactor in AWS auth flow
0.2.10	2025-01-09	#50400	Add S3DataLakeTypesComparator
0.2.9	2025-01-09	#51022	Rename all classes and files from Iceberg V2
0.2.8	2025-01-09	#51012	Rename/Cleanup package from Iceberg V2
0.2.7	2025-01-09	#50957	Add support for GLUE RBAC (Assume role)
0.2.6	2025-01-08	#50991	Initial public release.

Setup Guide​

S3 Setup​

Iceberg Catalog Setup​

AWS Glue​

REST catalog​

Nessie​

Iceberg schema generation​

Schema evolution​

Deduplication​

Assumptions​

Branching​

Reference​

Config fields reference

Changelog​

Setup Guide

S3 Setup

Iceberg Catalog Setup

AWS Glue

REST catalog

Nessie

Iceberg schema generation

Schema evolution

Deduplication

Assumptions

Branching

Reference

Changelog