ClickHouse adds S3 table function for data lake queries

ClickHouse's s3() table function lets users run SQL on Amazon S3 files, reducing ETL steps and storage duplication for analytics workloads [DevTo].

sources[DevTo]

ClickHouse's native S3 support via the s3() table function allows users to run SQL directly on CSV, Parquet, JSON, and ORC files stored in S3 [DevTo]. The engine treats an S3 URL as a virtual table, enabling column-pruned reads on Parquet files and wildcard patterns across thousands of partitions. For example, a simple SELECT * FROM s3('https://my-bucket.s3.amazonaws.com/sales.csv', 'CSVWithNames') LIMIT 10; returns the first ten rows without loading data into ClickHouse. ClickHouse reads only the required columns from Parquet files, reducing I/O and query latency [DevTo]. Users can materialize data into a MergeTree table with a single CREATE … AS SELECT statement, gaining the full performance of local storage while preserving the original schema definition. The s3() function eliminates duplicate storage on both S3 and ClickHouse, cutting infrastructure spend by the amount of data duplicated. By removing a dedicated ETL step, analysts can query logs, IoT streams, or archived business records directly from the data lake, accelerating time-to-insight [DevTo]. However, frequently accessed datasets still benefit from being loaded into a MergeTree table for lower S3 access costs and improved query performance on specific columns, forcing engineers to balance latency against storage overhead.

adjacent broadcasts

TX_360095·engineering

operator_channel

[ comments_offline · provider_not_configured ]

transmission_log

Subscribe to the broadcast.

Daily digest of the day's most important tech news. No fluff. Engineering signal only.

// delivered via substack · double-opt-in confirmation