Apache Kafka + AWS S3

Stream Kafka Events Directly into AWS S3 for Scalable Data Archiving

Automate the flow of real-time Kafka messages into durable S3 storage without writing a single line of custom infrastructure code.

Why integrate Apache Kafka and AWS S3?

Apache Kafka and AWS S3 are two of the most widely adopted tools in modern data infrastructure. Kafka handles high-throughput real-time event streaming; S3 handles scalable, cost-effective object storage. Together they show up at the core of data lake, analytics, and compliance architectures everywhere. Connecting Kafka to S3 lets engineering and data teams continuously capture, archive, and replay event streams with full durability and minimal operational overhead.

Automate & integrate Apache Kafka & AWS S3

Use case

Real-Time Event Archiving to Data Lake

Continuously consume messages from Kafka topics and write them as batched files into designated S3 prefixes to build a scalable, queryable data lake. Each batch can be partitioned by date, topic, or event type for efficient downstream querying. This pattern eliminates data loss risk while giving analysts direct access to historical event streams via Athena or Spark.

Use case

Compliance and Audit Log Storage

Route Kafka topics containing user activity logs, access events, or transaction records into immutable S3 buckets to satisfy compliance requirements like SOC 2, GDPR, and HIPAA. Events are written in tamper-evident formats with strict bucket policies and lifecycle rules, so audit trails stay durable, accessible, and cost-optimized over time.

Use case

ML Training Data Pipeline

Capture raw Kafka event streams — clickstream data, sensor readings, recommendation signals — and land them in S3 as structured files ready for ML model training jobs. Tray.ai can apply lightweight transformations before writing, so data arrives clean and consistently formatted. Data science teams get a reliable, versioned dataset without needing direct Kafka access.

Use case

Multi-Region Event Replication

Consume events from a primary Kafka cluster and write them to S3 buckets in multiple AWS regions to support geo-redundancy and disaster recovery. Each region's S3 bucket acts as a durable checkpoint that can seed secondary Kafka clusters or power regional analytics workloads. This decouples disaster recovery logic from Kafka's native replication mechanisms.

Use case

Dead Letter Queue Capture and Analysis

Automatically route Kafka dead letter queue (DLQ) messages — events that failed processing — into a dedicated S3 bucket for investigation and reprocessing. Engineers can inspect failed payloads, identify schema mismatches or upstream errors, and replay corrected events back into Kafka. Nothing gets lost, and the whole error handling workflow stays in one place.

Use case

Change Data Capture (CDC) Event Storage

Stream database change events published to Kafka — via tools like Debezium — directly into S3 to create a full changelog of database mutations over time. These stored CDC events can power data reconciliation, historical backfills, and audit workflows without querying the source database. The S3-based changelog also works as a reliable source of truth for rebuilding downstream system state.

Use case

IoT and Sensor Data Aggregation

Ingest high-volume IoT device telemetry published to Kafka topics and consolidate it into time-partitioned S3 files for analytics, anomaly detection, and long-term trend analysis. Tray.ai handles batching and format conversion — such as JSON to Parquet — before writing to S3, cutting storage costs and improving query efficiency. Operations and data engineering teams get a unified sensor data repository without managing custom ETL pipelines.

Get started with Apache Kafka & AWS S3 integration today

Apache Kafka & AWS S3 Challenges

What challenges are there when working with Apache Kafka & AWS S3 and how will using Tray.ai help?

Challenge

Managing Kafka Consumer Offsets Reliably

Custom Kafka-to-S3 pipelines must carefully manage consumer group offsets to avoid duplicate writes or missed messages, especially after failures or restarts. This requires persistent offset storage, careful commit timing relative to S3 write success, and idempotent write logic — all of which add significant engineering complexity.

How Tray.ai Can Help:

Tray.ai manages consumer group offset coordination internally, committing offsets only after successful S3 writes are confirmed. Built-in retry logic and idempotent file naming mean pipeline restarts won't produce duplicate or missing data in S3, which removes the most error-prone part of DIY Kafka consumer development.

Challenge

Handling Schema Evolution Across Kafka Topics

Kafka topics frequently evolve their message schemas over time — adding fields, changing types, or restructuring payloads. Writing these heterogeneous messages to S3 without a schema management strategy results in unreadable files, broken downstream queries, and silent data quality issues.

How Tray.ai Can Help:

Tray.ai's data mapping layer lets teams define flexible, version-aware transformation logic that normalizes incoming Kafka messages before writing to S3. When schemas change, you update the mapping in the tray.ai workflow editor rather than touching infrastructure. S3 files stay consistently structured and query-ready.

Challenge

Optimizing S3 Write Performance and Cost

Writing each Kafka message as an individual S3 object leads to enormous numbers of tiny files — a well-known anti-pattern that degrades Athena and Spark query performance and inflates S3 request costs. Proper batching, file sizing, and compression all require non-trivial buffering logic in custom pipelines.

How Tray.ai Can Help:

Tray.ai gives you configurable batching parameters — message count thresholds, time windows, and maximum file size limits — so S3 objects get written at optimal sizes. Native compression support (gzip and others) cuts storage costs, and consistent file sizing keeps downstream analytics queries fast and affordable.

Challenge

Securing Data in Transit Between Kafka and S3

Moving sensitive event data from Kafka brokers to S3 across network boundaries introduces real security risks: unencrypted transmission, overly permissive IAM policies, and credential management headaches that can create compliance exposure.

How Tray.ai Can Help:

Tray.ai enforces TLS encryption for all Kafka broker connections and uses AWS IAM role-based authentication for S3 access, so you don't need long-lived credentials embedded in pipeline code. Bucket-level and prefix-level access controls can be precisely scoped within the tray.ai connector configuration, keeping data access least-privilege and audit-ready.

Challenge

Monitoring Pipeline Health and Alerting on Lag

Without dedicated observability, Kafka-to-S3 pipelines can silently fall behind — accumulating consumer lag — or fail mid-batch, leaving S3 datasets stale or incomplete. Catching these issues before they affect downstream analytics or compliance reporting is something custom scripts rarely handle well.

How Tray.ai Can Help:

Tray.ai includes built-in workflow execution logs, error notifications, and configurable alerting so your team knows immediately when a Kafka consumer falls behind, an S3 write fails, or a pipeline step times out. Per-step diagnostic data makes it straightforward to find bottlenecks and restore healthy data flow without sifting through raw broker logs.

Start using our pre-built Apache Kafka & AWS S3 templates today

Start from scratch or use one of our pre-built Apache Kafka & AWS S3 templates to quickly solve your most common use cases.

Apache Kafka & AWS S3 Templates

Find pre-built Apache Kafka & AWS S3 solutions for common use cases

Browse all templates

Template

Kafka Topic to S3 Batch File Writer

Polls a specified Kafka topic on a configurable interval, accumulates messages into a batch, serializes them as JSON or CSV, and writes the output file to a target S3 bucket with a timestamped, partitioned key.

Steps:

  • Trigger on schedule or message threshold; connect to configured Kafka topic and consumer group
  • Accumulate and serialize messages into a single JSON or CSV payload with configurable batch size
  • Write serialized file to S3 using a partitioned key pattern such as /topic/year/month/day/batch_id

Connectors Used: Kafka, AWS S3

Template

Kafka DLQ to S3 Dead Letter Archive

Monitors a Kafka dead letter queue topic, captures all failed messages with their original metadata and error context, and writes them to a dedicated S3 bucket path for engineer review and replay.

Steps:

  • Subscribe to the designated Kafka DLQ topic and consume new failure events in real time
  • Enrich each message with failure timestamp, topic of origin, consumer group, and error reason
  • Write enriched failure records to S3 under a structured path for triage and downstream reprocessing

Connectors Used: Kafka, AWS S3

Template

S3 File Event Trigger to Kafka Producer

Watches for new file uploads in a specified S3 bucket or prefix, reads and parses the file content, and publishes individual records as messages to a target Kafka topic for downstream stream processing.

Steps:

  • Trigger on S3 object creation event via S3 event notification or scheduled S3 prefix poll
  • Download and parse the new file content — supporting JSON, CSV, or newline-delimited formats
  • Publish each parsed record as an individual Kafka message to the configured target topic

Connectors Used: AWS S3, Kafka

Template

Kafka CDC Event Log to S3 Changelog Store

Consumes change data capture events from a Kafka topic fed by Debezium or a similar CDC tool, transforms the event envelope into a normalized schema, and appends records to a partitioned S3 changelog for historical audit and replay.

Steps:

  • Connect to the Kafka CDC topic and consume new insert, update, and delete event records
  • Normalize each CDC event envelope — extracting before/after state, operation type, and source metadata
  • Append normalized records to a date-partitioned S3 prefix as a durable, replayable changelog

Connectors Used: Kafka, AWS S3

Template

Kafka IoT Telemetry to Parquet on S3

Reads high-frequency IoT sensor messages from a Kafka topic, buffers them into configurable time windows, converts the batch to Parquet format, and uploads the file to a time-partitioned S3 location for analytics-ready storage.

Steps:

  • Consume IoT telemetry messages from the designated Kafka topic using a managed consumer group
  • Buffer messages over a configurable time window and serialize the batch into columnar Parquet format
  • Upload the Parquet file to S3 under a time-partitioned prefix compatible with Athena and Redshift Spectrum

Connectors Used: Kafka, AWS S3

Template

Multi-Topic Kafka Fan-Out to S3 Buckets

Monitors multiple Kafka topics simultaneously and routes messages from each topic to a corresponding dedicated S3 bucket or prefix, enabling clean separation of data domains within a shared data lake.

Steps:

  • Define a topic-to-bucket routing configuration mapping each Kafka topic to its target S3 destination
  • Consume messages across all configured Kafka topics using parallel consumer group instances
  • Route and write each message batch to the correct S3 bucket and prefix according to the routing map

Connectors Used: Kafka, AWS S3