Skip to main content

S3 Partitioning

Bacalhau's S3 partitioning feature builds on the core partitioning system to automatically handle data distribution from S3 buckets across multiple job executions. This specialized implementation includes graceful failure handling and independent retry of failed partitions specifically optimized for S3 data sources.

Key Benefits

  • Automatic Data Distribution: Intelligently distributes S3 objects across partitions
  • Multiple Partitioning Strategies: Choose from various strategies based on your data organization
  • Clean Processing Logic: Write code focused on processing, not partitioning
  • Failure Isolation: Failures are contained to individual partitions
  • Independent Retries: Failed partitions are retried automatically without affecting successful ones

Partitioning Strategies

Bacalhau supports multiple S3 partitioning strategies to match different data organization patterns:

No Partitioning (Shared Data)

When all executions need access to all the data, omit the partition configuration:

inputSources:
- target: /data
source:
type: s3
params:
bucket: config-bucket
key: reference-data/
# No partition config - all executions see all files

Perfect for:

  • Loading shared reference data
  • Processing configuration files
  • Running analysis that needs the complete dataset

Object-Based Distribution

Evenly distributes objects across partitions without specific grouping logic:

inputSources:
- target: /uploads
source:
type: s3
params:
bucket: data-bucket
key: user-uploads/
partition:
type: object

Ideal for:

  • Processing large volumes of user uploads
  • Handling randomly named files
  • Large-scale data transformation tasks

Date-Based Partitioning

Process each day's data in parallel using a configurable date format:

inputSources:
- target: /logs
source:
type: s3
params:
bucket: app-logs
key: 'logs/*'
partition:
type: date
dateFormat: '2006-01-02'

Perfect for:

  • Daily analytics processing
  • Log aggregation and analysis
  • Time-series computations

Regex-Based Partitioning

Distribute data based on patterns in object keys:

inputSources:
- target: /sales
source:
type: s3
params:
bucket: global-sales
key: 'regions/*'
partition:
type: regex
pattern: '([^/]+)/.*'

Enables scenarios like:

  • Regional sales analysis
  • Geographic data processing
  • Territory-specific reporting

Substring-Based Partitioning

Distributes data based on substring segments in object keys:

inputSources:
- target: /segments
source:
type: s3
params:
bucket: customer-data
key: segments/*
partition:
type: substring
startIndex: 0
endIndex: 3

Perfect for:

  • Customer cohort analysis
  • Segment-specific processing
  • Category-based computations

Combining Partitioned and Shared Data

You can combine partitioned data with shared reference data in the same job:

inputSources:
- target: /config
source:
type: s3
params:
bucket: config-bucket
key: reference/*
# No partitioning - all executions see all reference data
- target: /daily-logs
source:
type: s3
params:
bucket: app-logs
key: logs/*
partition:
type: date
dateFormat: '2006-01-02'

This pattern supports:

  • Processing daily logs with shared lookup tables
  • Analyzing data using common reference files
  • Running calculations that need both partitioned data and shared configuration

Complete Job Examples

Example 1: Object-Based Partitioning

Here's a complete job specification using object-based partitioning:

name: process-uploads
count: 5
type: batch
tasks:
- name: process-uploads
engine:
type: docker
params:
image: ubuntu:latest
parameters:
- bash
- -c
- |
echo "Processing partition $BACALHAU_PARTITION_INDEX of $BACALHAU_PARTITION_COUNT"
file_count=$(find /uploads -type f | wc -l)
echo "Found $file_count files to process in this partition"
inputSources:
- target: /uploads
source:
type: s3
params:
bucket: data-bucket
key: user-uploads/
partition:
type: object

Example 2: Combining Partitioned and Shared Data

Here's a complete job specification that combines partitioned and shared data sources:

name: daily-analysis
count: 7 # Process a week of data
type: batch
tasks:
- name: daily-analytics
engine:
type: docker
params:
image: ubuntu:latest
parameters:
- bash
- -c
- |
echo "Processing partition $BACALHAU_PARTITION_INDEX of $BACALHAU_PARTITION_COUNT"
echo "Reference data files:"
find /config -type f | sort
echo "Daily log files for this partition:"
find /daily-logs -type f | wc -l
inputSources:
- target: /config
source:
type: s3
params:
bucket: config-bucket
key: reference/*
# No partitioning - all executions see all reference data
- target: /daily-logs
source:
type: s3
params:
bucket: app-logs
key: logs/*
partition:
type: date
dateFormat: '2006-01-02'
outputs:
- name: results
path: /outputs

Usage

To run a job with S3 partitioning, define your job with the appropriate partitioning strategy and set the number of partitions with the count parameter, then submit:

bacalhau job run job-spec.yaml