S3 Partitioning
Bacalhau's S3 partitioning feature builds on the core partitioning system to automatically handle data distribution from S3 buckets across multiple job executions. This specialized implementation includes graceful failure handling and independent retry of failed partitions specifically optimized for S3 data sources.
Key Benefits
- Automatic Data Distribution: Intelligently distributes S3 objects across partitions
- Multiple Partitioning Strategies: Choose from various strategies based on your data organization
- Clean Processing Logic: Write code focused on processing, not partitioning
- Failure Isolation: Failures are contained to individual partitions
- Independent Retries: Failed partitions are retried automatically without affecting successful ones
Partitioning Strategies
Bacalhau supports multiple S3 partitioning strategies to match different data organization patterns:
No Partitioning (Shared Data)
When all executions need access to all the data, omit the partition configuration:
inputSources:
- target: /data
source:
type: s3
params:
bucket: config-bucket
key: reference-data/
# No partition config - all executions see all files
Perfect for:
- Loading shared reference data
- Processing configuration files
- Running analysis that needs the complete dataset
Object-Based Distribution
Evenly distributes objects across partitions without specific grouping logic:
inputSources:
- target: /uploads
source:
type: s3
params:
bucket: data-bucket
key: user-uploads/
partition:
type: object
Ideal for:
- Processing large volumes of user uploads
- Handling randomly named files
- Large-scale data transformation tasks
Date-Based Partitioning
Process each day's data in parallel using a configurable date format:
inputSources:
- target: /logs
source:
type: s3
params:
bucket: app-logs
key: 'logs/*'
partition:
type: date
dateFormat: '2006-01-02'
Perfect for:
- Daily analytics processing
- Log aggregation and analysis
- Time-series computations
Regex-Based Partitioning
Distribute data based on patterns in object keys:
inputSources:
- target: /sales
source:
type: s3
params:
bucket: global-sales
key: 'regions/*'
partition:
type: regex
pattern: '([^/]+)/.*'
Enables scenarios like:
- Regional sales analysis
- Geographic data processing
- Territory-specific reporting
Substring-Based Partitioning
Distributes data based on substring segments in object keys:
inputSources:
- target: /segments
source:
type: s3
params:
bucket: customer-data
key: segments/*
partition:
type: substring
startIndex: 0
endIndex: 3
Perfect for:
- Customer cohort analysis
- Segment-specific processing
- Category-based computations
Combining Partitioned and Shared Data
You can combine partitioned data with shared reference data in the same job:
inputSources:
- target: /config
source:
type: s3
params:
bucket: config-bucket
key: reference/*
# No partitioning - all executions see all reference data
- target: /daily-logs
source:
type: s3
params:
bucket: app-logs
key: logs/*
partition:
type: date
dateFormat: '2006-01-02'
This pattern supports:
- Processing daily logs with shared lookup tables
- Analyzing data using common reference files
- Running calculations that need both partitioned data and shared configuration
Complete Job Examples
Example 1: Object-Based Partitioning
Here's a complete job specification using object-based partitioning:
name: process-uploads
count: 5
type: batch
tasks:
- name: process-uploads
engine:
type: docker
params:
image: ubuntu:latest
parameters:
- bash
- -c
- |
echo "Processing partition $BACALHAU_PARTITION_INDEX of $BACALHAU_PARTITION_COUNT"
file_count=$(find /uploads -type f | wc -l)
echo "Found $file_count files to process in this partition"
inputSources:
- target: /uploads
source:
type: s3
params:
bucket: data-bucket
key: user-uploads/
partition:
type: object
Example 2: Combining Partitioned and Shared Data
Here's a complete job specification that combines partitioned and shared data sources:
name: daily-analysis
count: 7 # Process a week of data
type: batch
tasks:
- name: daily-analytics
engine:
type: docker
params:
image: ubuntu:latest
parameters:
- bash
- -c
- |
echo "Processing partition $BACALHAU_PARTITION_INDEX of $BACALHAU_PARTITION_COUNT"
echo "Reference data files:"
find /config -type f | sort
echo "Daily log files for this partition:"
find /daily-logs -type f | wc -l
inputSources:
- target: /config
source:
type: s3
params:
bucket: config-bucket
key: reference/*
# No partitioning - all executions see all reference data
- target: /daily-logs
source:
type: s3
params:
bucket: app-logs
key: logs/*
partition:
type: date
dateFormat: '2006-01-02'
outputs:
- name: results
path: /outputs
Usage
To run a job with S3 partitioning, define your job with the appropriate partitioning strategy and set the number of partitions with the count
parameter, then submit:
bacalhau job run job-spec.yaml