Timeouts
This guide explains how Bacalhau uses timeouts to manage job execution duration, protecting both users and compute resources from problematic jobs.
Understanding Timeouts
Timeouts in Bacalhau set limits on how long a job can run before being automatically terminated. They serve several important purposes:
- Resource Protection: Prevent runaway jobs from consuming resources indefinitely
- Fair Scheduling: Ensure compute nodes remain available for other jobs
- Error Recovery: Help identify and abort jobs that are stuck in infinite loops
- QoS Management: Allow operators to define service levels for different job types
Timeout Types
Bacalhau supports several types of timeouts:
Timeout Type | Description |
---|---|
Execution Timeout | Maximum time a job can spend in execution |
Queue Timeout | Maximum time a job can wait for resources |
Total Timeout | Overall maximum lifespan of a job including queue time |
For Job Submitters: Setting Timeouts
When submitting jobs, you can specify timeouts to control job duration.
Command Line Specification
# Set execution timeout to 30 minutes (1800 seconds)
bacalhau docker run --timeout 1800 ubuntu:latest -- <long-running-command>
# Set queue timeout (how long to wait for resources)
bacalhau docker run --queue-timeout 600 ubuntu:latest -- <command>
YAML Job Specification
# timeout-job.yaml
Type: batch
Count: 1
Tasks:
- Name: main
Engine:
Type: docker
Params:
Image: ubuntu:latest
Parameters:
- sleep
- '300'
Timeouts:
ExecutionTimeout: 1800 # 30 minutes in seconds
QueueTimeout: 600 # 10 minutes in seconds
Submit using:
bacalhau job run timeout-job.yaml
Timeout Behavior
Different timeout types affect job execution in different ways:
-
Execution Timeout: When reached, the current execution is terminated. If the Total Timeout hasn't been reached, the job may be rescheduled on another available node.
-
Total Timeout: When reached, the job is completely terminated across all nodes, no results are published, and the job status is marked as
Failed
with a timeout message. -
Queue Timeout: When reached, a job waiting for resources will fail rather than continue waiting indefinitely.
This approach provides resilience for intermittent failures while still protecting the system from problematic jobs.
For Node Operators: Configuring Timeout Limits
Compute node operators can configure default and maximum timeout values for their nodes.
Node Timeout Configuration
Node operators can set default and maximum timeouts in the config.yaml
file:
# config.yaml
JobDefaults:
Batch:
Task:
Timeouts:
# Default timeout for batch jobs (1 hour)
ExecutionTimeout: '1h'
# Maximum allowed timeout (4 hours)
TotalTimeout: '4h'
Ops:
Task:
Timeouts:
# Default timeout for ops jobs (30 minutes)
ExecutionTimeout: '30m'
# Maximum allowed timeout (2 hours)
TotalTimeout: '2h'
Configuration Format
Timeout values should be specified with a numeric value followed by a time unit:
s
for secondsm
for minutesh
for hours
For example: 30m
, 2h
, or 3600s
Job Type Support
Timeouts can be configured for these job types:
- Batch jobs
- Ops jobs
Note: Timeout configuration is not available for Daemon and Service jobs, which are designed to run continuously.
Queue Timeouts
Queue timeouts control how long a job will wait for suitable resources before failing:
# Set queue timeout to 15 minutes
bacalhau docker run --queue-timeout 900 ubuntu -- <command>
In YAML:
Timeouts:
QueueTimeout: 900 # seconds
This is useful when:
- Your job has specific resource requirements
- You want to fail fast if resources aren't available
- You have time-sensitive workloads
Balancing Timeout Values
Setting appropriate timeouts requires balancing several factors:
- Job Duration: How long your job actually needs to run
- Resource Usage: Longer timeouts tie up resources for extended periods
- Network Stability: Account for possible interruptions or slowdowns
- Data Size: Consider the size of input/output data when setting timeouts
Best Practices
- Set Realistic Timeouts: Allow enough time for your job to complete, plus a buffer
- Use Queue Timeouts: For time-sensitive jobs, set reasonable queue timeouts
- Test and Adjust: Monitor job execution times and refine timeout values
- Document Timeout Requirements: Include timeout information in job documentation
- Consider Resource Impact: Shorter timeouts help prevent resource monopolization
Troubleshooting
If your jobs are timing out unexpectedly:
- Review Job Logs: Check logs to understand where time is being spent
- Optimize Performance: Look for ways to speed up your job
- Check Resource Allocation: Insufficient resources can slow job execution
- Segment Large Jobs: Break very large jobs into smaller, faster parts
- Verify Node Settings: Node operators may have maximum timeout limits