Troubleshooting
This guide provides solutions for common issues encountered by Bacalhau users. By understanding these troubleshooting scenarios, you'll be able to create more reliable jobs and workflows.
What You'll Learn
- How to diagnose and resolve common Bacalhau job issues
- Strategies for debugging stuck, failed, or misbehaving jobs
- Best practices to prevent common problems
Job Lifecycle Issues
Jobs Stuck in Pending State
One of the most common issues users encounter is jobs remaining in the "Pending" state and never executing.
Possible Causes
- No available nodes: No compute nodes are connected to the orchestrator
- Resource constraints too high: Requesting more CPU, memory, or GPU than any available node can provide
- Mismatched node selector: Job requirements don't match available node capabilities
- Network partitioning: Orchestrator can't communicate with compute nodes
Diagnosis
Check the job status and specifications for clues:
bacalhau job describe <jobID>
# For more detailed information in YAML format
bacalhau job describe <jobID> --output yaml
Look for status messages that might indicate scheduling issues.
Check available compute nodes:
bacalhau node list
Ensure there are active compute nodes with sufficient resources.
Solutions
- Reduce resource requests: Lower CPU, memory, or GPU requirements
- Add more compute nodes: Add capacity to your cluster
- Check network connectivity: Ensure nodes can communicate with each other
- Modify job requirements: Adjust constraints to match available resources
Input Data Access Issues
Problems accessing or mounting input data are another common source of failures.
Possible Causes
- Wrong path or URL: Incorrect or inaccessible source location
- Missing credentials: No or invalid authentication for S3 or private URLs
- Network limitations: Compute node can't reach data source
- Path mapping errors: Incorrect source-to-destination mapping
Diagnosis
Check job specs and status:
bacalhau job describe <jobID> --output yaml
If the job started but failed during execution, check logs:
bacalhau job logs <jobID>
Look for messages like "file not found" or "access denied".
Solutions
- Validate paths: Double-check that source paths, URLs, or S3 buckets exist and are accessible
- Check credentials: Ensure proper environment variables or configuration for authenticated sources
- Test connectivity: Verify the compute node can reach the data source
- Local testing: Test data access locally before running on Bacalhau
Example of corrected input mounting:
# INCORRECT (missing file)
bacalhau docker run --input /path/does/not/exist:/data ubuntu:latest -- cat /data/file.txt
# CORRECT
bacalhau docker run --input /path/that/exists:/data ubuntu:latest -- cat /data/file.txt
No Output Found
Jobs complete successfully, but expected output files are missing.
Possible Causes
- Wrong output path: Not writing to the
/outputs
directory - Command errors: The job ran but the command failed to produce output
- Permission issues: Container user can't write to output location
- Publisher configuration: Publisher not configured correctly
Diagnosis
Check job specification and execution details:
bacalhau job describe <jobID> --output yaml
If the job executed, check logs for clues about what the job did:
bacalhau job logs <jobID>
Verify your job actually wrote to the /outputs
directory.
Solutions
- Use absolute paths: Always use absolute paths in your commands
- Write to
/outputs
: Ensure your job writes to the/outputs
directory specifically - Add debugging: Add commands to list directories and print current working directory
- Check permissions: Ensure your process has permission to write to the output location
Examples
# INCORRECT (writing to wrong location)
bacalhau docker run ubuntu:latest -- echo "Hello" > result.txt
# CORRECT
bacalhau docker run ubuntu:latest -- bash -c 'echo "Hello" > /outputs/result.txt'
Container and Resource Issues
Container Errors
Issues with container execution or container image availability.
Possible Causes
- Image not found: The specified container image doesn't exist or is inaccessible
- Command errors: The command specified doesn't exist in the container
- Resource limitations: The container runs out of resources during execution
- Exit codes: The container process exits with a non-zero code
Diagnosis
Check job specification for container configuration:
bacalhau job describe <jobID> --output yaml
If the container started, check logs for execution errors:
bacalhau job logs <jobID>
Look for messages about image pulling or command execution.
Solutions
- Verify image exists: Check that the image name is correct and accessible
- Test locally: Try running the container locally with Docker first
- Check command: Ensure the command exists in the container and has correct syntax
- Adjust resources: Provide sufficient CPU, memory, and disk for your workload
Example of corrected container image:
# INCORRECT (typo in image name)
bacalhau docker run ubuntuu:latest -- echo "Hello"
# CORRECT
bacalhau docker run ubuntu:latest -- echo "Hello"
# CORRECT (with specific image version)
bacalhau docker run ubuntu:20.04 -- echo "Hello"
Resource Exhaustion
Jobs fail because they run out of resources during execution.
Possible Causes
- Out of memory (OOM): Job exceeds allocated memory
- Disk space exhaustion: Job writes more data than allocated disk space
- CPU thrashing: Insufficient CPU allocation causes extreme slowdown
- GPU memory errors: CUDA out of memory errors for GPU jobs
Diagnosis
Check job specification and status:
bacalhau job describe <jobID> --output yaml
If the job executed, check logs for error messages:
bacalhau job logs <jobID>
Look for error messages about memory, disk space, or resource limits.
Solutions
- Increase resources: Allocate more memory, CPU, or disk space
- Optimize code: Reduce resource usage in your application
- Process in batches: Break large workloads into smaller chunks
- Clean up temporary files: Remove unneeded files during processing
Example of increased resource allocation:
# Increased memory allocation
bacalhau docker run --memory 4GB python:3.9 -- python memory_intensive_script.py
# Increased disk space
bacalhau docker run --disk 20GB ubuntu:latest -- dd if=/dev/zero of=/outputs/large_file bs=1M count=15000
Command and Syntax Issues
Command Line Parsing Issues
Problems related to how commands and arguments are passed to containers.
Possible Causes
- Missing separator: No
--
between Bacalhau flags and container command - Quote handling: Issues with shell quotes and argument passing
- Special characters: Problems with special characters in commands
Diagnosis
Check the exact command being executed:
bacalhau job describe <jobID> --output yaml
Look at the command fields to see what was actually executed.
Solutions
- Use the separator: Always use
--
between Bacalhau flags and the container command - Quote properly: Be careful with nested quotes in shell commands
- Use bash -c: For complex commands, wrap them in
bash -c '...'
- Use yaml specs: For very complex commands, use declarative YAML specifications
Example of corrected command syntax:
# INCORRECT (missing separator)
bacalhau docker run ubuntu:latest echo "Hello"
# CORRECT
bacalhau docker run ubuntu:latest -- echo "Hello"
# CORRECT (complex command)
bacalhau docker run ubuntu:latest -- bash -c 'for i in {1..5}; do echo "Number $i"; done > /outputs/result.txt'