What is Bacalhau?
Bacalhau is an open-source distributed compute orchestration framework designed to bring compute to the data. Instead of moving large datasets around networks, Bacalhau makes it easy to execute jobs close to the data's location, drastically reducing latency and resource overhead.
Why It Matters
- Highly Distributed Architecture: Deploy compute networks that span regions, cloud providers, and on-premises datacenters—all working together as a unified system.
- Resilient Operation: Compute nodes operate effectively even with intermittent connectivity to orchestrators, maintaining service availability during network partitioning or isolation.
- Data Sovereignty & Security: Process sensitive data within security boundaries without requiring it to leave your premises, enabling computation while preserving data control.
- Cross-Organizational Computation: Allow specific vetted computations on protected datasets without exposing raw data, breaking data silos between organizations.
- Resource Efficiency: By minimizing data transfers, Bacalhau saves bandwidth costs and ensures jobs run faster.
- High Scalability: As your data and processing needs grow, simply add more compute nodes on demand—whether on-premises or in the cloud.
- Ease of Integration: Bacalhau works with existing container images (Docker, etc.), meaning you can leverage your current workflows without major rewrites.
Key Features
- Single Binary Simplicity: Bacalhau is a single self-contained binary that functions as a client, orchestrator, and compute node—making it incredibly easy to set up and scale your distributed compute network.
- Modular Architecture: Bacalhau's design supports multiple execution engines (Docker, WebAssembly) and storage providers through clean interfaces, allowing for easy extension.
- Orchestrator-Compute Model: A dedicated orchestrator coordinates job scheduling, while compute nodes run tasks—all from the same binary with different runtime modes.
- Flexible Storage Integrations: Bacalhau integrates with S3, HTTP/HTTPS, and other storage systems, letting you pull data from various sources.
- Multiple Job Types: Support for batch, ops, daemon, and service job types to accommodate different workflow requirements.
- Declarative & Imperative Submissions: Define jobs in a YAML spec (declarative) or pass all arguments via CLI (imperative).
- Publisher Support: Output results to local volumes, S3, or other storage backends—so your artifacts are readily accessible.
Use Cases
Bacalhau's distributed compute framework enables a wide range of applications across different industries:
Log Processing
Process logs efficiently at scale by running distributed jobs directly at the source, reducing costs by up to 93% in bandwidth usage while improving real-time insights. Bacalhau supports various job types for log management:
- Daemon Jobs: Continuously run on each node for real-time log aggregation and compression
- Service Jobs: Handle ongoing processing tasks like log aggregation and issue detection
- Batch Jobs: Execute on-demand in-depth analysis of historical log data
- Ops Jobs: Enable real-time querying of live logs for urgent investigations
Distributed Data Warehousing
Query and analyze data across multiple regions by deploying compute tasks directly where your data resides. This approach reduces latency, enhances performance, and ensures compliance with data sovereignty regulations. Bacalhau integrates with modern data tools like Apache Iceberg and DuckDB to enable:
- Reduced data movement with local query execution
- Improved query performance through compute-data proximity
- Seamless scalability with dynamic node addition
- Compliance with data regulations through region-specific processing
Fleet Management
Efficiently manage distributed nodes across multiple environments with capabilities for:
- Remote execution of commands without requiring SSH access
- Automated software deployment and configuration updates
- Real-time metrics and logs collection
- Targeted job execution based on node attributes
- Rapid incident response and automated recovery
Distributed Machine Learning
Train and deploy ML models across a distributed compute fleet, optimizing performance while keeping data in place:
- Distribute training across multiple machines to handle larger models
- Process data locally to minimize network transfers
- Deploy inference jobs near users for low-latency predictions
- Support federated learning for privacy-sensitive applications
Edge Computing
Run compute tasks closer to the data source for applications requiring low latency and minimal bandwidth usage:
- Process and analyze sensor, IoT, or video data in real time
- Perform pre-processing and filtering at the edge before sending refined data
- Distribute tasks across available edge resources dynamically
- Ensure data privacy by keeping computations near the source
How It Works
Bacalhau's architecture enables you to create compute networks that bridge traditional infrastructure boundaries. When you submit a job, Bacalhau intelligently determines which compute nodes are best positioned to process the data based on locality, availability, and your defined constraints—without requiring data movement or constant connectivity.
This approach is particularly valuable for:
- Organizations with data that cannot leave certain security boundaries
- Multi-region operations where data transfer is expensive or impractical
- Scenarios where multiple parties need to collaborate on analysis without sharing raw data
- Edge computing environments with intermittent connectivity
Community
Bacalhau has a very friendly community and we are always happy to help you get started:
- Join the Slack Community Go to #bacalhau channel – it is the easiest way to engage with other members in the community and get help.
- Contributing – learn how to contribute to the Bacalhau project.