Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This PR introduces a new transport layer for inter-node connectivity based on NATS, in addition to the existing libp2p transport. ## Integration Architecture NATS is embedded into Bacalhau nodes. Each Orchestrator node (aka Requester node) also runs a NATS server and connects with other Orchestrator nodes to form a NATS cluster. Compute nodes on the other hand only run a NATS client and connect to any of the Orchestrator nodes to listen for job requests and publish their node info to the network. With this approach, only the Orchestrator nodes need to be reachable, which simplies the network requirements for the Compute nodes. Also, the Compute nodes only need to know the address of only a single Orchestrator node, and will learn about the other Orchestrators at runtime. This simplifies bootstrapping, but also enables the Compute node to failover and reconnect to other Orchestrators at runtime. For completeness, the Orchestrator nodes also run a NATS client to subscribe for Compute node info, publish job requests and listen for responses. ### NATS Subjects NATS follows subject based addressing where clients publish and subscribe to subjects, and the servers take care of routing of those messages. These are the current subject patterns used to orchestrate jobs: - `node.compute.<node-id>.*` Every compute node subscribes to a topic in the format of `node.compute.<node-id>.*` to handle job requests from the orchestrator, such as `AskForBid` and `BidAccepted`. For example, if the Orchestrator selects node `QmUg1MoAUMEbpzg6jY45zpfNo2dmTBJS6CzZCTMbniYsvC` to run a job, it will send a message to `node.compute.QmUg1MoAUMEbpzg6jY45zpfNo2dmTBJS6CzZCTMbniYsvC.AskForBid/1`, which will land on the correct compute node. The suffix `/1` is just for versioning in case we change the `AskForBid` message format - `node.orchestrator.<node-id>.*` Similarly, Orchestrator nodes listen to a dedicate subject to handle compute callbacks, such as `OnRunComplete` and `OnComputeFailure` - `node.info.<node-id>` Compute nodes periodically publish their node info to `node.info.<node-id>` subject, whereas Orchestrator nodes listen to `node.info.*` so their can handle node info from all compute nodes ## Running with NATS NATS is currently opt-in, and libp2p is still the default transport layer. To run a network with NATS, use the following commands: ``` # Run Orchestrator node bacalhau serve --node-type=requester --use-nats # Run a second Orchestrator node bacalhau serve --node-type=requester --use-nats --cluster-peers=<HOST>:6222 # Run a compute node bacalhau serve --node-type=compute --use-nats --orchestrators=<HOST>:4222 # Run devstack with NATS bacalhau devstack --use-nats ``` ## Testing Done - Deployed and tested in `development` environment - All existing tests are passing when using NATS transport ## Remaining Work 1. Logstream support for NATS transport 1. Auth, as currently any node can join the network 1. Devstack tests based on NATS and deployed as part of circleci checks
- Loading branch information