-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auth doesn't work between requester nodes #3840
Comments
Interesting. Can you share the steps you followed so others can also reproduce this? Perhaps we could target the cluster running on bacalhau.tech. It contains one requester node and two compute nodes. All of nodes use a token to authenticate:
We already understand this (I think) - we generate a token before deploying the cluster and configure the nodes to use the token. The nodes are then started with the token already present in their config file. This is how the new terraform works.
I'd propose we review the marketplace terraform which does this, and is deployed at bacalhau.tech which uses NATs as network transport. I am also happy to sync on this and walk through things - that might be easier than reading terraform.
At a high-level I think the answer is: "It depends" - what constitutes a network?
When deploying you can provide tokens via the vars file. If you do not provide tokens they will be randomly generated. Before any nodes start their config files are defined to include tokens. They allows them to start up and connect to each other. |
For what it's worth I am unable to reproduce this on bacalhau.tech - although I'll admit I'm not exactly sure what a successful reproduction looks like. Here is the config file I am using on a node I am running on my desktop. node:
network:
cluster:
name: global
port: 6222
orchestrators:
- bacalhau.tech:4222
port: 4222
type: nats
type:
- compute
- requester
# added so I can target my node
labels:
- foo : bar Running this job targeting
|
To clarify, what I mean by auth not working is that a requester node can join even without auth key. Here is how you can reproduce this:
That is the ideal scenario, but we also generate a token if the user doesn't pass a token instead of assuming the network is open https://github.com/bacalhau-project/bacalhau/blob/main/cmd/cli/serve/util.go#L181 |
OK, I wasn't aware there was a difference. We can add the auth to cluster routes too.
That is already possible – just create a token of your choosing (via any method) and add it to the config of each node when you deploy them.
The terraform that Forrest has developed decides on an auth key ahead of time. We can do the same.
Users have a choice. For "unsophisticated" users who are hand-cranking a deployment, they can just use the output on stdout or the
That is the desired behaviour. We should not support open networks where possible because there's more chance one will make it into production through crappy configuration. You can say that's the users fault if you like, but it'll be our support ticket to hear them complain about it and our reputational risk if our software is considered unsecure. By requiring all networks to be secure, we turn an open network from a silent, invisible security risk into an up front problem that is immediately apparent to users when they can't connect. |
I hear your point. I am just concerned that we are prematurely optimizing and over complicating the onboarding process at an early stage of the project. We are also giving a false assumption that it is secure by default with no user intervention required, where in-fact it is not production grade secure. We are printing the token in plain text in stdout and in bacalhau.run, and we have a single token to access the whole network, which won't allow us to authenticate nodeX is actually nodeX. Meaning someone who is deploying bacalhau in production should use a different auth mechanism like the ones explained in Node ACL and supported by NATS. Most systems that I've played with are open by default (ElasticSearch, Nomad, Graphana, Spark, ...) which greatly simplified locally testing and playing with these solutions. Also someone is who is deploying and self-managing a bacalhau network in production is expected to have the level of understanding that their network is open and that they have to take additional steps to secure it, where they will have multiple options instead of us doing some magic on their behalf. I know I'll be surprised by some, but we can't check all the boxes and we need to make some tradeoffs. In short, a single token for the whole network is not secure and doesn't solve authN. I understand it is better than wide open network, but it will give wrong assumptions that security is covered. Sometimes less is more! What I think a good tradeoff is to introduce commands to help operators to secure their network and generate keys on-demand. NATS are doing something similar Product's input is needed @aronchick |
Talking about Node ACLs at this point of the project is far more premature then authentication with a simple token. We have no requirement to apply access control on a node-by-node basis that has come from any actual use case. I deliberately decided against using NATS usernames and passwords because it only adds complexity to our existing setup.
The solution we already have makes the auth token part of the peer address, so there is zero extra complexity added. Could you tell me what the additional complexity is for testing with the existing token solution? From my POV, forcing users to generate an auth token themselves and/or set config to choose between secure or insecure is adding complexity.
Users should not need to be software engineers to understand how to run Bacalhau. My points above are about how the system fails when the user does not have that level of understanding. As I said elsewhere, you can't just chuck this sort of problem over the wall and say "the users are too stupid" – you have to take steps to make sure an misconfigured system is not a complete security write-off. It is much better to have secure and inaccessible system that needs to be reconfigured before use than have an accidentally and invisibly insecure system. Also, you haven't seem to have acknowledged that there is already a way for users to set their own auth token via config. So they already have the multiple options that you specify.
This is plainly incorrect given the reasons you have set out. Do you have a better reason for believing this? There is no such absolute as "secure" or "not secure" – there is only "secure enough". Using an auth token is secure enough for our networks today and doesn't introduce needless complexity into the setup experience. |
This PR switches staging to use NATS transport layer instead of Libp2p. It has been baking for couple of days with no issues reported by the canaries. Keep in mind that this is working with no additional changes related to auth because of the bug reported [here](https://github.com/bacalhau-project/expanso-planning/issues/518), where requester nodes can join a network even without auth keys. When we fix that issue, we will need to pre-provision the auth key instead of letting the requester node auto-generate it, or reuse the terraform modules using for marketplace Closes bacalhau-project/expanso-planning#521 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Chores** - Updated stage environment configuration to support a new network type. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Short cut here is to disable clustering of requester nodes. |
@wdbaruni Per the above I'm going to resolve this by adding the auth token to the cluster routes. Please shout if we need to talk about the solution further. |
Today we try to make the networks secure by default by auto-generating an auth token if the user does not provide one. There has been a long discussion about this that you can find [here](https://github.com/bacalhau-project/expanso-planning/issues/518). The summary is: 1. Adds complexity to user onboarding as they will have to go through the logs or console output to figure out and copy the auto-generated token 2. We are printing the generated auth token in plain text in the console 3. I prefer to decouple auto-auth from launching NATS as our transport layer 4. While better than making the network open, token based auth is not secure enough and we don't want to give the impression to the users that their networks are secure be default. Reasons include: 1. Token based auth doesn't encrypt traffic on transit 1. We are using a global token and don't identify or authorize the compute nodes differently 2. No easy way to rotate or expire the token 1. We are planning to add more auth options in the future that are more secure than global tokens, and this shouldn't be the default for our users This PR enables users to run open networks which will simplify testing out bacalhau, and they will need to provide their auth token to secure their networks instead of us doing magic on their behalf and generating a random one for them. In the future it might make more sense to fail the network from starting if not secure instead of doing some magic
Will address this as part of #3867 |
AuthSecret
only applies for compute nodes when they want to join a network, but requester nodes are still able to join others without authentication.The reason is compute nodes join as NATS client and we are applying auth only for NATS clients. Requester nodes on the other hand are NATS servers and they join other Requesters using NATS cluster, which accepts a different configuration for auth.
This allows a hybrid node to join existing networks, and start accepting and running jobs without any auth. It works since the node will join as a NATS server where it will handle and forward messages, and at the same time it will run a NATS client that will connect to itself. This will allow the node to publish its NodeInfo to the whole network and handle
AskForBid
and all other messages.Caution
Before jumping and adding the auth token to cluster routes, we need to understand how we can deploy a cluster in one go and without having to deploy the requester first, inspect the generated token, and then deploy the remaining nodes. Today I was able to deploying staging using NATS and it worked out of the box without pre-generation of auth token because of this security hole. The question are:
The text was updated successfully, but these errors were encountered: