Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth doesn't work between requester nodes #3840

Open
wdbaruni opened this issue Feb 14, 2024 · 11 comments
Open

Auth doesn't work between requester nodes #3840

wdbaruni opened this issue Feb 14, 2024 · 11 comments
Labels
th/auth Theme: Relates to authentication and authorization type/bug Type: Something is not working as expected
Milestone

Comments

@wdbaruni
Copy link
Member

AuthSecret only applies for compute nodes when they want to join a network, but requester nodes are still able to join others without authentication.

The reason is compute nodes join as NATS client and we are applying auth only for NATS clients. Requester nodes on the other hand are NATS servers and they join other Requesters using NATS cluster, which accepts a different configuration for auth.

This allows a hybrid node to join existing networks, and start accepting and running jobs without any auth. It works since the node will join as a NATS server where it will handle and forward messages, and at the same time it will run a NATS client that will connect to itself. This will allow the node to publish its NodeInfo to the whole network and handle AskForBid and all other messages.

Caution

Before jumping and adding the auth token to cluster routes, we need to understand how we can deploy a cluster in one go and without having to deploy the requester first, inspect the generated token, and then deploy the remaining nodes. Today I was able to deploying staging using NATS and it worked out of the box without pre-generation of auth token because of this security hole. The question are:

  1. How should we deploy our existing networks and switch them to NATS?
  2. How do we expect users to deploy their networks in one go?
  3. How is bacalhau-tech network working today, and how is the auth key is generated?
@wdbaruni
Copy link
Member Author

cc @simonwo @frrist

@frrist
Copy link
Member

frrist commented Feb 14, 2024

Interesting. Can you share the steps you followed so others can also reproduce this?

Perhaps we could target the cluster running on bacalhau.tech. It contains one requester node and two compute nodes. All of nodes use a token to authenticate:

  • a token for authenticating to the requester API is required by clients.
  • the requester requires a token from compute nodes to connect.

Before jumping and adding the auth token to cluster routes, we need to understand how we can deploy a cluster in one go and without having to deploy the requester first, inspect the generated token, and then deploy the remaining nodes.

We already understand this (I think) - we generate a token before deploying the cluster and configure the nodes to use the token. The nodes are then started with the token already present in their config file. This is how the new terraform works.

How should we deploy our existing networks and switch them to NATS

I'd propose we review the marketplace terraform which does this, and is deployed at bacalhau.tech which uses NATs as network transport. I am also happy to sync on this and walk through things - that might be easier than reading terraform.

How do we expect users to deploy their networks in one go?

At a high-level I think the answer is: "It depends" - what constitutes a network?

  • Are all the compute nodes connected to a single requester node, or do different sets of compute nodes talk to their respective requester nodes?
  • How do requester nodes communicate with each other?
  • Can a compute node connected to requester A also talk to requester B?

How is bacalhau-tech network working today, and how is the auth key is generated?

When deploying you can provide tokens via the vars file. If you do not provide tokens they will be randomly generated. Before any nodes start their config files are defined to include tokens. They allows them to start up and connect to each other.

@frrist
Copy link
Member

frrist commented Feb 14, 2024

For what it's worth I am unable to reproduce this on bacalhau.tech - although I'll admit I'm not exactly sure what a successful reproduction looks like. Here is the config file I am using on a node I am running on my desktop.

node:
    network:
        cluster:
            name: global
            port: 6222
        orchestrators:
            - bacalhau.tech:4222
        port: 4222
        type: nats
    type:
        - compute
        - requester
    # added so I can target my node
    labels:
      - foo : bar

Running this job targeting bacalhau.tech as my API fails:

bacalhau docker run --selector=foo=bar ubuntu:latest echo hello world 
Job successfully submitted. Job ID: 2ef2f0dd-69c3-4914-add4-08e42ff3fef3
Checking job status... (Enter Ctrl+C to exit at any time, your job will continue running):

	Communicating with the network  ................  done ✅  0.0s
	   Creating job for submission  ................  err  ❌  0.0s

Error submitting job: not enough nodes to run job. requested: 1, available: 0. 
	Node QmTWp1LL: labels map[Architecture:amd64 Operating-System:linux git-lfs:true] don't match required selectors [foo = bar]
	Node QmfRfevR: labels map[Architecture:amd64 Operating-System:linux git-lfs:true] don't match required selectors [foo = bar]
Job Results By Node:

To get more details about the run, execute:
	bacalhau describe 2ef2f0dd-69c3-4914-add4-08e42ff3fef3

@wdbaruni
Copy link
Member Author

To clarify, what I mean by auth not working is that a requester node can join even without auth key. Here is how you can reproduce this:

# hybrid node joining without token
bacalhau serve --node-type requester,compute --network nats --orchestrators=bacalhau.tech:4222  --cluster-peers=bacalhau.tech:6222

# calling node list against this node will print all nodes in the network
bacalhau node list

# you can even submit jobs and they will be orchestrated on the compute nodes in the network
bacalhau docker run --concurrency 3 ubuntu echo hello

We already understand this (I think) - we generate a token before deploying the cluster and configure the nodes to use the token. The nodes are then started with the token already present in their config file. This is how the #3089 works.

That is the ideal scenario, but we also generate a token if the user doesn't pass a token instead of assuming the network is open https://github.com/bacalhau-project/bacalhau/blob/main/cmd/cli/serve/util.go#L181

@simonwo
Copy link
Contributor

simonwo commented Feb 15, 2024

Requester nodes on the other hand are NATS servers and they join other Requesters using NATS cluster, which accepts a different configuration for auth.

OK, I wasn't aware there was a difference. We can add the auth to cluster routes too.

Before jumping and adding the auth token to cluster routes, we need to understand how we can deploy a cluster in one go and without having to deploy the requester first, inspect the generated token, and then deploy the remaining nodes.

That is already possible – just create a token of your choosing (via any method) and add it to the config of each node when you deploy them.

How should we deploy our existing networks and switch them to NATS?

The terraform that Forrest has developed decides on an auth key ahead of time. We can do the same.

How do we expect users to deploy their networks in one go?

Users have a choice. For "unsophisticated" users who are hand-cranking a deployment, they can just use the output on stdout or the bacalhau.run as they do currently and receive secure NATS usage automatically. For more sophisticated users, they can pre-choose a token as above.

That is the ideal scenario, but we also generate a token if the user doesn't pass a token instead of assuming the network is open.

That is the desired behaviour. We should not support open networks where possible because there's more chance one will make it into production through crappy configuration. You can say that's the users fault if you like, but it'll be our support ticket to hear them complain about it and our reputational risk if our software is considered unsecure.

By requiring all networks to be secure, we turn an open network from a silent, invisible security risk into an up front problem that is immediately apparent to users when they can't connect.

@wdbaruni
Copy link
Member Author

wdbaruni commented Feb 15, 2024

That is the desired behaviour. We should not support open networks where possible because there's more chance one will make it into production through crappy configuration. You can say that's the users fault if you like, but it'll be our support ticket to hear them complain about it and our reputational risk if our software is considered unsecure.

By requiring all networks to be secure, we turn an open network from a silent, invisible security risk into an up front problem that is immediately apparent to users when they can't connect.

I hear your point. I am just concerned that we are prematurely optimizing and over complicating the onboarding process at an early stage of the project. We are also giving a false assumption that it is secure by default with no user intervention required, where in-fact it is not production grade secure. We are printing the token in plain text in stdout and in bacalhau.run, and we have a single token to access the whole network, which won't allow us to authenticate nodeX is actually nodeX. Meaning someone who is deploying bacalhau in production should use a different auth mechanism like the ones explained in Node ACL and supported by NATS.

Most systems that I've played with are open by default (ElasticSearch, Nomad, Graphana, Spark, ...) which greatly simplified locally testing and playing with these solutions. Also someone is who is deploying and self-managing a bacalhau network in production is expected to have the level of understanding that their network is open and that they have to take additional steps to secure it, where they will have multiple options instead of us doing some magic on their behalf. I know I'll be surprised by some, but we can't check all the boxes and we need to make some tradeoffs.

In short, a single token for the whole network is not secure and doesn't solve authN. I understand it is better than wide open network, but it will give wrong assumptions that security is covered. Sometimes less is more!

What I think a good tradeoff is to introduce commands to help operators to secure their network and generate keys on-demand. NATS are doing something similar

Product's input is needed @aronchick

@simonwo
Copy link
Contributor

simonwo commented Feb 16, 2024

I am just concerned that we are prematurely optimizing and over complicating the onboarding process at an early stage of the project.

Talking about Node ACLs at this point of the project is far more premature then authentication with a simple token. We have no requirement to apply access control on a node-by-node basis that has come from any actual use case. I deliberately decided against using NATS usernames and passwords because it only adds complexity to our existing setup.

Most systems that I've played with are open by default (ElasticSearch, Nomad, Graphana, Spark, ...) which greatly simplified locally testing and playing with these solutions.

The solution we already have makes the auth token part of the peer address, so there is zero extra complexity added. Could you tell me what the additional complexity is for testing with the existing token solution?

From my POV, forcing users to generate an auth token themselves and/or set config to choose between secure or insecure is adding complexity.

Also someone is who is deploying and self-managing a bacalhau network in production is expected to have the level of understanding that their network is open and that they have to take additional steps to secure it, where they will have multiple options instead of us doing some magic on their behalf.

Users should not need to be software engineers to understand how to run Bacalhau. My points above are about how the system fails when the user does not have that level of understanding. As I said elsewhere, you can't just chuck this sort of problem over the wall and say "the users are too stupid" – you have to take steps to make sure an misconfigured system is not a complete security write-off. It is much better to have secure and inaccessible system that needs to be reconfigured before use than have an accidentally and invisibly insecure system.

Also, you haven't seem to have acknowledged that there is already a way for users to set their own auth token via config. So they already have the multiple options that you specify.

In short, a single token for the whole network is not secure and doesn't solve authN.

This is plainly incorrect given the reasons you have set out. Do you have a better reason for believing this?

There is no such absolute as "secure" or "not secure" – there is only "secure enough". Using an auth token is secure enough for our networks today and doesn't introduce needless complexity into the setup experience.

wdbaruni referenced this issue Feb 19, 2024
This PR switches staging to use NATS transport layer instead of Libp2p.
It has been baking for couple of days with no issues reported by the
canaries.

Keep in mind that this is working with no additional changes related to
auth because of the bug reported
[here](https://github.com/bacalhau-project/expanso-planning/issues/518),
where requester nodes can join a network even without auth keys. When we
fix that issue, we will need to pre-provision the auth key instead of
letting the requester node auto-generate it, or reuse the terraform
modules using for marketplace

Closes bacalhau-project/expanso-planning#521

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Chores**
- Updated stage environment configuration to support a new network type.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@frrist
Copy link
Member

frrist commented Mar 13, 2024

Short cut here is to disable clustering of requester nodes.

@simonwo
Copy link
Contributor

simonwo commented Mar 20, 2024

@wdbaruni Per the above I'm going to resolve this by adding the auth token to the cluster routes. Please shout if we need to talk about the solution further.

@wdbaruni
Copy link
Member Author

@simonwo yeah go for it. This will break staging and demo networks as we don't pre-generate auth-tokens and they only connect today using this bug. But heads up that I am removing auto-generation of auth tokens and allowing open networks for now #3670

wdbaruni referenced this issue Mar 21, 2024
Today we try to make the networks secure by default by auto-generating
an auth token if the user does not provide one. There has been a long
discussion about this that you can find
[here](https://github.com/bacalhau-project/expanso-planning/issues/518).
The summary is:
1. Adds complexity to user onboarding as they will have to go through
the logs or console output to figure out and copy the auto-generated
token
2. We are printing the generated auth token in plain text in the console
3. I prefer to decouple auto-auth from launching NATS as our transport
layer
4. While better than making the network open, token based auth is not
secure enough and we don't want to give the impression to the users that
their networks are secure be default. Reasons include:
    1. Token based auth doesn't encrypt traffic on transit
1. We are using a global token and don't identify or authorize the
compute nodes differently
    2. No easy way to rotate or expire the token
1. We are planning to add more auth options in the future that are more
secure than global tokens, and this shouldn't be the default for our
users

This PR enables users to run open networks which will simplify testing
out bacalhau, and they will need to provide their auth token to secure
their networks instead of us doing magic on their behalf and generating
a random one for them. In the future it might make more sense to fail
the network from starting if not secure instead of doing some magic
@simonwo simonwo removed their assignment Apr 12, 2024
@wdbaruni wdbaruni added this to the v1.4.0 milestone Apr 16, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni added type/bug Type: Something is not working as expected th/auth Theme: Relates to authentication and authorization labels Apr 22, 2024
@wdbaruni wdbaruni modified the milestones: v1.4.0, v1.7.0 Jun 25, 2024
@wdbaruni
Copy link
Member Author

Will address this as part of #3867

@wdbaruni wdbaruni moved this from Inbox to Backlog in Engineering Planning Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
th/auth Theme: Relates to authentication and authorization type/bug Type: Something is not working as expected
Projects
Status: Backlog
Development

No branches or pull requests

3 participants