Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support NATS TLS Communication #4686

Merged
merged 4 commits into from
Nov 5, 2024
Merged

Support NATS TLS Communication #4686

merged 4 commits into from
Nov 5, 2024

Conversation

jamlo
Copy link
Contributor

@jamlo jamlo commented Nov 5, 2024

This PR adds the ability to enable TLS for NATS communication in a cluster.
It supports custom CA as well as regular signed certificate by known CA. If the TLS configuration was not set in the config, it will default back to non-tls. The NATS server within the orchestrator node dictates if NATS TLS is enabled or not. When enabled, only TLS communication is allowed.

Sample Orchestrator Config:

NameProvider: "uuid"
API:
  Port: 1234
Orchestrator:
  Enabled: true
  Auth:
    Token: "i_am_very_secret_token"
  TLS:
    ServerCert: "/path/to/cert"
    ServerKey: "/path/to/key"
    CACert: "/path/to/ca-cert" # Optional, will default to the system trust-store 
    ServerTimeout: 15

Sample Compute Node Config:

NameProvider: "uuid"
API:
  Port: 1234
Compute:
  Enabled: true
  Orchestrators:
    - nats://bacalhau-orchestrator-node:4222
  Auth:
    Token: "i_am_very_secret_token"
  TLS:
    CACert: "/path/to/ca-cert" # Optional, will default to the system trust-store 

For details on how to enable TLS for the NATS Server on the orchestrator node and compute node, please check the integration tests in the commits within this PR as they have detailed coverage for multiple use-cases as well as detailed configuration files.

Linear: https://linear.app/expanso/issue/ENG-305/plan-and-implement-nats-tls-enabling-in-core-bacalhau
Github: #4681

Summary by CodeRabbit

Release Notes

  • New Features

    • Added TLS configuration options for both compute and orchestrator nodes, enhancing secure communication capabilities.
    • Introduced new test suites for validating NATS integration with TLS, including support for custom CA configurations.
  • Bug Fixes

    • Improved error handling in Docker image building and command execution within test suites.
  • Documentation

    • Updated run_locally.sh script with new command-line options for improved usability.
    • Enhanced README documentation for integration tests.
  • Chores

    • Added new certificates and keys for secure communications.
    • Updated dependency management in go.mod to include new libraries for JSON path evaluation.

This commit add the ability to enable TLS for NATS communication
in a cluster. It supports custom CA as well as regular signed
certificate by known CA.

To Enable TLS for the NATS Server on the orchestrator node and
compute node, please check the integration tests in this commit
as they have detailed coverage for multiple usecases as well as
detailed configuration files.

Linear: https://linear.app/expanso/issue/ENG-305/plan-and-implement-nats-tls-enabling-in-core-bacalhau
Github: #4681
Copy link

linear bot commented Nov 5, 2024

Copy link
Contributor

coderabbitai bot commented Nov 5, 2024

Walkthrough

This pull request introduces several changes, primarily focused on enhancing TLS (Transport Layer Security) capabilities within the Bacalhau project. Key modifications include the addition of new configuration fields for TLS in various structs, updates to related validation logic, and the introduction of new test suites for validating TLS functionality. Additionally, several new certificate and key files are included to support secure communications. Changes to scripts and configuration files improve usability and flexibility in testing and deployment scenarios.

Changes

File Path Change Summary
.cspell/custom-dictionary.txt Added new entry: tlsca.
.gitignore Modified path: test_integration/assets/dockerfiles/bacalhau_bintest_integration/common_assets/bacalhau_bin.
.pre-commit-config.yaml Updated exclusion pattern for detect-private-key hook.
pkg/config/types/compute.go Added field: TLS ComputeTLS and new type: ComputeTLS with field CACert string.
pkg/config/types/generated_constants.go Added constants for TLS configuration.
pkg/config/types/generated_descriptions.go Added new entries for TLS-related configuration and updated description for DisableAnalyticsKey.
pkg/config/types/orchestrator.go Added new struct: OrchestratorTLS and field for TLS settings in Orchestrator.
pkg/nats/transport/nats.go Updated NATSTransportConfig struct with new TLS fields and validation logic.
pkg/nats/transport/nats_config_test.go Added new test cases for TLS validation.
pkg/node/node.go Enhanced NATSTransportConfig with new TLS fields.
test_integration/README.md Updated command-line options for run_locally.sh.
test_integration/base_suite.go Added method for converting JSON strings to dynamic JSON.
test_integration/go.mod Added new dependencies for JSON path evaluation.
test_integration/utils/dynamic_json.go Introduced utility for handling dynamic JSON data.
test_integration/common_assets/certificates/generated_assets/ Added new certificate and key files for orchestrator and NATS.
test_integration/common_assets/nodes_configs/ Added YAML configurations for TLS-enabled compute and orchestrator with custom CA.
test_integration/common_assets/docker_compose_files/ Created new Docker Compose file for orchestrator and compute services.
test_integration/run_locally.sh Enhanced script for better command-line control and usability.
test_integration/init_test.go Reordered import statements.

Assessment against linked issues

Objective Addressed Explanation
Implement NATS TLS support within the core (ENG-305)
Define steps to enable TLS communication for NATS (ENG-305)
Define how certificates and keys will be generated and fed to server and clients (ENG-305)

Possibly related issues

Possibly related PRs

Suggested reviewers

  • wdbaruni

🐇 In the code, we hop and play,
Adding TLS in a secure way.
Certificates and keys, oh what a sight,
For safe connections, we’ll code all night!
With tests in place, our work is done,
Hopping forward, let’s have some fun! 🐇


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@jamlo jamlo requested a review from wdbaruni November 5, 2024 01:03
@jamlo jamlo marked this pull request as ready for review November 5, 2024 01:03
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 27

🧹 Outside diff range and nitpick comments (41)
test_integration/common_assets/nodes_configs/7_tls_enabled_compute.yaml (1)

1-2: Document TLS configuration details.

While the comment mentions CA cert being embedded in the trust store, it would be helpful to:

  1. Document the expected CA certificate location in the trust store
  2. Specify whether client certificate authentication is required
  3. Document any TLS version or cipher suite requirements
test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (1)

8-9: Security concern: Hardcoded authentication token.

While this is a test configuration, using a hardcoded token that suggests it's a secret ("i_am_very_secret_token") could be misleading. Consider:

  1. Using a more obvious test token name (e.g., "test_token_123")
  2. Adding a comment indicating this is for testing only

Apply this diff:

  Auth:
-    Token: "i_am_very_secret_token"
+    # Note: This is a test-only token
+    Token: "test_token_123"
test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (3)

1-3: Document the port choice rationale.

While using port 1234 is fine for testing, it would be helpful to add a comment explaining why this specific port was chosen and whether it has any significance for the tests.


15-17: Document the purpose of test labels.

Consider adding a comment explaining the purpose of these labels in the test configuration and how they are used in the integration tests.


1-17: Add security requirements documentation.

Consider adding a comment block at the top of the file that documents:

  1. Required certificate formats and key sizes
  2. Expected file permissions
  3. Validation requirements for production use
  4. Links to security best practices for NATS TLS configuration

This will help users understand the security implications when adapting this test configuration for production use.

test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (3)

1-3: Consider making the API port configurable

While the port 1234 is valid for testing, consider making it configurable through environment variables to avoid potential conflicts during integration tests.

 NameProvider: "uuid"
 API:
-  Port: 1234
+  Port: ${TEST_API_PORT:-1234}

16-18: Use more descriptive label names

Consider using label names that reflect their purpose in the test scenario (e.g., test_scenario, tls_enabled).

 Labels:
-  label1: label1Value
-  label2: label2Value
+  test_scenario: tls_custom_ca
+  tls_enabled: "true"

1-18: Consider implementing a configuration validation mechanism

To ensure robust testing, consider implementing:

  1. A configuration validator that checks for required TLS files
  2. Environment variable substitution for flexible testing
  3. A schema definition for this YAML configuration to catch misconfigurations early

This will help prevent test failures due to misconfiguration and make the test suite more maintainable.

test_integration/common_assets/dockerfiles/Dockerfile-JumpboxNode (1)

10-12: Optimize Dockerfile by combining RUN commands.

Multiple RUN commands create unnecessary layers in the Docker image. Consider combining related commands to reduce the image size.

Here's an optimized version:

-# Install Nats CLI
-RUN curl -sf https://binaries.nats.dev/nats-io/natscli/[email protected] | sh
-RUN mv /app/nats /usr/local/bin/
+# Install Nats CLI
+RUN curl -sf https://binaries.nats.dev/nats-io/natscli/[email protected] | sh && \
+    mv /app/nats /usr/local/bin/
.pre-commit-config.yaml (1)

11-11: Document the rationale for excluding certificate files.

Consider adding a comment explaining why certificate files (including private keys) are excluded from the private key detection, to help future maintainers understand the security implications.

-    exclude: 'testdata/.*|test_integration/common_assets/certificates/.*|docker-compose-deployment/certificates/.*'
+    # Exclude test certificates and keys needed for TLS testing
+    exclude: 'testdata/.*|test_integration/common_assets/certificates/.*|docker-compose-deployment/certificates/.*'
test_integration/README.md (1)

41-46: Consider adding TLS-specific documentation.

Since this PR introduces NATS TLS support, it would be helpful to:

  1. Add examples of running TLS-specific tests
  2. Document any environment prerequisites needed for TLS tests (certificates, keys, etc.)
  3. Explain test naming conventions to help users identify TLS-related tests

Example addition:

 ./run_locally.sh -c  # Compile bacalhau binary only, do not runs tests
 ./run_locally.sh -s -t TestName  # Skip compilation and run specific test
+
+# Running TLS-specific tests:
+./run_locally.sh -t TestNATSTLS  # Run TLS communication tests
+
+# Note: TLS tests require valid certificates and keys in test/integration/common_assets/certs/
pkg/config/types/compute.go (1)

14-15: Enhance field documentation for clarity.

The current comment could be more specific about the TLS configuration capabilities.

-	// NATS specifies the NATS related configuration on the compute node.
+	// NATS specifies the NATS configuration for the compute node, including TLS settings
+	// for secure communication with the NATS server.
test_integration/utils/dynamic_json.go (1)

10-20: Add input validation to prevent processing empty JSON strings.

While the implementation is good, consider adding input validation to handle edge cases.

 func ParseToDynamicJSON(jsonStr string) (*DynamicJSON, error) {
+	if len(strings.TrimSpace(jsonStr)) == 0 {
+		return nil, fmt.Errorf("empty JSON string")
+	}
 	var data interface{}
 	if err := json.Unmarshal([]byte(jsonStr), &data); err != nil {
 		return nil, fmt.Errorf("failed to parse JSON: %w", err)
 	}
 	return &DynamicJSON{data: data}, nil
 }
test_integration/common_assets/scripts/compute_node_with_start_cmd.sh (3)

3-14: Enhance script documentation and error handling.

The usage example could be more specific to NATS TLS configuration since this script is part of the TLS support PR. Additionally, consider adding signal handling for graceful shutdown.

 # Function to display usage and exit
 usage() {
     echo "Usage: $0 -c \"command with args\" (Required)"
-    echo "Example: $0 -c \"bacalhau server --config=...\""
+    echo "Example: $0 -c \"bacalhau server --config=/path/to/config.yaml --nats-tls-ca=/path/to/ca.crt\""
     exit 1
 }

+# Handle cleanup on script exit
+cleanup() {
+    echo "Cleaning up..."
+    pkill dockerd
+    exit "${1:-0}"
+}
+
+# Set up signal traps
+trap 'cleanup 1' SIGTERM SIGINT

52-53: Consider making Docker daemon path configurable.

The script assumes dockerd-entrypoint.sh is in PATH. Consider making this configurable for different environments.

+# Docker daemon entrypoint script
+DOCKER_ENTRYPOINT="${DOCKER_ENTRYPOINT:-dockerd-entrypoint.sh}"
+
 # Start the first docker daemon
-dockerd-entrypoint.sh &
+${DOCKER_ENTRYPOINT} &

1-81: Add security considerations for TLS testing.

Since this script is part of the NATS TLS support implementation, consider adding validation for TLS-related parameters in the custom command to prevent misconfiguration during testing:

  1. Verify that TLS certificate paths exist when TLS is enabled
  2. Check certificate permissions
  3. Validate certificate expiration dates

Would you like me to provide an implementation for these security checks?

test_integration/run_locally.sh (4)

11-18: Adjust exit code for help display.

The usage function should exit with code 0 when help is explicitly requested via -h, as this is not an error condition. Exit code 1 should be reserved for error scenarios.

 usage() {
     echo "Usage: ${0} [-s] [-t TEST_NAME] [-c]"
     echo "  -s: Skip compilation"
     echo "  -t TEST_NAME: Run specific test by name"
     echo "  -c: Compile only (don't run tests)"
     echo "  -h: Show this help message"
-    exit 1
+    # Exit 0 if help was requested, 1 for invalid usage
+    if [ "${1:-}" = "help" ]; then
+        exit 0
+    else
+        exit 1
+    fi
 }

21-39: Document the shellcheck disable directive.

While the command-line parsing is well-implemented, please add a comment explaining why SC2249 is disabled. This helps maintainers understand the reasoning behind bypassing the shellcheck warning.

-    # shellcheck disable=SC2249
+    # shellcheck disable=SC2249  # Explain: This rule suggests using case instead of if/elif for pattern matching,
+    # but we're already using case here. This might be a false positive.

68-77: Add error handling and improve test output.

The test running function could benefit from better error handling and more informative output.

 run_tests() {
+    local test_exit_code
+
     if [[ -n "${SPECIFIC_TEST}" ]]; then
-        echo "Running specific test: ${SPECIFIC_TEST}"
+        echo "=== Running specific test: ${SPECIFIC_TEST} ==="
         go test -v -run "${SPECIFIC_TEST}"
+        test_exit_code=$?
     else
-        echo "Running all tests"
+        echo "=== Running all tests ==="
         go test -v -count=1 ./...
+        test_exit_code=$?
     fi
+
+    if [ ${test_exit_code} -ne 0 ]; then
+        echo "=== Tests failed with exit code: ${test_exit_code} ==="
+        return ${test_exit_code}
+    fi
+    echo "=== Tests completed successfully ==="
 }

Line range hint 1-97: Consider organizing tests into focused suites.

Given that this PR introduces NATS TLS support, consider organizing the integration tests into focused suites (e.g., TLS tests, non-TLS tests) to allow for more targeted testing scenarios. This would make it easier to validate specific functionality and maintain test coverage as features evolve.

test_integration/4_orchestrator_config_override_and_flag_and_config_flag_suite_test.go (2)

76-91: Consider adding WebUI TLS configuration validation

While the test effectively validates basic WebUI configuration, consider adding assertions for TLS-related settings if the WebUI supports secure communication. This would align with the PR's objective of implementing TLS support.

Example additions:

// Validate TLS configuration if applicable
webUITLSEnabled, err := agentConfigJSON.Query("$.WebUI.TLS.Enabled")
s.Require().NoError(err)
// Add appropriate assertion based on expected configuration

Line range hint 1-91: Consider expanding test coverage for TLS configurations

While the test suite effectively validates configuration overrides and the changes improve code quality, consider expanding it to include comprehensive TLS configuration testing scenarios:

  1. Validation of TLS certificate paths and configurations
  2. Testing of TLS fallback behavior
  3. Verification of TLS-only mode
  4. Testing invalid certificate scenarios

This would ensure robust coverage of the new TLS functionality being introduced.

test_integration/6_jobs_basic_runs_scenarios_suite_test.go (2)

11-11: Remove unnecessary empty line change.

This change appears to be unintentional and doesn't affect functionality.


Line range hint 42-50: Consider adding TLS-specific test scenarios.

Given that this PR introduces TLS support for NATS communication, consider adding test cases that specifically verify:

  1. Job execution with TLS enabled
  2. Job execution with custom CA certificates
  3. Error handling for invalid certificates

Would you like me to help create additional test cases for TLS scenarios?

test_integration/5_orchestrator_no_config_suite_test.go (1)

Line range hint 54-74: Consider adding TLS configuration assertions.

Given that this PR introduces NATS TLS support, this test should verify the default TLS behavior when no configuration is provided. Consider adding assertions to verify:

  • The default non-TLS NATS configuration
  • The absence of TLS-related fields in the default configuration

Add these assertions after line 74:

    apiPort := int(unmarshalledOutputMap["API"].(map[string]interface{})["Port"].(float64))
    s.Require().Equalf(1234, apiPort, "Incorrect API.Port, got: %d", apiPort)
+
+   // Verify NATS TLS configuration defaults
+   natsConfig, exists := unmarshalledOutputMap["NATS"].(map[string]interface{})
+   s.Require().True(exists, "NATS configuration should exist")
+   
+   // Verify TLS is disabled by default
+   tlsConfig, exists := natsConfig["TLS"].(map[string]interface{})
+   s.Require().False(exists, "TLS configuration should not exist in default config")
test_integration/8_nats_tls_custom_ca_suite_test.go (3)

16-25: Consider adding documentation for the test suite

While the implementation is correct, adding documentation comments would help explain the purpose of this test suite and its relationship to the NATS TLS feature.

Add documentation like this:

+// NatsTLSCustomCASuite tests NATS TLS functionality with custom CA certificates.
+// It verifies that:
+// - TLS connections with custom CA certificates work correctly
+// - Non-TLS connections are rejected
+// - Regular operations work with TLS enabled
 type NatsTLSCustomCASuite struct {
     BaseDockerComposeTestSuite
 }

27-48: Add config file existence validation

While the setup is well-structured, it's recommended to validate the existence of configuration files before using them.

Consider adding validation:

 func (s *NatsTLSCustomCASuite) SetupSuite() {
     rawDockerComposeFilePath := "./common_assets/docker_compose_files/orchestrator-and-compute-custom-startup.yml"
+    // Validate file existence
+    files := []string{
+        rawDockerComposeFilePath,
+        s.commonAssets("nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml"),
+        s.commonAssets("nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml"),
+    }
+    for _, file := range files {
+        if _, err := os.Stat(file); os.IsNotExist(err) {
+            s.T().Fatalf("Required config file not found: %s", file)
+        }
+    }

50-71: Consider environment-specific timeout configuration

The 30-second timeout might be too short for slower CI environments or resource-constrained systems.

Consider making the timeout configurable:

+    // Default timeout that can be overridden via environment variable
+    timeout := 30 * time.Second
+    if envTimeout := os.Getenv("BACALHAU_TEST_TIMEOUT"); envTimeout != "" {
+        if d, err := time.ParseDuration(envTimeout); err == nil {
+            timeout = d
+        }
+    }
-    _, err = s.waitForJobToComplete(jobID, 30*time.Second)
+    _, err = s.waitForJobToComplete(jobID, timeout)
pkg/config/types/orchestrator.go (3)

14-15: Enhance NATS field documentation to mention TLS capabilities.

The current comment could be more specific about the TLS capabilities being introduced.

Consider updating the comment to:

-	// NATS specifies the NATS related configuration on the orchestrator.
+	// NATS specifies the NATS related configuration on the orchestrator, including TLS settings for secure communication.

28-31: Enhance security documentation for TLS key and cert.

The comments for TLS key and certificate should include security recommendations.

Consider updating the comments:

-	// ServerTLSKey specifies the private key file path given to NATS server to serve TLS connections.
+	// ServerTLSKey specifies the private key file path for TLS connections.
+	// The file should be in PEM format and readable only by the orchestrator process.
 	ServerTLSKey string `yaml:"ServerTLSKey,omitempty" json:"ServerTLSKey,omitempty"`
-	// ServerTLSCert specifies the certificate file path given to NATS server to serve TLS connections.
+	// ServerTLSCert specifies the certificate file path for TLS connections.
+	// The certificate should be in PEM format and signed by a trusted CA.
 	ServerTLSCert string `yaml:"ServerTLSCert,omitempty" json:"ServerTLSCert,omitempty"`

35-36: Clarify CA certificate requirements and chain of trust.

The CACert field comment should provide more details about the chain of trust and certificate requirements.

Consider updating the comment:

-	// CACert specifies the CA file path that the orchestrator node trusts when connecting to NATS server.
+	// CACert specifies the Certificate Authority (CA) file path for verifying client certificates.
+	// If not specified, the system's default trust store will be used.
+	// The CA certificate should be in PEM format and contain the complete chain of trust.
pkg/publicapi/client/v2/options.go (3)

Line range hint 70-76: LGTM! Consider adding more documentation.

The updated comment better clarifies that this is specifically for server TLS CA certificates. This aligns well with the PR's objective of implementing NATS TLS support.

Consider adding examples in the comment to show typical usage patterns, such as:

// Example:
//   client.WithCACertificate("/path/to/ca.crt")

Line range hint 134-164: Consider enhancing error handling and security measures.

While the TLS implementation is solid, there are a few areas for improvement:

  1. The panic in getTLSTransport when reading the CA certificate could be handled more gracefully.
  2. Consider adding validation for the CACert file path before attempting to read it.
  3. The implementation could be extended to support client certificates for mutual TLS authentication.

Here's a suggested improvement for the error handling:

 func getTLSTransport(config *Config) *http.Transport {
 	tr := &http.Transport{}
 
 	if !config.TLS.UseTLS {
 		return tr
 	}
 
 	if config.TLS.CACert != "" {
+		// Validate file exists and is readable
+		if _, err := os.Stat(config.TLS.CACert); err != nil {
+			// Log error and fall back to system certificates
+			return tr
+		}
+
 		caCert, err := os.ReadFile(config.TLS.CACert)
 		if err != nil {
-			// unreachable: we already checked that the file exists at CLI startup
-			// if it has gone missing in the meantime then something is very wrong
-			newErr := errors.Wrap(err, fmt.Sprintf("Error: unable to read CA certificate: %s", config.TLS.CACert))
-			panic(newErr.Error())
+			// Log error and fall back to system certificates
+			return tr
 		}
 
 		caCertPool := x509.NewCertPool()
-		caCertPool.AppendCertsFromPEM(caCert)
+		if !caCertPool.AppendCertsFromPEM(caCert) {
+			// Log error: invalid certificate
+			return tr
+		}

Line range hint 25-39: Consider adding TLS configuration validation.

The Config struct could benefit from a validation method to ensure TLS settings are consistent and secure.

Consider adding a validation method:

// Validate checks if the configuration is valid
func (c *Config) Validate() error {
    if c.TLS.UseTLS && c.TLS.CACert != "" {
        if _, err := os.Stat(c.TLS.CACert); err != nil {
            return fmt.Errorf("CA certificate file not accessible: %w", err)
        }
    }
    return nil
}
pkg/config_legacy/types/node.go (1)

68-68: LGTM! Consider adding documentation.

The rename from CACert to ServerTLSCACert improves clarity by explicitly indicating its purpose for server TLS CA certificates.

Consider adding a comment to document that this field specifies the CA certificate used to verify the server's TLS certificate:

+	// ServerTLSCACert specifies the path to the CA certificate used to verify
+	// the server's TLS certificate. If not specified, the system's default
+	// trust store will be used.
	CACert string `yaml:"ServerTLSCACert"`
pkg/nats/transport/nats_config_test.go (1)

168-181: Document the default TLS timeout value.

While the test correctly validates the default timeout behavior, consider adding a comment explaining why 10 seconds was chosen as the default timeout value. This helps future maintainers understand the rationale behind this specific duration.

 func (suite *NATSTransportConfigSuite) TestNATSTransportDefaultTLSTimeoutConfig() {
+	// When ServerTLSTimeout is set to 0, it should default to 10 seconds
+	// which is a reasonable timeout for most TLS handshakes while preventing
+	// indefinite waiting in case of connection issues
 	natsConfig := NATSTransportConfig{
test_integration/utils/test_utils.go (1)

54-60: LGTM! Good addition of image verification.

The added verification step improves test reliability by ensuring the Docker image was successfully created. The error message is clear and includes all necessary details.

Consider wrapping the error in a more specific error type to help with error handling in test cases:

 if err != nil {
-    return fmt.Errorf("unable to find image  %s:%s. Error: %v", imageName, imageTag, err)
+    return fmt.Errorf("image build verification failed for %s:%s: %w", imageName, imageTag, err)
 }

This change:

  • Uses %w to wrap the original error for better error handling
  • Improves the error message clarity
  • Makes it easier to handle specific error cases in tests
pkg/nats/transport/nats.go (4)

62-62: Typo in comment: 'Use' should be 'Used'

The comment on line 62 has a typo. It should read 'Used by the Nats Client when node acts as compute'.

Apply this diff to fix the typo:

-	// Use by the Nats Client when node acts as compute
+	// Used by the Nats Client when node acts as compute

87-88: Use idiomatic string comparison for empty checks

In Go, it's more conventional to check if a string is empty using == "" rather than len(s) > 0. Consider updating the checks on lines 87 and 88 for clarity and readability.

Apply this diff to make the change:

-	serverCertProvided := len(c.ServerTLSCert) > 0
-	serverKeyProvided := len(c.ServerTLSKey) > 0
+	serverCertProvided := c.ServerTLSCert != ""
+	serverKeyProvided := c.ServerTLSKey != ""

104-106: Set default TLS timeout for non-positive values

Currently, the default ServerTLSTimeout is set when the value is zero, but an error is raised if it's negative. To enhance robustness, consider setting the default timeout for any non-positive values.

Apply this diff to adjust the validation:

 if serverCertProvided && serverKeyProvided && c.ServerTLSTimeout <= 0 {
-	c.ServerTLSTimeout = NATSServerDefaultTLSTimeout
+	c.ServerTLSTimeout = NATSServerDefaultTLSTimeout
 }

269-278: Review conditional logic for TLS client configuration

The conditions in CreateClient (lines 271-278) might not cover all TLS configuration scenarios, especially when some TLS fields are missing. To ensure all possible configurations are handled correctly, consider refactoring the logic for clarity and completeness.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between f68f794 and d76252c.

⛔ Files ignored due to path filters (1)
  • test_integration/go.sum is excluded by !**/*.sum
📒 Files selected for processing (40)
  • .cspell/custom-dictionary.txt (1 hunks)
  • .gitignore (1 hunks)
  • .pre-commit-config.yaml (1 hunks)
  • pkg/config/types/compute.go (1 hunks)
  • pkg/config/types/generated_constants.go (2 hunks)
  • pkg/config/types/generated_descriptions.go (2 hunks)
  • pkg/config/types/orchestrator.go (2 hunks)
  • pkg/config_legacy/types/generated_descriptions.go (2 hunks)
  • pkg/config_legacy/types/node.go (1 hunks)
  • pkg/nats/transport/nats.go (6 hunks)
  • pkg/nats/transport/nats_config_test.go (2 hunks)
  • pkg/node/node.go (1 hunks)
  • pkg/publicapi/client/v2/options.go (1 hunks)
  • test_integration/1_orchestrator_basic_config_suite_test.go (1 hunks)
  • test_integration/4_orchestrator_config_override_and_flag_and_config_flag_suite_test.go (1 hunks)
  • test_integration/5_orchestrator_no_config_suite_test.go (1 hunks)
  • test_integration/6_jobs_basic_runs_scenarios_suite_test.go (1 hunks)
  • test_integration/7_nats_tls_default_truststore_suite_test.go (1 hunks)
  • test_integration/8_nats_tls_custom_ca_suite_test.go (1 hunks)
  • test_integration/README.md (1 hunks)
  • test_integration/base_suite.go (2 hunks)
  • test_integration/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.crt (1 hunks)
  • test_integration/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.key (1 hunks)
  • test_integration/common_assets/certificates/nats_custom/nats-server.crt (1 hunks)
  • test_integration/common_assets/certificates/nats_custom/nats-server.key (1 hunks)
  • test_integration/common_assets/certificates/nats_custom/nats_root_ca.crt (1 hunks)
  • test_integration/common_assets/certificates/nats_custom/nats_root_ca.key (1 hunks)
  • test_integration/common_assets/docker_compose_files/orchestrator-and-compute-custom-startup.yml (1 hunks)
  • test_integration/common_assets/dockerfiles/Dockerfile-JumpboxNode (1 hunks)
  • test_integration/common_assets/nodes_configs/7_tls_enabled_compute.yaml (1 hunks)
  • test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (1 hunks)
  • test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (1 hunks)
  • test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (1 hunks)
  • test_integration/common_assets/scripts/compute_node_with_start_cmd.sh (1 hunks)
  • test_integration/go.mod (2 hunks)
  • test_integration/init_test.go (1 hunks)
  • test_integration/run_locally.sh (1 hunks)
  • test_integration/utils/dynamic_json.go (1 hunks)
  • test_integration/utils/test_utils.go (2 hunks)
  • test_integration/utils/yaml_template_engine.go (2 hunks)
✅ Files skipped from review due to trivial changes (7)
  • .cspell/custom-dictionary.txt
  • .gitignore
  • test_integration/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.crt
  • test_integration/common_assets/certificates/nats_custom/nats-server.crt
  • test_integration/common_assets/certificates/nats_custom/nats_root_ca.crt
  • test_integration/init_test.go
  • test_integration/utils/yaml_template_engine.go
🧰 Additional context used
🪛 Gitleaks
test_integration/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.key

1-52: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

test_integration/common_assets/certificates/nats_custom/nats-server.key

1-52: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

test_integration/common_assets/certificates/nats_custom/nats_root_ca.key

1-52: Identified a Private Key, which may compromise cryptographic security and sensitive data encryption.

(private-key)

🔇 Additional comments (38)
test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (2)

1-3: LGTM: Basic configuration looks appropriate.

The UUID name provider and API port configuration are suitable for integration testing purposes.


10-13: Verify the CA certificate path and add validation checks.

The CA certificate path is hardcoded to a specific location within the container. Consider:

  1. Adding a comment about the expected certificate format
  2. Verifying the certificate exists during test setup

Run the following script to verify the certificate path and format:

Consider adding a comment about the expected certificate format:

  Nats:
    # The paths for these certs are within the container after
    # the assets has been copied during image creation
+   # Expected format: PEM-encoded X.509 certificate
    CACert: "/bacalhau_integration_tests/common_assets/certificates/nats_custom/nats_root_ca.crt"
✅ Verification successful

Certificate path and format have been verified successfully

The verification script confirmed that:

  • The certificate file exists at the expected location in test assets
  • The file contains a valid PEM-encoded certificate (starts with "BEGIN CERTIFICATE")

The suggested comment about the certificate format would still be valuable for documentation purposes, but the core concern about certificate validation has been addressed.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify CA certificate existence and format

# Test 1: Check if the certificate file exists in the test assets
CERT_PATH="common_assets/certificates/nats_custom/nats_root_ca.crt"
if ! fd -t f "nats_root_ca.crt" test_integration/; then
    echo "Error: Certificate file not found in test assets"
    exit 1
fi

# Test 2: Verify it's a valid certificate file (should start with BEGIN CERTIFICATE)
if ! head -n 1 "test_integration/$CERT_PATH" | grep -q "BEGIN CERTIFICATE"; then
    echo "Error: Invalid certificate format"
    exit 1
fi

Length of output: 300

test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (2)

13-13: Review the TLS timeout value.

The ServerTLSTimeout of 15 seconds seems high for a local test environment. Consider:

  1. Documenting why 15 seconds was chosen
  2. Making this value configurable through environment variables
  3. Adding a comment about recommended values for different environments
#!/bin/bash
# Description: Check for other timeout configurations to ensure consistency
# Expect: Similar timeout values in related configurations

rg "TLSTimeout:" 

4-7: Add a warning comment about the test-only token.

The hardcoded token "i_am_very_secret_token" should be clearly marked as a test-only value to prevent accidental use in production.

test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (1)

9-14: Verify TLS certificate paths and consider timeout configuration

The TLS configuration looks good, but consider:

  1. Adding runtime validation to ensure certificate files exist
  2. Making the timeout configurable for different test environments
✅ Verification successful

Let me gather more information about the certificate validation and test setup.


TLS configuration is properly set up and validated

The initial concerns have been addressed by the evidence:

  • All required certificates (crt, key, and root CA) are present in the assets directory
  • The certificates are used consistently across test configurations
  • The 15-second timeout is standardized across TLS-enabled configurations and suitable for test environments
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify that the certificate files exist in other test configurations
rg -l "nats_custom/nats-server\.(crt|key)" test_integration/
rg -l "nats_custom/nats_root_ca\.crt" test_integration/

Length of output: 443


Script:

#!/bin/bash
# Check if there's any certificate validation logic in the test files
rg -A 5 "nats_root_ca\.crt" test_integration/8_nats_tls_custom_ca_suite_test.go

# Check if certificates are actually present in the assets directory
fd . test_integration/common_assets/certificates/nats_custom/

# Look for any timeout configuration patterns
rg "ServerTLSTimeout" test_integration/

Length of output: 866

.pre-commit-config.yaml (1)

11-11: LGTM! Verify the broader exclusion pattern.

The updated exclusion pattern for the detect-private-key hook makes sense for TLS testing purposes.

Let's verify what additional files would be excluded with this change:

✅ Verification successful

The broader certificate exclusion pattern is safe and appropriate

The verification confirms this is a safe change because:

  • The certificates are strictly for testing purposes, located in the test directory
  • The README confirms these are test certificate generation scripts
  • No usage of these certificates was found outside of test directories
  • The broader exclusion pattern is necessary to cover both generated and custom NATS certificates used in TLS testing
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check what new files would be excluded by the broader pattern
# that weren't covered by the original pattern

# List all files under the certificates directory
fd . "test_integration/common_assets/certificates" \
  --exclude "generated_assets" \
  --type f

Length of output: 561


Script:

#!/bin/bash
# Let's check the contents of the README to understand the purpose and usage of these certificates
cat test_integration/common_assets/certificates/README.md

# Also check if these certificates are used only in tests
rg -l "nats-server\.(crt|key)" --type-add 'yaml:*.{yaml,yml}' -g '!test*' -g '!testdata'
rg -l "nats_root_ca\.(crt|key)" --type-add 'yaml:*.{yaml,yml}' -g '!test*' -g '!testdata'

Length of output: 595

test_integration/README.md (1)

41-46: LGTM! Clear documentation of command-line options.

The new command-line options are well documented with clear examples showing each use case.

test_integration/common_assets/docker_compose_files/orchestrator-and-compute-custom-startup.yml (1)

17-17: Justify the use of privileged mode.

Running containers in privileged mode grants extensive permissions and poses security risks. Please document why privileged mode is necessary for the orchestrator node.

pkg/config/types/compute.go (2)

14-26: Implementation aligns well with TLS support objectives.

The addition of NATS TLS configuration capabilities is well-structured and follows existing patterns in the codebase. The implementation provides a clean way for compute nodes to specify their TLS settings while maintaining backward compatibility through the use of omitempty tags.


23-26: Consider adding path validation and documentation.

While the implementation is correct, consider these security and usability improvements:

  1. Add validation for the CA certificate file path to ensure it exists and is readable
  2. Document the expected format and requirements for the CA certificate file

Let's check how certificate paths are handled elsewhere in the codebase:

✅ Verification successful

Let me gather more information about certificate validation patterns in the codebase.


Certificate path validation is already handled appropriately in the codebase

Based on the findings:

  1. The codebase has a validateFile function in cmd/util/flags/cliflags/config.go that handles file validation
  2. The actual CA certificate validation is performed at runtime in pkg/publicapi/client/v2/options.go where the certificate is read and verified
  3. Similar CACert fields are consistently used across the codebase (in both compute and orchestrator configurations) without additional validation at the struct level

The current implementation aligns with the codebase's patterns for handling certificate paths, where validation occurs during the operational phase rather than at the configuration level. This is a reasonable approach as certificate paths may not exist during configuration parsing but only need to be valid when establishing connections.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar certificate path handling patterns
rg -g '!{vendor,third_party}/**' -A 3 'CACert.*string.*`yaml:"' 

# Look for certificate validation patterns
ast-grep --pattern 'func $_(path string) error {
  $$$
  os.Open($path)
  $$$
}'

Length of output: 883


Script:

#!/bin/bash
# Search for certificate validation functions
rg -g '!{vendor,third_party}/**' -B 2 -A 5 'func.*[Vv]alidat.*[Cc]ertificate'

# Look for file validation patterns
rg -g '!{vendor,third_party}/**' -B 2 -A 5 'func.*[Vv]alidat.*[Ff]ile'

# Search for actual usage of CACert fields
rg -g '!{vendor,third_party}/**' -B 2 -A 5 '\.[Cc][Aa][Cc]ert'

Length of output: 11367

test_integration/utils/dynamic_json.go (1)

1-9: LGTM! Clean package structure and imports.

The package name and imports are well-organized and appropriate for the utility's purpose.

test_integration/run_locally.sh (1)

5-10: LGTM! Clear and well-structured variable initialization.

The default values are properly initialized and follow shell scripting best practices.

test_integration/7_nats_tls_default_truststore_suite_test.go (5)

1-15: LGTM! Clean imports organization.

The imports are well-organized and all are being used in the code.


45-48: LGTM! Clean teardown implementation.

The teardown properly delegates to the base class after logging.


73-75: LGTM! Standard test suite runner.

The test runner follows the standard testify suite pattern.


27-43: Consider making configuration paths more flexible.

While the implementation is solid, consider making the paths to configuration files and Docker compose files configurable through environment variables or test parameters for better flexibility.

#!/bin/bash
# Verify the existence of referenced configuration files
echo "Checking for required configuration files..."
ls -l test_integration/common_assets/docker_compose_files/orchestrator-and-compute-custom-startup.yml \
     test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml \
     test_integration/common_assets/nodes_configs/7_tls_enabled_compute.yaml

16-25: Verify the GlobalRunIdentifier variable.

The code references globalTestExecutionId which should be defined elsewhere in the package.

✅ Verification successful

Let me try a different search pattern to locate the variable definition.


globalTestExecutionId is properly defined and used across test suites

The variable is correctly defined in init_test.go and is used consistently across all test suites for test run identification. The initialization happens at the package level, making it available to all test suites in the package.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for globalTestExecutionId definition
rg "globalTestExecutionId.*=|var globalTestExecutionId|const globalTestExecutionId" test_integration/

Length of output: 743


Script:

#!/bin/bash
# Search for globalTestExecutionId with broader pattern
rg "globalTestExecutionId" test_integration/

# Also search for test suite initialization patterns
rg "func init|var global" test_integration/

Length of output: 1499

test_integration/4_orchestrator_config_override_and_flag_and_config_flag_suite_test.go (2)

57-73: LGTM! Improved JSON handling and validation

The changes enhance the code by:

  • Using a more robust JSON parsing approach
  • Implementing proper error handling
  • Using clean JSON query syntax for field access
  • Validating configuration override behavior

47-54: Verify TLS configuration validation

Given that this PR implements NATS TLS support, consider adding test cases to verify that TLS-related flags and configurations are properly handled by the node list command.

test_integration/6_jobs_basic_runs_scenarios_suite_test.go (1)

Line range hint 1-108: Verify TLS configuration in orchestrator config.

The test suite references an orchestrator config file at nodes_configs/6_orchestrator_config.yaml. Since this PR adds TLS support for NATS, we should verify that this configuration properly tests both TLS and non-TLS scenarios.

test_integration/5_orchestrator_no_config_suite_test.go (1)

54-60: LGTM! Command execution improvement.

The change from using curl to directly using the bacalhau CLI is a good improvement that:

  • Reduces external dependencies
  • Provides more consistent and reliable test execution
  • Aligns with the command usage in other test suites
test_integration/8_nats_tls_custom_ca_suite_test.go (3)

1-15: LGTM: Clean imports structure

The imports are well-organized and all dependencies are properly utilized in the code.


73-84: LGTM: Good negative test case

The test properly verifies that non-TLS connections are rejected with the expected error message.


100-102: LGTM: Standard suite runner

The suite runner follows the standard testify/suite pattern.

pkg/config_legacy/types/node.go (1)

Line range hint 1-1: Consider migration strategy for legacy package.

The file resides in pkg/config_legacy, but implements core TLS functionality. Consider creating a migration plan to move this configuration to the main config package, ensuring a clean separation between legacy and current code.

Let's check if there are any new TLS configurations in the main config package:

test_integration/1_orchestrator_basic_config_suite_test.go (1)

43-66: LGTM on the refactoring, but consider adding TLS-specific validations.

The refactoring to use convertStringToDynamicJSON and specific JSON path queries improves the test's robustness. However, given this PR's objective to add NATS TLS support, consider adding validations to ensure nodes are communicating over a secure channel when TLS is enabled.

Let's verify if there are any TLS-specific tests elsewhere:

pkg/nats/transport/nats_config_test.go (2)

116-150: LGTM! Comprehensive TLS configuration validation tests.

The new test cases effectively validate critical TLS configuration requirements:

  • Ensuring both certificate and key are provided together
  • Preventing negative timeout values

These tests align well with security best practices for TLS implementation.


183-202: Consider adding path validation tests.

The happy path test verifies field preservation but doesn't validate path existence or accessibility. Consider adding test cases that:

  1. Use temporary test certificates/keys
  2. Test with both valid and invalid file paths
  3. Verify file permission requirements

This would ensure robust handling of TLS file configurations in production environments.

pkg/config/types/generated_constants.go (2)

27-27: LGTM: Compute node CA certificate configuration.

The constant follows the established naming convention and is appropriately placed among other compute-related configurations.


88-91: LGTM: Comprehensive NATS TLS configuration for orchestrator.

The constants provide a complete set of TLS configuration options including CA certificate, server certificate, private key, and timeout settings. They follow the established naming convention and are properly placed among other orchestrator-related configurations.

Let's verify these constants are used consistently in the configuration structs and their corresponding YAML/env var bindings:

✅ Verification successful

NATS TLS configuration constants are properly integrated and consistent

The verification confirms that:

  • The constants are properly mapped to struct fields in OrchestratorNats
  • Each constant has corresponding field with matching YAML tags and proper documentation
  • The fields cover all TLS aspects: CA cert, server cert, private key, and timeout
  • The descriptions in generated_descriptions.go match the struct field documentation

The configuration structure is complete and consistent with the established patterns in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the new constants are properly used in configuration structs and bindings

# Test 1: Check for corresponding struct fields in compute.go and orchestrator.go
echo "Checking for corresponding struct fields..."
rg -A 2 "type (Compute|Orchestrator)Nats struct" pkg/config/types/

# Test 2: Check for YAML tag bindings
echo "Checking for YAML tag bindings..."
rg "yaml:.*NATS\.(CACert|ServerTLSCert|ServerTLSKey|ServerTLSTimeout)" pkg/config/types/

# Test 3: Check for environment variable bindings
echo "Checking for environment variable bindings..."
rg "env:.*NATS_(CA_CERT|SERVER_TLS)" pkg/config/types/

Length of output: 1081


Script:

#!/bin/bash
# Let's check the complete struct definitions and their fields
echo "Checking OrchestratorNats struct definition..."
rg -A 10 "type OrchestratorNats struct" pkg/config/types/

echo -e "\nChecking usage of the new constants..."
rg "OrchestratorNATSServerTLS|OrchestratorNATSCACert" pkg/config/types/

echo -e "\nChecking YAML/env bindings with broader pattern..."
rg "yaml:\".*TLS.*\".*env:" pkg/config/types/

Length of output: 2815

test_integration/base_suite.go (1)

8-11: LGTM: Required imports reinstated.

The reinstated imports are necessary for both existing functionality and the new convertStringToDynamicJSON method.

test_integration/go.mod (1)

6-6: LGTM on the new dependencies.

The addition of github.com/PaesslerAG/jsonpath and its indirect dependency github.com/PaesslerAG/gval are appropriate for the integration tests. These libraries will help in validating JSON responses and configurations in the new NATS TLS integration tests.

Also applies to: 24-24

pkg/config_legacy/types/generated_descriptions.go (3)

Line range hint 1-150: Add descriptions for NATS TLS configuration entries.

Based on the PR objectives which introduce NATS TLS support, this configuration map appears to be missing entries for the new NATS TLS configuration options (e.g., ComputeNATSCACertKey, OrchestratorNATSCACertKey, etc.) that were mentioned in the AI summary. These descriptions are essential for users to understand how to configure TLS for NATS communication.

#!/bin/bash
# Search for NATS-related configuration keys in the codebase
echo "Searching for NATS configuration keys..."
rg "ComputeNATS|OrchestratorNATS"

# Search for NATS-related constants
echo "Searching for NATS-related constants..."
ast-grep --pattern 'const $_ = "ComputeNATS" | const $_ = "OrchestratorNATS"'

Line range hint 1-3: Update the source generator instead of this file.

This is a generated file (indicated by "Code generated by go generate; DO NOT EDIT"). Any changes to descriptions should be made in the source generator rather than directly in this file.

#!/bin/bash
# Find the generator file
echo "Searching for go:generate directives related to descriptions..."
rg "//go:generate.*description"

# Find potential source files
echo "Searching for potential source files..."
fd -e go -e gotmpl "description|config.*gen"

11-11: Verify TLS auto-enabling conditions.

The descriptions state that TLS is "Auto enabled if Insecure or ServerTLSCACert are specified". This seems to differ from the AI summary which mentioned "CACert" was changed to "ServerTLSCACert". Let's verify this change is consistent with the actual implementation.

Also applies to: 21-21

✅ Verification successful

The TLS auto-enabling description is accurate and consistent with implementation

Based on the code evidence:

  1. The struct definition in pkg/config_legacy/types/node.go confirms that TLS is "Auto enabled if Insecure or CACert are specified"
  2. The implementation in pkg/publicapi/client/v2/options.go shows both WithCACertificate and WithInsecureTLS options that can trigger TLS usage
  3. The description in generated_descriptions.go correctly states "Auto enabled if Insecure or ServerTLSCACert are specified", which matches the actual field name ServerTLSCACert used in the codebase

The description accurately reflects the implementation and the name change from CACert to ServerTLSCACert is consistently applied throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for TLS auto-enabling logic in the codebase
echo "Searching for TLS auto-enabling conditions..."
rg -A 5 "func.*UseTLS|TLS.*enabled" 

# Search for CACert/ServerTLSCACert usage
echo "Searching for CACert/ServerTLSCACert usage..."
rg "CACert|ServerTLSCACert"

Length of output: 596


Script:

#!/bin/bash
# Let's try a different approach to find TLS-related code

# Search for TLS configuration in the codebase
echo "Searching for TLS configuration..."
rg -l "TLS.*Config|UseTLS"

# Look for the specific files that might contain the implementation
echo -e "\nSearching for specific TLS implementation..."
ast-grep --pattern 'if $_ || $_ {
  $$.UseTLS = true
  $$$
}'

# Search for ServerTLSCACert in configuration
echo -e "\nSearching for ServerTLSCACert in configuration..."
rg "ServerTLSCACert" -A 3 -B 3

# Look for any TLS auto-enabling logic
echo -e "\nSearching for TLS auto-enabling logic..."
ast-grep --pattern 'func $_($$) bool {
  $$$
  UseTLS $$$
  $$$
}'

Length of output: 8536


Script:

#!/bin/bash
# Let's check the actual implementation of TLS auto-enabling logic

# Search for TLS configuration struct and its usage
echo "Searching for TLS configuration implementation..."
rg -A 10 "type.*TLS.*Config struct" 

# Look for places where UseTLS is set
echo -e "\nSearching for UseTLS assignments..."
ast-grep --pattern '$$.UseTLS = $_'

# Check if there's any logic in client initialization
echo -e "\nSearching for client initialization logic..."
rg -A 10 "func.*WithCACertificate|func.*WithInsecure"

Length of output: 3694

pkg/config/types/generated_descriptions.go (3)

29-29: LGTM: Clear description for compute node's NATS CA certificate configuration.

The description accurately explains the purpose of the CA certificate file path for compute nodes connecting to the NATS server.


32-32: LGTM: Improved clarity in analytics configuration description.

The description now clearly explains the purpose and impact of disabling analytics.


90-93: LGTM: Comprehensive NATS TLS configuration descriptions for orchestrator.

The descriptions for orchestrator's NATS TLS configuration are clear and cover all necessary aspects:

  • CA certificate for trust verification
  • Server certificate for TLS connections
  • Private key for TLS connections
  • TLS timeout configuration

Compute:
Enabled: true
Orchestrators:
- nats://bacalhau-orchestrator-node:4222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Update NATS URL scheme for TLS.

The URL scheme nats:// typically indicates non-TLS communication. For TLS-enabled connections, consider using the tls:// scheme to explicitly indicate secure communication.

-    - nats://bacalhau-orchestrator-node:4222
+    - tls://bacalhau-orchestrator-node:4222
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- nats://bacalhau-orchestrator-node:4222
- tls://bacalhau-orchestrator-node:4222

Comment on lines +10 to +11
Auth:
Token: "i_am_very_secret_token"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using environment variables for sensitive values.

Storing authentication tokens directly in configuration files poses a security risk. Consider using environment variables or secure secrets management.

  Auth:
-    Token: "i_am_very_secret_token"
+    Token: ${COMPUTE_AUTH_TOKEN}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Auth:
Token: "i_am_very_secret_token"
Auth:
Token: ${COMPUTE_AUTH_TOKEN}

Comment on lines +6 to +9
Compute:
Enabled: true
Orchestrators:
- nats://bacalhau-orchestrator-node:4222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

TLS configuration needs to be added for NATS connection

The compute node configuration in 7_tls_enabled_compute.yaml is missing the required TLS configuration for connecting to the orchestrator. Based on the evidence:

  • The orchestrator node (in other config files) uses TLS with server certificates and CA certificates
  • The compute node with custom CA (8_tls_enabled_compute_with_custom_ca.yaml) includes NATS TLS configuration
  • The current config only has a plain NATS URL without TLS settings

Add the following TLS configuration under the Compute section:

  Nats:
    CACert: "/bacalhau_integration_tests/common_assets/certificates/nats_custom/nats_root_ca.crt"
🔗 Analysis chain

Verify TLS configuration with orchestrator.

The compute node configuration needs to match the orchestrator's TLS settings. Let's verify the compatibility:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check orchestrator's TLS configuration to ensure compatibility

# Look for orchestrator TLS configuration
echo "Searching for orchestrator TLS settings..."
rg -l "tls:" --type yaml

# Check for any TLS-related environment variables in deployment files
echo "Checking for TLS environment variables..."
rg -l "TLS_" --type yaml

# Look for certificate paths in deployment configurations
echo "Checking for certificate configurations..."
rg -l "cert.*path|ca.*path" --type yaml

Length of output: 611


Script:

#!/bin/bash
# Let's examine the TLS configurations in the found files and related orchestrator settings

# Check the content of deployment files for TLS settings
echo "=== Checking docker-compose deployment TLS settings ==="
rg -A 5 -B 5 "TLS_" test_integration/common_assets/docker_compose_files/deployment-with-minio-and-registry.yml docker-compose-deployment/docker-compose.yml

# Check orchestrator configurations in test files
echo "=== Checking orchestrator configurations in test files ==="
fd -e yaml -e yml . test_integration/common_assets/nodes_configs/ -x cat {}

# Look for any NATS configuration with TLS
echo "=== Checking NATS configurations ==="
rg -A 5 -B 5 "nats://" --type yaml

Length of output: 10121

Comment on lines +4 to +7
Compute:
Enabled: true
Orchestrators:
- nats://bacalhau-orchestrator-node:4222
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Verify NATS URL scheme for TLS.

The NATS URL should use the tls:// scheme instead of nats:// when TLS is enabled to explicitly indicate secure communication.

Apply this diff:

  Orchestrators:
-    - nats://bacalhau-orchestrator-node:4222
+    - tls://bacalhau-orchestrator-node:4222
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Compute:
Enabled: true
Orchestrators:
- nats://bacalhau-orchestrator-node:4222
Compute:
Enabled: true
Orchestrators:
- tls://bacalhau-orchestrator-node:4222

Comment on lines 11 to 12
ServerTLSCert: "/bacalhau_integration_tests/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.crt"
ServerTLSKey: "/bacalhau_integration_tests/common_assets/certificates/generated_assets/bacalhau-orchestrator-node.key"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Certificate files need stricter permissions in test environment

The verification revealed that the TLS certificate and key files in the test environment have overly permissive permissions:

  • Key files (.key) are readable by others (644) instead of required 600
  • All files are owned by jailuser:jailuser

While these are test certificates, it's recommended to:

  • Set key files to 600 permissions (owner read/write only)
  • Set certificate files to 644 permissions (world readable)

This maintains good security practices even in test environments and prevents accidental key exposure.

🔗 Analysis chain

Verify certificate path permissions in container.

The TLS certificate and key paths are within the container. Ensure that:

  1. The files have appropriate permissions (600 for key, 644 for cert)
  2. Only necessary users have access to these directories

The previous search might have failed due to special characters. Let's try a more focused search to understand the certificate usage and verify it's only in test contexts.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any other references to these certificate paths
# to ensure they're only used in test contexts

rg "/bacalhau_integration_tests/common_assets/certificates/generated_assets/bacalhau-orchestrator-node" | grep -v "test"

Length of output: 618


Script:

#!/bin/bash
# Let's check for certificate paths in test contexts and their usage
rg -l "/bacalhau_integration_tests/common_assets/certificates/generated_assets"

# Check file permissions in the certificate directory
fd . "test_integration/common_assets/certificates/generated_assets" -t f -x ls -l {}

Length of output: 1099

Comment on lines +75 to +80
// TODO: Unfortunately the "bacalhau agent config" command is
// TODO: unpredictable and returns weird first characters.
// TODO: For now, Trim the first character if it is not "{"
cleanedJsonString := strings.TrimLeftFunc(agentConfigOutput, func(r rune) bool {
return r != '{'
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical: Unpredictable command output needs proper handling.

The TODO comments indicate that the bacalhau agent config command produces unpredictable output with "weird first characters". This is concerning because:

  1. It could affect parsing of TLS configurations
  2. The current workaround of trimming characters is fragile
  3. The issue needs proper fixing in the command implementation

Consider:

  1. Opening an issue to track fixing the command output
  2. Adding proper error handling for malformed output
  3. Adding test cases for various malformed outputs

Comment on lines +82 to +87
output, err := s.convertStringToDynamicJSON(cleanedJsonString)
s.Require().NoError(err)

orchestratorEnabled := marshalledOutput.(map[string]interface{})["config"].(map[string]interface{})["Orchestrator"].(map[string]interface{})["Enabled"].(bool)
s.Require().Truef(orchestratorEnabled, "Expected orchestrator to be enabled, got: %t", orchestratorEnabled)
orchestratorEnabled, err := output.Query("$.Orchestrator.Enabled")
s.Require().NoError(err)
s.Require().True(orchestratorEnabled.Bool())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add TLS configuration validation.

Since this PR adds TLS support, this test should verify that the agent configuration includes the expected TLS settings when enabled.

Add TLS-specific checks:

 output, err := s.convertStringToDynamicJSON(cleanedJsonString)
 s.Require().NoError(err)

 orchestratorEnabled, err := output.Query("$.Orchestrator.Enabled")
 s.Require().NoError(err)
 s.Require().True(orchestratorEnabled.Bool())
+
+// Verify TLS configuration when enabled
+tlsEnabled, err := output.Query("$.NATS.TLS.Enabled")
+s.Require().NoError(err)
+if tlsEnabled.Bool() {
+    // Verify required TLS fields are present
+    certPath, err := output.Query("$.NATS.TLS.CertificatePath")
+    s.Require().NoError(err)
+    s.Require().NotEmpty(certPath.String())
+
+    keyPath, err := output.Query("$.NATS.TLS.KeyPath")
+    s.Require().NoError(err)
+    s.Require().NotEmpty(keyPath.String())
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output, err := s.convertStringToDynamicJSON(cleanedJsonString)
s.Require().NoError(err)
orchestratorEnabled := marshalledOutput.(map[string]interface{})["config"].(map[string]interface{})["Orchestrator"].(map[string]interface{})["Enabled"].(bool)
s.Require().Truef(orchestratorEnabled, "Expected orchestrator to be enabled, got: %t", orchestratorEnabled)
orchestratorEnabled, err := output.Query("$.Orchestrator.Enabled")
s.Require().NoError(err)
s.Require().True(orchestratorEnabled.Bool())
output, err := s.convertStringToDynamicJSON(cleanedJsonString)
s.Require().NoError(err)
orchestratorEnabled, err := output.Query("$.Orchestrator.Enabled")
s.Require().NoError(err)
s.Require().True(orchestratorEnabled.Bool())
// Verify TLS configuration when enabled
tlsEnabled, err := output.Query("$.NATS.TLS.Enabled")
s.Require().NoError(err)
if tlsEnabled.Bool() {
// Verify required TLS fields are present
certPath, err := output.Query("$.NATS.TLS.CertificatePath")
s.Require().NoError(err)
s.Require().NotEmpty(certPath.String())
keyPath, err := output.Query("$.NATS.TLS.KeyPath")
s.Require().NoError(err)
s.Require().NotEmpty(keyPath.String())
}

Comment on lines +219 to +235
func (s *BaseDockerComposeTestSuite) convertStringToDynamicJSON(jsonString string) (*utils.DynamicJSON, error) {
// Cleanup the Json output. Unfortunate that the CLI prints extra
// characters at the beginning and at the end
cleanedJsonString := strings.TrimLeftFunc(jsonString, func(r rune) bool {
return r != '{' && r != '['
})
cleanedJsonString = strings.TrimRightFunc(cleanedJsonString, func(r rune) bool {
return r != '}' && r != ']'
})

parsedDynamicJSON, err := utils.ParseToDynamicJSON(cleanedJsonString)
if err != nil {
return nil, err
}

return parsedDynamicJSON, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve code organization and documentation.

The implementation is solid but could benefit from the following improvements:

  1. Extract the duplicated JSON cleaning logic (shared with unmarshalJSONString) into a private helper method:
+// cleanJSONString removes non-JSON characters from the beginning and end of the string
+func (s *BaseDockerComposeTestSuite) cleanJSONString(jsonString string) string {
+    cleaned := strings.TrimLeftFunc(jsonString, func(r rune) bool {
+        return r != '{' && r != '['
+    })
+    return strings.TrimRightFunc(cleaned, func(r rune) bool {
+        return r != '}' && r != ']'
+    })
+}

+// convertStringToDynamicJSON converts a JSON string to a DynamicJSON object.
+// It handles both object and array JSON types and cleans any non-JSON characters
+// from the input string before parsing.
 func (s *BaseDockerComposeTestSuite) convertStringToDynamicJSON(jsonString string) (*utils.DynamicJSON, error) {
+    if strings.TrimSpace(jsonString) == "" {
+        return nil, fmt.Errorf("empty JSON string provided")
+    }
-    cleanedJsonString := strings.TrimLeftFunc(jsonString, func(r rune) bool {
-        return r != '{' && r != '['
-    })
-    cleanedJsonString = strings.TrimRightFunc(cleanedJsonString, func(r rune) bool {
-        return r != '}' && r != ']'
-    })
+    cleanedJsonString := s.cleanJSONString(jsonString)
     parsedDynamicJSON, err := utils.ParseToDynamicJSON(cleanedJsonString)
     if err != nil {
         return nil, err
     }
     return parsedDynamicJSON, nil
 }
  1. Update unmarshalJSONString to use the new helper:
 func (s *BaseDockerComposeTestSuite) unmarshalJSONString(jsonString string, expectedType JSONResponseType) (interface{}, error) {
-    cleanedJsonString := strings.TrimLeftFunc(jsonString, func(r rune) bool {
-        return r != '{' && r != '['
-    })
-    cleanedJsonString = strings.TrimRightFunc(cleanedJsonString, func(r rune) bool {
-        return r != '}' && r != ']'
-    })
+    cleanedJsonString := s.cleanJSONString(jsonString)

Committable suggestion skipped: line range outside the PR's diff.

@@ -3,6 +3,7 @@ module bacalhau/integration_tests
go 1.23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incorrect Go version.

The specified Go version 1.23 is invalid as it does not exist yet. The latest stable version is 1.22.

Apply this diff to fix the version:

-go 1.23
+go 1.22
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
go 1.23
go 1.22

pkg/node/node.go Outdated
Comment on lines 357 to 361
ServerTLSCACert: cfg.BacalhauConfig.Orchestrator.NATS.CACert,
ServerTLSCert: cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSCert,
ServerTLSKey: cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSKey,
ServerTLSTimeout: cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSTimeout,
ClientTLSCACert: cfg.BacalhauConfig.Compute.NATS.CACert,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add validation and documentation for TLS configurations.

While the TLS field additions look good, consider the following improvements:

  1. Add validation to ensure:

    • If ServerTLSCert is provided, ServerTLSKey must also be provided
    • ServerTLSTimeout has a reasonable value (not too short or too long)
    • Certificates are valid and not expired
  2. Add documentation comments explaining:

    • The purpose of each TLS field
    • Required format and paths for certificates
    • Expected timeout duration units

Here's a suggested implementation:

 config := &nats_transport.NATSTransportConfig{
     NodeID:                   cfg.NodeID,
     Host:                     cfg.BacalhauConfig.Orchestrator.Host,
     Port:                     cfg.BacalhauConfig.Orchestrator.Port,
     AdvertisedAddress:        cfg.BacalhauConfig.Orchestrator.Advertise,
     AuthSecret:               cfg.BacalhauConfig.Orchestrator.Auth.Token,
     Orchestrators:            cfg.BacalhauConfig.Compute.Orchestrators,
     StoreDir:                 storeDir,
     ClusterName:              cfg.BacalhauConfig.Orchestrator.Cluster.Name,
     ClusterPort:              cfg.BacalhauConfig.Orchestrator.Cluster.Port,
     ClusterPeers:             cfg.BacalhauConfig.Orchestrator.Cluster.Peers,
     ClusterAdvertisedAddress: cfg.BacalhauConfig.Orchestrator.Cluster.Advertise,
     IsRequesterNode:          cfg.BacalhauConfig.Orchestrator.Enabled,
+    // Path to the CA certificate used to verify the server's certificate
     ServerTLSCACert:          cfg.BacalhauConfig.Orchestrator.NATS.CACert,
+    // Path to the server's TLS certificate (required if ServerTLSKey is provided)
     ServerTLSCert:            cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSCert,
+    // Path to the server's TLS private key (required if ServerTLSCert is provided)
     ServerTLSKey:             cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSKey,
+    // Timeout duration for TLS handshake (e.g. "30s", "1m")
     ServerTLSTimeout:         cfg.BacalhauConfig.Orchestrator.NATS.ServerTLSTimeout,
+    // Path to the CA certificate used to verify the client's certificate
     ClientTLSCACert:          cfg.BacalhauConfig.Compute.NATS.CACert,
 }

+// Validate TLS configuration
+if config.ServerTLSCert != "" || config.ServerTLSKey != "" {
+    if config.ServerTLSCert == "" || config.ServerTLSKey == "" {
+        return nil, fmt.Errorf("both ServerTLSCert and ServerTLSKey must be provided")
+    }
+}
+
+// Validate TLS timeout
+if config.ServerTLSTimeout != "" {
+    timeout, err := time.ParseDuration(config.ServerTLSTimeout)
+    if err != nil {
+        return nil, fmt.Errorf("invalid ServerTLSTimeout format: %w", err)
+    }
+    if timeout < 5*time.Second || timeout > 5*time.Minute {
+        return nil, fmt.Errorf("ServerTLSTimeout must be between 5s and 5m")
+    }
+}

Committable suggestion skipped: line range outside the PR's diff.


💡 Codebase verification

Based on the search results, I can now provide a final response. The code shows that certificates are generated with a hardcoded 1-year expiration (NotAfter: time.Now().AddDate(1, 0, 0)) but there's no active validation of certificate expiration during runtime.

Add runtime certificate expiration validation

The current implementation only sets certificate expiration during creation but doesn't validate it during usage. Consider:

  • Adding a certificate validation helper in pkg/lib/crypto/certificate.go to check expiration
  • Implementing periodic certificate expiration checks in the NATS transport layer
  • Adding expiration validation when loading certificates in pkg/nats/transport/nats.go

Example locations needing validation:

  • pkg/lib/crypto/certificate.go: Add expiration check when certificates are loaded/used
  • pkg/nats/transport/nats.go: Add validation when setting up TLS config
  • pkg/node/node.go: Add validation when loading certificates from config
🔗 Analysis chain

Verify TLS certificate expiration.

Consider adding certificate expiration checks to prevent using expired certificates.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the TLS certificate expiration check is implemented elsewhere in the codebase.

# Test: Search for certificate expiration checks
rg -A 5 "certificate.*expir|expir.*certificate|x509.*Not.*After|cert.*valid"

# Test: Search for certificate validation functions
ast-grep --pattern 'func $_(cert $_) {
  $$$
  expir
  $$$
}'

Length of output: 142


Script:

#!/bin/bash
# Let's try a more comprehensive search for TLS certificate validation

# Search for TLS certificate related code
rg -i "tls.*cert|cert.*valid|cert.*check" -A 5

# Search for x509 certificate handling
rg "x509\." -A 5

# Search for specific NATS TLS configuration and validation
rg "nats.*tls|tls.*nats" -A 5

# Look for any certificate loading or verification functions
ast-grep --pattern 'func $_($_) {
  $$$
  LoadCert
  $$$
}'

ast-grep --pattern 'func $_($_) {
  $$$
  VerifyCert
  $$$
}'

Length of output: 66564


Script:

#!/bin/bash
# Let's search for certificate expiration validation in the crypto package
# and any certificate validation functions

# Search for certificate validation and expiration checks in crypto package
rg -A 10 "NotAfter|NotBefore|Verify|Valid" pkg/lib/crypto/

# Look for any certificate validation in NATS transport
rg -A 10 "cert.*valid|valid.*cert" pkg/nats/transport/

# Search for any certificate loading and validation patterns
ast-grep --pattern 'func $_($_) {
  $$$
  cert := &x509.Certificate{
    $$$
    NotAfter: $_
    $$$
  }
  $$$
}'

# Look for any TLS config validation
ast-grep --pattern 'func $_($_) {
  $$$
  tls.Config{
    $$$
  }
  $$$
}'

Length of output: 3496

Copy link
Member

@wdbaruni wdbaruni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Just minor comments

Comment on lines +59 to +63
// Used by the Nats Client when node acts as orchestrator
ServerTLSCACert string

// Use by the Nats Client when node acts as compute
ClientTLSCACert string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if you are running a hybrid node? do you think this differentiating is helpful here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally separated these 2 to make it explicit separation, with configuration coming from 2 different places: Orchestrator and Compute. For hybrid, I would keep it as is, even if we defined it is 2 places, since technically we have a "compute node" and an "orchestrator node" concept.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (10)
test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (1)

2-3: Consider using a less common port number.

While port 1234 works for testing, it's a commonly probed port. For production deployments, consider using a less predictable port number.

test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (3)

6-7: Use a more secure token format even in test configurations.

While this is a test configuration, using obvious test tokens can lead to bad practices. Consider using a more secure format that mirrors production requirements (e.g., a UUID or a sufficiently long random string).

  Auth:
-    Token: "i_am_very_secret_token"
+    Token: "2fd4e1c67a2d28fced849ee1bb76e7391b93eb12" # Example secure format

15-17: Document the purpose of labels.

The current labels appear generic. Consider:

  • Adding comments to explain their purpose in testing
  • Using more descriptive names that reflect their testing context
 Labels:
+  # Label used for node identification in integration tests
   label1: label1Value
+  # Label used for test scenario classification
   label2: label2Value

2-3: Document the significance of port 1234.

If this port number is significant for testing purposes, please add a comment explaining why this specific port was chosen.

 API:
+  # Port 1234 is used for integration testing to avoid conflicts with default ports
   Port: 1234
test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (2)

2-3: Consider using a less common port number for tests

Using well-known port numbers like 1234 could lead to port conflicts in test environments. Consider using a port number in the ephemeral range (49152-65535) to minimize potential conflicts.

API:
-  Port: 1234
+  Port: 54321

16-18: Use more descriptive test labels

The current labels are generic and don't indicate their purpose in tests. Consider using labels that reflect their test context.

Labels:
-  label1: label1Value
-  label2: label2Value
+  test_environment: integration
+  tls_mode: custom_ca
pkg/nats/transport/nats.go (4)

54-63: LGTM! Well-structured TLS configuration fields.

The separation of TLS configuration fields into server, orchestrator client, and compute client contexts provides clear boundaries for different security roles. This design allows for flexible TLS configurations while maintaining clear separation of concerns.

Consider documenting these fields with comments explaining their specific usage contexts and requirements, especially for hybrid node scenarios where both orchestrator and compute roles are present.


147-166: Improve TLS configuration error handling.

The server TLS configuration is secure using NATS's built-in GenTLSConfig. However, the error handling could be more specific to help with troubleshooting.

 serverTLSConfig, err := server.GenTLSConfig(serverTLSConfigOpts)
 if err != nil {
-  log.Error().Msgf("failed to configure NATS server TLS: %v", err)
-  return nil, err
+  return nil, fmt.Errorf("failed to configure NATS server TLS - ensure valid certificate and key files: %w", err)
 }

265-274: Consider refactoring client TLS configuration logic.

While the current implementation works and handles both orchestrator and compute roles, the logic could be simplified for better maintainability.

Consider introducing a role-based configuration approach:

-// We need to do this logic since the Nats Transport Layer does not differentiate
-// between orchestrator mode and compute mode
-if config.ServerTLSCert == "" && config.ClientTLSCACert != "" {
-  // this client is for a compute node
-  clientOptions = append(clientOptions, nats.RootCAs(config.ClientTLSCACert))
-} else if config.ServerTLSCert != "" && config.ServerTLSCACert != "" {
-  // this client is for a orchestrator node
-  clientOptions = append(clientOptions, nats.RootCAs(config.ServerTLSCACert))
-}
+// Determine the appropriate CA certificate based on node role
+var caCert string
+switch {
+case config.ServerTLSCert != "":
+  // Orchestrator role
+  caCert = config.ServerTLSCACert
+case config.ClientTLSCACert != "":
+  // Compute role
+  caCert = config.ClientTLSCACert
+}
+
+if caCert != "" {
+  clientOptions = append(clientOptions, nats.RootCAs(caCert))
+}

This refactor:

  1. Makes the role-based logic more explicit
  2. Reduces nesting
  3. Simplifies the conditions
  4. Maintains the same functionality

25-25: Document TLS timeout security implications.

The default TLS timeout of 10 seconds should be documented with its security implications and recommendations for different deployment scenarios.

Add a comment explaining the timeout's purpose and recommendations:

-const NATSServerDefaultTLSTimeout = 10
+// NATSServerDefaultTLSTimeout is the default timeout in seconds for TLS handshake.
+// A shorter timeout (e.g., 5s) may be preferred in low-latency environments,
+// while a longer timeout might be needed in high-latency or resource-constrained environments.
+const NATSServerDefaultTLSTimeout = 10
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d76252c and 1767a51.

📒 Files selected for processing (10)
  • pkg/config/types/compute.go (1 hunks)
  • pkg/config/types/generated_constants.go (2 hunks)
  • pkg/config/types/generated_descriptions.go (2 hunks)
  • pkg/config/types/orchestrator.go (2 hunks)
  • pkg/nats/transport/nats.go (6 hunks)
  • pkg/nats/transport/nats_config_test.go (1 hunks)
  • pkg/node/node.go (1 hunks)
  • test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (1 hunks)
  • test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (1 hunks)
  • test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • pkg/config/types/compute.go
  • pkg/config/types/generated_descriptions.go
  • pkg/config/types/orchestrator.go
  • pkg/nats/transport/nats_config_test.go
  • pkg/node/node.go
🔇 Additional comments (6)
test_integration/common_assets/nodes_configs/8_tls_enabled_compute_with_custom_ca.yaml (2)

6-7: NATS URL scheme needs updating for TLS.

A previous review has already identified this issue.


10-13: Verify CA certificate availability.

Let's ensure the CA certificate exists in the specified path and has appropriate permissions.

✅ Verification successful

CA certificate path is valid and accessible

The certificate exists at the correct relative path test_integration/common_assets/certificates/nats_custom/nats_root_ca.crt and is properly referenced in the configuration. The path structure aligns with the integration test setup, and the certificate is part of a complete TLS setup including server certificates and keys.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify CA certificate existence and permissions
# Expected: File should exist and be readable

# Check if the certificate file exists
fd -t f "nats_root_ca.crt" | grep -q "certificates/nats_custom" || echo "CA certificate not found"

# Check file permissions if found
fd -t f "nats_root_ca.crt" -x stat -c "%a %n" {} \; | grep "certificates/nats_custom"

Length of output: 184


Script:

#!/bin/bash
# Let's try a different approach using ripgrep to locate the certificate
# and check its existence in the repository

# Search for references to the certificate file
rg -l "nats_root_ca.crt"

# Look for the actual certificate file
find . -name "nats_root_ca.crt" -type f

# Check the directory structure
ls -R test_integration/common_assets/certificates 2>/dev/null || echo "Directory not found"

Length of output: 932

test_integration/common_assets/nodes_configs/7_tls_enabled_orchestrator.yaml (2)

11-12: Verify certificate path consistency across configurations.

Ensure these certificate paths are consistently used across all test configurations and documentation.

#!/bin/bash
# Check for consistency in certificate path usage
rg "/bacalhau_integration_tests/common_assets/certificates/generated_assets/bacalhau-orchestrator-node" --type yaml

13-13: Verify if the TLS timeout value is sufficient.

The 15-second timeout might need adjustment based on your deployment environment and NATS connection patterns. Consider:

  • Network latency in your target environment
  • TLS handshake overhead
  • Connection pool behavior
✅ Verification successful

Let me gather more information about timeout configurations across the codebase.


Let me search for NATS server configuration and any documentation about timeout settings.


Based on the search results, I can see that the 15-second ServerTimeout is used consistently in test configurations and is properly defined in the codebase. Let me verify if there are any default timeout values or documentation about recommended timeout settings.


Let me check specifically for TLS server timeout configurations and any related documentation.


Based on the search results, I can now provide a definitive response about the TLS timeout value:

The 15-second ServerTimeout setting is appropriate for NATS TLS connections.

The codebase shows that:

  • NATS has a default TLS timeout of 10 seconds (NATSServerDefaultTLSTimeout = 10)
  • The 15-second value provides a reasonable buffer above the default
  • The timeout is only used for the TLS handshake phase, not for ongoing connections
  • The value is properly validated to be positive (ServerTLSTimeout must be a positive number)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for other timeout configurations in the codebase for consistency
rg -A 2 "ServerTimeout.*15" --type yaml

Length of output: 510


Script:

#!/bin/bash
# Search for any ServerTimeout configurations in yaml files
rg "ServerTimeout:" --type yaml -A 2

# Search for timeout related configurations in Go files that might set defaults
rg "ServerTimeout|DefaultTimeout" --type go -A 2

Length of output: 2494


Script:

#!/bin/bash
# Search for NATS server configuration and documentation
rg -g "*.md" -g "*.go" "NATS.*config|TLS.*timeout" -A 3

# Look for any comments or documentation about timeouts in the codebase
fd README.md | xargs cat

Length of output: 86057


Script:

#!/bin/bash
# Search for any default timeout values in Go files
rg "DefaultTimeout|default.*timeout" --type go -A 2

# Search for any timeout related documentation
rg -g "*.md" "timeout|TLS" -A 2

Length of output: 8161


Script:

#!/bin/bash
# Search for TLS server timeout configurations and documentation
rg "ServerTimeout|TLS.*timeout" -g "*.go" -g "*.md" -A 3

# Look for any NATS TLS configuration documentation
rg "NATS.*TLS|TLS.*config" -g "*.md" -g "*.go" -A 3

Length of output: 7952

test_integration/common_assets/nodes_configs/8_tls_enabled_orchestrator_with_custom_ca.yaml (1)

8-14: 🛠️ Refactor suggestion

Enhance TLS configuration robustness

Several improvements can be made to the TLS configuration:

  1. The comment about container paths should be more explicit about the required directory structure
  2. The ServerTimeout lacks units specification (seconds/minutes)
  3. Consider adding validation for certificate existence

Let's verify if the certificate paths exist in the repository:

Suggested improvements:

 TLS:
-    # The paths for these certs are within the container after
-    # the assets has been copied during image creation
+    # These paths are relative to the container's root directory.
+    # Required directory structure:
+    #   /bacalhau_integration_tests/
+    #     common_assets/
+    #       certificates/
+    #         nats_custom/
+    #           - nats-server.crt
+    #           - nats-server.key
+    #           - nats_root_ca.crt
     ServerCert: "/bacalhau_integration_tests/common_assets/certificates/nats_custom/nats-server.crt"
     ServerKey: "/bacalhau_integration_tests/common_assets/certificates/nats_custom/nats-server.key"
     CACert: "/bacalhau_integration_tests/common_assets/certificates/nats_custom/nats_root_ca.crt"
-    ServerTimeout: 15
+    # Timeout in seconds for TLS handshake
+    ServerTimeout: 15
pkg/config/types/generated_constants.go (1)

28-28: LGTM! Constants follow consistent naming patterns.

The new TLS-related constants are well-structured and complete, providing configuration keys for both compute nodes (client-side) and orchestrator (server-side) TLS setup. Note that this is a generated file (from pkg/config/types/gen/generate.go), so any changes should be made to the generator rather than directly to this file.

Let's verify the generator file exists and contains the corresponding entries:

Also applies to: 95-98

✅ Verification successful

Constants correctly reflect the TLS configuration structure

The generated constants accurately map to the TLS-related fields defined in the codebase:

  • ComputeTLSCACertKey maps to ComputeTLS.CACert
  • OrchestratorTLSServerCertKey maps to OrchestratorTLS.ServerCert
  • OrchestratorTLSServerKeyKey maps to OrchestratorTLS.ServerKey
  • OrchestratorTLSServerTimeoutKey maps to OrchestratorTLS.ServerTimeout
  • OrchestratorTLSCACertKey maps to OrchestratorTLS.CACert

The generator is working as expected, creating constants from the struct field definitions while maintaining the proper naming patterns.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify the generator file exists and contains TLS-related entries
rg -A 2 "TLS" pkg/config/types/gen/generate.go

Length of output: 46


Script:

#!/bin/bash
# First, let's check if the generator file exists and its content
fd "generate.go" pkg/config/types/

# If the file exists, let's look for any TLS-related content with case-insensitive search
fd "generate.go" pkg/config/types/ --exec cat {} | rg -i "tls"

# Let's also check where these constants are actually used
rg "ComputeTLSCACertKey|OrchestratorTLSCACertKey|OrchestratorTLSServerCertKey|OrchestratorTLSServerKeyKey|OrchestratorTLSServerTimeoutKey" -A 2

Length of output: 2827


Script:

#!/bin/bash
# Let's check the generator file content to see how these constants are defined
cat pkg/config/types/gen/generate.go

# Also check for any config schema or definition files that might be the source
fd -e yaml -e json . pkg/config/types/gen/ --exec cat {}

Length of output: 8267


Script:

#!/bin/bash
# Let's check the Bacalhau config struct to see where these TLS fields are defined
rg -A 5 "type Bacalhau struct" pkg/config/types/

Length of output: 653


Script:

#!/bin/bash
# Let's check for TLS-related structs and fields in the config types
rg -A 5 "type (Compute|Orchestrator) struct" pkg/config/types/

# Also check for any TLS-related struct definitions
rg -B 2 -A 5 "TLS.*struct" pkg/config/types/

Length of output: 3128

@jamlo jamlo merged commit 5c34a47 into main Nov 5, 2024
14 checks passed
@jamlo jamlo deleted the eng-305/jamlo/enable-nats-tls branch November 5, 2024 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants