Skip to content

Commit

Permalink
fix: only collect OS type; fix method name; add readme (#4579)
Browse files Browse the repository at this point in the history
### What:
- Fix NodeID option name (cosmetic change)
- Hash the NodeID before recording it.
- Only collect OS type, instead of all details.
- Add a README.md describing what is collected, and how to opt out of
collection

---------

Co-authored-by: frrist <[email protected]>
  • Loading branch information
frrist and frrist authored Oct 3, 2024
1 parent d1a9e86 commit b5a4918
Show file tree
Hide file tree
Showing 3 changed files with 123 additions and 5 deletions.
2 changes: 1 addition & 1 deletion cmd/cli/serve/serve.go
Original file line number Diff line number Diff line change
Expand Up @@ -260,7 +260,7 @@ func serve(cmd *cobra.Command, cfg types.Bacalhau, fsRepo *repo.FsRepo) error {

if !cfg.DisableAnalytics {
err = analytics.SetupAnalyticsProvider(ctx,
analytics.WithNodeNodeID(sysmeta.NodeName),
analytics.WithNodeID(sysmeta.NodeName),
analytics.WithInstallationID(system.InstallationID()),
analytics.WithInstanceID(sysmeta.InstanceID),
analytics.WithNodeType(isRequesterNode, isComputeNode),
Expand Down
118 changes: 118 additions & 0 deletions pkg/analytics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# What Data is shared by users of Bacalhau?

When a job is submitted or completed, data is collected about it to help track, manage, and optimize its execution.

## What information is collected on the bacalhau agent:

- **Node Type**: One of: ‘hybrid’, ‘orchestrator’, ‘compute’.
- **Node Version:** The version of bacalhau the node is running.
- **Node ID**: The identifier of the bacalhau node.
- **Installation ID**: The identified associated with the installation of bacalhau.
- **Instance ID**: An anonymous identifier of the bacalhau node.
- **Operating System Type**: The name of the operating system the bacalhau node is running on.

## **What information is collected on job submissions and completions:**

1. **Job Identification**
- **ID**: A unique identifier for the job.
- **Namespace Hash**: A hashed version of the job’s namespace, used for grouping related jobs.
- **Name Set**: Whether a specific name was set for the job.
- **Type**: The type of job you’re running.
- **Count**: The number of tasks associated with the job.
- **Labels & Metadata Counts**: The number of labels and metadata entries attached to the job.
2. **State and Timing Information (Terminal Jobs Only)**
- **State**: The current state of the job (e.g., completed, failed).
- **Creation & Modification Times**: When the job was created and last modified.
3. **Versioning and Revisions**
- **Version & Revision**: These fields help track changes to the job’s configuration over time.
4. **Task-Specific Information**
- **Task Name Hash**: A hashed version of the task name for internal tracking.
- **Task Engine & Publisher Types**: The type of engine and publisher used for the task.
- **Environment Variables & Metadata**: The number of environment variables and metadata entries tied to the task.
- **Input Source Types**: The types of input sources for the task (e.g., file, database).
- **Result Paths Count**: The number of result paths generated by the task.
5. **Resource Allocation**
- **CPU, Memory, Disk, GPU Usage**: The amount of CPU, memory, disk, and GPU resources requested by the task.
- **Network Details**: The network type and number of network domains used by the task.
6. **Timeouts**
- **Execution Timeout**: The maximum allowed time for the task to run.
- **Queue Timeout**: The maximum time the task can wait in the queue.
- **Total Timeout**: The total allowed time for the job, including both queue and execution time.
7. **Warnings and Errors (Submitted Jobs Only)**
- Any warnings or errors that occurred during the job submission or execution process.

## **What Information is Collected on Job Execution**

When a job is executed, detailed information about the execution process is collected to help monitor and optimize performance, as well as assist with troubleshooting. Here’s a breakdown of what is collected:

1. **Execution Identification**
- **Execution ID**: A unique identifier for the execution.
- **Job ID**: The identifier for the associated job.
- **Evaluation ID**: An identifier linking the execution to its evaluation process.
- **Node Name Hash**: A hashed version of the name of the node where the execution is running.
- **Namespace Hash**: A hashed version of the namespace under which the execution is running.
2. **Execution Metadata**
- **Execution Name Set**: Whether a specific name was set for the execution.
- **Previous & Next Executions**: Links to any preceding or subsequent executions, if applicable.
- **Follow-up Evaluation ID**: An identifier for any follow-up evaluations related to the execution.
- **Revision**: A version number that tracks changes to the execution configuration over time.
- **Creation & Modification Times**: Timestamps indicating when the execution was created and last modified.
3. **Resource Allocation**
- **Total CPU Units**: The total CPU resources allocated for the execution.
- **Total Memory, Disk, and GPU Usage**: The memory, disk space, and GPU resources used by the execution.
4. **Execution States**
- **Desired State:** The intended state of the execution (e.g., running, completed).
- **Compute State & Message**: The actual state of the execution, including any details about its progress or errors.
- **Compute Error Code**: An error code related to any issues with the execution's state on the compute node.
5. **Published Results**
- **Published Result Type**: The type of result produced by the execution, such as output files or data.
6. **Run Command Results**
- **Run Output Details**: Information about the command’s execution, including:
- **Exit Code**: The exit code returned by the executed task (typically 0 for success).
- **RunResultStdoutTruncated**: Whether stdout was truncated during execution.
- **RunResultStderrTruncated**: Whether stderr was truncated during execution.

# How do users opt out of sharing data?

To opt out of sharing data, users may run one of the following commands before starting their bacalhau node:
**Disable collection via `config set`**

```bash
bacalhau config set DisableAnalytics true
```

**Disable collection via environment variable**

```bash
export BACALHAU_DISABLEANALYTICS=true
```

**Disable collection via editing the config file**

```bash
echo 'disableanalytics: true' >> ~/.bacalhau/config.yaml
```

**Disable collection via a config flag**

```bash
bacalhau --config=DisableAnalytics=true <command>
```

## **How can users verify they have opted out?**

```bash
bacalhau config list | grep disableanalytics
```

Expected output when collection is disabled:

```bash
disableanalytics true No description available
```

Expected output when collection is enabled:

```bash
disableanalytics false No description available
```
8 changes: 4 additions & 4 deletions pkg/analytics/analytics.go
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ const DefaultOtelCollectorEndpoint = "t.bacalhau.org:4317"
const (
NodeInstallationIDKey = "installation_id"
NodeInstanceIDKey = "instance_id"
NodeIDKey = "node_id"
NodeIDHashKey = "node_id_hash"
NodeTypeKey = "node_type"
NodeVersionKey = "node_version"
)
Expand All @@ -41,9 +41,9 @@ func WithEndpoint(endpoint string) Option {
}
}

func WithNodeNodeID(id string) Option {
func WithNodeID(id string) Option {
return func(c *Config) {
c.attributes = append(c.attributes, attribute.String(NodeIDKey, id))
c.attributes = append(c.attributes, attribute.String(NodeIDHashKey, hashString(id)))
}
}

Expand Down Expand Up @@ -108,7 +108,7 @@ func SetupAnalyticsProvider(ctx context.Context, opts ...Option) error {

// Create a new resource with auto-detected host information
res, err := resource.New(ctx,
resource.WithOS(),
resource.WithOSType(),
resource.WithSchemaURL(semconv.SchemaURL),
resource.WithAttributes(config.attributes...),
)
Expand Down

0 comments on commit b5a4918

Please sign in to comment.