Skip to content

vector_component_allocated_bytes negative memory #24361

@hunsbea

Description

@hunsbea

A note for the community

Problem

The memory allocated under vector_component_allocated_bytes is showing negative values for some metrics.

This does not happen immediately, but I've been test running Vector on some servers for a couple months now and I've noticed this across many of them.

Example:

Image

This is a problem because without accurate memory metrics, it's difficult to get a baseline for how much memory Vector needs to perform well in my environment, and it's unclear whether it's safe to run alongside other apps with high memory requirements and do capacity planning and that kind of thing. There are other ways to measure memory, but since this is supposed to be a Vector feature, it would be great if it worked.

This is Vector 0.48.0 on Almalinux 9 installed via the publicly available RPM. I will try on latest, but it may be a month before I can confirm if the issue still exists. Would be great meanwhile if you could check if there's some kind of silly counter bug to fix.

Configuration

acknowledgements:
  enabled: true
api:
  enabled: true

sources:
  maven_logs:
    type: file
    fingerprint:
      strategy: device_and_inode
    multiline:
      # Start collecting a multiline when the line starts with a date
      # and stop when you see another line starting with a date
      start_pattern: '^\d{4}-\d{2}-\d{2}'
      mode: "halt_before"
      condition_pattern: '^\d{4}-\d{2}-\d{2}'
      timeout_ms: 1000
    max_line_bytes: 1024000
    include:
      - "/opt/maven/logs/*.log"
      - "/opt/maven/logs/*.log_[0-9]*"
    exclude:
      - "/opt/maven/logs/*.bz2"
      - "/opt/maven/logs/*.gz"
      - "/opt/maven/logs/*.zip"
      - "/opt/maven/logs/*.tar"
  journal:
    type: journald
  metrics:
    type: internal_metrics
    scrape_interval_secs: 10

transforms:
  process_maven_logs:
    type: remap
    inputs:
      - enrich_maven_logs
    source: |
      path_tokens, err = split(.file, "/")
      file_tokens, err = split(path_tokens[-1], ".")
      .app_name = file_tokens[0] || "UNKNOWN"
      .app_instance = file_tokens[1] || "UNKNOWN"
  process_journal:
    type: remap
    inputs:
      - journal
    source: |
      .app_name = .source_type  # "journald"
  enrich_maven_logs:
    type: lua
    inputs:
      - maven_logs
    version: "2"
    # This will only run the expensive system commands the first time we see a log file, then we'll use the cached results
    source: |
      function get_file_owner(filepath)
        local handle = io.popen("stat -c '%U' '" .. filepath .. "' 2>/dev/null")
        local user = handle and handle:read("*a"):gsub("%s+", "") or "unknown"
        if handle then handle:close() end
        return user
      end

      function get_process_name(filepath)
        local handle = io.popen("fuser '" .. filepath .. "' 2>/dev/null | xargs -r ps -o comm= -p | grep -v vector | head -1")
        local result = handle and handle:read("*a"):gsub("%s+", "") or ""
        if handle then handle:close() end
        return result
      end

      function init(emit)
        file_metadata_cache = {}
      end

      function process(event, emit)
        local filepath = event.log.file
        if filepath and not file_metadata_cache[filepath] then
          local user = get_file_owner(filepath)
          local process_name = get_process_name(filepath)
          file_metadata_cache[filepath] = {
            process_name = process_name,
            owner_user = user
          }
        end

        if filepath and file_metadata_cache[filepath] then
          local metadata = file_metadata_cache[filepath]
          -- NOTE: setting event.log.<field> in Lua ACTUALLY sets event.<field>
          -- so, event.log.process_name becomes .process_name in VRL later
          event.log.process_name = metadata.process_name
          event.log.owner_user = metadata.owner_user
        end

        emit(event)
      end
    hooks:
      init: "init"
      process: "process"

sinks:
  kafka_maven:
    type: kafka
    inputs:
      - process_maven_logs
    encoding:
      codec: native
    compression: zstd
    librdkafka_options:
      queue.buffering.max.kbytes: "40960"  # 40 MB
      socket.send.buffer.bytes: "41943040" # 40 MB
    bootstrap_servers: "<REDACTED>"
    # Vector spends too much CPU on context switching and has high memory usage
    # if we have too many topic names
    topic: vector.logs.maven.apps
    rate_limit_duration_secs: 1
    rate_limit_num: 50000
  kafka_system:
    type: kafka
    inputs:
      - process_journal
    encoding:
      codec: native
    compression: zstd
    librdkafka_options:
      queue.buffering.max.kbytes: "40960"
      socket.send.buffer.bytes: "41943040"
      message.max.bytes: "2048000"  # twice the file source max_line_bytes
    bootstrap_servers: "<REDACTED>"
    topic: "vector.logs.system.{{app_name}}"
    rate_limit_duration_secs: 1
    rate_limit_num: 50000
  prometheus_sink:
    type: prometheus_exporter
    inputs:
      - metrics
    address: 0.0.0.0:9598

Version

0.48.0

Debug Output

Due to the nature of the bug not showing up immediately, I don't think a backtrace from me will be helpful

Example Data

I have tons of different log lines, this probably isn't relevant, and if it is, it would be hard for me to pin down which ones are causing an issue

Additional Context

It's running on AlmaLinux 9 as a systemd unit

[Unit]
Description=Vector
Documentation=https://vector.dev
After=network-online.target
Requires=network-online.target

[Service]

Vector must run as root because we will run system commands to infer metadata from log files

User=root
Group=root
ExecStartPre=/usr/bin/vector validate
ExecStart=/usr/bin/vector --watch-config --allocation-tracing
ExecReload=/usr/bin/vector validate --no-environment
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
AmbientCapabilities=CAP_NET_BIND_SERVICE
EnvironmentFile=-/etc/default/vector
StartLimitInterval=10
StartLimitBurst=5

Lower CPU priority to prevent interference but allow using any idle CPU

CPUWeight=20
Nice=10
LimitNOFILE=4096
[Install]
WantedBy=multi-user.target

References

didn't see anyone else talking about this

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions