Brian/ai optimize tokenizer #45705

gh123man · 2026-01-29T20:26:52Z

What does this PR do?

Benchmark Results

Comparing main vs this PR for the tokenize() function:

Benchmark	main	PR	Speedup	Memory
TokenizerShort	57.1 ns/op	26.7 ns/op	2.1x faster	56B → 18B (-68%)
TokenizerLong	1354 ns/op	506 ns/op	2.7x faster	1440B → 864B (-40%)

New Comprehensive Benchmarks (PR only)

Benchmark	ns/op	B/op	Description
TokenizerMedium	169	256	Typical log line (~80 bytes)
TokenizerVeryLong	1297	2288	Verbose log (~400 bytes)
TokenizerJSON	484	864	JSON-heavy messages
TokenizerTimestampHeavy	355	656	Multiple timestamp formats
TokenizerNumberHeavy	301	480	Numeric data
TokenizerSpecialCharsHeavy	296	576	Special characters
TokenizerStackTrace	345	480	Java stack traces
TokenizerLongWords	86	16	Long character runs
TokenizerLongNumbers	65	16	Long digit sequences
TokenizerRealisticApacheLog	463	800	Apache access log
TokenizerRealisticAppLog	407	656	Application log

Key Optimizations

256-byte lookup table for token classification - O(1) array index vs switch
256-byte lookup table for case conversion - no branches
Reusable working buffers on Tokenizer struct - amortized allocation cost
Exact-sized result slices - allocate actual token count, not input length
Length-based dispatch in getSpecialLongToken - fast rejection of impossible matches
Extracted emitToken method - cleaner code, compiler inlines it
Only buffer letter tokens - skip buffering for non-letter characters
unsafe.String - avoid allocation when checking special tokens

Motivation

Describe how you validated your changes

Additional Notes

agent-platform-auto-pr · 2026-01-29T21:04:32Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor d8f3269
📊 Static Quality Gates Dashboard

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_rpm_arm64	-4.0 KiB (0.00% reduction)	727.128 → 727.124 → 737.340
✅	agent_suse_arm64	-4.0 KiB (0.00% reduction)	727.128 → 727.124 → 737.340
✅	docker_agent_arm64	-4.0 KiB (0.00% reduction)	814.215 → 814.211 → 824.020
✅	docker_agent_jmx_arm64	-4.0 KiB (0.00% reduction)	993.813 → 993.809 → 1003.620

27 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	agent_deb_amd64	748.049 MiB
✅	agent_deb_amd64_fips	696.836 MiB
✅	agent_heroku_amd64	325.593 MiB
✅	agent_msi	659.733 MiB
✅	agent_rpm_amd64	748.033 MiB
✅	agent_rpm_amd64_fips	696.819 MiB
✅	agent_rpm_arm64_fips	679.273 MiB
✅	agent_suse_amd64	748.033 MiB
✅	agent_suse_amd64_fips	696.819 MiB
✅	agent_suse_arm64_fips	679.273 MiB
✅	docker_agent_amd64	810.525 MiB
✅	docker_agent_jmx_amd64	1001.403 MiB
✅	docker_cluster_agent_amd64	180.824 MiB
✅	docker_cluster_agent_arm64	196.669 MiB
✅	docker_cws_instrumentation_amd64	7.135 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_dogstatsd_amd64	38.414 MiB
✅	docker_dogstatsd_arm64	36.749 MiB
✅	dogstatsd_deb_amd64	29.630 MiB
✅	dogstatsd_deb_arm64	27.802 MiB
✅	dogstatsd_rpm_amd64	29.630 MiB
✅	dogstatsd_suse_amd64	29.630 MiB
✅	iot_agent_deb_amd64	42.751 MiB
✅	iot_agent_deb_arm64	39.872 MiB
✅	iot_agent_deb_armhf	40.442 MiB
✅	iot_agent_rpm_amd64	42.751 MiB
✅	iot_agent_suse_amd64	42.751 MiB

On-wire sizes (compressed)

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	+4.22 KiB (0.00% increase)	182.850 → 182.854 → 184.810
✅	agent_deb_amd64_fips	neutral	171.277 MiB → 173.790
✅	agent_heroku_amd64	-7.2 KiB (0.01% reduction)	87.289 → 87.282 → 88.450
✅	agent_msi	+44.0 KiB (0.03% increase)	142.406 → 142.449 → 143.300
✅	agent_rpm_amd64	-24.48 KiB (0.01% reduction)	185.780 → 185.756 → 188.160
✅	agent_rpm_amd64_fips	+29.54 KiB (0.02% increase)	173.553 → 173.581 → 176.600
✅	agent_rpm_arm64	+22.53 KiB (0.01% increase)	168.368 → 168.390 → 169.930
✅	agent_rpm_arm64_fips	-14.23 KiB (0.01% reduction)	159.088 → 159.074 → 160.550
✅	agent_suse_amd64	-24.48 KiB (0.01% reduction)	185.780 → 185.756 → 188.160
✅	agent_suse_amd64_fips	+29.54 KiB (0.02% increase)	173.553 → 173.581 → 176.600
✅	agent_suse_arm64	+22.53 KiB (0.01% increase)	168.368 → 168.390 → 169.930
✅	agent_suse_arm64_fips	-14.23 KiB (0.01% reduction)	159.088 → 159.074 → 160.550
✅	docker_agent_amd64	neutral	275.087 MiB → 277.400
✅	docker_agent_arm64	-23.83 KiB (0.01% reduction)	262.678 → 262.655 → 266.040
✅	docker_agent_jmx_amd64	-2.55 KiB (0.00% reduction)	343.720 → 343.718 → 346.020
✅	docker_agent_jmx_arm64	-32.76 KiB (0.01% reduction)	327.305 → 327.273 → 330.660
✅	docker_cluster_agent_amd64	neutral	63.874 MiB → 64.510
✅	docker_cluster_agent_arm64	neutral	60.135 MiB → 61.170
✅	docker_cws_instrumentation_amd64	neutral	2.994 MiB → 3.330
✅	docker_cws_instrumentation_arm64	neutral	2.726 MiB → 3.090
✅	docker_dogstatsd_amd64	neutral	14.863 MiB → 15.820
✅	docker_dogstatsd_arm64	neutral	14.202 MiB → 14.830
✅	dogstatsd_deb_amd64	neutral	7.831 MiB → 8.790
✅	dogstatsd_deb_arm64	-2.25 KiB (0.03% reduction)	6.719 → 6.717 → 7.710
✅	dogstatsd_rpm_amd64	neutral	7.844 MiB → 8.800
✅	dogstatsd_suse_amd64	neutral	7.844 MiB → 8.800
✅	iot_agent_deb_amd64	neutral	11.213 MiB → 12.040
✅	iot_agent_deb_arm64	neutral	9.585 MiB → 10.450
✅	iot_agent_deb_armhf	+2.18 KiB (0.02% increase)	9.780 → 9.782 → 10.620
✅	iot_agent_rpm_amd64	neutral	11.230 MiB → 12.060
✅	iot_agent_suse_amd64	neutral	11.230 MiB → 12.060

cit-pr-commenter-54b7da · 2026-01-29T21:18:01Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 39f43a86-76a0-400b-b4c7-6128911b448f

Baseline: d8f3269
Comparison: eaf1e3a
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+0.88	[-2.22, +3.98]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	tcp_syslog_to_blackhole	ingress throughput	+1.33	[+1.26, +1.39]	1	Logs
➖	docker_containers_cpu	% cpu utilization	+0.88	[-2.22, +3.98]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	+0.58	[+0.52, +0.63]	1	Logs
➖	otlp_ingest_logs	memory utilization	+0.48	[+0.38, +0.58]	1	Logs
➖	ddot_metrics_sum_delta	memory utilization	+0.45	[+0.25, +0.65]	1	Logs
➖	file_tree	memory utilization	+0.36	[+0.31, +0.42]	1	Logs
➖	ddot_logs	memory utilization	+0.35	[+0.29, +0.41]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	+0.12	[-0.04, +0.28]	1	Logs
➖	docker_containers_memory	memory utilization	+0.04	[-0.03, +0.12]	1	Logs
➖	ddot_metrics	memory utilization	+0.04	[-0.19, +0.27]	1	Logs
➖	otlp_ingest_metrics	memory utilization	+0.03	[-0.12, +0.18]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.03	[-0.36, +0.41]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.01	[-0.08, +0.09]	1	Logs
➖	quality_gate_idle	memory utilization	+0.00	[-0.04, +0.05]	1	Logs bounds checks dashboard
➖	uds_dogstatsd_to_api	ingress throughput	+0.00	[-0.12, +0.13]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.01	[-0.15, +0.12]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.02	[-0.05, +0.02]	1	Logs bounds checks dashboard
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.07, +0.03]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.02	[-0.44, +0.40]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.09	[-0.58, +0.40]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	-0.30	[-0.53, -0.07]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.52	[-0.75, -0.29]	1	Logs
➖	quality_gate_logs	% cpu utilization	-3.51	[-4.98, -2.04]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	docker_containers_cpu	simple_check_run	10/10
✅	docker_containers_memory	memory_usage	10/10
✅	docker_containers_memory	simple_check_run	10/10
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	lost_bytes	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	lost_bytes	10/10	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_metrics_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.

gh123man added 5 commits January 29, 2026 14:35

AI optimize tokenizer

f7ab139

faster

d236865

even faster

d579b27

cleanup

414fd55

Merge branch 'main' into brian/ai-optimize-tokenizer

b43c9cf

github-actions bot added team/agent-log-pipelines medium review PR review might take time labels Jan 29, 2026

Fix init

eaf1e3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brian/ai optimize tokenizer #45705

Brian/ai optimize tokenizer #45705

gh123man commented Jan 29, 2026 •

edited

Loading

Uh oh!

agent-platform-auto-pr bot commented Jan 29, 2026 •

edited

Loading

Info

Uh oh!

cit-pr-commenter-54b7da bot commented Jan 29, 2026 •

edited

Loading

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Brian/ai optimize tokenizer #45705

Are you sure you want to change the base?

Brian/ai optimize tokenizer #45705

Conversation

gh123man commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Benchmark Results

New Comprehensive Benchmarks (PR only)

Key Optimizations

Motivation

Describe how you validated your changes

Additional Notes

Uh oh!

agent-platform-auto-pr bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter-54b7da bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gh123man commented Jan 29, 2026 •

edited

Loading

agent-platform-auto-pr bot commented Jan 29, 2026 •

edited

Loading

cit-pr-commenter-54b7da bot commented Jan 29, 2026 •

edited

Loading