[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

nadongjun · 2024-07-07T06:14:57Z

Use case

Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale.

The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results.
The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results.
Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases.

Description

Benchmark Environment

A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances.
- Headscale Instance: t2.large instance deployed in the ap-northeast-2 region
- Tailscale Clients: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client)
  - The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).

Comparison Metrics and Results

The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate.

Total Client (600) Creation Time and Error Occurrence Rate
- SQLite: sqlite-node-list.json
  - Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145)
  - Successful Client Creations: 442
  - Errors Encountered: 158 (ERR: context deadline exceeded)
- PostgreSQL: postgres-node-list.json
  - Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120)
  - Successful Client Creations: 501
  - Errors Encountered: 99 (ERR: context deadline exceeded)
Error Occurrences
The following provides a verification of the errors that occurred in Issue [Bug] Node Connection Issues(~600 nodes) in v0.23.0-alpha12 #1966, including their occurrence rates:
- ERROR-1: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the Cannot create user: context deadline exceeded error.
- ERROR-2: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. This was not related to Headscale itself. With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline.
Profiling Results
- The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data.

Profiling Results: cpu

postgres

sqlite

Profiling Results: memory

postgres

sqlite

Conclusion

Client Creation Time: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations.
Error Occurrence Frequency: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite.
Profiling Performance: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite.

Recommendation

Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and config.yaml to include guidelines for selecting the appropriate database based on the use case.

Attachments

Headscale Node Lists:
- postgres-node-list.json
- sqlite-node-list.json
pprof Profiling and Heap Measurement Results:
- cpu-postgres.png
- cpu-sqlite.png
- heap-postgres.png
- heap-sqlite.png

Contribution

I can write the design doc for this feature
I can contribute this feature

How can it be implemented?

No response

The text was updated successfully, but these errors were encountered:

chriswiggins · 2024-07-11T11:33:07Z

Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)?

I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?

CNLHC · 2024-07-11T11:44:11Z

Thanks for your meticulous benchmark.
It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔
According to your benchmark results, the gorm overhead seems unexpectedly high.

BTW, if you have raw profile data, maybe flamegraph will provide more insight.

nadongjun · 2024-07-11T12:56:45Z

@CNLHC

The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results.

Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that.

kradalby · 2024-07-13T09:05:07Z

I've posted a comment touching on some of these topics here: #1993 (comment)

nadongjun added the enhancement New feature or request label Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

nadongjun commented Jul 7, 2024

chriswiggins commented Jul 11, 2024

CNLHC commented Jul 11, 2024 •

edited

Loading

nadongjun commented Jul 11, 2024

kradalby commented Jul 13, 2024

[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

Comments

nadongjun commented Jul 7, 2024

Use case

Description

Benchmark Environment

Comparison Metrics and Results

Profiling Results: cpu

postgres

sqlite

Profiling Results: memory

postgres

sqlite

Conclusion

Recommendation

Attachments

Contribution

How can it be implemented?

chriswiggins commented Jul 11, 2024

CNLHC commented Jul 11, 2024 • edited Loading

nadongjun commented Jul 11, 2024

kradalby commented Jul 13, 2024

CNLHC commented Jul 11, 2024 •

edited

Loading