Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Benchmarking PostgreSQL vs. SQLite for Large-Scale VPN Networks Using Headscale #2001

Open
2 tasks done
nadongjun opened this issue Jul 7, 2024 · 4 comments
Open
2 tasks done
Labels
enhancement New feature or request

Comments

@nadongjun
Copy link
Contributor

Use case

Although Headscale is primarily designed for small-scale VPN environments (e.g., home VPNs), there is increasing interest in using Headscale to deploy large-scale VPN networks with over 500 clients. However, there is a lack of comprehensive guides and benchmark materials for such use cases. This benchmark provides recommendations for database configurations and comparison metrics to assist users in setting up large-scale VPN networks with Headscale.

  • The aim of this benchmark was to evaluate the suitability of SQLite (with the WAL option) versus PostgreSQL for large-scale client networks. The evaluation criteria included error occurrence frequency, client request processing time, and profiling results.
  • The benchmark results showed that PostgreSQL outperformed SQLite in terms of total client creation time, error occurrence frequency, and profiling results.
  • Based on these results, it is recommended to update the documentation and config.yaml to include criteria for selecting a database based on different use cases.

Description

Benchmark Environment

  • A single Headscale server was used to create and connect 60 Tailscale clients on each of 10 EC2 instances.
    • Headscale Instance: t2.large instance deployed in the ap-northeast-2 region
    • Tailscale Clients: 10 t2.medium instances (5 in ap-northeast-2 and 5 in us-east-1), with 60 Tailscale clients deployed per instance (total 600 tailscale client)
      - The benchmark was performed in a consistent environment where the only variable was the choice of database specified in the headscale config.yaml: either PostgreSQL or SQLite (with the WAL option).

Comparison Metrics and Results

The benchmark evaluated SQLite (with the WAL option) and PostgreSQL based on the following three criteria: total client creation time and error occurrence rate.

  1. Total Client (600) Creation Time and Error Occurrence Rate

    • SQLite: sqlite-node-list.json
      • Total Creation Time: 3.46 hours (First user created at: 1720269701, Last user created at: 1720282145)
      • Successful Client Creations: 442
      • Errors Encountered: 158 (ERR: context deadline exceeded)
    • PostgreSQL: postgres-node-list.json
      • Total Creation Time: 2.47 hours (First user created at: 1720183268, Last user created at: 1720192120)
      • Successful Client Creations: 501
      • Errors Encountered: 99 (ERR: context deadline exceeded)
  2. Error Occurrences
    The following provides a verification of the errors that occurred in Issue [Bug] Node Connection Issues(~600 nodes) in v0.23.0-alpha12 #1966, including their occurrence rates:

    • ERROR-1: The "ERR update not sent, context cancelled..." error occurred more frequently with SQLite. PostgreSQL experienced the Cannot create user: context deadline exceeded error.
    • ERROR-2: The issue where some clients go from online to offline was caused by insufficient resources on the EC2 instances. This was not related to Headscale itself. With t2.medium instances, up to 60 Tailscale clients could be deployed effectively, but deploying more than 100 clients led to resource shortages, causing some clients to go offline.
  3. Profiling Results

    • The profiling results show that SQLite had higher query and database-related time compared to PostgreSQL. Please refer to the attached files for detailed profiling data.

Profiling Results: cpu

postgres

cpu-postgres

sqlite

cpu-sqlite

Profiling Results: memory

postgres

heap-postgres

sqlite

heap-sqlite

Conclusion

  • Client Creation Time: PostgreSQL outperformed SQLite in terms of faster client creation and a higher number of successful client creations.
  • Error Occurrence Frequency: PostgreSQL had a lower error(context deadline exceeded) occurrence rate compared to SQLite.
  • Profiling Performance: PostgreSQL demonstrated better performance in query execution and database-related operations compared to SQLite.

Recommendation

Based on the benchmark results, PostgreSQL is recommended for large-scale client environments over SQLite. It is suggested to update the documentation and config.yaml to include guidelines for selecting the appropriate database based on the use case.

Attachments

  • Headscale Node Lists:
    • postgres-node-list.json
    • sqlite-node-list.json
  • pprof Profiling and Heap Measurement Results:
    • cpu-postgres.png
    • cpu-sqlite.png
    • heap-postgres.png
    • heap-sqlite.png

Contribution

  • I can write the design doc for this feature
  • I can contribute this feature

How can it be implemented?

No response

@nadongjun nadongjun added the enhancement New feature or request label Jul 7, 2024
@chriswiggins
Copy link

Does this highlight a need (regardless of database) to be keeping more info in memory, or doing less database work every time a peer update occurs (or a mixture of both)?

I've been diving into the source, and I see a couple of TODOs by @kradalby mentioning caching. Would there be appetite for that?

@CNLHC
Copy link
Contributor

CNLHC commented Jul 11, 2024

Thanks for your meticulous benchmark.
It astonishes me that creating just 600 clients will take hours both in postgresql and sqlite. Intuitively speaking, I think the vital bottleneck here maybe not the database🤔
According to your benchmark results, the gorm overhead seems unexpectedly high.

BTW, if you have raw profile data, maybe flamegraph will provide more insight.

@nadongjun
Copy link
Contributor Author

@CNLHC

The reason creating 600 clients took several hours seems to be because I made the requests synchronously (using tailscale CLI to headscale by shell script). The latency observed in the benchmark is being analyzed in two areas: network latency due to AWS region distances and the database. Once the analysis is complete, I'll share the results.

Currently, I only have raw profile data for SQLite, so I am attaching a flamegraph for that.

sqlite

@kradalby
Copy link
Collaborator

I've posted a comment touching on some of these topics here: #1993 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants