Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to dropshot 0.15 #7196

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

update to dropshot 0.15 #7196

wants to merge 11 commits into from

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Dec 3, 2024

No description provided.

hawkw and others added 3 commits December 3, 2024 12:11
As of oxidecomputer/dropshot@f0e6a46,
the OpenAPI documents for APIs that don't use custom user-defined error
types should be unchanged from the previous Dropshot release. This
commit updates that and undoes the diff to the OpenAPI docs that broke
progenitor.
@iliana iliana changed the title [WIP] update to oxidecomputer/dropshot#1180 update to dropshot 0.15 Dec 5, 2024
@iliana iliana marked this pull request as ready for review December 5, 2024 21:58
@iliana
Copy link
Contributor

iliana commented Dec 5, 2024

I think this is ready to merge barring any test failures, but I want to merge it after we branch for R12.

@iliana
Copy link
Contributor

iliana commented Dec 6, 2024

I'm not completely sure yet but I think this is failing in CI because sled-agent is using dropshot 0.15's default_request_body_max_bytes to write a config file for MGS, but MGS is on an older dropshot. https://buildomat.eng.oxide.computer/wg/0/artefact/01JECD8056131TGQ88YT5461PR/vf0F9NvP9NS6utkc5JpUxVireyQNiKooCBzR9UHjjaHeQJGo/01JECD8EAYXDC7E33WV6TV4JNC/01JECG53D8QTFM4MT6VK7BZH49/oxide-mgs:default.log?format=x-bunyan

EDIT: wait no that's not a generated config.

@iliana iliana force-pushed the eliza/dropshot-1180 branch from 45b9156 to f7cfe6d Compare December 6, 2024 17:56
@davepacheco
Copy link
Collaborator

We'll want to update this to 0.15.1, which I expect to be available shortly. (See oxidecomputer/dropshot#1195.)

@hawkw
Copy link
Member Author

hawkw commented Dec 11, 2024

IIRC we wanted to wait until after the release to merge this, is that still the case?

@iliana
Copy link
Contributor

iliana commented Dec 11, 2024

After we branch for R12 yeah, but I think that's just my risk assessment. I don't know how soon we're branching.

@@ -60,7 +60,7 @@ bind_address = "127.0.0.1:12222"
#
# This should be brought back down to a more reasonable value once per-endpoint
# request body limits are implemented.
request_body_max_bytes = 3221225472
default_request_body_max_bytes = 3221225472
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for picking this up!

Think the comment above here may need to be updated since 0.15 implements these limits. (Just saying that this should be brought down is good, I think.)

@@ -46,7 +46,7 @@ bind_address = "127.0.0.1:12220"
#
# This should be brought back down to a more reasonable value once per-endpoint
# request body limits are implemented.
request_body_max_bytes = 3221225472
default_request_body_max_bytes = 3221225472
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -2359,7 +2359,7 @@ impl ServiceManager {
bind_address: SocketAddr::new(*opte_ip, nexus_port),
// This has to be large enough to support:
// - bulk writes to disks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one too I think.

Comment on lines 45 to 47
# Host OS images are just over 800 MiB currently; set this to 2 GiB to give some
# breathing room.
request_body_max_bytes = 2_147_483_648
default_request_body_max_bytes = 2_147_483_648
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here -- just a todo mentioning that it's now possible to bring this down and use the per-endpoint limit.

Comment on lines 41 to 43
# Host OS images are just over 800 MiB currently; set this to 2 GiB to give some
# breathing room.
request_body_max_bytes = 2_147_483_648
default_request_body_max_bytes = 2_147_483_648
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

Comment on lines 82 to 84
# Host OS images are just over 800 MiB currently; set this to 2 GiB to give some
# breathing room.
request_body_max_bytes = 2_147_483_648
default_request_body_max_bytes = 2_147_483_648
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here.

@iliana
Copy link
Contributor

iliana commented Dec 13, 2024

I figure since I intend this to land post-R12 I may as well make the values sensible now. If we log request body sizes it might be good to pull those logs from dogfood as well to see if we would be overlooking any endpoints.

@iliana
Copy link
Contributor

iliana commented Dec 16, 2024

I've found most of the larger-than-necessary API-wide limits but I'm still finding some more.

@iliana
Copy link
Contributor

iliana commented Dec 16, 2024

I think that's everything, except for a few defaults that remain set above 1 MiB:

  • A handful of Oximeter producer endpoints use 10 MiB as the max body size. I'm not entirely sure why? (Oximeter retrieves these metrics without sending large request bodies, right?)
  • External/internal DNS use 100 MiB as the max body size but I don't have the necessary context to safely lower this.
  • wicketd is 8 MiB (this was the previous value before it was bumped to 4 GiB to support TUF repos).

And for completion, testing-only values:

  • The simulated Crucible Pantry in sled-agent/src/sim has an 8 MiB limit.
  • Transient DNS servers use a 4 MiB limit; the DNS server used in nexus-test-utils and for testing the resolver implementation use an 8 MiB limit.

Given how wide-reaching of a change this has the opportunity to be, I'm running mupdate, rack init, and live-tests on dublin from this branch.

@hawkw
Copy link
Member Author

hawkw commented Dec 16, 2024

Given how wide-reaching of a change this has the opportunity to be, I'm running mupdate, rack init, and live-tests on dublin from this branch.

Yeah, that sounds like a good call --- thanks for taking this over while I was moving!

@iliana
Copy link
Contributor

iliana commented Dec 16, 2024

[switch 1 00:00:08] Completed .. 4a (8/8) Checking the new RoT bootloader: after 3.01s
[switch 1 00:00:08] Completed (4/6) Updating RoT bootloader: after 6.86s
[switch 1 00:00:08]   Running (5/6) Updating RoT
[switch 1 00:00:08]   Skipped (5/6) Updating RoT: RoT active slot already at version 1.0.30
[switch 1 00:00:08]   Running (6/6) Updating SP
[switch 1 00:00:08]   Running .. 6a (1/4) Sending data to MGS (slot 0)
[switch 1 00:00:08]    Failed .. 6a (1/4) Sending data to MGS (slot 0): after 42.87ms: SP component update failed at stage "sending" for switch_sp:sidecar-c (version 1.0.32)
[switch 1 00:00:08]           .. 6a   Caused by:
[switch 1 00:00:08]           .. 6a   - Error Response: status: 400 Bad Request; headers: {"content-type": "application/json", "x-request-id": "0b1e394e-1722-4ff9-81b5-4767c196b43f", "content-length": "126", "date": "Sun, 28 Dec 1986 00:09:02 GMT"}; value: Error { error_code: None, message: "request body exceeded maximum size of 1048576 bytes", request_id: "0b1e394e-1722-4ff9-81b5-4767c196b43f" }

haha woooo

@iliana
Copy link
Contributor

iliana commented Dec 17, 2024

Okay, mupdate, rack init, and basic operation including instance launch all seem fine on a racklette with this installed. Mupdate was performed from a switch zone built on this branch. Trying to get live-tests going but running into unrelated issues.

There appear to be no rejected requests due to unexpected large request bodies:

root@oxz_switch0:~# pilot host exec -O -c "grep 'request body exceeded maximum size' /var/svc/log/* /zone/oxz_switch/root/var/svc/log/* /pool/ext/*/crypt/zone/*/root/var/svc/log/* /pool/ext/*/crypt/debug/*" 14-17
=== Failure from BRM42220026 (cubby 14):
exit code 1

=== Failure from BRM27230037 (cubby 15):
exit code 2: grep: can't open "/zone/oxz_switch/root/var/svc/log/*"

=== Failure from BRM23230018 (cubby 16):
exit code 1

=== Failure from BRM23230010 (cubby 17):
exit code 2: grep: can't open "/zone/oxz_switch/root/var/svc/log/*"

@sunshowers
Copy link
Contributor

This is awesome, thanks for being so thorough iliana!

@iliana
Copy link
Contributor

iliana commented Dec 17, 2024

I may have found an issue while running live tests related to internal DNS, but it's unclear as of yet if that's caused by this PR or something else.

@hawkw
Copy link
Member Author

hawkw commented Jan 9, 2025

@iliana are you still actively working on this? if not, i'm happy to pick it back up, if there's anything remaining to do --- any updates on the live test internal DNS issue?

@iliana
Copy link
Contributor

iliana commented Jan 9, 2025

Oh I knew I was forgetting something. Feel free to pick this back up!

The live test issue was actually #7265 -- the live test failed because there was a different problem. I think we should rebase this PR and do another test run on a racklette. All of the various issues with live tests should be fixed in current main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants