Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore: add and fill host info in restore progress #4088

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Oct 29, 2024

This PR creates and fills new swagger host restore progress definitions from #4082 in restore service.
It also updates managerclient to correctly display bandwidth and shard information in sctool progress.

Fixes #4042

@Michal-Leszczynski
Copy link
Collaborator Author

Examples:

Before first load&stream
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       10s
Progress:       0% | 37%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    316.128k/s
  - Load&stream: unknown

╭─────────────────────────────────────────────────┬──────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │ Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼──────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 0% | 37% │  86k │       0 │    32.245k │      0 │
╰─────────────────────────────────────────────────┴──────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   160.801k/s/shard │               unknown │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   155.433k/s/shard │               unknown │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During restore
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         RUNNING (restoring backed-up data)
Start time:     29 Oct 24 09:27:27 CET
Duration:       31s
Progress:       74% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.585k/s

╭─────────────────────────────────────────────────┬────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │   Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 74% | 100% │  86k │ 64.368k │        86k │      0 │
╰─────────────────────────────────────────────────┴────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           813/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           810/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

During repair
miles@fedora:~/scylla-manager$ ./sctool.dev progress -c myc restore/81d15a37-0774-453b-b6b9-10d9ece8a0e4 --details
Restore progress
Run:            a19376f0-95cf-11ef-9136-0892040e83bb
Status:         DONE
Start time:     29 Oct 24 09:27:27 CET
End time:       29 Oct 24 09:28:02 CET
Duration:       34s
Progress:       100% | 100%
Snapshot Tag:   sm_20241021091028UTC
Bandwidth:
  - Download:    306.050k/s
  - Load&stream: 1.415k/s


╭─────────────────────────────────────────────────┬─────────────┬──────┬─────────┬────────────┬────────╮
│ Keyspace                                        │    Progress │ Size │ Success │ Downloaded │ Failed │
├─────────────────────────────────────────────────┼─────────────┼──────┼─────────┼────────────┼────────┤
│ multi_location_4d99b6b98f8c11efb0cb0892040e83bb │ 100% | 100% │  86k │     86k │        86k │      0 │
╰─────────────────────────────────────────────────┴─────────────┴──────┴─────────┴────────────┴────────╯

Hosts info
╭────────────────┬────────┬────────────────────┬───────────────────────╮
│ Host           │ Shards │ Download bandwidth │ Load&stream bandwidth │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.12 │      2 │   150.350k/s/shard │           724/s/shard │
├────────────────┼────────┼────────────────────┼───────────────────────┤
│ 192.168.200.13 │      2 │   155.797k/s/shard │           724/s/shard │
╰────────────────┴────────┴────────────────────┴───────────────────────╯

@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review October 29, 2024 09:34
@karol-kokoszka
Copy link
Collaborator

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)

(813/s/shard) is bytes per second per shard ? B is missing.

}
return endV.Sub(*start)
}
return 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return 0
return time.Duration(0)

for clarity

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michal-Leszczynski idle time per node is missing

@Michal-Leszczynski
Copy link
Collaborator Author

Why the kilobytes is (160.801k/s/shard) instead of (160.801kB/s/shard)
(813/s/shard) is bytes per second per shard ? B is missing.

I just used the standard way in which we display bytes in managerclient.
I will update it to use KiB instead of k, etc... (in other places as well).

@Michal-Leszczynski idle time per node is missing

That's true! I forgot about your suggestion about that in the issue.

Unfortunately, it can't be calculated like Restore duration - (time reported as download) - (time reported as load & stream) since we don't really have Restore duration part. Note that the run duration displayed in sctool progress is the run duration, and not the entire task execution duration. So if restore was running for 1h and then was paused and resumed, the sctool progress would display duration close to 0, but we would like the idle time to be consistent across runs.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side), but even that won't solve the whole problem.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

In order to overcome that, we would need to know when download and load&stream stage started and finished, but we currently don't have such information.

For those reasons, I would prefer to skip idle time display, as it can still be observed via SM metrics.
@karol-kokoszka what do you think about it?

It was a left-over after feature development:/
It's going to be needed for calculating per shard
download/stream bandwidth in progress command.
This commit also moves host shard info to the tablesWorker,
as it is commonly reused during restore procedure.
This allows to calculate download/stream per shard
bandwidth in 'sctool progress' display.

s: add and fill host info in prog
It is nicer to see:
"Size: 10B" instead of "Size: 10" or
"Size: 20KiB" instead of "Size: 20k".
@karol-kokoszka
Copy link
Collaborator

Having information that node was spending time on something other that l&s or download helps with finding potential optimizations for the restore.
Seeing high bw for load & stream and high bw for download may be misleading when there is no information about how the node was utilized during the restore.

Examples from 3.3.3 shows that node can download fast, l&s fast but remain idle for most of the time.

Of course, we could manually calculate entire task execution duration by traversing previous runs (on SM side),

Yes, then let's do it.

Another thing with this approach is that the total task execution duration also includes other time consuming restore stages (e.g. indexing, changing tombstone_gc, rebulding views, ...). This would result in reporting overestimated idle time, which could cause more harm than good.

Call it other instead of idle then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store bandwidth characteristic of Manager restore process
2 participants