Skip to content

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support#17332

Open
shimizukko wants to merge 21 commits intomasterfrom
makito/DAOS-18387
Open

DAOS-18387 test: recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support#17332
shimizukko wants to merge 21 commits intomasterfrom
makito/DAOS-18387

Conversation

@shimizukko
Copy link
Contributor

@shimizukko shimizukko commented Dec 31, 2025

To support MD-on-SSD for ddb, we need to support two commands. ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Update test_recovery_ddb_ls to support MD-on-SSD with the new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium, but that will break other tests in ddb.py because they don't support MD-on-SSD yet. Keep the original tests as ddb_pmem.py and ddb_pmem.yaml and keep running them on VM (except test_recovery_ddb_ls because that's updated in this PR).

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

To support MD-on-SSD for ddb, we need to support two commands.
ddb prov_mem and ddb ls with --db_path.

Update ddb_utils.py to support the new commands.

Add check_ram_used in recovery_utils.py to detect whether
the system is MD-on-SSD.

Update test_recovery_ddb_ls to support MD-on-SSD with the
new ddb commands.

We need to update the test yaml to run on MD-on-SSD/HW Medium,
but that will break other tests in ddb.py because they don't
support MD-on-SSD yet. Keep the original tests as ddb_pmem.py
and ddb_pmem.yaml and keep running them on VM (except
test_recovery_ddb_ls because that's updated in this PR).

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@github-actions
Copy link

github-actions bot commented Dec 31, 2025

Ticket title is 'CR Test Update - recovery/ddb.py test_recovery_ddb_ls MD-on-SSD Support'
Status is 'In Review'
Labels: 'catastrophic_recovery'
https://daosio.atlassian.net/browse/DAOS-18387

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17332/1/display/redirect

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17332/5/execution/node/898/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@shimizukko shimizukko marked this pull request as ready for review January 2, 2026 06:44
@shimizukko shimizukko requested review from a team as code owners January 2, 2026 06:44
@shimizukko shimizukko requested review from dinghwah and phender January 2, 2026 06:44
@shimizukko
Copy link
Contributor Author

@phender @dinghwah
In Git diff, it looks like ddb_pmem.py and ddb_pmem.yaml are new, but they appear so because I removed test_recovery_ddb_ls from there and renamed the files by adding _pmem. Please focus on the rest of the files.

I have two questions:

  1. In recovery_utils.py, I added check_ram_used to determine whether the system is MD-on-SSD. The logic is that if the test runs on HW Medium and the server config has ram field, it must be running on MD-on-SSD. Is this okay or is there any other better way?

  2. On MD-on-SSD, we need to load/mount the pool dir to a new location. That location can be anywhere, but I chose /mnt/daos_load to make it consistent with the existing pattern. At the end of the test, I call umount and rm -rf on /mnt/daos_load. Is this okay?

Thanks.

Args:
remote_file_path (str): File path to copy to local.
test_dir (str): Test directory. Usually self.test_dir.
remote (str): Remote hostname to copy file from.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_clush_command requires a NodeSet.

Suggested change
remote (str): Remote hostname to copy file from.
remote (NodeSet): Remote hostname to copy file from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the change not get committed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to add this file. Fixed.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Also use
self.server_managers[0].manager.job.yaml.metadata_params.path.value
to get control_metadata_path.

Update self.fail log message to include failed hosts.

Update comment. Include up-to-date sample output.

Update test yaml timeout value.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@shimizukko shimizukko requested a review from phender February 4, 2026 05:47
vos_paths = self.server_managers[0].get_vos_files(pool)
if not vos_paths:
self.fail(
f"vos file wasn't found in {self.server_managers[0].get_vos_paths(pool)[0]}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we call get_vos_files() a second time for this error message, if it returned an empty list the first time we called it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember why I wrote this. Fixed.

test_clients: 1

timeout: 1800
timeout: 30M
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based upon https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17332/12/testReport/FTEST_recovery/DdbTest/ should this be 6 minutes?

Suggested change
timeout: 30M
timeout: 360

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my FTC node, it takes much longer than 6 min, so I put 30 min. Maybe it's slow if the node doesn't have PMEM? I adjusted it to the reasonable value for CI.

test_servers: 1
test_clients: 1

timeout: 1800
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args:
remote_file_path (str): File path to copy to local.
test_dir (str): Test directory. Usually self.test_dir.
remote (str): Remote hostname to copy file from.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did the change not get committed?

@@ -0,0 +1,468 @@
"""
(C) Copyright 2022-2024 Intel Corporation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this copyright is wrong / copy-pasted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we leave the Intel Copyright statement and add HPE below it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is a new file, right? So there was no work done on it at Intel

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see in your other comment you moved it

ddb_pmem.py used to be ddb.py

That wasn't obvious to me to GitHub's diff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so I mentioned that in the comment above, but didn't include you in the at mention.

self.random_akey = get_random_string(10)
self.random_data = get_random_string(10)

def test_recovery_ddb_rm(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we should use log_step in these new tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ddb_pmem.py used to be ddb.py. I added MD-on-SSD support for test_recovery_ddb_ls, but then I had to also update the test yaml to support MD-on-SSD, so I renamed the original test to ddb_pmem.py and moved test_recovery_ddb_ls to ddb.py, which now supports both modes. I'll move the remaining tests in ddb_pmem.py to ddb.py and add MD-on-SSD support. At that time, I want to do a bunch of refactoring including log_step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're just moving these temporarily to ddb_pmem.py and then plan to move them back? Is it too much work to make them all work now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored ddb_pmem.py to include log_step, etc.

Remove get_vos_paths() from self.fail.

Reduce timeout to 7M.

Use single quotes to surround double quotes.

Move DdbCommand instantiation outside of if-else.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
Comment on lines 292 to 293
Before calling this method, "" (two double quotes) needs to be set to
self.vos_path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. Why can't this function handle that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my opinion.

If we do self.vos_path.value = '""' here, vos_path = '""' in ddb.py (line 138) would be unnecessary, so we should remove that. In that case, what do we set to vos_path when we instantiate DdbCommand? Setting None and letting prov_mem() update it to '""' later seems odd. I meant the comment as instruction rather than requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to be set to '""' to begin with? Does the command itself require an empty path? That seems odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the command itself requires an empty path. I agree it's odd.

Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Skip-unit-tests: true
Skip-fault-injection-test: true
Skip-func-hw-test-medium: false
Test-tag: test_recovery_ddb_ls DdbPMEMTest
Signed-off-by: Makito Kano <makito.kano@hpe.com>
@shimizukko shimizukko requested a review from a team February 16, 2026 07:56
@shimizukko
Copy link
Contributor Author

@daos-stack/daos-gatekeeper I updated the commit message, but if we simply merge it, I believe the old message will be used. Please use the one shown above when merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants