Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CWL steps that return arrays of files should be discovered datasets #1484

Open
davidjsherman opened this issue Dec 12, 2024 · 1 comment
Open

Comments

@davidjsherman
Copy link

We would like to test CWL CommandLineTools that produce an output containing an array of files discovered during execution. This ought to be the CWL equivalent of individual datasets in the Advanced Tool Development Topic on dynamic numbers of outputs, where CWL's internal logic provides the same functionality as the Galaxy discover_datasets element.

All of the documentation and examples of how to make assertions for CWL outputs seem to only treat the case where the output is a single file.

Even though the CommandLineTool doesn't known how many outputs will be made, concretely for every test case we do know what outputs to expect, and can name them explicitly in the assertions.

If we try to use element_tests on a set of expected outputs, Planemo raises a TypeError in verify_elements, that suggests that Planemo isn't converting the array of files into a data collection as galaxy/tool_util expects.

- doc: generate some subsets by sampling
  job: sample_job.yaml
  outputs:
    samples:
      element_tests:
        subset-1.txt:
          asserts: {"has_n_lines": {"n": 100}}
        subset-2.txt:
          asserts: {"has_n_lines": {"n": 100}}

The error is

File "lib/python3.13/site-packages/galaxy/tool_util/verify/interactor.py", line 1205, in verify_collection
    verify_elements(data_collection["elements"], output_collection_def.element_tests)

Running Planemo under Pdb reveals that data_collection is an array of CWL objects of class File, not a data collection that verify_collection can consume.

So, in decreasing order, the hope is that

  1. Planemo can in fact make assertions about CWL arrays, but we couldn't find it in the documentation. We would be willing to make a PR to improve the documentation.
  2. There is a way in the Planemo test to declare that the array of files is a data collection, or coerce it.
  3. Planemo needs to be be modified to convert CWL arrays to collections, on which assertions can be expressed. I would need advice about where in the code this should happen, before I could say whether we could help.
  4. There is a workaround, that involves using another representation for the array of files. This could be considered but would be costly, since our CWL CommandLineTools really do return arrays that subsequent steps scatter over. Normally I would be reticent to change the representation and the pipelines just to satisfy the testing framework.

Thanks in advance for any advice you might have

@davidjsherman
Copy link
Author

Minimal reproducible example

split_test.cwl

- job: split_job.yaml
  outputs:
    lines:
      element_tests:
        xaa:
          asserts: {"has_n_lines": {"n": 1}}
        xab:
          asserts: {"has_n_lines": {"n": 1}}
        xac:
          asserts: {"has_n_lines": {"n": 1}}

split.cwl:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
id: "split"
label: "split"
doc: "Split file into single lines"

inputs:
  input:
    type: File
    inputBinding:
      position: 1

outputs:
  lines:
    type: File[]
    outputBinding:
      glob: "x??"

baseCommand: ["split", "-l", "1"]

split_job.yaml

input:
  class: File
  path: "/etc/shells"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant