CWL steps that return arrays of files should be discovered datasets #1484

davidjsherman · 2024-12-12T11:47:03Z

We would like to test CWL CommandLineTools that produce an output containing an array of files discovered during execution. This ought to be the CWL equivalent of individual datasets in the Advanced Tool Development Topic on dynamic numbers of outputs, where CWL's internal logic provides the same functionality as the Galaxy discover_datasets element.

All of the documentation and examples of how to make assertions for CWL outputs seem to only treat the case where the output is a single file.

Even though the CommandLineTool doesn't known how many outputs will be made, concretely for every test case we do know what outputs to expect, and can name them explicitly in the assertions.

If we try to use element_tests on a set of expected outputs, Planemo raises a TypeError in verify_elements, that suggests that Planemo isn't converting the array of files into a data collection as galaxy/tool_util expects.

- doc: generate some subsets by sampling
  job: sample_job.yaml
  outputs:
    samples:
      element_tests:
        subset-1.txt:
          asserts: {"has_n_lines": {"n": 100}}
        subset-2.txt:
          asserts: {"has_n_lines": {"n": 100}}

The error is

File "lib/python3.13/site-packages/galaxy/tool_util/verify/interactor.py", line 1205, in verify_collection
    verify_elements(data_collection["elements"], output_collection_def.element_tests)

Running Planemo under Pdb reveals that data_collection is an array of CWL objects of class File, not a data collection that verify_collection can consume.

So, in decreasing order, the hope is that

Planemo can in fact make assertions about CWL arrays, but we couldn't find it in the documentation. We would be willing to make a PR to improve the documentation.
There is a way in the Planemo test to declare that the array of files is a data collection, or coerce it.
Planemo needs to be be modified to convert CWL arrays to collections, on which assertions can be expressed. I would need advice about where in the code this should happen, before I could say whether we could help.
There is a workaround, that involves using another representation for the array of files. This could be considered but would be costly, since our CWL CommandLineTools really do return arrays that subsequent steps scatter over. Normally I would be reticent to change the representation and the pipelines just to satisfy the testing framework.

Thanks in advance for any advice you might have

The text was updated successfully, but these errors were encountered:

davidjsherman · 2024-12-12T14:03:51Z

Minimal reproducible example

split_test.cwl

- job: split_job.yaml
  outputs:
    lines:
      element_tests:
        xaa:
          asserts: {"has_n_lines": {"n": 1}}
        xab:
          asserts: {"has_n_lines": {"n": 1}}
        xac:
          asserts: {"has_n_lines": {"n": 1}}

split.cwl:

#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: CommandLineTool
id: "split"
label: "split"
doc: "Split file into single lines"

inputs:
  input:
    type: File
    inputBinding:
      position: 1

outputs:
  lines:
    type: File[]
    outputBinding:
      glob: "x??"

baseCommand: ["split", "-l", "1"]

split_job.yaml

input:
  class: File
  path: "/etc/shells"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CWL steps that return arrays of files should be discovered datasets #1484

CWL steps that return arrays of files should be discovered datasets #1484

davidjsherman commented Dec 12, 2024

davidjsherman commented Dec 12, 2024

CWL steps that return arrays of files should be discovered datasets #1484

CWL steps that return arrays of files should be discovered datasets #1484

Comments

davidjsherman commented Dec 12, 2024

davidjsherman commented Dec 12, 2024