You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Undersampling occurs in augur filter when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:
consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. --group-by year --subsample-max-sequences 300 is equivalent to --group-by year --sequences-per-group 150. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.
In the original formulation of only --sequences-per-group the idea was to say specify --sequences-per-group 10 and --group-by country would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of --subsample-max-sequences. I'd think of --subsample-max-sequences as solely specifying --sequences-per-group.
Possible solutions
Roughly sorted from least to most work involved.
Add warnings. Example:
WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.
Add an option --output-group-by-sizes to highlight any discrepancies. Example:
year
target size
available sequences
output size
2023
150
200
150
2024
150
100
100
Both (1) and (2) have been adopted for --group-by-weights in #1454, but they could be extended to other sampling methods.
Create an "augur filter GUI" that has a sidebar with controls to adjust augur filter parameters and graphs on the main view that shows spread of output data.
The text was updated successfully, but these errors were encountered:
Context
Undersampling occurs in
augur filter
when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:Some historical context from #1454 (comment):
Possible solutions
Roughly sorted from least to most work involved.
Add warnings. Example:
Add an option
--output-group-by-sizes
to highlight any discrepancies. Example:Both (1) and (2) have been adopted for
--group-by-weights
in #1454, but they could be extended to other sampling methods.augur filter
GUI" that has a sidebar with controls to adjustaugur filter
parameters and graphs on the main view that shows spread of output data.The text was updated successfully, but these errors were encountered: