A small Python CLI for synchronizing data files from a "drop" directory into a canonical repository structure, validating integrity (checksums), and preparing metadata for recording in a LabKey server. Configuration is loaded solely from a per-drop-folder sync.yml to allow site-specific behavior.
- Overview
- Features
- Architecture and Key Files
- Installation
- Quickstart
- Configuration
- Usage
- sync
- How It Works (Detailed Flow)
- Output and Tables
- Known Limitations and Gaps
- Roadmap / Next Steps
- Development
- Troubleshooting
- License
This tool automates synchronizing files discovered in a "drop" directory into a structured repository path. It uses a filename filter and a strict regular expression with named groups to parse metadata from filenames, computes checksums, performs a presence check in LabKey, and copies files when run in execute mode. It can update existing LabKey rows (and optionally insert new ones) based on a configurable mapping.
After a successful and verified copy to the repository, the original drop files can be moved (archived) into a configured processed folder, preserving the subdirectory structure relative to the drop folder. Both copy and move actions are only executed when --do-it is provided; otherwise, the tool prints detailed dry-run plans.
- [Discovery and templating] Glob discovery (
drop_filename_filter) and strict regex parsing (drop_filename_regex) with named capture groups. - [Target rendering]
repository_filenamesupports placeholders: regex groups,<run>(auto-increment),<hash>(CRC32), and<source.Field>from matched metadata. - [Integrity] If a
.md5sidecar exists, verify before copying; else compute a BLAKE3 digest and write a.blake3sidecar pre-copy. Post-copy verification is a block-by-block compare. - [LabKey presence + write-back] Presence check via
field_parameters.file_list(default CONTAINS by target path) or a configured presence field withequal|containssemantics. In execute mode, updates existing rows and optionally inserts new ones unlesswriteback.skip_creates: true. - [Skip creates] When
skip_creates: true, planned creates are suppressed in logs; copy and move are also skipped for files where write-back is skipped (including existing rows without aRowId). - [Dry-run plans]
- Pre-run plan table for each file (including
write_action: update|create|skip_create). - Copy plan summary:
would_copyorwould_skip:<reason>. - Update diffs per matched row; planned create fields (suppressed when
skip_creates: true). - Archive/move plan summary mirroring copy gating:
would_moveorwould_skip:<reason>.
- Pre-run plan table for each file (including
- [Executed run summaries]
- Copy summary (copied vs skipped with reasons), update diffs, planned creates (suppressed when
skip_creates: true). - Archive/move summary after verified copies: moves originals under
processed_folderpreserving structure.
- Copy summary (copied vs skipped with reasons), update diffs, planned creates (suppressed when
- [Console + Log] Styled tables via
richand a tee’d log file atdm/logs/sync/<runmode>/<dataset>-<timestamp>.log.
dm/dm.pysync: main CLI command. Loads<drop_folder>/sync.yml, discovers files, renders targets, checks integrity, optional metadata matching, LabKey presence, and outputs dry-run plans. With--do-it, performs copy + verify, write-back (update/insert), copies sidecars, and archives originals intoprocessed_folder.
dm/helpers.pyMessage,TableOutput,Hasherutilities.
dm/integrity.py- Read
.md5, write.blake3, and copy matching sidecar files.
- Read
dm/metadata.py- General-purpose metadata loaders (LabKey/Excel/CSV). Note: the current pipeline in
dm/dm.pyperforms simplified LabKey-based lookups inline.
- General-purpose metadata loaders (LabKey/Excel/CSV). Note: the current pipeline in
dm/sync.yml.TEMPLATE- Reference configuration for per-drop
sync.ymlfiles.
- Reference configuration for per-drop
environment.yml,Makefile,.vscode/settings.json- Environment, development helpers, and editor config.
Recommended with conda or mamba.
- Create/update environment with Makefile:
make envupdate- Or manually with conda/mamba:
# Recommended: create a new environment named dm
yes | conda env create -f environment.yml -n dm || conda env update -f environment.yml -n dm --prune
conda activate dmNote:
This project uses a standard, portable environment.yml without a hard-coded prefix:.
- Verify dependencies (installed via pip in the environment):
pyyaml,labkey,yachalk,rich,click,python-dotenv,openpyxl,blake3
Follow these steps to try the tool with a minimal configuration.
- Create a per-drop config at
<drop_folder>/sync.yml
Minimal config (no metadata usage):
# <drop_folder>/sync.yml
drop_filename_filter: "**/*.fastq.gz"
drop_filename_regex: "[^\\/]+\\/(?P<phase>[^\\/]+)\/.*(?P<seq>SEQ_[A-Z]{5}).*(?P<prefix>(?P<lib>[A-Z]{3}_[A-Z]{6})_[A-Z][0-9]+_[A-Z][0-9]{3})_(?P<suffix>[^.]+)"
repository_folder: /cluster/work/nexusintercept/data-repository
repository_filename: scRNA/raw/<phase>/<lib>/<lib>_<seq>_r<run>__<suffix>.fastq.gz
processed_folder: /cluster/work/nexusintercept/data-processed
filename_sequence: run
labkey:
host: intercept-labkey-dev.nexus.ethz.ch # set to your LabKey host
container: "LOOP Intercept" # set to your LabKey container/folder
schema: exp # adjust to your schema
table: data # adjust to your table
field_parameters:
Path_To_Synced_Data: file_list- Run a dry run (planning only), then actually copy with
--do-it:
python dm/dm.py sync --drop-folder /path/to/drop
python dm/dm.py sync --drop-folder /path/to/drop --do-itTiny example: include a field from LabKey in the target path using <source.Field>
Add a simple metadata source and match rule, then reference it in repository_filename:
# Snippet to add to <drop_folder>/sync.yml
metadata_sources:
- name: lk_experiments
type: labkey
host: intercept-labkey-dev.nexus.ethz.ch
container: "LOOP Intercept"
schema: exp
table: data
columns: [Name, Uploaded_Filename_Prefix]
metadata_match:
key_template: <prefix>
search:
- source: lk_experiments
field: Uploaded_Filename_Prefix
# Now you can use a metadata placeholder in the filename template
repository_filename: scRNA/raw/<phase>/<lib>/<lk_experiments.Name>_r<run>__<suffix>.fastq.gzNotes:
- Ensure your drop files match the
drop_filename_regex(update the regex to fit your naming scheme). - Adjust LabKey
schemaandtableto your server’s layout. If you use a custom table, setschema(e.g.,exp) andtable(e.g.,scRNA_Experiments) accordingly. - If a
.md5sidecar exists next to a source file, it will be verified. Otherwise a.blake3sidecar is created before copying.
Configuration is loaded from a single YAML file located in the drop folder:
- Local:
<drop_folder>/sync.yml(must exist in the drop folder you pass via--drop-folder).
Use dm/sync.yml.TEMPLATE as a reference template for creating per-drop sync files. The drop folder path itself is provided only via the CLI option --drop-folder and is not included in any YAML.
Key settings in sync.yml:
drop_filename_filter: Glob pattern to find files (e.g.,**/*.fastq.gz).drop_filename_regex: Regex with named groups to parse metadata from file paths. Example groups used:phase,seq,prefix,lib,suffix.repository_folder: Root folder where files are to be synchronized.repository_filename: Template for the target filename, e.g.,scRNA/raw/<phase>/<lib>/<lib>_<seq>_r<run>__<suffix>.fastq.gz. Placeholders are replaced using the named groups and special values. Also supports metadata placeholders of the form<source.Field>after metadata matching, e.g.,scRNA/raw/<phase>/<lib>/<phase>_<lk_experiments.Name>_r<run>__<suffix>.fastq.gz.processed_folder: Parent folder where original drop files are moved after a verified copy. The relative path under the drop folder is preserved.filename_sequence: Eitherrun(increments<run>) orhash(sets<hash>to a short CRC32-derived value).date_format: Datetime format string intended for use by helper functions likenow()ordrop_file_mtime(). Reserved for future LabKey write-back (not used by the core flow yet).labkey:host: LabKey hostcontainer: LabKey container/folder (e.g.,LOOP mTORUS)schema: Target schema name only (no dots), e.g.,expor your custom schematable: Target table name in that schema, e.g.,16S_Experimentscontext: Optional context path; passed through to the API wrapper
metadata_sources: External metadata sources. The current pipeline indm/dm.pyperforms inline LabKey lookups using the top-level LabKey connection. Other types (Excel/CSV) exist indm/metadata.pybut are not invoked by the default pipeline.metadata_match: Rules to find the metadata row for each file before syncing.key_template: Default template string to render a metadata key from the filename regex variables (e.g.,<prefix>,<lib>_<seq>).search: Ordered list of rule objects, each with:source: Name of the source frommetadata_sourcesto search.field: Field/column name within that source to match against.key_template(optional): Override the default template for this rule.
metadata_required: Boolean flag. Iftrue, files without a matching metadata row are skipped (reasonmetadata_missing) for both copy and archive plans.
Example metadata_sources configuration:
metadata_sources:
- name: lk_experiments
type: labkey
host: your-labkey-host.example.org
container: "Your Container"
schema: exp
table: data
columns: [Name, Created, Run, Path_To_Synced_Data]
filters:
- { field: Path_To_Synced_Data, type: contains, value: "/scRNA/raw/" }
- name: sample_manifest
type: excel
path: manifests/sample_manifest.xlsx # relative to the drop folder unless absolute
sheet: Sheet1 # optional; defaults to active sheet
- name: barcodes
type: csv
path: manifests/barcodes.csv # relative to the drop folder unless absolute
delimiter: "," # optional; default is comma
metadata_match:
# Default key template used to render a key per file from regex variables
key_template: <prefix>
# Ordered rules (first match wins)
search:
- source: lk_experiments
field: Uploaded_Filename_Prefix
# key_template: <prefix>
- source: sample_manifest
field: Prefix
key_template: <lib>_<seq>Notes:
fields: Mapping used to build LabKey rows during write-back. Supports placeholders and functions (now(),drop_file_mtime()).field_parameters:file_list: identifies the field used for presence CONTAINS-check by target path (e.g.,Path To Synced Data).file_list_aggregator: reserved for future aggregation behavior during write-back.
replacements:before_matchapplies to captured/derived variables before templating and metadata matching.before_writebackcan transform variables (target: var) prior to rendering or fields (target: field) after rendering.
Example template dm/sync.yml.TEMPLATE (excerpt):
drop_filename_regex: "[^\/]+\/(?P<phase>[^\/]+)\/.*(?P<seq>SEQ_[A-Z]{5}).*(?P<prefix>(?P<lib>[A-Z]{3}_[A-Z]{6})_[A-Z][0-9]+_[A-Z][0-9]{3})_(?P<suffix>[^.]+)"
drop_filename_filter: "**/*.fastq.gz"
repository_folder: data/repository
repository_filename: scRNA/raw/<phase>/<lib>/<lib>_<seq>_r<run>__<suffix>.fastq.gz
processed_folder: data/processed
filename_sequence: run
labkey:
host: your-labkey-host.example.org
container: LOOP Intercept
schema: exp
table: data
fields:
Name: <lib>_<seq>_r<run>
Project_Phase: <phase>
Run_Number: <run>
Uploaded_Filename_Prefix: <prefix>
Data_synced: true
Date_Of_Syncing: now()
Date_Of_Uploading: drop_file_mtime()
Path_To_Synced_Data:
field_parameters:
Path_To_Synced_Data: file_list
Uploaded_Filename_Prefix: file_list_aggregator
lookups:
phase:
btki: BTKiRun commands from the repo root unless otherwise noted.
Synchronize files from a drop folder into the repository.
Typical usage (configuration is taken from <drop_folder>/sync.yml):
python dm/dm.py sync --drop-folder /path/to/drop
# Execute copy/write-back/move after confirmation
python dm/dm.py sync --drop-folder /path/to/drop --do-itNotes:
- The regex is enforced for planning; non-matching files are listed as skipped by regex in discovery output.
<run>increments to avoid collisions within a run whenfilename_sequence: run. Ifhashis enabled,<hash>is a deterministic CRC32.- Integrity policy:
- If
.md5exists: MD5 must match; otherwise the file is skipped. - If
.md5is absent: a.blake3sidecar is computed and written before copy.
- If
- With
--do-it, after a verified copy, sidecars are copied to the repository and originals are moved underprocessed_folder(preserving structure).
Inside dm/dm.py (sync command):
- Configuration load: The tool reads
<drop_folder>/sync.ymldirectly based on the--drop-folderargument. - File discovery: Uses
globwithdrop_filename_filterto find candidate files in the drop folder; prints matched and regex-skipped lists. - Regex validation and capture: Validates each file path against
drop_filename_regex; extracts named groups (e.g.,phase,seq,lib,prefix,suffix). - Derivations and replacements: Optionally derive variables from matched metadata per
metadata_derive, then applyreplacements.before_matchto captured/derived variables. - Target filename rendering: Render
repository_filenamewith variables and<source.Field>placeholders (from matched metadata). Resolve<run>collisions or set<hash>perfilename_sequence. - Integrity check:
- If
.md5exists: compute and compare MD5; mismatches skip copying. - If no
.md5: compute BLAKE3 and write a.blake3sidecar prior to copy.
- If
- LabKey presence check: By default uses
field_parameters.file_listwith a CONTAINS filter on target path, orpresence_check.fieldwithequal|contains. Annotates each file within_labkeyandexisting_row. - Pre-run reporting:
- Plan table with key annotations including
write_action: update|create|skip_create. - Dry-run write-back tables: update diffs and (unless
skip_creates) planned create fields. - Copy plan summary:
would_copyorwould_skip:<reason>. - Archive/move plan summary:
would_moveorwould_skip:<reason>(mirrors copy gating).
- Plan table with key annotations including
- Execute mode (
--do-it):- Enforce gating: metadata required, MD5 pass if present, and write-back viability (skip if
skip_createsblocks or missingRowIdfor updates). - Copy source → target, verify by block-by-block compare, then copy matching sidecar to the repository.
- Build write-back rows from
fields; update existing rows byRowIdand insert new rows unlesswriteback.skip_creates: true. - Post-run reporting: copy summary, update diffs, and (unless
skip_creates) planned create fields. - Archive/move originals: after verified copy, move original files under
processed_folderpreserving the drop-relative structure; print archive/move summary.
- Enforce gating: metadata required, MD5 pass if present, and write-back viability (skip if
Primary tables printed via rich:
- [Plan table] One row per file with key annotations.
write_actionshowsupdate|create|skip_create. Rows withskip_createare highlighted yellow; MD5 mismatch is red. - [Dry-run write-back] Per-row update diffs and (unless
skip_creates) planned create fields. - [Copy plan summary]
would_copyorwould_skip:<reason>; skipped rows highlighted red. - [Archive/move plan summary]
would_moveorwould_skip:<reason>; skipped rows highlighted red. - [Copy summary (executed)]
copiedorskipped:<reason>with red highlighting for skips. - [Archive/move summary (executed)]
movedorskipped:<reason>with red highlighting for skips.
- [Metadata sources] The default pipeline performs inline LabKey lookups only. Excel/CSV helpers exist in
dm/metadata.pybut are not currently invoked. - [Aggregation]
file_list_aggregatoris reserved for future aggregation behavior. - [Field resolution] Write-back maps field captions/names best-effort; verify your
fieldskeys against LabKey when in doubt. - [Move semantics] Archive/move skips when destination exists; collision/retention policies can be extended if needed.
- [Imports] Imports are module-safe with fallbacks for script execution; running via
python -m dm.dmshould work.
- Optional: Excel/CSV metadata matching in the default pipeline.
- Optional: Aggregation using
file_list_aggregator. - Robust retries for LabKey write-back and file operations.
- Unit tests and CI.
- Code style: Black and Flake8 are configured in
setup.cfg. VS Code will auto-format on save per.vscode/settings.json. - Update environment:
make envupdate
conda activate dm- Run commands:
# Dry-run sync using config in the drop folder
echo "See planned actions"
python dm/dm.py sync --drop-folder /path/to/drop
# Execute copy after confirmation
python dm/dm.py sync --drop-folder /path/to/drop --do-it
# View planned actions (dry run)
python dm/dm.py sync --drop-folder /path/to/drop- Regex mismatch → program exits with error: Ensure
drop_filename_regexmatches the discovered file paths. - LabKey errors: Review host/container/schema/table and your credentials/permissions. The tool catches and prints errors like
ServerContextError,ServerNotFoundError,QueryNotFoundError, andRequestError. - No
.md5sidecar: The tool will compute and write a.blake3sidecar before copying; this is not treated as a failure. - Configuration source: Only
<drop_folder>/sync.ymlis read; there is no global merge. - Running as a module: Imports are module-safe with a fallback for direct script execution.
TBD