Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(iatlas): update iatlas-data with the latest improvements made to GitLab (ARCH-315, ARCH-317) #2900

Merged

Conversation

tschaffter
Copy link
Member

@tschaffter tschaffter commented Oct 30, 2024

Closes https://sagebionetworks.jira.com/browse/ARCH-315
Closes https://sagebionetworks.jira.com/browse/ARCH-317

Changelog

  • Recreate the project iatlas-data using the sandbox-py-app project template
  • Update to Python 3.11 and update schematic DB 0.1.5
  • Update Dockerfile to install Python dependencies with Poetry

Preview

Prepare the project

nx prepare iatlas-data

Build the Docker image

nx build-image iatlas-data

Start the PostgreSQL DB

nx build-image iatlas-postgres
nx serve-detach iatlas-postgres

Build the DB (does not work)

Directly with Python:

nx serve iatlas-data

With the containerized script:

nx serve-detach iatlas-data

The Python scripts that builds the local DB could not be successfully tested. Building the entire DB takes more than 2h and I stopped the process after that time.

Then I tried to build only the Patients table as suggested by Andrew:

The second is to just build a minimal database, using the table_names parameter:
https://github.com/Sage-Bionetworks/schematic_db/blob/main/schematic_db/rdb_updater/rdb_updater.py#L117

If you jsut need to get the database built, and don't need all the data you cna just do table_names="Patients" and only the Patients table will be populated with data (the other tables will still exist in the schema)

When updating the Python script to specify the table name:

updater = RDBUpdater(db, ms)
updater.update_database(method="insert", table_names=["Patients"], chunk_size=10000)

The following error is thrown:

$ nx serve iatlas-data

> nx run iatlas-data:serve

> poetry run python src/build_database.py

[31/Oct/2024 16:51:23] INFO [root._drop_all_tables:33] Dropping all tables
[31/Oct/2024 16:51:24] INFO [root._drop_all_tables:35] Dropped all tables
[31/Oct/2024 16:51:24] INFO [root._get_database_schema:43] Getting database schema
[31/Oct/2024 16:51:37] INFO [root._get_database_schema:45] Got database schema
[31/Oct/2024 16:51:37] INFO [root._build_database_from_schema:54] Building database
[31/Oct/2024 16:51:37] INFO [root._build_database_from_schema:56] Adding table to database schema: patients
[31/Oct/2024 16:51:37] INFO [root._build_database_from_schema:56] Adding table to database schema: mutation_types
[31/Oct/2024 16:51:37] INFO [root._build_database_from_schema:56] Adding table to database schema: genes
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: features
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: datasets
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: tags
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: publications
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: samples
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: mutations
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: gene_sets
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: nodes
[31/Oct/2024 16:51:38] INFO [root._build_database_from_schema:56] Adding table to database schema: snps
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: cells
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: tags_to_tags
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: tags_to_publications
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: slides
[31/Oct/2024 16:51:39] INFO [root._build_database_from_schema:56] Adding table to database schema: single_cell_pseudobulk_features
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: single_cell_pseudobulk
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: samples_to_tags
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: samples_to_mutations
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: rare_variant_pathway_associations
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: publications_to_genes_to_gene_sets
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: neoantigens
[31/Oct/2024 16:51:40] INFO [root._build_database_from_schema:56] Adding table to database schema: heritability_results
[31/Oct/2024 16:51:41] INFO [root._build_database_from_schema:56] Adding table to database schema: genes_to_samples
[31/Oct/2024 16:51:41] INFO [root._build_database_from_schema:56] Adding table to database schema: genes_to_gene_sets
[31/Oct/2024 16:51:41] INFO [root._build_database_from_schema:56] Adding table to database schema: features_to_samples
[31/Oct/2024 16:51:41] INFO [root._build_database_from_schema:56] Adding table to database schema: edges
[31/Oct/2024 16:51:41] INFO [root._build_database_from_schema:56] Adding table to database schema: germline_gwas_results
[31/Oct/2024 16:51:42] INFO [root._build_database_from_schema:56] Adding table to database schema: driver_results
[31/Oct/2024 16:51:42] INFO [root._build_database_from_schema:56] Adding table to database schema: datasets_to_tags
[31/Oct/2024 16:51:42] INFO [root._build_database_from_schema:56] Adding table to database schema: datasets_to_samples
[31/Oct/2024 16:51:42] INFO [root._build_database_from_schema:56] Adding table to database schema: copy_number_results
[31/Oct/2024 16:51:42] INFO [root._build_database_from_schema:56] Adding table to database schema: colocalizations
[31/Oct/2024 16:51:43] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts_to_tags
[31/Oct/2024 16:51:43] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts_to_samples
[31/Oct/2024 16:51:43] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts_to_mutations
[31/Oct/2024 16:51:43] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts_to_genes
[31/Oct/2024 16:51:43] INFO [root._build_database_from_schema:56] Adding table to database schema: cohorts_to_features
[31/Oct/2024 16:51:44] INFO [root._build_database_from_schema:56] Adding table to database schema: cells_to_samples
[31/Oct/2024 16:51:44] INFO [root._build_database_from_schema:56] Adding table to database schema: cells_to_genes
[31/Oct/2024 16:51:44] INFO [root._build_database_from_schema:56] Adding table to database schema: cells_to_features
[31/Oct/2024 16:51:45] INFO [root._build_database_from_schema:56] Adding table to database schema: cell_stats
[31/Oct/2024 16:51:45] INFO [root._build_database_from_schema:58] Database built
[31/Oct/2024 16:51:45] INFO [root.update_database:129] Updating database
10000
[31/Oct/2024 16:51:45] INFO [root.update_database:138] Database updated
[31/Oct/2024 16:51:45] INFO [root._update_table_with_manifest:278] Updating table with manifest; table name: cohorts_to_samples; manifest id: None
Traceback (most recent call last):
  File "/workspaces/sage-monorepo/apps/iatlas/data/src/build_database.py", line 1913, in <module>
    updater._update_table_with_manifest(
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/rdb_updater/rdb_updater.py", line 281, in _update_table_with_manifest
    split_tables = split_table_into_chunks(table, chunk_size)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/utils/dataframe_utils.py", line 34, in split_table_into_chunks
    return np.array_split(table, n_chunks)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/numpy/lib/_shape_base_impl.py", line 794, in array_split
    raise ValueError('number sections must be larger than 0.') from None
ValueError: number sections must be larger than 0.
Warning: command "poetry run python src/build_database.py" exited with non-zero status code
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 NX   Running target serve for project iatlas-data failed

Failed tasks:

- iatlas-data:serve

Hint: run the command with --verbose for more details.

@tschaffter
Copy link
Member Author

Error when running the latest version of the build DB script

After updating from https://gitlab.com/cri-iatlas/iatlas-data and using Schematic DB 0.1.6:

vscode@7d06102369c4:/workspaces/sage-monorepo$ nx serve iatlas-data

> nx run iatlas-data:serve

> poetry run python src/build_database.py

Traceback (most recent call last):
  File "/workspaces/sage-monorepo/apps/iatlas/data/src/build_database.py", line 1875, in <module>
    DatabaseConfig(iatlas_config),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/schema/database_config.py", line 103, in __init__
    self.tables: list[TableConfig] = [TableConfig(**table) for table in tables]
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/schema/database_config.py", line 103, in <listcomp>
    self.tables: list[TableConfig] = [TableConfig(**table) for table in tables]
                                      ^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/schema/database_config.py", line 85, in __init__
    self.columns = [
                   ^
  File "/workspaces/sage-monorepo/apps/iatlas/data/.venv/lib/python3.11/site-packages/schematic_db/schema/database_config.py", line 87, in <listcomp>
    name=column["name"],
         ~~~~~~^^^^^^^^
KeyError: 'name'
Warning: command "poetry run python src/build_database.py" exited with non-zero status code

@tschaffter tschaffter marked this pull request as ready for review October 31, 2024 16:52
@tschaffter tschaffter changed the title feat(iatlas): Update iatlas-data app with the latest improvements made to GitLab (ARCH-315) feat(iatlas): update iatlas-data with the latest improvements made to GitLab (ARCH-315) Oct 31, 2024
@tschaffter tschaffter changed the title feat(iatlas): update iatlas-data with the latest improvements made to GitLab (ARCH-315) feat(iatlas): update iatlas-data with the latest improvements made to GitLab (ARCH-315, ARCH-317) Oct 31, 2024
@tschaffter tschaffter merged commit d341394 into Sage-Bionetworks:main Oct 31, 2024
10 of 14 checks passed
@tschaffter tschaffter deleted the iatlas/update-from-gitlab branch October 31, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant