channel_metadata

Scripts to perform curation and harmonization of channel names / antibody targets for multiplexed tissue imaging data from the Human Tumor Atlas Network

Run on 14 May 2024:

Found 12733 combined attributes.
Found 1129 unique combined attributes.
513 antigens after LLM harmonization.

Top 10 antigens by number of HTAN data files using them:

LLM Harmonized Antigen	entityId Count
Nuclear	2257
CD68	2004
CD45	1948
Ki-67	1909
CD20	1781
CD8	1740
CD3	1690
VIM	1597
CD4	1587
PD-1	1473

Dependencies

Ensure the following Python packages are installed:

boto3
google-cloud-bigquery
pandas
tqdm
concurrent.futures

You also need to set up authentication for AWS and Google Cloud Platform (GCP):

AWS: Configure your AWS credentials using the profile name htan-dev.
GCP: Ensure your GCP credentials are set up to use BigQuery.

Steps

1. Create Clients

AWS Bedrock Runtime Client: Used to interact with the language model for text processing.
BigQuery Client: Used to execute SQL queries and retrieve data.

2. Define Helper Functions

curate_antigen_manual(antigen): Uses regular expressions to clean antigen names by removing unwanted characters and standardizing names.
parse_json_garbage(s): Parses JSON strings from potentially malformed input, attempting to recover valid JSON data.
initial_prompt(antigen): Generates a prompt for the language model to extract and harmonize gene names from a given input string.
prompt_llm(user_message, model_id, client): Sends the prompt to the LLM and retrieves the response.

3. Load and Query Data

Read SQL Query: Load the SQL query from a file.
Execute Query: Retrieve data from BigQuery and convert it to a pandas DataFrame.

4. Process Antigen Names

Extract Unique Antigens: Identify unique antigen names from the DataFrame.
Manual Cleaning: Apply curate_antigen_manual to clean the antigen names.
Prepare User Prompts: Generate prompts for each cleaned antigen name.

5. LLM Processing

Concurrent Processing: Use a thread pool to send prompts to the LLM and parse the responses.
Response Handling: Store and process the responses to create a mapping of original to harmonized antigen names.

6. Compile Results

Combine Data: Merge original, cleaned, and harmonized antigen names into a final output table.
Count Table: Create a count table to summarize the number of unique source IDs per harmonized antigen.

7. Save Outputs

Save Data: Output the results to CSV and JSON files.

Detailed Function Descriptions

Manual Curation

curate_antigen_manual(antigen): This function uses a series of regular expressions to clean and standardize antigen names. It removes common unwanted patterns such as numbers in brackets, prefixes like "Target:" or "Antigen", and suffixes like "-AF488". It also standardizes common variations of certain antigen names, ensuring uniformity.

JSON Parsing

parse_json_garbage(s): This function attempts to parse JSON data from a given string. If the string contains errors, it tries to parse up to the position of the error, making a best-effort attempt to recover valid JSON.

Initial Prompt for LLM

initial_prompt(antigen): This function generates a detailed prompt for the LLM to process an antigen name. The prompt instructs the LLM to extract the gene name, separate additional identifiers, and format the information into a JSON dictionary.

LLM Interaction

prompt_llm(user_message, model_id, client): This function sends the user message to the LLM and retrieves the response. The response is expected to be in JSON format, containing the harmonized gene name and additional information.

Execution and Data Handling

Query Execution: The script executes a predefined SQL query to retrieve antigen data from BigQuery.
Data Sampling: It extracts unique antigen names from the dataset.
Manual Cleaning: Applies the manual curation function to clean the antigen names.
LLM Processing: Sends each cleaned antigen name to the LLM for further harmonization and extracts the response.

Output

The script generates and saves the following outputs:

manually_cleaned_antigens.csv: Contains manually cleaned antigen names.
output_responses.json: Contains the LLM responses for each antigen.
output_antigens.csv: Combines original, cleaned, and harmonized antigen names.
output_count_table.csv: Summarizes the unique source IDs per harmonized antigen.

Running the Script

Ensure your AWS and GCP credentials are set up.
Install the required Python packages.
Place your SQL query in a file named query.sql.
Run the script.

The script will process the data, interact with the LLM, and generate the output files.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
consolidate.py		consolidate.py
curate_llama3.py		curate_llama3.py
entity_count_table.csv		entity_count_table.csv
input_table.csv		input_table.csv
output_antigens.csv		output_antigens.csv
output_count_table.csv		output_count_table.csv
output_dict.json		output_dict.json
output_responses.json		output_responses.json
query.sql		query.sql
transformed_data.json		transformed_data.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

channel_metadata

Run on 14 May 2024:

Dependencies

Steps

1. Create Clients

2. Define Helper Functions

3. Load and Query Data

4. Process Antigen Names

5. LLM Processing

6. Compile Results

7. Save Outputs

Detailed Function Descriptions

Manual Curation

JSON Parsing

Initial Prompt for LLM

LLM Interaction

Execution and Data Handling

Output

Running the Script

About

Releases

Packages

Languages

ncihtan/channel_metadata

Folders and files

Latest commit

History

Repository files navigation

channel_metadata

Run on 14 May 2024:

Dependencies

Steps

1. Create Clients

2. Define Helper Functions

3. Load and Query Data

4. Process Antigen Names

5. LLM Processing

6. Compile Results

7. Save Outputs

Detailed Function Descriptions

Manual Curation

JSON Parsing

Initial Prompt for LLM

LLM Interaction

Execution and Data Handling

Output

Running the Script

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages