CLINICAL TRIAL DATA EXTRACTOR WITH LLM PARSING

Project Summary 📝

The Clinical Trial Data Extractor with LLM Parsing project scrapes clinical trial data from a specified website (which will remain unnamed), processes it using a Large Language Model (LLM) via the OpenRouter API, and exports the results to a CSV file. This tool is designed for researchers, providing a streamlined and customizable solution for extracting and analyzing clinical trial data.

Report outline 🧾

Project Summary

Features 🚀

Customizable Scraping: Extract clinical trial data based on user-defined keywords entered via the terminal.
LLM-Powered Analysis: Process scraped data using advanced LLM models through OpenRouter API.
CSV Output: Generate CSV for trial data processed from the LLM response.
Data Control: Specify the number of pages to scrape, giving control over the data volume.
Page Count Detection: Automatically retrieves the total number of pages for any search query.
Automated Directory Setup: Automatically creates required directories for storing scraped and processed data.
Modular Design: Clean architecture with separate modules for scraping, processing, and saving data.
Real-Time Feedback: Displays live progress updates during scraping and data processing phases.
Error Handling: Robust error management for network issues and unexpected data formats.

Requirements 💻

Python 3.12.5+
All required packages are listed in requirements.txt.

Installation ⚙️

Clone the repository:

git clone https://github.com/mrjxtr/Clinical_Trial_Data_Extractor.git
cd Clinical_Trial_Data_Extractor

Install the required dependencies:
```
pip install -r requirements.txt
```
Configure your OpenRouter API key by adding it to the .env file or directly in src/main.py.

Usage 🖥

Run the script using:

python src/main.py

You will be prompted to provide a search keyword and specify the number of pages to scrape.

Project Structure 📂

src/main.py: Main orchestrator for scraping, processing, and saving data.
src/scraper.py: Contains the Scraper class for fetching clinical trial data.
src/llm_processor.py: Implements the LLMProcessor class for analyzing data with the LLM.
src/data_saver.py: Saves processed data in CSV format.
src/prompts.py: Houses customizable LLM prompt templates.

Notes 📌

Randomized Delays: To avoid server overload, requests include randomized delays.
Compliance: Always adhere to the website's terms of service when scraping data.
OpenRouter API Usage: Ensure you have sufficient API credits and follow OpenRouter's usage policies.
Ethical Considerations: Use this tool responsibly and only for research purposes. It is not intended for medical diagnosis or treatment.
Maintenance: Updates may be needed to adapt to changes in the website, LLM models, or API specifications.
Debugging: If issues occur with LLM parsing or CSV saving, additional debugging may be required.
Environment: Ensure a stable internet connection for running the script on a single machine.

Important: The current parser is optimized for "Breast Cancer" search results. You may need to modify the parser to suit other use cases. All intermediate data is stored in the output/ directory. The parsing code is located in src/llm_processor.py with the parse_llm_response function.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
output		output
src		src
.example.env		.example.env
.gitignore		.gitignore
LICENCE.txt		LICENCE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLINICAL TRIAL DATA EXTRACTOR WITH LLM PARSING

Project Summary 📝

Report outline 🧾

Features 🚀

Requirements 💻

Installation ⚙️

Usage 🖥

Project Structure 📂

Notes 📌

About

Releases

Packages

Languages

License

mrjxtr/Data_Extractor_LLM_Parser_Project

Folders and files

Latest commit

History

Repository files navigation

CLINICAL TRIAL DATA EXTRACTOR WITH LLM PARSING

Project Summary 📝

Report outline 🧾

Features 🚀

Requirements 💻

Installation ⚙️

Usage 🖥

Project Structure 📂

Notes 📌

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages