The Clinical Trial Data Extractor with LLM Parsing project scrapes clinical trial data from a specified website (which will remain unnamed), processes it using a Large Language Model (LLM) via the OpenRouter API, and exports the results to a CSV file. This tool is designed for researchers, providing a streamlined and customizable solution for extracting and analyzing clinical trial data.
- Customizable Scraping: Extract clinical trial data based on user-defined keywords entered via the terminal.
- LLM-Powered Analysis: Process scraped data using advanced LLM models through OpenRouter API.
- CSV Output: Generate CSV for trial data processed from the LLM response.
- Data Control: Specify the number of pages to scrape, giving control over the data volume.
- Page Count Detection: Automatically retrieves the total number of pages for any search query.
- Automated Directory Setup: Automatically creates required directories for storing scraped and processed data.
- Modular Design: Clean architecture with separate modules for scraping, processing, and saving data.
- Real-Time Feedback: Displays live progress updates during scraping and data processing phases.
- Error Handling: Robust error management for network issues and unexpected data formats.
- Python 3.12.5+
- All required packages are listed in
requirements.txt
.
-
Clone the repository:
git clone https://github.com/mrjxtr/Clinical_Trial_Data_Extractor.git cd Clinical_Trial_Data_Extractor
-
Install the required dependencies:
pip install -r requirements.txt
-
Configure your OpenRouter API key by adding it to the
.env
file or directly insrc/main.py
.
Run the script using:
python src/main.py
You will be prompted to provide a search keyword and specify the number of pages to scrape.
src/main.py
: Main orchestrator for scraping, processing, and saving data.src/scraper.py
: Contains theScraper
class for fetching clinical trial data.src/llm_processor.py
: Implements theLLMProcessor
class for analyzing data with the LLM.src/data_saver.py
: Saves processed data in CSV format.src/prompts.py
: Houses customizable LLM prompt templates.
- Randomized Delays: To avoid server overload, requests include randomized delays.
- Compliance: Always adhere to the website's terms of service when scraping data.
- OpenRouter API Usage: Ensure you have sufficient API credits and follow OpenRouter's usage policies.
- Ethical Considerations: Use this tool responsibly and only for research purposes. It is not intended for medical diagnosis or treatment.
- Maintenance: Updates may be needed to adapt to changes in the website, LLM models, or API specifications.
- Debugging: If issues occur with LLM parsing or CSV saving, additional debugging may be required.
- Environment: Ensure a stable internet connection for running the script on a single machine.
Important: The current parser is optimized for "Breast Cancer" search results. You may need to modify the parser to suit other use cases. All intermediate data is stored in the
output/
directory. The parsing code is located insrc/llm_processor.py
with theparse_llm_response
function.