A sophisticated Retrieval-Augmented Generation (RAG) API system designed to provide accurate, scientifically-grounded answers to research questions by leveraging both local scientific document databases and web searches.
- Scientific Query Validation: Ensures questions are scientific in nature before processing
- Multi-Stage Retrieval Pipeline:
- Scientific Database Search: Primary source for scientific papers with millions of documents from scientific journals
- Web Search Fallback: Searches top scientific journals, Google Scholar, and reputable scientific domains for relevant evidence
- Document Reranking: Uses Cohere for optimal document relevance
- Quality Assurance:
- Self-Corrective RAG: Employs iterative refinement and verification techniques to ensure factual accuracy
- Hallucination Detection: Verifies answer accuracy against source documents
- Answer Relevance Grading: Ensures responses directly address the question
- Citation Support: Includes proper academic citations in responses
- Adaptive Search: Falls back to web search if local database results are insufficient
- Streaming Support: Real-time status updates during query processing
- Query History: Track and retrieve past queries and answers
- User Management: Individual user query tracking and history
- Comprehensive Responses: Includes source documents, processing time, and search origin
- Query Caching: Optimizes response time by caching frequent or similar questions
- arXiv Dataset: Over 1.7 million research paper abstracts from Cornell University's arXiv dataset
- Data Ingestion Pipeline: Custom pipeline for processing and embedding scientific papers
- Planned Additional Sources:
- bioRxiv: Preprint server for biological sciences
- PubMed: Database of biomedical literature
- SSRN: Social Science Research Network
- Nature: Leading international scientific journal
- And more scientific sources to come...
- Python 3.12+
- PostgreSQL Database
- Supabase Account
- API Keys for:
- OpenAI
- Cohere
- Tavily
- SerpAPI/Serper (optional)
- Clone the repository:
git clone https://github.com/BilalKamal/scienceproves-me.git
cd scienceproves-me
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (.env):
# Database
SUPABASE_URL=your_supabase_url
SUPABASE_ANON_KEY=your_supabase_key
# API Keys
OPENAI_API_KEY=your_openai_key
COHERE_API_KEY=your_cohere_key
TAVILY_API_KEY=your_tavily_key
SERP_API_KEY=your_serp_key
SERPER_API_KEY=your_serper_key
- Initialize the database tables:
psql -U your_username -d your_database -f SQL/create_queries_table.sql
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
POST /ask
Content-Type: application/json
{
"question": "What is the role of mitochondria in cell energy production?",
"stream": true,
"user_id": "user123"
}
GET /history/{user_id}
The RAG system follows a sophisticated pipeline to ensure accurate scientific answers:
-
Query Validation
- Validates if the question is scientific in nature
- Uses LLM to assess query validity
-
Document Retrieval
# Primary database search db_docs = self._retrieve_local_docs(question) # Fallback to web search if needed if not db_docs or self._all_docs_below_threshold(db_docs): return self._websearch_path(question)
-
Document Processing
- Reranking for relevance
- Metadata extraction
- Citation formatting
-
Answer Generation
- Context-aware response generation
- Citation inclusion
- Quality verification
-
Quality Control
# Verify answer quality if not self._grade_hallucination(answer, docs): return self._websearch_path(question)
.
├── app/
│ ├── __init__.py
│ ├── api/
│ │ ├── __init__.py
│ │ └── routes/
│ │ ├── __init__.py
│ │ └── question.py
│ ├── core/
│ │ ├── __init__.py
│ │ └── config.py
│ ├── db/
│ │ ├── __init__.py
│ │ ├── manager.py
│ │ └── models.py
│ ├── main.py
│ └── services/
│ ├── __init__.py
│ └── request_manager.py
├── rag/
│ ├── __init__.py
│ ├── rag.py
│ ├── rag_embeddings.py
│ ├── rag_llm.py
│ ├── rag_prompts.py
│ ├── rag_reranker.py
│ ├── rag_retriever.py
│ └── rag_search_manager.py
├── SQL/
│ ├── create_documents_table.sql
│ ├── create_queries_table.sql
│ └── match_documents.sql
├── main.py
├── requirements.txt
└── vercel.json
├── Data-ingestion/
│ └── data-ingestion.ipynb # Script for processing arXiv dataset
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- Thanks to Cornell University and arXiv for providing the extensive research papers dataset
- Thanks to @JasonGoodison for the idea of a RAG app that answers questions using Scientific papers
- Thanks to @mckaywrigley for being an incredible mentor and inspiration in RAG development
- Thanks to @RLanceMartin and @LangChainAI for many of the ideas implemented
- Thanks to Pixegami for inspiration
- Thanks to all the amazing developers and creators I've learned from along the way
For support, email [email protected] or open an issue in the repository.