feat: Add SpaCy Processor for Enhanced NLP Support in Quivr #3468

Sahil-2101 · 2024-11-12T03:32:13Z

Description

This pull request introduces the SpaCyProcessor class to handle various text file types (PDF, DOCX, TXT, and CSV) and perform NLP processing using spaCy. This addition includes:

Key Features:

File Extraction: Supports asynchronous text extraction from PDFs using fitz (PyMuPDF), DOCX files via python-docx, and handling of TXT and CSV files.
NLP Processing: Integrates spaCy's NLP pipeline for entity recognition and sentence tokenization, adding metadata on entities and sentences in each document chunk.
Document Chunking: Implements RecursiveCharacterTextSplitter to divide documents into manageable chunks with specified overlap, ensuring consistent chunk sizes.
Error Handling and Logging: Provides robust logging for extraction errors and validation checks, improving traceability.

Motivation:
This feature adds support for spaCy NLP processing to enable richer text analysis and processing across various file types. The processor now efficiently handles different file formats, extracts meaningful text, and applies NLP, making it easier to work with structured document data in downstream applications.

Checklist before requesting a review

Please delete options that are not relevant.

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented hard-to-understand areas
New and existing unit tests pass locally with my changes
Any dependent changes have been merged

Screenshots (if appropriate):

- Introduced SpaCyProcessor to handle various file formats (PDF, DOCX, TXT, CSV) - Supports recursive text splitting for chunked processing - Applies spaCy NLP pipeline for tokenization and entity recognition on file content

…sing - Added support for PDF, DOCX, TXT, and CSV file extraction using fitz (PyMuPDF) and python-docx. - Integrated spaCy NLP pipeline for entity recognition and sentence tokenization. - Configured RecursiveCharacterTextSplitter for document chunking with metadata (chunk size, entities, sentences). - Added error handling and logging for robust file processing and validation. - Improved support for asynchronous text extraction, enabling efficient file reading.

StanGirard · 2024-11-12T17:11:12Z

Very nice PR! We'll take a closer look :) @AmineDiro

Sahil-2101 added 2 commits November 12, 2024 14:19

add SpaCy processor for NLP text analysis

32a82bd

- Introduced SpaCyProcessor to handle various file formats (PDF, DOCX, TXT, CSV) - Supports recursive text splitting for chunked processing - Applies spaCy NLP pipeline for tokenization and entity recognition on file content

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 12, 2024

Sahil-2101 changed the title ~~Add SpaCy Processor for Enhanced NLP Support in Quivr~~ feat: Add SpaCy Processor for Enhanced NLP Support in Quivr Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add SpaCy Processor for Enhanced NLP Support in Quivr #3468

feat: Add SpaCy Processor for Enhanced NLP Support in Quivr #3468

Sahil-2101 commented Nov 12, 2024

StanGirard commented Nov 12, 2024

feat: Add SpaCy Processor for Enhanced NLP Support in Quivr #3468

Are you sure you want to change the base?

feat: Add SpaCy Processor for Enhanced NLP Support in Quivr #3468

Conversation

Sahil-2101 commented Nov 12, 2024

Description

Checklist before requesting a review

Screenshots (if appropriate):

StanGirard commented Nov 12, 2024