📄 Docsifer: Efficient Data Conversion to Markdown

title	emoji	colorFrom	colorTo	sdk	app_file	pinned
Docsifer	👻 / 📚	green	indigo	docker	app.py	false

📄 Docsifer: Efficient Data Conversion to Markdown

Docsifer is a powerful FastAPI + Gradio service for converting various data formats (PDF, PowerPoint, Word, Excel, Images, Audio, HTML, etc.) to Markdown. It leverages the MarkItDown library and can optionally use LLMs (via OpenAI) for richer extraction (OCR, speech-to-text, etc.).

✨ Key Features

Comprehensive Format Support:
- PDF: Extracts text and structure effectively.
- PowerPoint: Converts slides into Markdown-friendly content.
- Word: Processes .docx files with precision.
- Excel: Extracts tabular data as Markdown tables.
- Images: Reads EXIF metadata and applies OCR for text extraction.
- Audio: Retrieves EXIF metadata and performs speech transcription.
- HTML: Transforms web pages into Markdown.
- Text-Based Formats: Handles CSV, JSON, XML with ease.
- ZIP Files: Iterates over contents for batch processing.
LLM Integration: Leverages OpenAI's GPT-4 for enhanced extraction quality and contextual understanding.
Efficient and Fast: Optimized for speed while maintaining high accuracy.
Easy Deployment: Dockerized for hassle-free setup and scalability.
Interactive Playground: Test conversion processes interactively using a Gradio-powered interface.
Usage Analytics: Tracks token usage and access statistics via Upstash Redis.

🚀 Use Cases

Knowledge Indexing: Convert various document formats into Markdown for indexing and search.
Text Analysis: Prepare data for semantic analysis and NLP tasks.
Content Transformation: Simplify content preparation for blogs, documentation, or databases.
Metadata Extraction: Extract meaningful metadata from images and audio for categorization and tagging.

🛠️ Getting Started

1. Clone the Repository

git clone https://github.com/lh0x00/docsifer.git
cd docsifer

2. Build and Run with Docker

Make sure Docker is installed and running on your machine.

docker build -t lightweight-embeddings .
docker run -p 7860:7860 lightweight-embeddings

The API will now be accessible at http://localhost:7860.

📖 API Overview

Endpoints

/v1/convert: Convert a file to Markdown. Supports both file uploads and file path inputs. Accepts optional OpenAI parameters to enable LLM-based enhancements.
/v1/stats: Retrieve usage statistics, including access counts and token usage.

Interactive Docs

Visit the Swagger UI for detailed, interactive documentation.
Explore additional resources with ReDoc.

🔬 Playground

Interactive Conversion

Test file conversion directly in the browser using the Gradio interface.
Simply visit http://localhost:7860 after starting the server to access the playground.

Features

File Upload: Upload a file directly or provide a local file path.
OpenAI Integration: Optionally provide OpenAI API details to enhance conversion with LLM capabilities.
Conversion Result: View the resulting Markdown output instantly.
Usage Statistics: Monitor access and token usage through the Gradio interface.

🌐 Resources

Documentation: Explore full documentation
Hugging Face Space: Try the live demo
GitHub Repository: View source code

💡 Why Docsifer?

Versatile and Comprehensive: Handles a wide range of formats, making it a one-stop solution for content conversion.
AI-Powered: Uses OpenAI's GPT-4 to enhance extraction accuracy and adapt to complex data structures.
User-Friendly: Offers intuitive APIs and a built-in interactive interface for experimentation.
Scalable and Efficient: Optimized for performance with Docker support and asynchronous processing.
Transparent Analytics: Tracks usage metrics to help monitor and manage service consumption.

👥 Contributors

lamhieu / lh0x00 – Creator and Maintainer (GitHub, HuggingFace)

Contributions are welcome! Check out the contribution guidelines.

📜 License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docsifer		docsifer
.editorconfig		.editorconfig
.gitignore		.gitignore
.pylintrc		.pylintrc
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Docsifer: Efficient Data Conversion to Markdown

✨ Key Features

🚀 Use Cases

🛠️ Getting Started

1. Clone the Repository

2. Build and Run with Docker

📖 API Overview

Endpoints

Interactive Docs

🔬 Playground

Interactive Conversion

Features

🌐 Resources

💡 Why Docsifer?

👥 Contributors

📜 License

About

Releases

Packages

Languages

License

lh0x00/docsifer

Folders and files

Latest commit

History

Repository files navigation

📄 Docsifer: Efficient Data Conversion to Markdown

✨ Key Features

🚀 Use Cases

🛠️ Getting Started

1. Clone the Repository

2. Build and Run with Docker

📖 API Overview

Endpoints

Interactive Docs

🔬 Playground

Interactive Conversion

Features

🌐 Resources

💡 Why Docsifer?

👥 Contributors

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages