-
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Milestone
Description
Description
Implement duplicate detection for documents based on filename similarity, content similarity, and metadata matching, with options to merge, keep both, or delete duplicates.
Requirements
- Detect similar document names (fuzzy matching)
- Detect similar content (hash comparison)
- Detect exact duplicates (file hash)
- Show duplicate warnings on upload
- Provide merge or keep options
- Version comparison view
- Batch duplicate detection
- Smart merge with conflict resolution
Detection Methods
1. Exact Duplicates (File Hash)
- Calculate MD5/SHA-256 hash of file
- Compare with existing document hashes
- 100% match = exact duplicate
2. Filename Similarity (Fuzzy Matching)
- Levenshtein distance < 3
- Same base name with version suffix
- Example: "Resume_v1.pdf" vs "Resume_v2.pdf"
3. Content Similarity
- Extract text from both documents
- Calculate similarity score (0-100%)
- Threshold: >85% = likely duplicate
UI Design
Duplicate Warning on Upload
┌────────────────────────────────────────┐
│ ⚠️ Potential Duplicate Detected │
├────────────────────────────────────────┤
│ Senior_Dev_Resume_2025.pdf │
│ │
│ Similar to: │
│ 📄 Senior_Dev_Resume.pdf │
│ Uploaded: Oct 15, 2025 │
│ 87% content match │
│ │
│ What would you like to do? │
│ │
│ ◉ Keep both as separate versions │
│ ○ Replace old with new │
│ ○ Keep old, discard new │
│ ○ View comparison first │
│ │
│ [Cancel] [Proceed] │
└────────────────────────────────────────┘
Acceptance Criteria
- Detects exact duplicate files
- Detects similar filenames
- Detects similar content
- Shows duplicate warnings
- Can compare documents side-by-side
- Can merge duplicates
- Can keep both versions
- Batch duplicate scan works
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status