Skip to content

Document Duplicate Detection & Merging #234

@adriandarian

Description

@adriandarian

Description

Implement duplicate detection for documents based on filename similarity, content similarity, and metadata matching, with options to merge, keep both, or delete duplicates.


Requirements

  • Detect similar document names (fuzzy matching)
  • Detect similar content (hash comparison)
  • Detect exact duplicates (file hash)
  • Show duplicate warnings on upload
  • Provide merge or keep options
  • Version comparison view
  • Batch duplicate detection
  • Smart merge with conflict resolution

Detection Methods

1. Exact Duplicates (File Hash)

  • Calculate MD5/SHA-256 hash of file
  • Compare with existing document hashes
  • 100% match = exact duplicate

2. Filename Similarity (Fuzzy Matching)

  • Levenshtein distance < 3
  • Same base name with version suffix
  • Example: "Resume_v1.pdf" vs "Resume_v2.pdf"

3. Content Similarity

  • Extract text from both documents
  • Calculate similarity score (0-100%)
  • Threshold: >85% = likely duplicate

UI Design

Duplicate Warning on Upload

┌────────────────────────────────────────┐
│ ⚠️  Potential Duplicate Detected        │
├────────────────────────────────────────┤
│ Senior_Dev_Resume_2025.pdf             │
│                                        │
│ Similar to:                            │
│ 📄 Senior_Dev_Resume.pdf               │
│    Uploaded: Oct 15, 2025              │
│    87% content match                   │
│                                        │
│ What would you like to do?             │
│                                        │
│ ◉ Keep both as separate versions       │
│ ○ Replace old with new                 │
│ ○ Keep old, discard new                │
│ ○ View comparison first                │
│                                        │
│ [Cancel]               [Proceed]       │
└────────────────────────────────────────┘

Acceptance Criteria

  • Detects exact duplicate files
  • Detects similar filenames
  • Detects similar content
  • Shows duplicate warnings
  • Can compare documents side-by-side
  • Can merge duplicates
  • Can keep both versions
  • Batch duplicate scan works

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

No status

Relationships

None yet

Development

No branches or pull requests

Issue actions