Skip to content

Feature Request: Automatic Anonymization of Transcribed Text #4

@erklu

Description

@erklu

Description

To protect sensitive information in transcribed content, we should implement automatic anonymization.
By using a combination of Named Entity Recognition (NER) and regex-based matching, we can identify and mask personal data in transcripts.

In the POC phase, anonymization results should be saved to a sidecar file, allowing comparison between original and anonymized text, and enabling quality review before any permanent application.

Expected Behavior
• The system should automatically identify and replace sensitive information such as:
• Personal names → [NAME]
• Personal identity numbers → [PERSONAL_ID]
• Phone numbers → [PHONE]
• Addresses → [ADDRESS]
• Organizations/companies → [ORGANIZATION]
• The anonymized version is stored in a separate sidecar file, so it can be reviewed without affecting the original transcription.
• Users may choose the level of anonymization:
• Full anonymization – replaces all identified sensitive data.
• Pseudonymization – retains structure but masks selected elements.

Technical Guidelines

NER-based anonymization
• Use a Swedish NER model (e.g., KB-BERT-NER).
• Identify and replace relevant entities with general labels.

Example output:
[NAME] bor på [ADDRESS] i [GPE] och jobbar på [ORG].

Regex-based anonymization
• Apply regular expressions to detect and mask structured data, such as personal IDs and phone numbers.

Example output:
Mitt personnummer är [PERSONAL_ID] och mitt nummer är [PHONE].

Sidecar File: anonymized_output.json

The anonymized content is stored in a JSON sidecar file, alongside:
• The original text,
• The anonymized version,
• A list of identified entities.

This allows for manual or automated review before full application.

Future extension ideas:
• Add LLM-based review of anonymization quality.
• Evaluate and compare multiple NER models to determine the best fit for production use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions