Feature Request: Automatic Anonymization of Transcribed Text

Description

To protect sensitive information in transcribed content, we should implement automatic anonymization.
By using a combination of Named Entity Recognition (NER) and regex-based matching, we can identify and mask personal data in transcripts.

In the POC phase, anonymization results should be saved to a sidecar file, allowing comparison between original and anonymized text, and enabling quality review before any permanent application.

⸻

Expected Behavior
• The system should automatically identify and replace sensitive information such as:
• Personal names → [NAME]
• Personal identity numbers → [PERSONAL_ID]
• Phone numbers → [PHONE]
• Addresses → [ADDRESS]
• Organizations/companies → [ORGANIZATION]
• The anonymized version is stored in a separate sidecar file, so it can be reviewed without affecting the original transcription.
• Users may choose the level of anonymization:
• Full anonymization – replaces all identified sensitive data.
• Pseudonymization – retains structure but masks selected elements.

⸻

Technical Guidelines

NER-based anonymization
• Use a Swedish NER model (e.g., KB-BERT-NER).
• Identify and replace relevant entities with general labels.

Example output:
[NAME] bor på [ADDRESS] i [GPE] och jobbar på [ORG].

Regex-based anonymization
• Apply regular expressions to detect and mask structured data, such as personal IDs and phone numbers.

Example output:
Mitt personnummer är [PERSONAL_ID] och mitt nummer är [PHONE].

⸻

Sidecar File: anonymized_output.json

The anonymized content is stored in a JSON sidecar file, alongside:
• The original text,
• The anonymized version,
• A list of identified entities.

This allows for manual or automated review before full application.

Future extension ideas:
• Add LLM-based review of anonymization quality.
• Evaluate and compare multiple NER models to determine the best fit for production use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Automatic Anonymization of Transcribed Text #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Automatic Anonymization of Transcribed Text #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions