-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description
To protect sensitive information in transcribed content, we should implement automatic anonymization.
By using a combination of Named Entity Recognition (NER) and regex-based matching, we can identify and mask personal data in transcripts.
In the POC phase, anonymization results should be saved to a sidecar file, allowing comparison between original and anonymized text, and enabling quality review before any permanent application.
⸻
Expected Behavior
• The system should automatically identify and replace sensitive information such as:
• Personal names → [NAME]
• Personal identity numbers → [PERSONAL_ID]
• Phone numbers → [PHONE]
• Addresses → [ADDRESS]
• Organizations/companies → [ORGANIZATION]
• The anonymized version is stored in a separate sidecar file, so it can be reviewed without affecting the original transcription.
• Users may choose the level of anonymization:
• Full anonymization – replaces all identified sensitive data.
• Pseudonymization – retains structure but masks selected elements.
⸻
Technical Guidelines
NER-based anonymization
• Use a Swedish NER model (e.g., KB-BERT-NER).
• Identify and replace relevant entities with general labels.
Example output:
[NAME] bor på [ADDRESS] i [GPE] och jobbar på [ORG].
Regex-based anonymization
• Apply regular expressions to detect and mask structured data, such as personal IDs and phone numbers.
Example output:
Mitt personnummer är [PERSONAL_ID] och mitt nummer är [PHONE].
⸻
Sidecar File: anonymized_output.json
The anonymized content is stored in a JSON sidecar file, alongside:
• The original text,
• The anonymized version,
• A list of identified entities.
This allows for manual or automated review before full application.
Future extension ideas:
• Add LLM-based review of anonymization quality.
• Evaluate and compare multiple NER models to determine the best fit for production use.