Feature request: Audio tagging for captions (music, applause, crowd, laughter)

## Summary
Introduce optional audio tagging in the worker pipeline so that subtitles (SRT/VTT) can include neutral audio cues such as music, applause, laughter and crowd noise. The feature should be available both:

1. **When triggered via flags set in Kaltura REACH**, and  
2. **When users order captions in the standalone Sunet Scribe web interface**, via a simple on/off toggle.

## Background
No organisation currently uses automatic “Captions Audio Tags” in Kaltura REACH, but several have expressed interest in richer accessibility features. Sunet Scribe should therefore support this capability so that institutions can enable it when they are ready.

In parallel, the standalone web frontend should provide a user-facing toggle to request audio tagging when ordering captions, independent of REACH.

For the initial version, we should restrict tagging to four neutral and non-sensitive categories:

- `music`
- `applause`
- `laughter`
- `crowd` (ambient audience noise)

These tags must be rendered in the same language as the subtitles (e.g. `[musik]` in Swedish, `[music]` in English).

## Model suggestion
Use an AudioSet-trained classifier such as **PANNs CNN14** (lighter, mainly for CPU use) or **PANNs RestNet22** (heavier, more accurate and intended for GPU pipelines), which covers the required categories reliably and runs well on-premise.

Example model links:  
https://huggingface.co/nicofarr/panns_Wavegram_Logmel_Cnn14
https://huggingface.co/nicofarr/panns_ResNet22

## Requirements (high-level)
- Audio tagging activates only when explicitly requested:
  - via REACH job metadata **or**
  - via a toggle in the standalone web interface.
- Only the four defined neutral categories may generate tags.
- Tags must follow the output subtitle language.
- No sensitive categories (crying, screaming, coughing, gasping, emotional sounds, etc.) may ever be emitted.
- If the classifier is uncertain or another category dominates, no tag should be produced.
- Tags should appear as standard cues in SRT/VTT (placement to be decided by the team).

## Extensibility
The feature should be designed so that additional **neutral, non-sensitive** categories can be added later after evaluation. The initial set is intentionally limited.

## Out of scope
- No changes to Whisper/VAD decision-making.
- No tagging of sensitive or person-related audio categories.
- No automatic expansion of the tag set without a product decision.

## Expected outcome
When enabled, the pipeline should produce subtitles enriched with clear, language-appropriate audio tags for neutral events (music, applause, laughter, crowd noise). This improves accessibility for both REACH-integrated workflows and the standalone web interface, even though no institution uses REACH audio tagging today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature request: Audio tagging for captions (music, applause, crowd, laughter) #14

Summary

Background

Model suggestion

Requirements (high-level)

Extensibility

Out of scope

Expected outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature request: Audio tagging for captions (music, applause, crowd, laughter) #14

Description

Summary

Background

Model suggestion

Requirements (high-level)

Extensibility

Out of scope

Expected outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions