[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575

TransZAllen · 2025-08-27T07:49:43Z

What is it?

[✔ ] Bugfix (user facing)
Feature (user facing)
Codebase improvement (dev facing)
Meta improvement to the project (dev facing)

Description of the changes in your PR

Problem

This issue persists in the latest version (0.28.0) and the most recent commit on the dev branch.
Downloaded SRT subtitles for some videos are empty, containing only timestamps and sequence numbers, as reported in #10030. This affects videos with styled subtitles, such as:

https://youtu.be/mtb-qa8xvFU (https://www.youtube.com/watch?v=mtb-qa8xvFU) (【HimeHina MV】Roki (Cover))
https://www.youtube.com/watch?v=zbQRY8KSVbU (Ousama Ranking Opening 2 Full『Hadaka no Yuusha』by Vaundy)
https://youtu.be/-eDYT_20YhM (https://www.youtube.com/watch?v=-eDYT_20YhM) (炜WARD ROMANCE ft. Feng Yi)
https://www.youtube.com/watch?v=L-BgxLtMxh0 (Styled subtitles for YouTube demonstration)
https://www.youtube.com/watch?v=Cc2nkx77U24 (Test: Rainbow Captions In Youtube)
https://www.youtube.com/watch?v=lUDPjyfmJrs (【original anime MV】III【hololive/宝鐘マリン＆こぼ・かなえる】)

Problem Analysis

The issue occurs because some YouTube subtitles use TTML format with nested tags, which the original parser did not handle correctly.

Problematic TTML Example

<p begin="00:00:01.000" end="00:00:03.000">
  <span style="s4">Hello World!</span>
</p>

The text ("Hello World!") is nested inside a tag for styling (e.g., colors or karaoke effects). The original parser only processed direct child nodes, missing the text inside , resulting in empty SRT output.
Non-Problematic TTML Example

<p begin="00:00:01.000" end="00:00:03.000" style="s2">
  Hello World!
</p>

This TTML has text directly under, which was parsed correctly by the original code.

Root Cause

The original SrtFromTtmlWriter.build method used a non-recursive loop, failing to extract text from nested tags like  in styled subtitles (e.g., rainbow or karaoke captions).

Solution

Added a new extractText() method to recursively extract text from all nodes, including TextNode and   tags, handling nested tags like .
Replaced the non-recursive loop in SrtFromTtmlWriter.build with a call to extractText().

Before/After Screenshots/Screen Record

Before: None
After: None

Fixes the following issue(s)

Fixes Download subtitles, when the language is zh-TW Chinese (Taiwan) to download, the subtitles in the srt file will be blank, only the timeline has no text records #10030

Relies on the following changes

None

APK testing

The APK can be found by going to the "Checks" tab below the title. On the left pane, click on "CI", scroll down to "artifacts" and click "app" to download the zip file which contains the debug APK of this PR. You can find more info and a video demonstration on this wiki page.

Fixed Cases (previously empty SRT files)

https://youtu.be/mtb-qa8xvFU (【HimeHina MV】Roki (Cover))
https://www.youtube.com/watch?v=zbQRY8KSVbU (Ousama Ranking Opening 2 Full『Hadaka no Yuusha』by Vaundy)
https://youtu.be/-eDYT_20YhM (炜WARD ROMANCE ft. Feng Yi)
https://www.youtube.com/watch?v=L-BgxLtMxh0 (Styled subtitles for YouTube demonstration)
https://www.youtube.com/watch?v=Cc2nkx77U24 (Test: Rainbow Captions In Youtube)
https://www.youtube.com/watch?v=lUDPjyfmJrs (【original anime MV】III【hololive/宝鐘マリン＆こぼ・かなえる】)

Regression Testing

Tested a video with simple subtitles that downloaded correctly in the original NewPipe: https://www.youtube.com/watch?v=BVAIImxcv4g .
Confirmed SRT output remains correct after the fix, ensuring no regression.

All tested videos now produce correct SRT files with subtitle text.

Due diligence

[ ✔] I read the contribution guidelines.

…issue TeamNewPipe#10030) - Previously, *.SRT files only contained timestamps and sequence numbers, without the actual text content. - Added recursive text extraction to handle nested tags in TTML files.(e.g.: tags)

TobiGr · 2025-08-27T08:39:26Z

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

TransZAllen · 2025-08-27T09:58:07Z

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

I'm a newcomer to the NewPipe project and this is my first code submission. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!
So, I’ll skip the unit tests this time. Thanks for your review and feedback! 👍
And by the way, I noticed the changes were merged into the v0.28.x branch.

Isira-Seneviratne · 2025-08-28T00:05:14Z

app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java

+        if (node instanceof TextNode textNode) {
+            text.append((textNode).text());
+        } else if (node instanceof Element element) {
+            // <br> is a self-closing HTML tag used to insert a line break.
+            if (element.tagName().equalsIgnoreCase("br")) {
+                // Add a newline for <br> tags
+                text.append(NEW_LINE);
+            }
+        }
+        // Recursively process child nodes
+        for (final Node child : node.childNodes()) {
+            extractText(child, text);
+        }


The build method could be rewritten like this:

Suggested change

if (node instanceof TextNode textNode) {

text.append((textNode).text());

} else if (node instanceof Element element) {

// is a self-closing HTML tag used to insert a line break.

if (element.tagName().equalsIgnoreCase("br")) {

// Add a newline for tags

text.append(NEW_LINE);

}

}

// Recursively process child nodes

for (final Node child : node.childNodes()) {

extractText(child, text);

}

final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p")

.map(paragraph -> {

// Element.text extracts from child nodes as well

return new Pair<>(paragraph, paragraph.text());

})

.filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty())

.toList();

for (final var pair : pairList) {

final var paragraph = pair.first;

final var text = pair.second;

final String begin = getTimestamp(paragraph, "begin");

final String end = getTimestamp(paragraph, "end");

writeFrame(begin, end, text);

}

Element.text extracts from child nodes as well

But does it convert   tags to a new line? At least not from what I could find in the docs.

@Isira-Seneviratne

Thanks for the suggestion!

I gave it a try locally, but it failed to build （built with dev branch） because:

Pair requires the commons-lang3 dependency, which is currently not part of NewPipe.

selectStream(...) is only available in newer Jsoup versions. The current setup doesn’t support it.

In addition, the recursive extractText() method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).

For this reason, I’d prefer to keep the current approach extractText() instead of introducing new dependencies or relying on newer APIs.

Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.

@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.

I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.

Right, I changed the parameter to String as well, forgot to add that.

I have pushed the changes to the remote repository : TransZAllen@c7ab950

allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$ git show commit c7ab950807d061f0a0f6b1c87be711e00df12e76 (HEAD -> debug_issue10030, origin/debug_issue10030) Author: TransZAllen <tree.story@outlook.com> Date: Sat Aug 30 16:06:52 2025 +0800 Test a new method 2 to get the actual text content. There are still the two observations. diff --git a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java index 5f43185c6..6dc132939 100644 --- a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java +++ b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java @@ -6,7 +6,6 @@ import org.jsoup.nodes.Element; import org.jsoup.nodes.Node; import org.jsoup.nodes.TextNode; import org.jsoup.parser.Parser; -import org.jsoup.select.Elements; import org.schabi.newpipe.streams.io.SharpStream; import java.io.ByteArrayInputStream; @@ -101,27 +100,14 @@ public class SrtFromTtmlWriter { Parser.xmlParser()); final StringBuilder text = new StringBuilder(128); - final Elements paragraphList = doc.select("body > div > p"); - - // check if has frames - if (paragraphList.isEmpty()) { - return; - } - - for (final Element paragraph : paragraphList) { - text.setLength(0); - - // Recursively extract text from all child nodes - extractText(paragraph, text); - - if (ignoreEmptyFrames && text.length() < 1) { - continue; - } - + final var paragraphs = doc.selectStream("body > div > p") + .filter(paragraph -> !ignoreEmptyFrames || paragraph.hasText()) + .toList(); + for (final var paragraph : paragraphs) { final String begin = getTimestamp(paragraph, "begin"); final String end = getTimestamp(paragraph, "end"); - - writeFrame(begin, end, text); + //writeFrame(begin, end, paragraph.text()); + writeFrame(begin, end, new StringBuilder(paragraph.text())); } } } allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$

Right, I changed the parameter to String as well, forgot to add that.

I accidentally edited your reply, sorry about that

Switching to paragraph.text() without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.

The extractText() function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks ( ), which are explicitly replaced with \r\n—something that paragraph.text() may not do correctly. Some software, for instance, defines   as \n or even as spaces, and we can't be sure that paragraph.text() handles these cases the same way as NewPipe does. And any other special characters, we don't know.

Switching to paragraph.text() introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text() could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.

For these reasons, I recommend continuing to use extractText().

Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552

public String text() { final StringBuilder accum = StringUtil.borrowBuilder(); new TextAccumulator(accum).traverse(this); return StringUtil.releaseBuilder(accum).trim(); }

And it uses the trim() function.
After researching it on Google, I found that trim() is a standard Java String method used to remove leading and trailing whitespace from a string.

Since it removes whitespace, it is not suitable for subtitles.

TobiGr · 2025-08-28T06:35:23Z

I'm a newcomer to the NewPipe project and this is my first code submission.

Hi and welcome!

. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!

If you wrote some, feel free to add them to the PR. Our test coverage is dramatically low.

And by the way, I noticed the changes were merged into the v0.28.x branch.

This PR has not been merged yet. I just added it to the project to ensure it is merged before the 0.28.1 release.

github-actions bot added the size/small PRs with less than 50 changed lines label Aug 27, 2025

TobiGr added bug Issue is related to a bug downloader Issue is related to the downloader labels Aug 27, 2025

Fix JDoc and apply suggestions

e1888ed

TobiGr force-pushed the dev branch from 312cbf9 to e1888ed Compare August 27, 2025 08:38

TobiGr approved these changes Aug 27, 2025

View reviewed changes

TobiGr added this to v0.28.x Aug 27, 2025

github-project-automation bot moved this to Todo in v0.28.x Aug 27, 2025

TobiGr moved this from Todo to In Progress in v0.28.x Aug 27, 2025

Isira-Seneviratne reviewed Aug 28, 2025

View reviewed changes

Merge branch 'TeamNewPipe:dev' into dev

74518c8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575

[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575

Uh oh!

TransZAllen commented Aug 27, 2025 •

edited

Loading

Uh oh!

TobiGr commented Aug 27, 2025

Uh oh!

TransZAllen commented Aug 27, 2025 •

edited

Loading

Uh oh!

Isira-Seneviratne Aug 28, 2025 •

edited

Loading

Uh oh!

TobiGr Aug 28, 2025 •

edited

Loading

Uh oh!

TransZAllen Aug 28, 2025

Uh oh!

Isira-Seneviratne Aug 28, 2025 •

edited

Loading

Uh oh!

TobiGr Aug 28, 2025

Uh oh!

TransZAllen Aug 30, 2025 •

edited by Isira-Seneviratne

Loading

Uh oh!

TransZAllen Aug 30, 2025

Uh oh!

Isira-Seneviratne Aug 30, 2025

Uh oh!

TransZAllen Aug 30, 2025

Uh oh!

TransZAllen Aug 30, 2025

Uh oh!

TobiGr commented Aug 28, 2025

Uh oh!

Uh oh!

-        if (node instanceof TextNode textNode) {
-            text.append((textNode).text());
-        } else if (node instanceof Element element) {
-            // <br> is a self-closing HTML tag used to insert a line break.
-            if (element.tagName().equalsIgnoreCase("br")) {
-                // Add a newline for <br> tags
-                text.append(NEW_LINE);
-            }
-        }
-        // Recursively process child nodes
-        for (final Node child : node.childNodes()) {
-            extractText(child, text);
-        }
+        final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p")
+                .map(paragraph -> {
+                    // Element.text extracts from child nodes as well
+                    return new Pair<>(paragraph, paragraph.text());
+                })
+                .filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty())
+                .toList();
+        for (final var pair : pairList) {
+            final var paragraph = pair.first;
+            final var text = pair.second;
+            final String begin = getTimestamp(paragraph, "begin");
+            final String end = getTimestamp(paragraph, "end");
+            writeFrame(begin, end, text);
+        }

Uh oh!

[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575

Are you sure you want to change the base?

[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575

Uh oh!

Conversation

TransZAllen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is it?

Description of the changes in your PR

Problem

Problem Analysis

Problematic TTML Example

Root Cause

Solution

Before/After Screenshots/Screen Record

Fixes the following issue(s)

Relies on the following changes

APK testing

Due diligence

Uh oh!

TobiGr commented Aug 27, 2025

Uh oh!

TransZAllen commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isira-Seneviratne Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TobiGr Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TransZAllen Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Isira-Seneviratne Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TobiGr Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

TransZAllen Aug 30, 2025 • edited by Isira-Seneviratne Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TransZAllen Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Isira-Seneviratne Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

TransZAllen Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

TransZAllen Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

TobiGr commented Aug 28, 2025

Uh oh!

Uh oh!

TransZAllen commented Aug 27, 2025 •

edited

Loading

TransZAllen commented Aug 27, 2025 •

edited

Loading

Isira-Seneviratne Aug 28, 2025 •

edited

Loading

TobiGr Aug 28, 2025 •

edited

Loading

Isira-Seneviratne Aug 28, 2025 •

edited

Loading

TransZAllen Aug 30, 2025 •

edited by Isira-Seneviratne

Loading