Skip to content

Conversation

TransZAllen
Copy link

@TransZAllen TransZAllen commented Aug 27, 2025

What is it?

  • [✔ ] Bugfix (user facing)
  • Feature (user facing)
  • Codebase improvement (dev facing)
  • Meta improvement to the project (dev facing)

Description of the changes in your PR

Problem

This issue persists in the latest version (0.28.0) and the most recent commit on the dev branch.
Downloaded SRT subtitles for some videos are empty, containing only timestamps and sequence numbers, as reported in #10030. This affects videos with styled subtitles, such as:

Problem Analysis

The issue occurs because some YouTube subtitles use TTML format with nested tags, which the original parser did not handle correctly.

Problematic TTML Example
<p begin="00:00:01.000" end="00:00:03.000">
  <span style="s4">Hello World!</span>
</p>

The text ("Hello World!") is nested inside a tag for styling (e.g., colors or karaoke effects). The original parser only processed direct child nodes, missing the text inside , resulting in empty SRT output.
Non-Problematic TTML Example

<p begin="00:00:01.000" end="00:00:03.000" style="s2">
  Hello World!
</p>

This TTML has text directly under<p>, which was parsed correctly by the original code.

Root Cause

The original SrtFromTtmlWriter.build method used a non-recursive loop, failing to extract text from nested tags like <span> in styled subtitles (e.g., rainbow or karaoke captions).

Solution

Added a new extractText() method to recursively extract text from all nodes, including TextNode and <br> tags, handling nested tags like <span>.
Replaced the non-recursive loop in SrtFromTtmlWriter.build with a call to extractText().

Before/After Screenshots/Screen Record

  • Before: None
  • After: None

Fixes the following issue(s)

Relies on the following changes

  • None

APK testing

The APK can be found by going to the "Checks" tab below the title. On the left pane, click on "CI", scroll down to "artifacts" and click "app" to download the zip file which contains the debug APK of this PR. You can find more info and a video demonstration on this wiki page.

  • Fixed Cases (previously empty SRT files)

https://youtu.be/mtb-qa8xvFU (【HimeHina MV】Roki (Cover))
https://www.youtube.com/watch?v=zbQRY8KSVbU (Ousama Ranking Opening 2 Full『Hadaka no Yuusha』by Vaundy)
https://youtu.be/-eDYT_20YhM (炜WARD ROMANCE ft. Feng Yi)
https://www.youtube.com/watch?v=L-BgxLtMxh0 (Styled subtitles for YouTube demonstration)
https://www.youtube.com/watch?v=Cc2nkx77U24 (Test: Rainbow Captions In Youtube)
https://www.youtube.com/watch?v=lUDPjyfmJrs (【original anime MV】III【hololive/宝鐘マリン&こぼ・かなえる】)

  • Regression Testing

Tested a video with simple subtitles that downloaded correctly in the original NewPipe: https://www.youtube.com/watch?v=BVAIImxcv4g .
Confirmed SRT output remains correct after the fix, ensuring no regression.

All tested videos now produce correct SRT files with subtitle text.

Due diligence

…issue TeamNewPipe#10030)

- Previously, *.SRT files only contained timestamps and sequence numbers, without the actual text content.
- Added recursive text extraction to handle nested tags in TTML
  files.(e.g.: <span> tags)
@github-actions github-actions bot added the size/small PRs with less than 50 changed lines label Aug 27, 2025
@TobiGr TobiGr added bug Issue is related to a bug downloader Issue is related to the downloader labels Aug 27, 2025
@TobiGr
Copy link
Contributor

TobiGr commented Aug 27, 2025

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

@TobiGr TobiGr added this to v0.28.x Aug 27, 2025
@github-project-automation github-project-automation bot moved this to Todo in v0.28.x Aug 27, 2025
@TobiGr TobiGr moved this from Todo to In Progress in v0.28.x Aug 27, 2025
@TransZAllen
Copy link
Author

TransZAllen commented Aug 27, 2025

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

I'm a newcomer to the NewPipe project and this is my first code submission. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!
So, I’ll skip the unit tests this time. Thanks for your review and feedback! 👍
And by the way, I noticed the changes were merged into the v0.28.x branch.

Comment on lines +71 to +83
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
Copy link
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build method could be rewritten like this:

Suggested change
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p")
.map(paragraph -> {
// Element.text extracts from child nodes as well
return new Pair<>(paragraph, paragraph.text());
})
.filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty())
.toList();
for (final var pair : pairList) {
final var paragraph = pair.first;
final var text = pair.second;
final String begin = getTimestamp(paragraph, "begin");
final String end = getTimestamp(paragraph, "end");
writeFrame(begin, end, text);
}

Copy link
Contributor

@TobiGr TobiGr Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Element.text extracts from child nodes as well

But does it convert <br> tags to a new line? At least not from what I could find in the docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isira-Seneviratne

Thanks for the suggestion!

I gave it a try locally, but it failed to build (built with dev branch) because:

  1. Pair requires the commons-lang3 dependency, which is currently not part of NewPipe.
  2. selectStream(...) is only available in newer Jsoup versions. The current setup doesn’t support it.

In addition, the recursive extractText() method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).

For this reason, I’d prefer to keep the current approach extractText() instead of introducing new dependencies or relying on newer APIs.

Copy link
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.

@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.

Copy link
Author

@TransZAllen TransZAllen Aug 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I changed the parameter to String as well, forgot to add that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed the changes to the remote repository : TransZAllen@c7ab950

allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$ git show
commit c7ab950807d061f0a0f6b1c87be711e00df12e76 (HEAD -> debug_issue10030, origin/debug_issue10030)
Author: TransZAllen <tree.story@outlook.com>
Date:   Sat Aug 30 16:06:52 2025 +0800

    Test a new method 2 to get the actual text content.
    
    There are still the two observations.

diff --git a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
index 5f43185c6..6dc132939 100644
--- a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
+++ b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
@@ -6,7 +6,6 @@ import org.jsoup.nodes.Element;
 import org.jsoup.nodes.Node;
 import org.jsoup.nodes.TextNode;
 import org.jsoup.parser.Parser;
-import org.jsoup.select.Elements;
 import org.schabi.newpipe.streams.io.SharpStream;
 
 import java.io.ByteArrayInputStream;
@@ -101,27 +100,14 @@ public class SrtFromTtmlWriter {
                 Parser.xmlParser());
 
         final StringBuilder text = new StringBuilder(128);
-        final Elements paragraphList = doc.select("body > div > p");
-
-        // check if has frames
-        if (paragraphList.isEmpty()) {
-            return;
-        }
-
-        for (final Element paragraph : paragraphList) {
-            text.setLength(0);
-
-            // Recursively extract text from all child nodes
-            extractText(paragraph, text);
-
-            if (ignoreEmptyFrames && text.length() < 1) {
-                continue;
-            }
-
+        final var paragraphs = doc.selectStream("body > div > p")
+                .filter(paragraph -> !ignoreEmptyFrames || paragraph.hasText())
+                .toList();
+        for (final var paragraph : paragraphs) {
             final String begin = getTimestamp(paragraph, "begin");
             final String end = getTimestamp(paragraph, "end");
-
-            writeFrame(begin, end, text);
+            //writeFrame(begin, end, paragraph.text());
+            writeFrame(begin, end, new StringBuilder(paragraph.text()));
         }
     }
 }
allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$ 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I changed the parameter to String as well, forgot to add that.

I accidentally edited your reply, sorry about that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to paragraph.text() without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.

The extractText() function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks (<br>), which are explicitly replaced with \r\n—something that paragraph.text() may not do correctly. Some software, for instance, defines <br> as \n or even as spaces, and we can't be sure that paragraph.text() handles these cases the same way as NewPipe does. And any other special characters, we don't know.

Switching to paragraph.text() introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text() could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.

For these reasons, I recommend continuing to use extractText().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552

    public String text() {
        final StringBuilder accum = StringUtil.borrowBuilder();
        new TextAccumulator(accum).traverse(this);
        return StringUtil.releaseBuilder(accum).trim();
    }

And it uses the trim() function.
After researching it on Google, I found that trim() is a standard Java String method used to remove leading and trailing whitespace from a string.

Since it removes whitespace, it is not suitable for subtitles.

@TobiGr
Copy link
Contributor

TobiGr commented Aug 28, 2025

I'm a newcomer to the NewPipe project and this is my first code submission.

Hi and welcome!

. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!

If you wrote some, feel free to add them to the PR. Our test coverage is dramatically low.

And by the way, I noticed the changes were merged into the v0.28.x branch.

This PR has not been merged yet. I just added it to the project to ensure it is merged before the 0.28.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is related to a bug downloader Issue is related to the downloader size/small PRs with less than 50 changed lines
Projects
Status: In Progress
3 participants