-
-
Notifications
You must be signed in to change notification settings - Fork 3.2k
[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…issue TeamNewPipe#10030) - Previously, *.SRT files only contained timestamps and sequence numbers, without the actual text content. - Added recursive text extraction to handle nested tags in TTML files.(e.g.: <span> tags)
Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this? |
I'm a newcomer to the NewPipe project and this is my first code submission. When I heard about adding |
if (node instanceof TextNode textNode) { | ||
text.append((textNode).text()); | ||
} else if (node instanceof Element element) { | ||
// <br> is a self-closing HTML tag used to insert a line break. | ||
if (element.tagName().equalsIgnoreCase("br")) { | ||
// Add a newline for <br> tags | ||
text.append(NEW_LINE); | ||
} | ||
} | ||
// Recursively process child nodes | ||
for (final Node child : node.childNodes()) { | ||
extractText(child, text); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build method could be rewritten like this:
if (node instanceof TextNode textNode) { | |
text.append((textNode).text()); | |
} else if (node instanceof Element element) { | |
// <br> is a self-closing HTML tag used to insert a line break. | |
if (element.tagName().equalsIgnoreCase("br")) { | |
// Add a newline for <br> tags | |
text.append(NEW_LINE); | |
} | |
} | |
// Recursively process child nodes | |
for (final Node child : node.childNodes()) { | |
extractText(child, text); | |
} | |
final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p") | |
.map(paragraph -> { | |
// Element.text extracts from child nodes as well | |
return new Pair<>(paragraph, paragraph.text()); | |
}) | |
.filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty()) | |
.toList(); | |
for (final var pair : pairList) { | |
final var paragraph = pair.first; | |
final var text = pair.second; | |
final String begin = getTimestamp(paragraph, "begin"); | |
final String end = getTimestamp(paragraph, "end"); | |
writeFrame(begin, end, text); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Element.text extracts from child nodes as well
But does it convert <br>
tags to a new line? At least not from what I could find in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
I gave it a try locally, but it failed to build (built with dev
branch) because:
Pair
requires thecommons-lang3
dependency, which is currently not part of NewPipe.selectStream(...)
is only available in newer Jsoup versions. The current setup doesn’t support it.
In addition, the recursive extractText()
method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).
For this reason, I’d prefer to keep the current approach extractText()
instead of introducing new dependencies or relying on newer APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.
@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I changed the parameter to String as well, forgot to add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have pushed the changes to the remote repository : TransZAllen@c7ab950
allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$ git show
commit c7ab950807d061f0a0f6b1c87be711e00df12e76 (HEAD -> debug_issue10030, origin/debug_issue10030)
Author: TransZAllen <tree.story@outlook.com>
Date: Sat Aug 30 16:06:52 2025 +0800
Test a new method 2 to get the actual text content.
There are still the two observations.
diff --git a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
index 5f43185c6..6dc132939 100644
--- a/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
+++ b/app/src/main/java/org/schabi/newpipe/streams/SrtFromTtmlWriter.java
@@ -6,7 +6,6 @@ import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.parser.Parser;
-import org.jsoup.select.Elements;
import org.schabi.newpipe.streams.io.SharpStream;
import java.io.ByteArrayInputStream;
@@ -101,27 +100,14 @@ public class SrtFromTtmlWriter {
Parser.xmlParser());
final StringBuilder text = new StringBuilder(128);
- final Elements paragraphList = doc.select("body > div > p");
-
- // check if has frames
- if (paragraphList.isEmpty()) {
- return;
- }
-
- for (final Element paragraph : paragraphList) {
- text.setLength(0);
-
- // Recursively extract text from all child nodes
- extractText(paragraph, text);
-
- if (ignoreEmptyFrames && text.length() < 1) {
- continue;
- }
-
+ final var paragraphs = doc.selectStream("body > div > p")
+ .filter(paragraph -> !ignoreEmptyFrames || paragraph.hasText())
+ .toList();
+ for (final var paragraph : paragraphs) {
final String begin = getTimestamp(paragraph, "begin");
final String end = getTimestamp(paragraph, "end");
-
- writeFrame(begin, end, text);
+ //writeFrame(begin, end, paragraph.text());
+ writeFrame(begin, end, new StringBuilder(paragraph.text()));
}
}
}
allen@allen-Inspiron-3558:/media/allen/WinUbuShare/NewPipe_ubuntu/branch_debug_issue10030/NewPipe$
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I changed the parameter to String as well, forgot to add that.
I accidentally edited your reply, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to paragraph.text()
without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.
The extractText()
function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks (<br>
), which are explicitly replaced with \r\n
—something that paragraph.text()
may not do correctly. Some software, for instance, defines <br>
as \n
or even as spaces, and we can't be sure that paragraph.text()
handles these cases the same way as NewPipe does. And any other special characters, we don't know.
Switching to paragraph.text()
introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text()
could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.
For these reasons, I recommend continuing to use extractText()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552
public String text() {
final StringBuilder accum = StringUtil.borrowBuilder();
new TextAccumulator(accum).traverse(this);
return StringUtil.releaseBuilder(accum).trim();
}
And it uses the trim() function.
After researching it on Google, I found that trim()
is a standard Java String method used to remove leading and trailing whitespace from a string.
Since it removes whitespace, it is not suitable for subtitles.
Hi and welcome!
If you wrote some, feel free to add them to the PR. Our test coverage is dramatically low.
This PR has not been merged yet. I just added it to the project to ensure it is merged before the 0.28.1 release. |
What is it?
Description of the changes in your PR
Problem
This issue persists in the latest version (0.28.0) and the most recent commit on the dev branch.
Downloaded SRT subtitles for some videos are empty, containing only timestamps and sequence numbers, as reported in #10030. This affects videos with styled subtitles, such as:
Problem Analysis
The issue occurs because some YouTube subtitles use TTML format with nested tags, which the original parser did not handle correctly.
Problematic TTML Example
The text ("Hello World!") is nested inside a tag for styling (e.g., colors or karaoke effects). The original parser only processed direct child nodes, missing the text inside , resulting in empty SRT output.
Non-Problematic TTML Example
This TTML has text directly under
<p>
, which was parsed correctly by the original code.Root Cause
The original SrtFromTtmlWriter.build method used a non-recursive loop, failing to extract text from nested tags like
<span>
in styled subtitles (e.g., rainbow or karaoke captions).Solution
Added a new
extractText()
method to recursively extract text from all nodes, includingTextNode
and<br>
tags, handling nested tags like<span>
.Replaced the non-recursive loop in
SrtFromTtmlWriter.build
with a call toextractText()
.Before/After Screenshots/Screen Record
Fixes the following issue(s)
Relies on the following changes
APK testing
The APK can be found by going to the "Checks" tab below the title. On the left pane, click on "CI", scroll down to "artifacts" and click "app" to download the zip file which contains the debug APK of this PR. You can find more info and a video demonstration on this wiki page.
https://youtu.be/mtb-qa8xvFU (【HimeHina MV】Roki (Cover))
https://www.youtube.com/watch?v=zbQRY8KSVbU (Ousama Ranking Opening 2 Full『Hadaka no Yuusha』by Vaundy)
https://youtu.be/-eDYT_20YhM (炜WARD ROMANCE ft. Feng Yi)
https://www.youtube.com/watch?v=L-BgxLtMxh0 (Styled subtitles for YouTube demonstration)
https://www.youtube.com/watch?v=Cc2nkx77U24 (Test: Rainbow Captions In Youtube)
https://www.youtube.com/watch?v=lUDPjyfmJrs (【original anime MV】III【hololive/宝鐘マリン&こぼ・かなえる】)
Tested a video with simple subtitles that downloaded correctly in the original NewPipe: https://www.youtube.com/watch?v=BVAIImxcv4g .
Confirmed SRT output remains correct after the fix, ensuring no regression.
All tested videos now produce correct SRT files with subtitle text.
Due diligence