Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

youtubetranscript.com cc selection option #179

Open
pasdesinfos opened this issue Dec 22, 2022 · 13 comments
Open

youtubetranscript.com cc selection option #179

pasdesinfos opened this issue Dec 22, 2022 · 13 comments
Labels
enhancement New feature or request

Comments

@pasdesinfos
Copy link

Is your feature request related tweets o a problem? Please describe.
:( Unknown error: Could not retrieve a transcript for the video http://www.youtube.com/watch?v=oBfDbucxPU4! This is most likely caused by: No transcripts were found for any of the requested language codes: ('en',) For this video (oBfDbucxPU4) transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - es ("Spanish (auto-generated)")[TRANSLATABLE] (TRANSLATION LANGUAGES) - af ("Afrikaans") - ak ("Akan") - sq ("Albanian") - am ("Amharic") - ar ("Arabic") - hy ("Armenian") - as ("Assamese") - ay ("Aymara") - az ("Azerbaijani") - bn ("Bangla") - eu ("Basque") - be ("Belarusian") - bho ("Bhojpuri") - bs ("Bosnian") - bg ("Bulgarian") - my ("Burmese") - ca ("Catalan") - ceb ("Cebuano") - zh-Hans ("Chinese (Simplified)") - zh-Hant ("Chinese (Traditional)") - co ("Corsican") - hr ("Croatian") - cs ("Czech") - da ("Danish") - dv ("Divehi") - nl ("Dutch") - en ("English") - eo ("Esperanto") - et ("Estonian") - ee ("Ewe") - fil ("Filipino") - fi ("Finnish") - fr ("French") - gl ("Galician") - lg ("Ganda") - ka ("Georgian") - de ("German") - el ("Greek") - gn ("Guarani") - gu ("Gujarati") - ht ("Haitian Creole") - ha ("Hausa") - haw ("Hawaiian") - iw ("Hebrew") - hi ("Hindi") - hmn ("Hmong") - hu ("Hungarian") - is ("Icelandic") - ig ("Igbo") - id ("Indonesian") - ga ("Irish") - it ("Italian") - ja ("Japanese") - jv ("Javanese") - kn ("Kannada") - kk ("Kazakh") - km ("Khmer") - rw ("Kinyarwanda") - ko ("Korean") - kri ("Krio") - ku ("Kurdish") - ky ("Kyrgyz") - lo ("Lao") - la ("Latin") - lv ("Latvian") - ln ("Lingala") - lt ("Lithuanian") - lb ("Luxembourgish") - mk ("Macedonian") - mg ("Malagasy") - ms ("Malay") - ml ("Malayalam") - mt ("Maltese") - mi ("Māori") - mr ("Marathi") - mn ("Mongolian") - ne ("Nepali") - nso ("Northern Sotho") - no ("Norwegian") - ny ("Nyanja") - or ("Odia") - om ("Oromo") - ps ("Pashto") - fa ("Persian") - pl ("Polish") - pt ("Portuguese") - pa ("Punjabi") - qu ("Quechua") - ro ("Romanian") - ru ("Russian") - sm ("Samoan") - sa ("Sanskrit") - gd ("Scottish Gaelic") - sr ("Serbian") - sn ("Shona") - sd ("Sindhi") - si ("Sinhala") - sk ("Slovak") - sl ("Slovenian") - so ("Somali") - st ("Southern Sotho") - es ("Spanish") - su ("Sundanese") - sw ("Swahili") - sv ("Swedish") - tg ("Tajik") - ta ("Tamil") - tt ("Tatar") - te ("Telugu") - th ("Thai") - ti ("Tigrinya") - ts ("Tsonga") - tr ("Turkish") - tk ("Turkmen") - uk ("Ukrainian") - und ("Unknown Language") - ur ("Urdu") - ug ("Uyghur") - uz ("Uzbek") - vi ("Vietnamese") - cy ("Welsh") - fy ("Western Frisian") - xh ("Xhosa") - yi ("Yiddish") - yo ("Yoruba") - zu ("Zulu") If you are sure that the described cause is not responsible for this error and that a transcript should be retrievable, please create an issue at https://github.com/jdepoix/youtube-transcript-api/issues. Please add which version of youtube_transcript_api you are using and provide the information needed to replicate the error. Also make sure that there are no open issues which already describe your problem!

Describe the solution you'd like
When available auto-generated subtitl, to be translated to en and transcribed as per default

Describe alternatives you've considered
cc selection option

Additional context
n/a

@erseco
Copy link

erseco commented Dec 22, 2022

Same error here. Maybe adding an option to select language solves the problem :)

@ghost
Copy link

ghost commented Dec 22, 2022

yeah same here, option to select would be good.

@jdepoix jdepoix added the enhancement New feature or request label Jan 2, 2023
@jdepoix
Copy link
Owner

jdepoix commented Jan 2, 2023

Hi @pasdesinfos,
I definitely see the use case for a feature where transcripts are auto-translated if they are not available in the requested language. However, this should not be the default. As this module is commonly used to train/validate Machine Learning models, translating the transcripts will introduce another variable into the data quality, which the user should always be aware of (by opting into it).

I actually thought about introducing this as an optional feature before, but there is an implementation detail that stopped me from doing so: if we want to automatically translate to the user-requested language, which transcript do we choose to translate from (if there are multiple)? Depending on the transcript we are translating from, the quality of the output will vary. A few things to consider:

  • I generally expect manually generated transcripts to be of higher quality than ASR transcripts. However, there is no data on the average quality of manually generated transcripts on YouTube, so I can not verify this. With how good modern ASR models have become, I could also imagine ASR transcripts being more reliable (on average) for high-resource languages like English, while being less reliable for low-resource languages.
  • Translating from high-resource languages (English, German, French, etc.) will most likely yield the best quality results. So they should probably be prioritised. However, this could conflict with prioritising manually generated transcripts.

So which heuristic for choosing the transcript to translate from, is most likely to yield the highest quality transcript? Any thoughts on this?

@ghost
Copy link

ghost commented Jan 4, 2023

@jdepoix First of all, I don't know what it means to translate transcripts, but the ASRs created in Turkish were understandable, if not completely accurate.

@erseco
Copy link

erseco commented Jan 5, 2023

Hi, IMHO the problem is when the main language of the video is in another language different to English, @toprak, @pasdeinfos and I are talking about adding an option (or allow automatically) the option of getting the source video original generated subtitles, not about translating them. If you get any Spanish video like this one: https://youtubetranscript.com/?v=Dby0_0vdr30 you will see the error, in the CLI tool you have to set the Spanish language to allow getting the correct transcript

Hope this explains the use case, best regards!

@jdepoix
Copy link
Owner

jdepoix commented Jan 5, 2023

Hi @toprak and @erseco,
I think what you are asking for is something different and it already is documented as a feature request in #133.
To my understanding, @pasdesinfos is asking about a feature where the transcripts are automatically translated to the requested language if no transcripts are available in that language. Could you maybe clarify @pasdesinfos to make sure we are on the same page here?

@pasdesinfos
Copy link
Author

pasdesinfos commented Jan 8, 2023

Hi @jdepoix @toprak @erseco,

I trust everything is well.

That's right @jdepoix. For instance, in the output for the video https://youtu.be/BOKqyl0VT7A , https://youtubetranscript.com/?v=BOKqyl0VT7A, indicates "No transcripts were found for any of the requested language codes: ('en',)", however it appears that "transcripts are available in the following languages: (MANUALLY CREATED) None (GENERATED) - fr ("French (auto-generated)")[TRANSLATABLE] ".

Could the heuristic be obtaining, by default, the auto-translated english version, when GENERATED transcript exists and is TRANSLATABLE. Ergo the output ":( Unknown error" will appear only in the event no transcripts at all exist.

Kind regards to everyone!

@p-toni
Copy link

p-toni commented Jun 27, 2023

Hi all. Same here, only if the YT source isn't in EN. As mentioned, just a selector can handle it.

@pasdesinfos
Copy link
Author

Hi @jdepoix, @toprak, @erseco, @toniseldr,

I wanted to take a moment to express my heartfelt gratitude to each of you for your invaluable contributions, unwavering dedication, commitment, and hard work. Your efforts have truly made a significant impact in making lives more wonderful. 🙏🎉

I mean, let's be honest here, without your brilliance, I'd probably be lost in a sea of confusion and chaos. 🌊😅

With self-deprecating humor and sincere appreciation,
@pasdesinfos 😄🙌

@jdepoix
Copy link
Owner

jdepoix commented Jun 28, 2023

Hi @pasdesinfos,

thank you very much for the kind words! 😊

However, this hasn't been implemented so I think it is okay for the ticket to stay open. Although I am not actively working on this, it might be something that someone wants to contribute to!

@jdepoix jdepoix reopened this Jun 28, 2023
@MarouaneZhani
Copy link

Hi,
Im getting the same error with the following video: https://www.youtube.com/watch?v=EtpRcefOD6M even if I specify the correct language 'de' in the languages parameter :

from llama_index.readers.youtube_transcript import YoutubeTranscriptReader

loader = YoutubeTranscriptReader()
documents = loader.load_data(
ytlinks=['https://www.youtube.com/watch?v=EtpRcefOD6M'],
languages=["de","en"]
)

Do you have any idea how can this be solved ?

@jdepoix
Copy link
Owner

jdepoix commented Aug 12, 2024

Hi @MarouaneZhani,
what is the exact error message you are getting?

@MarouaneZhani
Copy link

Hi @jdepoix
Sorry I already got it running using "de-DE" in languages, the error that I was getting :
Could not retrieve a transcript for the video https://www.youtube.com/watch?v=EtpRcefOD6M This is most likely caused by: No transcripts were found for any of the requested language codes.

I saw somewhere in the error the available code language was something like that "de-DE" and it worked after trying it !

Thanks
Marouane

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants