Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training on Russian #20

Open
Lorenzoncina opened this issue Nov 2, 2023 · 5 comments
Open

Training on Russian #20

Lorenzoncina opened this issue Nov 2, 2023 · 5 comments

Comments

@Lorenzoncina
Copy link

In order to train a model on Russian dara from Web Crawl, do you suggest a specifc pre-trained bert model?

@benob
Copy link
Owner

benob commented Nov 2, 2023

I don't know about Russian BERTs but what you want to care about is tokeization. In particular the preprocessing stage needs to normalize punctuation in a tokenization-neutral manner.

@nshmyrev
Copy link

nshmyrev commented Nov 2, 2023

There are already trained models https://alphacephei.com/vosk/models/vosk-recasepunc-ru-0.22.zip

@Lorenzoncina
Copy link
Author

@nshmyrev thanks for your reply. I noticed that to run this models you mentioned there are more dependencies than the one reposrted on this repository. Am I correct?

@Lorenzoncina
Copy link
Author

Lorenzoncina commented Nov 6, 2023

I'me getting this error when trying to run prediction with this russian model:

python3 ../../recasepunc/recasepunc.py predict checkpoint < ru-test.txt > output.txt
Traceback (most recent call last):
  File "../../recasepunc/recasepunc.py", line 752, in <module>
    main(config, config.action, config.action_args)
  File "../../recasepunc/recasepunc.py", line 723, in main
    generate_predictions(config, *args)
  File "../../recasepunc/recasepunc.py", line 346, in generate_predictions
    loaded = torch.load(checkpoint_path, map_location=config.device if torch.cuda.is_available() else 'cpu')
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/torch/serialization.py", line 875, in find_class
    return super().find_class(mod_name, name)
AttributeError: Can't get attribute 'Trie' on <module 'transformers.tokenization_utils' from '/raid/data/s2t/speech_tools/recasepunc/lib/python3.8/site-packages/transformers/tokenization_utils.py'>

I'm using an enviroment with all requirements requested here: https://github.com/benob/recasepunc
While I can use without problem the english model

@nshmyrev
Copy link

nshmyrev commented Nov 6, 2023

Hi. I replied you on alphacep/vosk-api#1459, it needs transformers==4.25.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants