-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Next language #2
Comments
Nice work! Swedish: Finnish: |
ALBERT in Danish ;-) |
Or you could do knowledge distillation on this model. Here's a ton of synopses and links: https://blog.inten.to/speeding-up-bert-5528e18bb4ea |
Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great! |
Joachim, I was suggesting that @emillykkejensen make one, not BotXO =] ALBERT requires the source material but any kind of knowledge distillation method can run from the BERT weights posted here. What would be really nice though would be if BotXO would post their code for text preprocessing here. I can see that they do everything lower-case but don't reduce repeated characters. That's a bit sad in my mind. I don't need a token for def squeeze_strings(s):
if type(s) == float:
return s
else:
s = re.sub(r'(?P<rep>.)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
s = re.sub(r'(?P<rep>..)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
return s There are only 25 words in the Danish language that should be affected by |
I would be surprised if the optimisation described above have any measurable impact on performance. Is it something that's been mentioned in the litterature? If so, I've missed it. I don't have permission to open source the data fetching/preprocessing code (yet) but it's currently quite hacky, cleaning up bad stuff from the internet (there is a surprising amount of NSFW stuff on Common Crawl) I'm currently in dialog with a norwegian proffessor about training AlBERT models :-) Edit: Oh, and thank you so much for the interest and participation guys, very happy to see that 👍 |
The blog post that @grofte links to, and the ALBERT paper linked in there (https://openreview.net/pdf?id=H1eA7AEtvS) do state those speedups quite clearly. Fewer parameters of course also means a smaller burden on memory. (And answering @grofte -- you're wrong, I'm not at a uni at this very point in time, and even if I was, that wouldn't mean I had lots of time. :) ) |
@jbingel sorry for being unclear: I was refering to the proposed |
@mollerhoj Oh, I don't think you would see a difference in performance. But you would probably train faster. The function is just stripping out nonsense filler with no semantic content. But I'm guessing that you guys do normalization through decomposition, sentence splitting (how though?), lower case, strip everything not in ascii+æøå. BERT tokenizer should take take of everything else once it has the vocab file. The other stuff you do to prepare the data for pre-training shouldn't be that interesting for employing the model. |
Danish cased model! |
What language BERT model would you like to be released next?
The text was updated successfully, but these errors were encountered: