Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next language #2

Open
mollerhoj opened this issue Dec 3, 2019 · 10 comments
Open

Next language #2

mollerhoj opened this issue Dec 3, 2019 · 10 comments
Labels

Comments

@mollerhoj
Copy link
Collaborator

mollerhoj commented Dec 3, 2019

What language BERT model would you like to be released next?



@polls polls bot added the Polls label Dec 3, 2019
@ViktorAlm
Copy link

ViktorAlm commented Jan 9, 2020

Nice work!

Swedish:
https://github.com/af-ai-center/bert

Finnish:
https://github.com/TurkuNLP/FinBERT

@emillykkejensen
Copy link

ALBERT in Danish ;-)

@grofte
Copy link

grofte commented Feb 11, 2020

ALBERT in Danish ;-)

Or you could do knowledge distillation on this model. Here's a ton of synopses and links: https://blog.inten.to/speeding-up-bert-5528e18bb4ea
The Huawei TinyBERT and the single layer model are both about a magnitude faster at inference than the base model.

@jbingel
Copy link

jbingel commented Feb 11, 2020

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

@grofte
Copy link

grofte commented Feb 12, 2020

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

Joachim, I was suggesting that @emillykkejensen make one, not BotXO =]
But you could do it too! You're at a uni - you have lots of time.

ALBERT requires the source material but any kind of knowledge distillation method can run from the BERT weights posted here.

What would be really nice though would be if BotXO would post their code for text preprocessing here. I can see that they do everything lower-case but don't reduce repeated characters. That's a bit sad in my mind. I don't need a token for '=======' or whatever it was I saw in there. This function should do it:

def squeeze_strings(s):
  if type(s) == float:
    return s
  else:
    s = re.sub(r'(?P<rep>.)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    s = re.sub(r'(?P<rep>..)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    return s

There are only 25 words in the Danish language that should be affected by squeeze_strings() and in all cases it would just change the conjugation (e.g. from "bortopererer" to "bortoperer", that is from present case to imperative).

@mollerhoj
Copy link
Collaborator Author

mollerhoj commented Feb 17, 2020

I would be surprised if the optimisation described above have any measurable impact on performance. Is it something that's been mentioned in the litterature? If so, I've missed it.

I don't have permission to open source the data fetching/preprocessing code (yet) but it's currently quite hacky, cleaning up bad stuff from the internet (there is a surprising amount of NSFW stuff on Common Crawl)

I'm currently in dialog with a norwegian proffessor about training AlBERT models :-)

Edit: Oh, and thank you so much for the interest and participation guys, very happy to see that 👍

@jbingel
Copy link

jbingel commented Feb 17, 2020

The blog post that @grofte links to, and the ALBERT paper linked in there (https://openreview.net/pdf?id=H1eA7AEtvS) do state those speedups quite clearly. Fewer parameters of course also means a smaller burden on memory.

(And answering @grofte -- you're wrong, I'm not at a uni at this very point in time, and even if I was, that wouldn't mean I had lots of time. :) )

@mollerhoj
Copy link
Collaborator Author

mollerhoj commented Feb 17, 2020

@jbingel sorry for being unclear: I was refering to the proposed squeeze_strings function. I'm well aware of the other improvements made to bert (roberta, albert, distilbert etc).

@grofte
Copy link

grofte commented Feb 19, 2020

@mollerhoj Oh, I don't think you would see a difference in performance. But you would probably train faster. The function is just stripping out nonsense filler with no semantic content.

But I'm guessing that you guys do normalization through decomposition, sentence splitting (how though?), lower case, strip everything not in ascii+æøå. BERT tokenizer should take take of everything else once it has the vocab file. The other stuff you do to prepare the data for pre-training shouldn't be that interesting for employing the model.

@VildMedPap
Copy link

Danish cased model!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants