Next language #2

mollerhoj · 2019-12-03T16:05:13Z

What language BERT model would you like to be released next?

ViktorAlm · 2020-01-09T12:45:03Z

Nice work!

Swedish:
https://github.com/af-ai-center/bert

Finnish:
https://github.com/TurkuNLP/FinBERT

emillykkejensen · 2020-01-15T07:27:47Z

ALBERT in Danish ;-)

grofte · 2020-02-11T09:49:05Z

ALBERT in Danish ;-)

Or you could do knowledge distillation on this model. Here's a ton of synopses and links: https://blog.inten.to/speeding-up-bert-5528e18bb4ea
The Huawei TinyBERT and the single layer model are both about a magnitude faster at inference than the base model.

jbingel · 2020-02-11T12:48:43Z

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

grofte · 2020-02-12T07:21:44Z

Agree that a compressed Danish BERT (ALBERT/DistilBERT/TinyBERT/...), would be great!

Joachim, I was suggesting that @emillykkejensen make one, not BotXO =]
But you could do it too! You're at a uni - you have lots of time.

ALBERT requires the source material but any kind of knowledge distillation method can run from the BERT weights posted here.

What would be really nice though would be if BotXO would post their code for text preprocessing here. I can see that they do everything lower-case but don't reduce repeated characters. That's a bit sad in my mind. I don't need a token for '=======' or whatever it was I saw in there. This function should do it:

def squeeze_strings(s):
  if type(s) == float:
    return s
  else:
    s = re.sub(r'(?P<rep>.)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    s = re.sub(r'(?P<rep>..)(?P=rep){2,}', r'\g<rep>\g<rep>', s)
    return s

There are only 25 words in the Danish language that should be affected by squeeze_strings() and in all cases it would just change the conjugation (e.g. from "bortopererer" to "bortoperer", that is from present case to imperative).

mollerhoj · 2020-02-17T10:25:05Z

I would be surprised if the optimisation described above have any measurable impact on performance. Is it something that's been mentioned in the litterature? If so, I've missed it.

I don't have permission to open source the data fetching/preprocessing code (yet) but it's currently quite hacky, cleaning up bad stuff from the internet (there is a surprising amount of NSFW stuff on Common Crawl)

I'm currently in dialog with a norwegian proffessor about training AlBERT models :-)

Edit: Oh, and thank you so much for the interest and participation guys, very happy to see that 👍

jbingel · 2020-02-17T11:06:29Z

The blog post that @grofte links to, and the ALBERT paper linked in there (https://openreview.net/pdf?id=H1eA7AEtvS) do state those speedups quite clearly. Fewer parameters of course also means a smaller burden on memory.

(And answering @grofte -- you're wrong, I'm not at a uni at this very point in time, and even if I was, that wouldn't mean I had lots of time. :) )

mollerhoj · 2020-02-17T13:00:01Z

@jbingel sorry for being unclear: I was refering to the proposed squeeze_strings function. I'm well aware of the other improvements made to bert (roberta, albert, distilbert etc).

grofte · 2020-02-19T20:06:06Z

@mollerhoj Oh, I don't think you would see a difference in performance. But you would probably train faster. The function is just stripping out nonsense filler with no semantic content.

But I'm guessing that you guys do normalization through decomposition, sentence splitting (how though?), lower case, strip everything not in ascii+æøå. BERT tokenizer should take take of everything else once it has the vocab file. The other stuff you do to prepare the data for pre-training shouldn't be that interesting for employing the model.

VildMedPap · 2020-04-22T07:32:24Z

Danish cased model!

polls bot added the Polls label Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next language #2

Next language #2

mollerhoj commented Dec 3, 2019 •

edited

Loading

ViktorAlm commented Jan 9, 2020 •

edited

Loading

emillykkejensen commented Jan 15, 2020

grofte commented Feb 11, 2020 •

edited

Loading

jbingel commented Feb 11, 2020

grofte commented Feb 12, 2020

mollerhoj commented Feb 17, 2020 •

edited

Loading

jbingel commented Feb 17, 2020

mollerhoj commented Feb 17, 2020 •

edited

Loading

grofte commented Feb 19, 2020

VildMedPap commented Apr 22, 2020

Next language #2

Next language #2

Comments

mollerhoj commented Dec 3, 2019 • edited Loading

ViktorAlm commented Jan 9, 2020 • edited Loading

emillykkejensen commented Jan 15, 2020

grofte commented Feb 11, 2020 • edited Loading

jbingel commented Feb 11, 2020

grofte commented Feb 12, 2020

mollerhoj commented Feb 17, 2020 • edited Loading

jbingel commented Feb 17, 2020

mollerhoj commented Feb 17, 2020 • edited Loading

grofte commented Feb 19, 2020

VildMedPap commented Apr 22, 2020

mollerhoj commented Dec 3, 2019 •

edited

Loading

ViktorAlm commented Jan 9, 2020 •

edited

Loading

grofte commented Feb 11, 2020 •

edited

Loading

mollerhoj commented Feb 17, 2020 •

edited

Loading

mollerhoj commented Feb 17, 2020 •

edited

Loading