Skip to content

Actions: huggingface/datatrove

All workflows

Actions

Loading...
Loading

Showing runs from all workflows
1,401 workflow runs
1,401 workflow runs

Filter by Event

Filter by Status

Filter by Branch

Filter by Actor

added stage 4 - filtering to minhash
Lint #57: Commit b480195 pushed by guipenedo
July 19, 2023 14:13 23s minhash
July 19, 2023 14:13 23s
🐛 Fix bug in gopher repetition filter
Lint #56: Commit 647c162 pushed by alexchapeaux
July 19, 2023 13:19 22s filters
July 19, 2023 13:19 22s
🎨 Add examples and time/length stats
Lint #55: Commit e98bfe8 pushed by alexchapeaux
July 19, 2023 12:25 20s exactsubstrings
July 19, 2023 12:25 20s
🎨 Simplify bytearange loading in stage 3
Lint #54: Commit d73a2a6 pushed by alexchapeaux
July 19, 2023 10:47 17s exactsubstrings
July 19, 2023 10:47 17s
🎨 Improve file handling in stage 1,2
Lint #53: Commit 0e3290b pushed by alexchapeaux
July 19, 2023 08:54 18s exactsubstrings
July 19, 2023 08:54 18s
renamed "compressed" to "compression"
Lint #52: Commit 8097bdf pushed by guipenedo
July 18, 2023 17:23 21s main
July 18, 2023 17:23 21s
added zst support to file opening
Lint #51: Commit 13075f7 pushed by guipenedo
July 18, 2023 17:19 18s main
July 18, 2023 17:19 18s
WIP minhash: added stages 1-3
Lint #50: Commit 66169dc pushed by guipenedo
July 18, 2023 17:00 23s minhash
July 18, 2023 17:00 23s
🚧 look for the problem
Lint #48: Commit aa6b60a pushed by alexchapeaux
July 18, 2023 12:19 18s exactsubstrings
July 18, 2023 12:19 18s
Merge pull request #12 from huggingface/deduplication
Lint #47: Commit 980d85e pushed by alexchapeaux
July 17, 2023 09:43 17s main
July 17, 2023 09:43 17s
Deduplication
Lint #46: Pull request #12 opened by alexchapeaux
July 17, 2023 09:42 16s deduplication
July 17, 2023 09:42 16s
🚧 Add WIP version of exact substrings
Lint #44: Commit b2771c6 pushed by alexchapeaux
July 16, 2023 19:40 18s exactsubstrings
July 16, 2023 19:40 18s
added emojis and stats to tokenizer
Lint #43: Commit 0b51530 pushed by guipenedo
July 13, 2023 11:47 20s main
July 13, 2023 11:47 20s
now catching jsondecodeerrors to skip malformed lines
Lint #42: Commit afb6202 pushed by guipenedo
July 13, 2023 10:10 21s main
July 13, 2023 10:10 21s
added tqdm
Lint #41: Commit 9af9499 pushed by guipenedo
July 13, 2023 09:30 20s main
July 13, 2023 09:30 20s
another small tokenization bugfix
Lint #39: Commit cd092ec pushed by guipenedo
July 11, 2023 09:18 21s main
July 11, 2023 09:18 21s
small tokenization bugfixes
Lint #38: Commit 35c3dfd pushed by guipenedo
July 11, 2023 09:09 16s main
July 11, 2023 09:09 16s
Merge pull request #11 from huggingface/stats
Lint #37: Commit 18bc618 pushed by alexchapeaux
July 11, 2023 08:34 18s main
July 11, 2023 08:34 18s
Stats
Lint #36: Pull request #11 opened by alexchapeaux
July 11, 2023 07:29 20s stats
July 11, 2023 07:29 20s
⏪ Remove change in warc reader
Lint #35: Commit 7d18f66 pushed by alexchapeaux
July 11, 2023 07:27 24s stats
July 11, 2023 07:27 24s
✨ Add length stats
Lint #34: Commit e8279a0 pushed by alexchapeaux
July 10, 2023 15:42 21s stats
July 10, 2023 15:42 21s
📝 Add sentence-deduplication example
Lint #33: Commit 7bd4cce pushed by alexchapeaux
July 10, 2023 10:14 18s deduplication
July 10, 2023 10:14 18s
ProTip! You can narrow down the results and go further in time using created:<2023-07-10 or the other filters available.