Skip to content

Commit 69f3d32

Browse files
Merge pull request #51 from CodeWithKyrian/dev-digits-pretokenizer
fix: digits pre-tokenizer returning empty array for text with no digits
2 parents 9aa12f9 + b58a590 commit 69f3d32

File tree

3 files changed

+19
-4
lines changed

3 files changed

+19
-4
lines changed

docs/getting-started.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,13 @@ Arguments:
6666
can
6767
specify it here. This downloads any additional configuration or data needed for that task.
6868
- `[options]` (optional): Additional options to customize the download process.
69-
- `-cache_dir=<directory>`: Choose where to save the models. If you've got a preferred storage spot, mention it
69+
- `--cache-dir=<directory>`: Choose where to save the models. If you've got a preferred storage spot, mention it
7070
here. Otherwise, it goes to the default cache location. You can use the shorthand `-c` instead of `--cache_dir`.
7171
- `--quantized=<true|false>`: Decide whether you want the quantized version of the model, which is smaller and
7272
faster. The default is true, but if for some reason you prefer the full version, you can set this to false. You
7373
can use the shorthand `-q` instead of `--quantized`. Example: `--quantized=false`, `-q false`.
74+
- `--model-filename=<filename>`: Specify the exact model filename to download (without the `.onnx` suffix. Eg. "
75+
model" or "model_quantized".
7476

7577
The `download` command will download the model weights and save them to the cache directory. The next time you use the
7678
model, TransformersPHP will use the cached weights instead of downloading them again.
@@ -199,7 +201,7 @@ OpenMP is a set of compiler directives and library routines that enable parallel
199201
programs. TransformersPHP uses OpenMP to enable multithreaded operations in the Tensors, which can improve performance
200202
on multi-core systems. OpenMP is not required, but it can provide a significant performance boost for some operations.
201203
Checkout the [OpenMP website](https://www.openmp.org/) for more information on how to install and configure OpenMP on
202-
your system.
204+
your system.
203205

204206
Example: On Ubuntu, you can install OpenMP using the following command:
205207

src/Commands/DownloadModelCommand.php

+9-1
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,20 @@ protected function configure(): void
4646

4747
$this->addOption(
4848
'quantized',
49-
null,
49+
'q',
5050
InputOption::VALUE_OPTIONAL,
5151
'Whether to download the quantized version of the model.',
5252
true
5353
);
5454

55+
$this->addOption(
56+
'model-filename',
57+
null,
58+
InputOption::VALUE_OPTIONAL,
59+
'The filename of the exact model weights version to download.',
60+
null
61+
);
62+
5563
}
5664

5765
protected function execute(InputInterface $input, OutputInterface $output): int

src/PreTokenizers/DigitsPreTokenizer.php

+6-1
Original file line numberDiff line numberDiff line change
@@ -9,16 +9,21 @@ class DigitsPreTokenizer extends PreTokenizer
99
{
1010

1111
protected string $pattern;
12+
1213
public function __construct(protected array $config)
1314
{
1415
$individualDigits = $this->config['individual_digits'] ? '' : '+';
16+
1517
$digitPattern = "[^\\d]+|\\d$individualDigits";
1618

1719
$this->pattern = "/$digitPattern/u";
1820

1921
}
22+
2023
public function preTokenizeText(string|array $text, array $options): array
2124
{
22-
return preg_split($this->pattern, $text, -1, PREG_SPLIT_NO_EMPTY) ?? [];
25+
preg_match_all($this->pattern, $text, $matches, PREG_SPLIT_NO_EMPTY);
26+
27+
return $matches[0] ?? [];
2328
}
2429
}

0 commit comments

Comments
 (0)