Skip to content

Commit

Permalink
Lowercase tag names before splitting
Browse files Browse the repository at this point in the history
  • Loading branch information
gbenson committed May 30, 2024
1 parent 700a4af commit 9da6b9a
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions src/dom_tokenizers/pre_tokenizers/dom_snapshot.py
Original file line number Diff line number Diff line change
Expand Up @@ -132,13 +132,13 @@ def get(
tokens = cache.get(string_index)
if tokens is not None:
return tokens
text = self._strings[string_index]
if lowercase:
text = text.lower()
tokens = [
NormalizedString(token)
for token in self._splitter.split(self._strings[string_index])
for token in self._splitter.split(text)
]
if lowercase:
for token in tokens:
token.lowercase()
cache[string_index] = tokens
return tokens

Expand Down

0 comments on commit 9da6b9a

Please sign in to comment.