Skip to content

Commit

Permalink
Implement a caching mechanism for tokenized sequences
Browse files Browse the repository at this point in the history
  • Loading branch information
jshuadvd committed Jul 11, 2024
1 parent 939aa76 commit 287957e
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion train.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,8 @@ def preprocess_data(data, tokenizer, max_length, overlap):
while start < len(data):
end = start + max_length
chunk = data[start:end]
tokenized_chunk = tokenizer.encode(chunk)
# tokenized_chunk = tokenizer.encode(chunk)
tokenized_chunk = cached_tokenize(chunk, tokenizer)

# Create sliding window sequences from the tokenized chunk
chunk_sequences = create_sliding_window_chunks(
Expand Down

0 comments on commit 287957e

Please sign in to comment.