Skip to content

Commit

Permalink
Added a process_dataset function to well, process data
Browse files Browse the repository at this point in the history
  • Loading branch information
jshuadvd committed Jul 9, 2024
1 parent 092c15c commit 4b03d93
Showing 1 changed file with 23 additions and 0 deletions.
23 changes: 23 additions & 0 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,29 @@ def preprocess_data(data, tokenizer, max_length, overlap):
return sequences


def process_dataset(dataset, tokenizer, max_length, overlap):
"""
Process a dataset using the preprocess_data function.
Args:
dataset: The dataset to process.
tokenizer: Tokenizer object for encoding the data.
max_length (int): Maximum sequence length for each chunk.
overlap (int): Overlap size between consecutive chunks.
Returns:
list: List of preprocessed sequences from the entire dataset.
"""
all_sequences = []
for item in dataset:
text = (
item["text"] if "text" in item else item["content"]
) # Adjust based on dataset structure
sequences = preprocess_data(text, tokenizer, max_length, overlap)
all_sequences.extend(sequences)
return all_sequences


def compute_perplexity(loss):
return torch.exp(loss)

Expand Down

0 comments on commit 4b03d93

Please sign in to comment.