You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Title: Token ID Out of Range & Indexing Assertion Errors During Training
Description:
I'm encountering several issues while training a model using the Meta-Llama-3.1-8B-Instruct tokenizer and dataset processing script. The main issues are as follows:
Token ID Out of Range:
During tokenization, I'm consistently receiving the following warning:
ERROR:__main__:Token ID 128256 out of range, adjusting to 127999
This occurs even after attempting to handle out-of-range token IDs by capping them at the maximum valid token ID (127999). This issue might be affecting the overall model performance and data integrity.
Indexing Assertion Error:
When generating the training split, the following error is triggered:
This assertion failure suggests that there might be an issue with how indices are being selected during the training process, potentially due to misaligned tensor dimensions or out-of-range indices.
Code:
Here is the script I'm using for tokenization and dataset processing:
importosimportjsonimportreimportpandasaspdfromdatasetsimportload_dataset, Dataset, DatasetDictfromtransformersimportAutoTokenizerfrommultiprocessingimportPool, cpu_countimportloggingfromtqdmimporttqdmimportpsutilfromretryimportretryimportrandomimportglob# Set up logginglogging.basicConfig(level=logging.INFO)
logger=logging.getLogger(__name__)
# Define pathsinput_data_dir='./ShardedData/SmallShards'output_data_dir='./processed_data'train_dir=os.path.join(output_data_dir, 'train')
test_dir=os.path.join(output_data_dir, 'test')
val_dir=os.path.join(output_data_dir, 'val')
hf_token='***************************************'# Create directories if they don't existos.makedirs(output_data_dir, exist_ok=True)
os.makedirs(train_dir, exist_ok=True)
os.makedirs(test_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)
# Load tokenizermodel_name='meta-llama/Meta-Llama-3.1-8B-Instruct'tokenizer=AutoTokenizer.from_pretrained(model_name, token=hf_token, use_fast=True)
iftokenizer.pad_tokenisNone:
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
defclean_text(text):
# Remove special characters and irregularitiestext=re.sub(r'[^A-Za-z0-9\s]+', ' ', text)
text=re.sub(r'\s+', ' ', text).strip()
returntextdefsplit_large_text(text, max_length=4096):
# Split the text into smaller chunkswords=text.split()
chunks= [' '.join(words[i:i+max_length]) foriinrange(0, len(words), max_length)]
returnchunksdeftokenize_function(examples):
try:
examples["text"] = [clean_text(text) fortextinexamples["text"]]
tokenized_output=tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)
# Validate token IDsvocab_size=tokenizer.vocab_sizefortoken_id_listintokenized_output['input_ids']:
fortoken_idintoken_id_list:
iftoken_id>=vocab_size:
logger.error(f"Token ID {token_id} out of range")
returntokenized_outputexceptExceptionase:
logger.error(f"Tokenization error: {e}")
return {"input_ids": [], "attention_mask": []}
defpreprocess_data(chunk_data):
try:
ifisinstance(chunk_data, dict):
chunk_data['text'] =str(chunk_data.get('text', ''))
else:
chunk_data= {"text": str(chunk_data)}
chunk_data['text'] =clean_text(chunk_data['text'])
iflen(chunk_data['text'].split()) >4096:
chunk_data['text'] =split_large_text(chunk_data['text'])
returnchunk_dataexceptjson.JSONDecodeErrorase:
logger.error(f"JSON decode error: {e}")
return {"text": ""}
defsave_chunk(data, split_dir, chunk_index):
output_shard=os.path.join(split_dir, f"tokenized_chunk_{chunk_index}.jsonl")
withopen(output_shard, 'a', encoding='utf-8') asf:
foritemindata:
json_str=json.dumps(item) +"\n"f.write(json_str)
defvalidate_tokenized_data(tokenized_datasets, vocab_size):
""" Validate that all token IDs in the tokenized datasets are within the valid range. """forexampleintokenized_datasets:
input_ids=example['input_ids']
ifany(token_id>=vocab_sizefortoken_idininput_ids):
returnFalsereturnTruedefprocess_chunk(chunk_data, chunk_index, split_dir):
all_data= [preprocess_data(json.loads(line)) forlineinchunk_data]
dataset=Dataset.from_dict({"text": [d["text"] fordinall_data]})
tokenized_datasets=dataset.map(tokenize_function, batched=True, batch_size=2048, remove_columns=["text"], num_proc=1)
# Verify token IDs are within the valid rangevocab_size=tokenizer.vocab_sizevalid=validate_tokenized_data(tokenized_datasets, vocab_size)
ifnotvalid:
logger.error(f"Token IDs out of range in chunk {chunk_index}. Adjusting token IDs.")
forexampleintokenized_datasets:
input_ids=example['input_ids']
adjusted_input_ids= []
fortoken_idininput_ids:
iftoken_id>=vocab_size:
logger.warning(f"Token ID {token_id} out of range, adjusting to {vocab_size-1}")
token_id=vocab_size-1# Adjust out-of-range token IDsadjusted_input_ids.append(token_id)
example['input_ids'] =adjusted_input_ids[:tokenizer.model_max_length]
example['attention_mask'] =example['attention_mask'][:tokenizer.model_max_length]
save_chunk(tokenized_datasets, split_dir, chunk_index)
defload_and_tokenize_in_chunks(file_path, chunk_size=50000):
chunk_index=0chunk_data= []
withopen(file_path, 'r', encoding='utf-8') asf:
forlineinf:
chunk_data.append(line)
iflen(chunk_data) >=chunk_size:
split_dir=select_split_dir()
process_chunk(chunk_data.copy(), chunk_index, split_dir)
chunk_data= [] # Reset the bufferchunk_index+=1# Ensure to save any remaining dataifchunk_data:
split_dir=select_split_dir()
process_chunk(chunk_data, chunk_index, split_dir)
defselect_split_dir():
""" Randomly select a directory (train, test, or val) based on the desired split ratio. """rand_num=random.random()
ifrand_num<0.90:
returntrain_direlifrand_num<0.95:
returntest_direlse:
returnval_dirdefprocess_file(file_path):
try:
load_and_tokenize_in_chunks(file_path)
returnfile_pathexceptExceptionase:
logger.error(f"Error processing file {file_path}: {e}")
returnNonedefmain():
all_files=glob.glob(os.path.join(input_data_dir, "shard_*.jsonl"))
# Load processed files cacheprocessed_files_cache=os.path.join(output_data_dir, 'processed_files_cache.json')
ifos.path.exists(processed_files_cache):
withopen(processed_files_cache, 'r') asf:
processed_files=set(json.load(f))
else:
processed_files=set()
# Filter out already processed filesall_files= [fforfinall_filesiffnotinprocessed_files]
# Shuffle the files for random processingrandom.shuffle(all_files)
# Create a pool of worker processesnum_workers=min(cpu_count(), 48) # Use the number of vCPUs or 48, whichever is lowerwithPool(num_workers) aspool:
# Use imap_unordered to apply process_file to each file in parallelforprocessed_fileintqdm(pool.imap_unordered(process_file, all_files), total=len(all_files), desc="Processing Files"):
ifprocessed_file:
processed_files.add(processed_file)
withopen(processed_files_cache, 'w') asf:
json.dump(list(processed_files), f)
if__name__=="__main__":
main()
Minimal Reproducible Example:
Here is a minimal code example to reproduce the token ID out-of-range issue:
importtorchfromtransformersimportAutoTokenizer# Your Hugging Face tokenhf_token='********************************'# Replace with your actual token# Specify the model name or pathmodel_name="meta-llama/Meta-Llama-3.1-8B-Instruct"# Load the tokenizer without manually setting special tokenstokenizer=AutoTokenizer.from_pretrained(model_name, token=hf_token, trust_remote_code=True)
# Example text inputtext="What is the capital of France?"# Tokenize the input texttokens=tokenizer(text, return_tensors="pt")
# Print the tokenized outputprint("Tokenized input:", tokens)
# Decode the tokens back to text (for verification)decoded_text=tokenizer.decode(tokens['input_ids'][0])
print("Decoded text:", decoded_text)
# Check for out-of-range token IDsvocab_size=tokenizer.vocab_sizeprint("Vocabulary Size:", vocab_size)
fori, token_idinenumerate(tokens["input_ids"][0]):
iftoken_id>=vocab_size:
print(f"Token ID {token_id} out of range at position {i} (Token: {tokenizer.decode([token_id])})")
Output:
Tokenized input: {'input_ids': tensor([[128000, 3923, 374, 279, 6864, 315, 9822, 30]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
Decoded text: What is the capital of France?
Vocabulary Size: 128000
Token ID 128000 out of range at position 0 (Token: )
Steps to Reproduce:
Use the provided minimal example code to tokenize any input text.
Observe the tokenization process and check the logs for "Token ID out of range" errors.
Run the training script with gradient checkpointing enabled.
Monitor for the Indexing.cu assertion error during the generation of the training split.
Expected Behavior:
Token IDs should be within the valid range after tokenization. The training process should proceed without assertion errors, and there should be no conflicts between gradient checkpointing and caching.
Additional Context:
The data being processed includes a mix of unicode and non-unicode characters. The script attempts to clean the data by removing special characters and non-unicode sequences. Despite these precautions, the issues mentioned above persist.
Any guidance on resolving these issues or insights into potential causes would be greatly appreciated.
The text was updated successfully, but these errors were encountered:
I don't know if we faced the same problem but looks similar.
Maybe it is caused by the code tokenizer.add_special_tokens({'pad_token': '[PAD]'})
I used the same method and find that the pad_token's id is 128001 but the max is 128000. It triggered the assertion srcIndex < srcSelectDimSize failed.
Then I use the tokenizer.pad_token = tokenizer.eos_token and the problem solved.
Title: Token ID Out of Range & Indexing Assertion Errors During Training
Description:
I'm encountering several issues while training a model using the Meta-Llama-3.1-8B-Instruct tokenizer and dataset processing script. The main issues are as follows:
Token ID Out of Range:
During tokenization, I'm consistently receiving the following warning:
This occurs even after attempting to handle out-of-range token IDs by capping them at the maximum valid token ID (
127999
). This issue might be affecting the overall model performance and data integrity.Indexing Assertion Error:
When generating the training split, the following error is triggered:
This assertion failure suggests that there might be an issue with how indices are being selected during the training process, potentially due to misaligned tensor dimensions or out-of-range indices.
Code:
Here is the script I'm using for tokenization and dataset processing:
Minimal Reproducible Example:
Here is a minimal code example to reproduce the token ID out-of-range issue:
Output:
Steps to Reproduce:
Indexing.cu
assertion error during the generation of the training split.Environment:
Expected Behavior:
Token IDs should be within the valid range after tokenization. The training process should proceed without assertion errors, and there should be no conflicts between gradient checkpointing and caching.
Additional Context:
The data being processed includes a mix of unicode and non-unicode characters. The script attempts to clean the data by removing special characters and non-unicode sequences. Despite these precautions, the issues mentioned above persist.
Any guidance on resolving these issues or insights into potential causes would be greatly appreciated.
The text was updated successfully, but these errors were encountered: