You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I got an error during sockeye.prepare_data saying that the target sequences are not token parallel (not having the same length). [[2, 1960], [2, 4, 4, 4, 4]] . For info 2 is the <s> and 4 is the , token. Here's the log :
[2023-09-28:15:10:02:INFO:sockeye.utils:log_sockeye_version] Sockeye: 3.1.29, commit 4dba5a3
9b3bde, path /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/init.py
[2023-09-28:15:10:02:INFO:sockeye.utils:log_torch_version] PyTorch: 1.10.0 (/home/jingshu.liu/anaconda3/envs/dev_jingsh
u/lib/python3.6/site-packages/torch/init.py)
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Command: /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/pytho
n3.6/site-packages/sockeye/prepare_data.py -s /data/mlmt//final.src.mask -t /data/mlmt//final.tgt --source-factors /dat
a/mlmt//final.src.cf --target-factors /data/mlmt//final.tgt.cf --shared-vocab --num-words 120000 --word-min-count 2 --m
ax-seq-len 200 --num-samples-per-shard 15000000 --max-processes 4 -o /home/share/research/mlmd/data_bin
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Arguments: Namespace(bucket_scaling=False, bucket_width=8, conf
ig=None, loglevel='INFO', loglevel_secondary_workers='INFO', max_processes=4, max_seq_len=(200, 200), min_num_shards=1,
no_bucketing=False, no_logfile=False, num_samples_per_shard=15000000, num_words=(120000, 120000), output='/home/share/
research/mlmd/data_bin', pad_vocab_to_multiple_of=8, quiet=False, quiet_secondary_workers=False, seed=13, shared_vocab=
True, source='/data/mlmt//final.src.mask', source_factor_vocabs=[], source_factors=['/data/mlmt//final.src.cf'], source
_factors_use_source_vocab=[], source_vocab=None, target='/data/mlmt//final.tgt', target_factor_vocabs=[], target_factor
s=['/data/mlmt//final.tgt.cf'], target_factors_use_target_vocab=[], target_vocab=None, word_min_count=(2, 2))
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] Random seed: 13
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] PyTorch seed: 13
[2023-09-28:15:10:02:INFO:main:prepare_data] Adjusting maximum length to reserve space for a BOS/EOS marker. New ma
ximum length: (201, 201)
[2023-09-28:15:40:05:INFO:main:prepare_data] 1997912086 samples will be split into 134 shard(s) (requested samples/
shard=15000000, min_num_shards=1).
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] Loading/creating vocabularies
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] (1) Surface form vocabularies (source & target)
[2023-09-28:22:37:39:INFO:sockeye.vocab:build_from_paths] Building vocabulary from dataset(s):
...
Traceback (most recent call last):
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 121, in
main()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 32, in main
prepare_data(args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 117, in prepare_data
keep_tmp_shard_files=keep_tmp_shard_files)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/data_io.py", line 609, in prepare_data
length_stats = pool.starmap(analyze_sequence_lengths, stats_args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
sockeye.utils.SockeyeError: Target sequences are not token-parallel: [[2, 1960], [2, 4, 4, 4, 4]]
The text was updated successfully, but these errors were encountered:
It looks like the error message is reporting that the files for target sequences (factor 0) and target factor sequences (factors 1+) are not token parallel. If you run word counts for each pair of lines from /data/mlmt//final.tgt and /data/mlmt//final.tgt.cf, are the lengths the same?
Hi, I got an error during sockeye.prepare_data saying that the target sequences are not token parallel (not having the same length).
[[2, 1960], [2, 4, 4, 4, 4]]
. For info 2 is the<s>
and 4 is the,
token. Here's the log :[2023-09-28:15:10:02:INFO:sockeye.utils:log_sockeye_version] Sockeye: 3.1.29, commit 4dba5a3
9b3bde, path /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/init.py
[2023-09-28:15:10:02:INFO:sockeye.utils:log_torch_version] PyTorch: 1.10.0 (/home/jingshu.liu/anaconda3/envs/dev_jingsh
u/lib/python3.6/site-packages/torch/init.py)
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Command: /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/pytho
n3.6/site-packages/sockeye/prepare_data.py -s /data/mlmt//final.src.mask -t /data/mlmt//final.tgt --source-factors /dat
a/mlmt//final.src.cf --target-factors /data/mlmt//final.tgt.cf --shared-vocab --num-words 120000 --word-min-count 2 --m
ax-seq-len 200 --num-samples-per-shard 15000000 --max-processes 4 -o /home/share/research/mlmd/data_bin
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Arguments: Namespace(bucket_scaling=False, bucket_width=8, conf
ig=None, loglevel='INFO', loglevel_secondary_workers='INFO', max_processes=4, max_seq_len=(200, 200), min_num_shards=1,
no_bucketing=False, no_logfile=False, num_samples_per_shard=15000000, num_words=(120000, 120000), output='/home/share/
research/mlmd/data_bin', pad_vocab_to_multiple_of=8, quiet=False, quiet_secondary_workers=False, seed=13, shared_vocab=
True, source='/data/mlmt//final.src.mask', source_factor_vocabs=[], source_factors=['/data/mlmt//final.src.cf'], source
_factors_use_source_vocab=[], source_vocab=None, target='/data/mlmt//final.tgt', target_factor_vocabs=[], target_factor
s=['/data/mlmt//final.tgt.cf'], target_factors_use_target_vocab=[], target_vocab=None, word_min_count=(2, 2))
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] Random seed: 13
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] PyTorch seed: 13
[2023-09-28:15:10:02:INFO:main:prepare_data] Adjusting maximum length to reserve space for a BOS/EOS marker. New ma
ximum length: (201, 201)
[2023-09-28:15:40:05:INFO:main:prepare_data] 1997912086 samples will be split into 134 shard(s) (requested samples/
shard=15000000, min_num_shards=1).
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] Loading/creating vocabularies
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] (1) Surface form vocabularies (source & target)
[2023-09-28:22:37:39:INFO:sockeye.vocab:build_from_paths] Building vocabulary from dataset(s):
...
Traceback (most recent call last):
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 121, in
main()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 32, in main
prepare_data(args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 117, in prepare_data
keep_tmp_shard_files=keep_tmp_shard_files)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/data_io.py", line 609, in prepare_data
length_stats = pool.starmap(analyze_sequence_lengths, stats_args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
sockeye.utils.SockeyeError: Target sequences are not token-parallel: [[2, 1960], [2, 4, 4, 4, 4]]
The text was updated successfully, but these errors were encountered: