Refactor parquet dataloader #867

zyaoj · 2024-12-03T14:27:16Z

What does this PR do? Please describe:
The first attempt to extract and migrate generic parquet dataloader from MERES to fairseq2.

Does your PR introduce any breaking changes? If yes, please list them:
N/A

Check list:

Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
Did you read the contributor guideline?
Did you make sure that your PR does only one thing instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)

artemru · 2025-01-16T15:41:20Z

requirements-devel.txt

+pandas~=2.0.0
+pandas-stubs~=2.2.3


could we remove these explicite deps here and only keep pyarrow as main dep

artemru · 2025-01-16T15:42:14Z

setup.py

-        "arrow": ["pyarrow>=13.0.0", "pandas~=2.0.0"],
+        "arrow": [
+            "pyarrow>=13.0.0",
+            "joblib~=1.4.2",


let's if we could go without it (maybe ThreadingPool from multiprocessing will do the jobs)

artemru · 2025-01-16T15:44:09Z

src/fairseq2/data/parquet/__init__.py

+    "pyarrow_to_torch_tensor",
+    "pyarrow_column_to_array",
+    "split_fragment_in_row_groups",
+    "table_func_wrap",


I think this's not needed at all

artemru · 2025-01-16T15:44:49Z

src/fairseq2/data/parquet/__init__.py

+    "_TableWrapper",
+    "_to_real_object",


now fs2.data handles well pyarrow.Table directly, so no more need for that

artemru · 2025-01-16T15:47:05Z

src/fairseq2/data/parquet/configs.py

+    Contains different options that allows to load only a part of the provided dataset.
+    """
+
+    columns: Optional[List[str]] = None


I'm not yet sure if it belongs to it

artemru · 2025-01-16T15:49:39Z

src/fairseq2/data/parquet/configs.py

+    """If ``True``, uses Parquet row groups instead of simple partitions which
+    are generally smaller. Highly recommended for non-partitioned parquet files."""
+
+    nb_parallel_fragments: Optional[int] = 5


maybe we could keep it with default=None and add an extra config arg (max_tokens) to be used with dynamic bucketing.

artemru · 2025-01-16T15:53:48Z

src/fairseq2/data/parquet/configs.py

+
+
+@dataclass
+class DataLoadingConfig:


this's a more specific dataloading case, I would call it Seq2SeqDataLoading.
But we could get something like ClassifierDataloading or PairDataloading ...

artemru · 2025-01-16T15:54:57Z

src/fairseq2/data/parquet/configs.py

+    shuffle: bool = True
+    """If ``True``, shuffles the dataset samples during the iteration. If ``False``
+    and ``order_by_length`` is ``None``, the batch samples will be produced in
+    natural Parquet dataset reading order."""
+
+    drop_null: bool = True
+    """If ``True``, drops rows containing any null value."""
+
+    seed: int = 123
+    """The RNG seed value for deterministic behavior."""
+
+    nb_epochs: int = 100
+    """
+    Number of passes over the data before iterations stop
+    """


this probably should go to Basic Dataset config (frontend pipeline)

artemru · 2025-01-16T16:17:08Z

src/fairseq2/data/parquet/pipeline.py

+            # XXX: this will reinit default aws creds if they were not provided explicitly
+            # tested only on aws cluster only !


remove comments about aws

artemru · 2025-01-16T16:17:24Z

src/fairseq2/data/parquet/pipeline.py

+            )
+            self.fragment = loads(dumps(self.fragment))
+            fragment_table = self.fragment.to_table(
+                columns=fragment_columns, use_threads=False


use_threads should be a parameter here

artemru · 2025-01-16T16:19:59Z

src/fairseq2/data/parquet/pipeline.py

+        return fragment_table  # type: ignore
+
+
+def parquet_fragments_to_pipeline_builder(


we could remove this one in favor of list_parquet_fragments ?

artemru · 2025-01-16T16:20:56Z

src/fairseq2/data/parquet/pipeline.py

+    # Apply filters if specified
+    if dataset_config.filters is not None or dataloader_config.drop_null:
+        pipeline_builder = pipeline_builder.map(
+            table_func_wrap(


not need for these wrapper any more

artemru · 2025-01-16T16:23:02Z

src/fairseq2/data/parquet/transform.py

+    return replace_table_column(table, column, new_array)
+
+
+def correct_paragraph_length(


this's too LCM specific function, no reason to keep it here

artemru · 2025-01-16T16:24:01Z

src/fairseq2/data/parquet/transform.py

+    # and take the 'list.min()' and 'list.max()' as needed.
+    filter_series = df_pl.with_columns(
+        (
+            (pl.col(column).list.eval(pl.col("").str.len_bytes()).list.min() >= min_len)


here we used len_bytes but probably len_chars is more relevant

artemru · 2025-01-16T16:25:28Z

src/fairseq2/data/parquet/utils.py

+    return table
+
+
+def load_one_fragment(


we have SafeFragment interface for this now

artemru · 2025-01-16T16:26:03Z

src/fairseq2/data/parquet/utils.py

+    return np.asarray(length_col, dtype=np.int32)
+
+
+class _TableWrapper:


artemru · 2025-01-16T16:26:08Z

src/fairseq2/data/parquet/utils.py

+        self.table: pa.Table = table
+
+
+def _to_real_object(x: Union[_TableWrapper, NestedDict]) -> BatchOutputType:


artemru · 2025-01-16T16:26:14Z

src/fairseq2/data/parquet/utils.py

+        return x
+
+
+def table_func_wrap(func: Callable[..., Any]) -> Callable[..., Any]:


zyaoj added 3 commits December 1, 2024 01:06

move configs

5071f80

move stopes utils

7af70d6

move stopes utils

ffd751f

zyaoj self-assigned this Dec 3, 2024

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2024

zyaoj marked this pull request as ready for review January 3, 2025 13:46

zyaoj requested review from artemru and cbalioglu as code owners January 3, 2025 13:46

zyaoj removed request for artemru and cbalioglu January 3, 2025 13:47

zyaoj marked this pull request as draft January 3, 2025 13:47

zyaoj added 17 commits January 10, 2025 13:57

merge

324a649

fix ci

fd4cc34

add dependencies for arrow

eef615e

add configs

75dcdda

update parquet utils

dd85d23

add parquet transform fns

d1efdae

add draft pipeline

b98b53c

add pytest fixture for pq

2f90700

cleanup

6fe1106

cleanup

cb26532

fix ci

966fc21

try fix ci

cf01c4c

try fix ci

5ba86be

try fix ci

57e545b

try fix ci

3fc58f5

fix pipeline ci

755afee

revise pipeline

a9d5b18

artemru reviewed Jan 16, 2025

View reviewed changes

src/fairseq2/data/parquet/utils.py

return table

def load_one_fragment(

Copy link

Contributor

artemru Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have SafeFragment interface for this now

artemru reviewed Jan 16, 2025

View reviewed changes

src/fairseq2/data/parquet/utils.py

return np.asarray(length_col, dtype=np.int32)

class _TableWrapper:

Copy link

Contributor

artemru Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

artemru reviewed Jan 16, 2025

View reviewed changes

src/fairseq2/data/parquet/utils.py

return x

def table_func_wrap(func: Callable[..., Any]) -> Callable[..., Any]:

Copy link

Contributor

artemru Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor parquet dataloader #867

Refactor parquet dataloader #867

zyaoj commented Dec 3, 2024 •

edited

Loading

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

artemru Jan 16, 2025

		# XXX: this will reinit default aws creds if they were not provided explicitly
		# tested only on aws cluster only !

		return fragment_table # type: ignore


		def parquet_fragments_to_pipeline_builder(

		return replace_table_column(table, column, new_array)


		def correct_paragraph_length(

		return np.asarray(length_col, dtype=np.int32)


		class _TableWrapper:

		self.table: pa.Table = table


		def _to_real_object(x: Union[_TableWrapper, NestedDict]) -> BatchOutputType:

		return x


		def table_func_wrap(func: Callable[..., Any]) -> Callable[..., Any]:

		pandas~=2.0.0
		pandas-stubs~=2.2.3

Refactor parquet dataloader #867

Are you sure you want to change the base?

Refactor parquet dataloader #867

Conversation

zyaoj commented Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zyaoj commented Dec 3, 2024 •

edited

Loading