Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Video support for datatrove #271

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

guipenedo
Copy link
Collaborator

WIP

Copy link
Collaborator Author

@guipenedo guipenedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how much detail you wanted in the feedback, some stuff are nits but I included them just in case.
Main issue imo is no sharding on the reader which would be a problem if you want to scale up/use multiple cpus. It's easily fixible, basically just pass rank and world_size to your find_triplets function and change the return to return triplets[rank::world_size]. This should be consistent across workers as list_files is deterministic (it sorts the file list) so there should be no overlap between different workers

"""Filter that uses ffmpeg to detect if a video is static (frozen)."""

name = "🧊 Video-Frozen-filter"
_requires_dependencies = ["ffmpeg"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably be this to show users the correct package name to install

Suggested change
_requires_dependencies = ["ffmpeg"]
_requires_dependencies = [("ffmpeg", "ffmpeg-python")]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to accept the actual change here I think

Comment on lines 32 to 33
self.data_folder = get_datafolder(data_folder)
self.paths_file = paths_file
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to save self.data_folder and self.paths_file as these are already saved by the base class

"""Overrides the base run method to handle triplet statistics correctly."""
triplet_count = 0

for triplet in self.find_triplets():
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're currently not sharding, which will make it hard to parallelize/easily scale up to many cpus

with self.data_folder.open(self.paths_file, "r") as f:
paths = [line.strip() for line in f]
else:
paths = self.data_folder.list_files(recursive=self.recursive, glob_pattern="*")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't want to use the glob_pattern you can just pass None (or nothing as it is the default), but then you probably want to remove it from the __init__ args


for path in paths:
base_name, ext = os.path.splitext(path)
if ext in video_extensions:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could maybe extract the triple finding thing away to the datafolder class later on (you'd pass a tuple like ((".mp4", ".avi", ".mkv", ".mov"), ".json", ".vtt")), but it's probably ok for now

def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:
"""Overrides the base run method to handle triplet statistics correctly."""
triplet_count = 0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows stacking multiple readers together (when you want to load data from multiple sources for example)

Suggested change
if data:
yield from data


return document

def ensure_local_copy(self, video_url: str) -> str:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job using all of datatrove useful io features here :)

src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved
src/datatrove/pipeline/filters/video_frozen_filter.py Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@guipenedo guipenedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, just missing the change to check the python lib dependency

"""Filter that uses ffmpeg to detect if a video is static (frozen)."""

name = "🧊 Video-Frozen-filter"
_requires_dependencies = ["ffmpeg"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to accept the actual change here I think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants