`kwcoco split --rng 12345` does not split different files referring to same videos uniformly #6

cameron-a-johnson · 2024-10-31T16:30:19Z

Hey! Paul might share this too: a pipeline produces multiple kwcoco datasets for the same set of videos, with frame-wise activity classification ground truth, pose detection, and some object detections, respectively.

FWIW, we want to kwcoco split each of those by the same random split. I thought kwcoco split --rng 12345 might create the same split on each file but it doesn't.

And that's really no big deal for us! Obviously a ~5-line splitter script that seeds a random split of the video id list will do.

Just thought the use case might be interesting to consider. If so, maybe I could take a stab at a re-implementation of split that would work for this case?

The text was updated successfully, but these errors were encountered:

Erotemic · 2024-11-01T02:23:34Z

I think only a minor modification is needed.

When balance_categories is True the current behavior is correct because the splits depend on the annotations. If you have two datasets with the same video AND annotation distribution you would expect they have the same split, but otherwise it may vary.

However, when balance_categories is False, then a consistent split over video-ids would make much more sense. The problem logic is here:

        # setup
        dset = kwcoco.CocoDataset.coerce(config['src'])
        images = dset.images()
        cids_per_image = images.annots.cids
        gids = images.lookup('id')
        group_ids = images.lookup('video_id')
        ...

        final_group_ids = []
        final_group_gids = []
        final_group_cids = []

        unique_cids = set(ub.flatten(cids_per_image)) | {0}
        distinct_cid = max(unique_cids) + 11

        for group_id, gid, cids in zip(group_ids, gids, cids_per_image):
            if len(cids) == 0:
                final_group_ids.append(group_id)
                final_group_gids.append(gid)
                final_group_cids.append(distinct_cid)
            else:
                final_group_ids.extend([group_id] * len(cids))
                final_group_gids.extend([gid] * len(cids))
                final_group_cids.extend(cids)

        # Balanced category split
        rng = kwarray.ensure_rng(config['rng'])

        shuffle = rng is not None
        factor = config['factor']
        self = util_sklearn.StratifiedGroupKFold(n_splits=factor,
                                                 random_state=rng,
                                                 shuffle=shuffle)

        if config['balance_categories']:
            split_idxs = list(self.split(X=final_group_gids, y=final_group_cids, groups=final_group_ids))
        else:
            split_idxs = list(self.split(X=final_group_gids, y=final_group_gids, groups=final_group_ids))

if not config['balance_categories'], then we should not need to duplicate image-ids for the sklearn splitter. I think you can just take the len(cids) == 0 path in this case, and it should work.

Of course, we should also write a test for this as well.

Even though most development of kwcoco is on gitlab, I think it makes sense to start using github more, as it should encourage more community contributions. Towards this end, I should enable CI on github for this package. So, please submit a PR here if you can help fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`kwcoco split --rng 12345` does not split different files referring to same videos uniformly #6

`kwcoco split --rng 12345` does not split different files referring to same videos uniformly #6

cameron-a-johnson commented Oct 31, 2024

Erotemic commented Nov 1, 2024 •

edited

Loading

kwcoco split --rng 12345 does not split different files referring to same videos uniformly #6

kwcoco split --rng 12345 does not split different files referring to same videos uniformly #6

Comments

cameron-a-johnson commented Oct 31, 2024

Erotemic commented Nov 1, 2024 • edited Loading

`kwcoco split --rng 12345` does not split different files referring to same videos uniformly #6

`kwcoco split --rng 12345` does not split different files referring to same videos uniformly #6

Erotemic commented Nov 1, 2024 •

edited

Loading