Proposal for batch.remove_at_index() method to enhance efficiency #9681

leonardcaquot94 · 2024-09-26T09:49:15Z

leonardcaquot94
Sep 26, 2024

Hello PyG team!

I’d like to propose the addition of a batch.remove_at_index() method to PyTorch Geometric, which would take a mask or a list of indices and efficiently filter the batch without using the to_list() / from_list() conversions. Below is a sample comparison between two approaches, demonstrating significant performance improvement.

Code Example:

from datetime import datetime
from torch_geometric.data import HeteroData, Batch
import torch

# Create initial HeteroData object and batch
data = HeteroData()
data['paper'].x = torch.randn(5, 16)
batch_size = 50
batch = Batch.from_data_list([data]*batch_size)

def func1(batch):
    idx_to_keep = torch.rand(batch_size) < 0.5
    data_to_keep = batch[idx_to_keep]
    new_batch = Batch.from_data_list(data_to_keep)
    return new_batch

def func2(batch):
    idx_to_keep = torch.rand(batch_size) < 0.5
    nodes_to_keep = idx_to_keep[batch['paper'].batch]
    batch['paper'].x = batch['paper'].x[nodes_to_keep]
    batch_filtered = batch['paper'].batch[nodes_to_keep]
    _, batch['paper'].batch, counts = torch.unique_consecutive(batch_filtered, return_inverse=True, return_counts=True)
    batch['paper'].ptr = torch.cat((torch.zeros(1, dtype=torch.int), counts.cumsum(0)))
    batch._num_graphs = idx_to_keep.sum()
    return batch

def copy(batch):
    batch_1 = Batch.from_data_list(batch.to_data_list())
    batch_2 = Batch.from_data_list(batch.to_data_list())
    return batch_1, batch_2

t1 = 0
t2 = 0
for _ in range(1000):
    batch_1, batch_2 = copy(batch)

    t = datetime.now()
    x = func1(batch_1)
    t1 += (datetime.now() - t).microseconds

    t = datetime.now()
    y = func2(batch_2)
    t2 += (datetime.now() - t).microseconds

print(f'func1 : {t1*1e-6} seconds')
print(f'func2 : {t2*1e-6} seconds')

Execution Results:

func1 : 0.946913 seconds
func2 : 0.11657 seconds

Explanation:

func1 uses the standard Batch.from_data_list() method to filter the batch, which is slower.
func2 directly manipulates the batch object without converting to/from lists, resulting in approximately a 10x speedup.

Additional Suggestions:

To make this feature more efficient, it could be beneficial to reuse some of the existing code from the collate function to handle custom node and edge attributes iteratively. Additionally, it might be useful to provide an option to return a "negative sub-batch" (i.e., the elements that are excluded from the mask) alongside the positive sub-batch.

Usage :

Currently, when we filter batch data like sub_batch = batch[mask], it returns a data list. But I believe it would be more convenient if it kept the result as a Batch object. This way, users can maintain the Batch format and call sub_batch.to_data_list() only when they explicitly need a data list. This would streamline operations where batch structure needs to be preserved.

I believe these improvements could greatly enhance performance, especially in batch filtering scenarios.

Let me know what you think of this idea, and if any further clarifications are needed!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for batch.remove_at_index() method to enhance efficiency #9681

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Proposal for batch.remove_at_index() method to enhance efficiency #9681

leonardcaquot94 Sep 26, 2024

Code Example:

Execution Results:

Explanation:

Additional Suggestions:

Usage :

Replies: 0 comments

leonardcaquot94
Sep 26, 2024