The Segmenter Logic Latency increases 10x between 20 and 30 Fusions #3510

kevinstephano · 2024-12-02T19:39:24Z

Thunder based repro. Add NVFUSER_DUMP=python_definition to see nvFuser's FusionDefinition as it is large!

import torch
import thunder

class MySimpleModel(torch.nn.Module):
    def __init__(self, n_layers=10):
        super().__init__()
        self.fcs = torch.nn.ModuleList([torch.nn.Linear(16, 16) for _ in range(n_layers)])

    def forward(self, x):
        for fc in self.fcs:
            x = torch.nn.functional.relu(fc(x))
        
        return x

def get_model_and_args():
    device = 'cuda'
    model = MySimpleModel(n_layers=30).to(device)
    args = (torch.randn(128, 16, device=device),)
    kwargs = {}
    return model, args, kwargs

model, args, kwargs = get_model_and_args()

# Check against the vanilla `thunder.jit` model
jfun = thunder.jit(model, nv_enable_linear=True)
import time
st=time.time()
expected = jfun(*args, **kwargs)
print("time:", time.time()-st)

There are a couple of issue happening here:

The segmenter increases in time to 300s at 30 layers for some reason up from 300 ms at 20 layers. [The Largest Issue]
The NVRTC compilation is happening more than once for the same activation kernel.

The text was updated successfully, but these errors were encountered:

kevinstephano added Host Latency Segmentation Issues related to nvFuser Segmentation Thunder labels Dec 2, 2024

kevinstephano mentioned this issue Dec 2, 2024

nvFuser linear fusion leads to notebook example timeout Lightning-AI/lightning-thunder#1490

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Segmenter Logic Latency increases 10x between 20 and 30 Fusions #3510

The Segmenter Logic Latency increases 10x between 20 and 30 Fusions #3510

kevinstephano commented Dec 2, 2024

The Segmenter Logic Latency increases 10x between 20 and 30 Fusions #3510

The Segmenter Logic Latency increases 10x between 20 and 30 Fusions #3510

Comments

kevinstephano commented Dec 2, 2024