Possibly incorrect gradients from autodiff #8356

chenzhekl · 2023-09-28T14:43:33Z

Describe the bug
A clear and concise description of what the bug is, ideally within 20 words.

The same algorithm of different forms produces different gradients.

To Reproduce
Please post a minimal sample code to reproduce the bug.
The developer team will put a higher priority on bugs that can be reproduced within 20 lines of code. If you want a prompt reply, please keep the sample code short and representative.

import taichi as ti
import torch

ti.init(arch=ti.cuda)


@ti.kernel
def foo(x: ti.types.ndarray(), y: ti.types.ndarray()):
    for i in x:
        # a = 0.0
        # for j in y:
        #     a += y[j]
        # x[i] += a
        for j in y:
            x[i] += y[j]


x = torch.tensor(
    [0, 0, 0, 0, 0], dtype=torch.float32, device="cuda", requires_grad=True
)
y = torch.tensor([1, 2, 3], dtype=torch.float32, device="cuda", requires_grad=True)

foo(x, y)
x.grad = torch.ones_like(x)
foo.grad(x, y)
print(x.grad, y.grad)

The above code outputs

[Taichi] version 1.7.0, llvm 15.0.4, commit aa0619fb, linux, python 3.10.12
[Taichi] Starting on arch=cuda
tensor([1., 1., 1., 1., 1.], device='cuda:0') tensor([1., 1., 1.], device='cuda:0')

while the commented-out code, which does the same thing, outputs:

[Taichi] version 1.7.0, llvm 15.0.4, commit aa0619fb, linux, python 3.10.12
[Taichi] Starting on arch=cuda
tensor([1., 1., 1., 1., 1.], device='cuda:0') tensor([5., 5., 5.], device='cuda:0')

Log/Screenshots
Please post the full log of the program (instead of just a few lines around the error message, unless the log is > 1000 lines). This will help us diagnose what's happening. For example:

$ python my_sample_code.py
[Taichi] mode=release
[Taichi] version 0.6.29, llvm 10.0.0, commit b63f6663, linux, python 3.8.3
...

Additional comments
If possible, please also consider attaching the output of command ti diagnose. This produces the detailed environment information and hopefully helps us diagnose faster.

If you have local commits (e.g. compile fixes before you reproduce the bug), please make sure you first make a PR to fix the build errors and then report the bug.

The text was updated successfully, but these errors were encountered:

jim19930609 · 2023-10-12T07:37:42Z

Ok the problem is that the Taichi compiler does not reject use of "struct-for" in an inner loop -- the line of for j in y in the code.

Write something like:

@ti.kernel
def foo(x: ti.types.ndarray(), y: ti.types.ndarray()):
    for i in x:
        for j in range(y.shape[0]): 
            x[i] += y[j]

Meanwhile, we should add a guard to emit an error message when struct-for is used inside the loop

taichi-gardener added this to Taichi Lang Sep 28, 2023

github-project-automation bot moved this to Untriaged in Taichi Lang Sep 28, 2023

jim19930609 self-assigned this Oct 13, 2023

jim19930609 moved this from Untriaged to Todo in Taichi Lang Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibly incorrect gradients from autodiff #8356

Possibly incorrect gradients from autodiff #8356

chenzhekl commented Sep 28, 2023

jim19930609 commented Oct 12, 2023

Possibly incorrect gradients from autodiff #8356

Possibly incorrect gradients from autodiff #8356

Comments

chenzhekl commented Sep 28, 2023

jim19930609 commented Oct 12, 2023