Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with mindspore.ops.normal #192

Open
fr30 opened this issue Aug 5, 2022 · 0 comments · May be fixed by #193
Open

Performance issue with mindspore.ops.normal #192

fr30 opened this issue Aug 5, 2022 · 0 comments · May be fixed by #193

Comments

@fr30
Copy link

fr30 commented Aug 5, 2022

Environment

Hardware Environment(Ascend/GPU/CPU):

/device gpu

Software Environment:

  • MindSpore version (source or binary): 1.8.0
  • Python version (e.g., Python 3.7.5): 3.7.10
  • OS platform and distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • GCC/Compiler version (if compiled from source): 7.5.0

Describe the current behavior

mindspore.ops.normal and mindspore.ops.StandardNormal have terrible performance. Generating a single random tensor of size 100x100x100 takes around 6 seconds, which is unacceptable. Also the problem probably occurs for different random ops.

Describe the expected behavior

Random ops should be much faster.

Steps to reproduce the issue\

Simply run the code:

from mindspore import Tensor, dtype
from typing import List
from mindspore.ops import normal
import time
import mindspore

mindspore.set_seed(2137)

def run_random_calculation(iters: List[int], shapes: List[int], prnt = False):
    assert len(iters) == len(shapes)

    mean = Tensor(0.0, dtype.float32)
    std = Tensor(1.0, dtype.float32)


    for i in range(len(iters)):
        iter_no = iters[i]
        shape = shapes[i]

        for j in range(iter_no):
            x = normal(shape, mean, std)

            if(prnt):
                print(x[:2][:2][:1])
warmup_iters = [
    1,
    1,
    0
]
benchmark_iters = [
    0,
    1,
    0
]
shapes = [
    (10, 10, 10),
    (100, 100, 100),
    (500, 500, 100)
]

run_random_calculation(warmup_iters, shapes)

start = time.time()

run_random_calculation(benchmark_iters, shapes)

end = time.time()

print(f'Result \nshapes: {shapes}\niters: {benchmark_iters}\ntime {end - start}')

Related log / screenshot

Result
shapes: [(10, 10, 10), (100, 100, 100), (500, 500, 100)]
iters: [0, 1, 0]
time 5.9976

Special notes for this issue

The problem lays in the file mindspore/ccsrc/plugin/device/gpu/kernel/cuda_impl/cuda_ops/random_op_impl.cu. Kernels for random generation run curand_init() for each iteration, which is expensive operation. Instead they could exploit the fact that curand_normal() changes the state passed as argument.
The problem is described in https://docs.nvidia.com/cuda/curand/device-api-overview.html#performance-notes also with a snippet that helps solving it.

@fr30 fr30 linked a pull request Aug 5, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant