Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broadcast_add is very slow #2499

Open
okpatil4u opened this issue Sep 25, 2024 · 2 comments
Open

broadcast_add is very slow #2499

okpatil4u opened this issue Sep 25, 2024 · 2 comments

Comments

@okpatil4u
Copy link

Rust code

use candle_core::Tensor;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let a = Tensor::rand(0f32, 1.0, (32,630, 12,32), &candle_core::Device::Cpu)?;
    let b = Tensor::rand(0f32, 1.0, (32,1, 1, 32), &candle_core::Device::Cpu)?;

    let start = std::time::Instant::now();
    for _ in 0..100{
        let _ = a.broadcast_add(&b);
    }
    println!("broadcast add  : {:?}",std::time::Instant::now()-start);

    Ok(())
}

broadcast add  : 3.317046333s

Python Code

import torch
import time

a = torch.rand(32,630,12,32)
c = torch.rand(32, 1, 1, 32)

start = time.time()
for i in range(100):
    out = a+c
print("broadcast add : ",time.time() - start)

broadcast add :  0.06077408790588379

Python version is around 55 times faster for a simple addition function. Is there a pathway ahead to solve this problem ?

@xiaoniaoyouhuajiang
Copy link

I used modified code (with the loop count adjusted to 10 times), in conjunction with the following cargo.toml and the flamegraph tool, to obtain this flame graph. A significant portion of the time cost appears to be the iterator overhead in the binary_map function, specifically alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter.

fn main() -> Result<()> {
    let a = Tensor::rand(0f32, 1.0, (32,630, 12,32), &candle_core::Device::Cpu)?;
    let b = Tensor::rand(0f32, 1.0, (32,1, 1, 32), &candle_core::Device::Cpu)?;

    let start = std::time::Instant::now();
    for _ in 0..10{
        let _ = a.broadcast_add(&b);
    }
    println!("broadcast add  : {:?}",std::time::Instant::now()-start);

    Ok(())
}

broadcast add : 441.14726ms

image

@xiaoniaoyouhuajiang
Copy link

xiaoniaoyouhuajiang commented Oct 10, 2024

I modified the loop count of broadcast_add in the code to 10 times and tested the performance of both optimized and unoptimized binary products, comparing the results with the performance of the Python code.

Experimental Conditions
my PC CPU:

OS: ubuntu 20.04
Architecture:                    x86_64
CPU Operating Modes:          32-bit, 64-bit
Address sizes:                 43 bits physical, 48 bits virtual
Byte Order:                    Little Endian
CPU:                            12
  Online CPU(s) list:          0-11
Vendor ID:                     AuthenticAMD
  Model Name:                  AMD Ryzen 5 3600 6-Core Processor
    CPU Family:               23
    Model:                     113
    Thread(s) per core:       2
    Core(s) per socket:       6
    Socket(s):                 1
    Stepping:                  0
    Frequency boost:          enabled

compiler setting

[profile.release]
opt-level = 2
lto = false
debug = true
panic = 'abort'

and this is my result:
optimized:broadcast add(10x loop) : 447.87228ms
optimized:broadcast add(100x) : 3.983712414s
python code: broadcast add(10x) : 0.04471111297607422(44.71111297607422ms)
python code: broadcast add(100x) : 0.4770965576171875(477.0965576171875ms)

Obviously, Python code is 10x faster than candle

I used strace to trace the system calls of both implementations and found that the Python version of broadcast seems to use multi-threading to distribute tensor operations (in this case, 48 threads). I am not sure if this implementation mechanism improves computational efficiency. In contrast, the Rust code in this case used only one thread/logical core. I tested whether multi-threading could enhance the efficiency of the code in this scenario using Rayon.

strace --follow-forks --summary-only python3 broadcast_add.py

strace: Process 66105 attached
strace: Process 66106 attached
strace: Process 66107 attached
strace: Process 66108 attached
strace: Process 66109 attached
strace: Process 66110 attached
strace: Process 66111 attached
strace: Process 66112 attached
strace: Process 66113 attached
strace: Process 66114 attached
...
// modified code
use rayon::prelude::*;
use std::time::Instant;

fn main() -> Result<()> {
    let a = Arc::new(Tensor::rand(0f32, 1.0, (32, 630, 12, 32), &candle_core::Device::Cpu)?);
    let b = Arc::new(Tensor::rand(0f32, 1.0, (32, 1, 1, 32), &candle_core::Device::Cpu)?);

    let start = Instant::now();
    (0..100).into_par_iter().for_each(|_| {
        let a_clone = a.clone();
        let b_clone = b.clone();
        let _ = a_clone.broadcast_add(&b_clone);
    });
    println!("broadcast add with Rayon: {:?}", Instant::now() - start);

    Ok(())
}

rayon version:broadcast add with Rayon(100x): 781.284023ms

However, such a coarse level of parallelism doesn't make practical sense. Could you tell me if there is an API or another method that could make Candle Tensor achieve the same efficiency as Torch in this scenario, and how does Torch internally implement tensor.broadcast_add? Thank you very much. @LaurentMazare

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants