Building a compile-time SIMD optimized smoothing filter

keldaris

This looks like a nice case study for when you're already using Rust for other reasons and just want to make a bit of numerical code go fast. However, as someone mostly writing C++ and Julia, this does not look promising at all - it's clear that the Julia implementation is both more elegant and faster, and it seems much easier to reproduce that result in C++ (which has no issues with compile time float constants, SIMD, GPU support, etc.) than Rust.

I've written very little Rust myself, but when I've tried, I've always come away with a similar impression that it's just not a good fit for performant numerical computing, with seemingly basic things (like proper SIMD support, const generics without weird restrictions, etc.) considered afterthoughts. For those more up to speed on Rust development, is this impression accurate, or have I missed something and should reconsider my view?

jonathrg

Tight loops of SIMD operations seems like something that might be more convenient to just implement directly in assembly? So you don't need to babysit the compiler like this.

feverzsj

I think most rust projects still depend on clang's vector extension when SIMD is required.

jtrueb

I think scientific computing in Rust is getting too much attention without contribution lately. We don't have many essential language features stabilized. SIMD, generic const exprs, intrinsics, function optimization overrides, and reasonable floating point overrides related to fast math are a long way from being stabilized. In order to get better perf, the code is full of these informal compiler hints to guide it towards an optimization like autovectorization or branch elision. The semantics around strict floating point standards are stifling and intrinsics have become less accessible than they used to be.

Separately, is Julia hitting a different LA backend? Rust's ndarray with a blas-src on Accelerate is pretty fast, but the Rust implementation is little slower on my macbook. This is a benchmark of a dot product.

```

    const M10: usize = 10_000_000;
    #[divan::bench]
    fn ndarray_dot32(b: Bencher) {
        b.with_inputs(|| (Array::from_vec(vec![0f32; M10]), Array::from_vec(vec![0f32; M10])))
            .bench_values(|(a, b)| {
                a.dot(&b)
            });
    }

    #[divan::bench]
    fn chunks_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M10], vec![0f32; M10]))
            .bench_values(|(a, b)| {
                a.chunks_exact(32)
                    .zip(b.chunks_exact(32))
                    .map(|(a, b)| a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>())
                    .sum::<f32>()
            });
    }

    #[divan::bench]
    fn iter_dot32(b: Bencher) {
        b.with_inputs(|| (vec![0f32; M100], vec![0f32; M100]))
            .bench_values(|(a, b)| {
                a.iter().zip(b.iter()).map(|(a, b)| a * b).sum::<f32>()
            });
    }
    
    ---- Rust ----
    Timer precision: 41 ns (100 samples)
    flops             fast    │ slow    │ median  │ mean
    ├─ chunks_dot32   3.903 ms│ 9.96 ms │ 4.366 ms│ 4.411 ms
    ├─ chunks_dot64   4.697 ms│ 16.29 ms│ 5.472 ms│ 5.516 ms
    ├─ iter_dot32     10.37 ms│ 11.36 ms│ 10.93 ms│ 10.86 ms
    ├─ iter_dot64     11.68 ms│ 13.07 ms│ 12.43 ms│ 12.4 ms
    ├─ ndarray_dot32  1.984 ms│ 2.91 ms │ 2.44 ms │ 2.381 ms
    ╰─ ndarray_dot64  4.021 ms│ 5.718 ms│ 5.141 ms│ 4.965 ms

    ---- Julia ----
    native_dot32:
    Median: 1.623 ms, Mean: 1.633 ms ± 341.705 μs
    Range: 1.275 ms - 12.242 ms

    native_dot64:
    Median: 5.286 ms, Mean: 5.179 ms ± 230.997 μs
    Range: 4.736 ms - 5.617 ms

    simd_dot32:
    Median: 1.818 ms, Mean: 1.830 ms ± 142.826 μs
    Range: 1.558 ms - 2.169 ms

    simd_dot64:
    Median: 3.564 ms, Mean: 3.567 ms ± 586.002 μs
    Range: 3.123 ms - 22.887 ms

    iter_dot32:
    Median: 9.566 ms, Mean: 9.549 ms ± 144.503 μs
    Range: 9.302 ms - 10.941 ms

    iter_dot64:
    Median: 9.666 ms, Mean: 9.640 ms ± 84.481 μs
    Range: 9.310 ms - 9.867 ms

    All: 0 bytes, 0 allocs

```

https://github.com/trueb2/flops-bench

TinkersW

I only clicked through the slides and didn't watch the video..but ugh all I see is scalar SIMD in the assembly output(the ss ending means scalar, it would be ps if it was vector)

And they are apparently relying on the compiler to generate it...just no.

Use intrinsics, it ant that hard.