🐘 @camelcdr@tech.lgbt

Joined March 2024
83 Photos and videos
I feel like overlapping FP and integer physical register file and ports may be better than overlapping FP with SIMD. Because scalar FP needs higher issue and lower latency than SIMD FP.
1
1
15
1,964
If you have 128-bit SIMD the goals are mostly aligned, but for >=256-bit sharing FP with SIMD becomes less attractive in my mind. This also isn't all that relevant if your integer execution isn't something like 6 wide or wider.
1
11
215
yes, this is good, more of that please This is a lot better than the alternative of a bunch of people crating their own incompatible subsets.
Someone looked at the overlapping mess of AVX512 extensions and thought “yes, that is good, more of that please”
2
2
299
@FelixCLC_ I tested the impact of disabling RVC on the SpacemiT X100 4-wide OoO core with clang. Enabling RVC resulted in a roughly 10% performance improvement. Not sure what this actually tells us about RVC exactly, but it's certainly interesting.
3
8
883
The L1I bandwidth of the X100 is 16-bytes / cycle, so it should be able to feed the 4-wide core without compressed instruction. While the X100 supports a handful of fusion pairs, those aren't compressed only. (bitwise bitwise, mul add, add load/store, slli sr*i)
4
112
camel-cdr retweeted
I have added latency, throughput, and port usage data for Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 to uops.info/table.html.

7
45
231
37,945
The RVP spec is coming along: github.com/riscv/riscv-p-spe… Here is a untested implementation of JPEG upsample in RVP: godbolt.org/z/r5bGGPsj5 This uses the current draft intrinsics. With the overloaded ones this will be less verbose. __riscv_preinterpret is still way to long IMO.

3
10
1,190
Replying to @FUZxxl
@FUZxxl A similar idea to the fast-path branch: s1first rd, rs1, rs2 Sets rd to 1 if rd1 becomes ready first, or 0 if rd2 becomes ready first, the other depenendcy is discarded.
2
1
5
295
This allows you to dynamically load ballance different code paths, which can be especially usefull if you have a parallel problem and you want to process the elements with SIMD bit one or two simultaniously with scalar.
1
1
6
216
The problem with mixing scalar and SIMD today, is that you need to be conservative with the number of scalar elements processee, because if the scalar iteration is slower than the SIMD one, the SIMD has to wait. With s1first you wouldn't have to wait, if one path is faster.
1
6
203
here are some unratified extensions: * red: Zibi (branch immediate) * green: additional V extensions (zvabd, zvzip, various dot product things) * blue: current IME and SiFive AME proposal * yellow: P proposal This uses a slightly different bit order.
3
387
for (i in 1 << popcount(mask)) »·······assert pdep(i, mask) == x & mask; »·······x = (x | ~mask) 1;
2
1
354
camel-cdr retweeted
Today at uASC'26 we introduced uops-again.info, a website documenting corner-case behaviours of port assignment on Intel processors. Joint work with Yarin Oziel, Tomer Laor, Shlomi Levy, @BloodyTangerine , Yossi Oren, @ThomasRokicki and Gabriel Scalosub
2
10
826
Another bad simdjson RVV PR... OK, ok, I'll quickly do the VLS port and do the VLA one when I find the time. Having both has the advantage that
2
8
1,793
I had a bit of trouble getting VLEN=512 to work, because I didn't notice the generic backens special cases the LUT size on icelake in one scenario.
1
173
PR: github.com/simdjson/simdjson… codegen looks solid: godbolt.org/z/WY68vEbhh I'm getting 2-3x speedup on the SpacemiT X60.
199
And here it is: camel-cdr.github.io/rvv-benc… rvv-bench on the first RVA23 hardware, which runs ubuntu 26.04 btw. I don't personally have access, but sanderjo ran it for me. Some of the dav1d folks also have access and have started testing their optimizations.

> Fully RVA22 compliant, and “Compliant with RVA23 excluding V extension.” Where is the RVA23 hardware? Is RVV even remotely good enough to justify making it mandatory?
4
12
1,382