camel-cdr

camel-cdr

83 Photos and videos

Tweets

camel-cdr @CamelCdr

Jun 4

I feel like overlapping FP and integer physical register file and ports may be better than overlapping FP with SIMD. Because scalar FP needs higher issue and lower latency than SIMD FP.

1,964

camel-cdr

camel-cdr @CamelCdr

Jun 4

If you have 128-bit SIMD the goals are mostly aligned, but for >=256-bit sharing FP with SIMD becomes less attractive in my mind. This also isn't all that relevant if your integer execution isn't something like 6 wide or wider.

215

camel-cdr

camel-cdr @CamelCdr

May 20

yes, this is good, more of that please This is a lot better than the alternative of a bunch of people crating their own incompatible subsets.

Pete Cawley @corsix

May 20

Someone looked at the overlapping mess of AVX512 extensions and thought “yes, that is good, more of that please”

299

camel-cdr

camel-cdr @CamelCdr

May 10

@FelixCLC_ I tested the impact of disabling RVC on the SpacemiT X100 4-wide OoO core with clang. Enabling RVC resulted in a roughly 10% performance improvement. Not sure what this actually tells us about RVC exactly, but it's certainly interesting.

883

camel-cdr

camel-cdr @CamelCdr

May 10

The L1I bandwidth of the X100 is 16-bytes / cycle, so it should be able to feed the 4-wide core without compressed instruction. While the X100 supports a handful of fusion pairs, those aren't compressed only. (bitwise bitwise, mul add, add load/store, slli sr*i)

112

Andreas Abel

camel-cdr retweeted

Andreas Abel @uops_info

Mar 29

I have added latency, throughput, and port usage data for Emerald Rapids, Meteor Lake, Arrow Lake, and Zen 5 to uops.info/table.html.

231

37,945

camel-cdr

camel-cdr @CamelCdr

Mar 11

The RVP spec is coming along: github.com/riscv/riscv-p-spe… Here is a untested implementation of JPEG upsample in RVP: godbolt.org/z/r5bGGPsj5 This uses the current draft intrinsics. With the overloaded ones this will be less verbose. __riscv_preinterpret is still way to long IMO.

1,190

camel-cdr

camel-cdr @CamelCdr

Mar 1

Replying to @FUZxxl

@FUZxxl A similar idea to the fast-path branch: s1first rd, rs1, rs2 Sets rd to 1 if rd1 becomes ready first, or 0 if rd2 becomes ready first, the other depenendcy is discarded.

295

camel-cdr

camel-cdr @CamelCdr

Mar 1

This allows you to dynamically load ballance different code paths, which can be especially usefull if you have a parallel problem and you want to process the elements with SIMD bit one or two simultaniously with scalar.

216

camel-cdr

camel-cdr @CamelCdr

Mar 1

The problem with mixing scalar and SIMD today, is that you need to be conservative with the number of scalar elements processee, because if the scalar iteration is slower than the SIMD one, the SIMD has to wait. With s1first you wouldn't have to wait, if one path is faster.

203

camel-cdr

camel-cdr @CamelCdr

Feb 21

gist.github.com/camel-cdr/bd…

Visualizing the RISC-V Instruction Set

Visualizing the RISC-V Instruction Set. GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

4,064

camel-cdr

camel-cdr @CamelCdr

Feb 21

here are some unratified extensions: * red: Zibi (branch immediate) * green: additional V extensions (zvabd, zvzip, various dot product things) * blue: current IME and SiFive AME proposal * yellow: P proposal This uses a slightly different bit order.

387

camel-cdr

camel-cdr @CamelCdr

Feb 19

for (i in 1 << popcount(mask)) »·······assert pdep(i, mask) == x & mask; »·······x = (x | ~mask) 1;

354

camel-cdr

camel-cdr @CamelCdr

Feb 15

escholarship.org/content/qt0…

255

Yossi Oren יוסי אורן

camel-cdr retweeted

Yossi Oren יוסי אורן @yossioren

Feb 3

Today at uASC'26 we introduced uops-again.info, a website documenting corner-case behaviours of port assignment on Intel processors. Joint work with Yarin Oziel, Tomer Laor, Shlomi Levy, @BloodyTangerine , Yossi Oren, @ThomasRokicki and Gabriel Scalosub

826

camel-cdr

camel-cdr @CamelCdr

Jan 12

Another bad simdjson RVV PR... OK, ok, I'll quickly do the VLS port and do the VLA one when I find the time. Having both has the advantage that

1,793

more replies

camel-cdr

camel-cdr @CamelCdr

Jan 18

I had a bit of trouble getting VLEN=512 to work, because I didn't notice the generic backens special cases the LUT size on icelake in one scenario.

173

camel-cdr

camel-cdr @CamelCdr

Jan 18

PR: github.com/simdjson/simdjson… codegen looks solid: godbolt.org/z/WY68vEbhh I'm getting 2-3x speedup on the SpacemiT X60.

199

camel-cdr

camel-cdr @CamelCdr

Jan 17

And here it is: camel-cdr.github.io/rvv-benc… rvv-bench on the first RVA23 hardware, which runs ubuntu 26.04 btw. I don't personally have access, but sanderjo ran it for me. Some of the dav1d folks also have access and have started testing their optimizations.

Longhorn @never_released

Jan 13

> Fully RVA22 compliant, and “Compliant with RVA23 excluding V extension.” Where is the RVA23 hardware? Is RVV even remotely good enough to justify making it mandatory?

1,382