Nerds taming the green dragon with SCALE, our framework for compiling CUDA codebases for AMD GPUs, with support for more accelerated platforms coming soon.
A lot of people assume that going cross-vendor means settling for lowest-common-denominator performance.
@SpectralMichael breaks down why that's wrong in his latest blog post:
Generality is free when your abstraction sits at the right level. Dive into Part 4 of my series, "Why hardware-agnostic isn't the same as lowest-common-denominator": scale-lang.com/posts/2026-06โฆ
๐ธ๐ฌ Spectral is at @superai_conf 2026 โ Marina Bay Sands, Singapore, 10โ11 June.
Singapore is one of the fastest-moving AI hubs in the world, and the conversations here about compute capacity and GPU vendor optionality are exactly the ones we care about.
If you're around, let's talk.
SCALE compiles unmodified CUDA code natively for AMD and NVIDIA. The numbers:
โก Up to 33.8ร faster than HIP on AMD MI300X
โก Up to 9% faster than nvcc on NVIDIA B300
One codebase. Any GPU. No rewrites.
See you in Hamburg ๐
Attendees were especially interested in the performance boost and better-than-native developer experience they can unlock with scale-lang.com - while expanding their compute capacity without being locked to a single vendor.
COMPUTEX always delivers on serendipity: the unplanned hallway chats turn into the most valuable ones. Heading home with a full notebook and a longer list of people to follow up with.
Until next time. ๐
Cross-architecture from a single codebase is exactly why we built SCALE. Thrilled to see @AtlasInference getting this running! More performance optimizations for both @AMD and @nvidia are on the way.
scale-lang.com
Atlas Inference is running Qwen3.6-27B on AMD Strix Halo ๐ฅณ
Using @SpectralCom's SCALE ROCm backend, our CUDA kernels compile and run on RDNAโ๏ธ
Cross-architecture inference from ONE codebase ๐ฃ๏ธ
Thank you @AIatAMD for the gift ๐
POC โ excited to keep tuning performanceโก๏ธ
๐ฆ๐ฝ๐ฒ๐ฐ๐๐ฟ๐ฎ๐น ๐๐ผ๐บ๐ฝ๐๐๐ฒ ๐ถ๐ ๐ป๐ผ๐ ๐ฝ๐ฎ๐ฟ๐ ๐ผ๐ณ ๐๐ต๐ฒ ๐ก๐ฉ๐๐๐๐ ๐๐ป๐ฐ๐ฒ๐ฝ๐๐ถ๐ผ๐ป ๐ฝ๐ฟ๐ผ๐ด๐ฟ๐ฎ๐บ. ๐ฉ
Inception is @nvidia's program for AI startups - a membership that gives access to technical resources, preferred pricing on NVIDIA hardware and software, and exposure to a global network of investors and partners.
CUDA is the de-facto standard for AI developers, and weโre honored to play our part in growing the ecosystem.
And on NVIDIA's B300 (CUDA 13), SCALE lands within a whisker of nvcc on its home turf: ๐ผ๐ป ๐ฝ๐ฎ๐ฟ ๐ผ๐ป ๐ฎ๐๐ฒ๐ฟ๐ฎ๐ด๐ฒ, ๐๐ฝ ๐๐ผ ๐ต% ๐ณ๐ฎ๐๐๐ฒ๐ฟ on individual workloads.
That's native CUDA tooling, matched and occasionally beaten, by a third-party compiler.
Everyone says CUDA can't target TPUs.
What they mean is nobody has written the compiler that raises CUDA code to something a systolic backend can consume.
Those are very different sentences.
Full post โ Part 3 of why @SpectralCom exists: tinyurl.com/mtnxcjsw
@ChrisKitching17, our CTO and co-founder of Spectral Compute, recently presented at the ๐ก๐๐ฅ ๐ฃ๐ฒ๐ฟ๐ณ๐๐ฎ๐ฏ ๐ฆ๐ฒ๐บ๐ถ๐ป๐ฎ๐ฟ hosted by NHR@FAU. The recording is now online: youtu.be/uSLD40GX5nM
The Q&A at the end is worth the watch on its own.
Attendees asked about:
โณ Whether optimized deep learning kernels flow through the same pipeline
โณ Support for Intel GPUs and emerging hardware
โณ How SCALE compares to NVIDIA's own compiler when targeting NVIDIA
โณ Adding custom accelerators (RISC-V came up) and emulating missing functionality
โณ Code-level transformations to improve occupancy on AMD
โณ Plans for OpenACC, CUDA Fortran, and OpenMP
๐จ๐ฝ ๐๐ผ ๐ฎ๐ฑ.๐ณร ๐ณ๐ฎ๐๐๐ฒ๐ฟ. Unmodified CUDA. AMD silicon.
Thanks to the @tensorwave team for benchmarking SCALE on MI355X and publishing the numbers.
Port to AMD used to mean a rewrite. Now it means a recompile.
Spectral Compute (@SpectralCom) used TensorWaveโs AMD-native infrastructure to benchmark CUDA portability and performance on @AMD Instinctโข MI355X GPUs.
See how they did it - tensorwave.com/blog/spectralโฆ
Sort the issues of almost any popular open-source CUDA project by most commented, and you'll inevitably find the exact same unresolved thread: 'Any chance of AMD support?'"
For most maintainers, that thread stays open indefinitely. Porting a complex CUDA codebase is a part-time job, and volunteer contributors rarely have one to spare.
๐ฆ๐๐๐๐ ๐๐๐ฟ๐ป๐ ๐๐ต๐ฎ๐ ๐ถ๐๐๐๐ฒ ๐ถ๐ป๐๐ผ ๐ฎ ๐ฟ๐ฒ๐ฐ๐ผ๐บ๐ฝ๐ถ๐น๐ฒ.
Our compiler toolchain is free for research and evaluation. If you maintain a CUDA project and want to finally close that thread, come say hi: discord.com/invite/KNpgGbTc3โฆ