LLVM 16.0.0 was just released today, and as I did for LLVM 15, I wanted to highlight some of the RISC-V specific changes and improvements. This is very much a tour of a chosen subset of additions rather than an attempt to be exhaustive. If you're interested in RISC-V, you may also want to check out my recent attempt to enumerate the commercially available RISC-V SoCs and if you want to find out what's going on in LLVM as a whole on a week-by-week basis, then I've got the perfect newsletter for you.
In case you're not familiar with LLVM's release schedule, it's worth noting that there are two major LLVM releases a year (i.e. one roughly every 6 months) and these are timed releases as opposed to being cut when a pre-agreed set of feature targets have been met. We're very fortunate to benefit from an active and growing set of contributors working on RISC-V support in LLVM projects, who are responsible for the work I describe below - thank you! I coordinate biweekly sync-up calls for RISC-V LLVM contributors, so if you're working in this area please consider dropping in.
LLVM 16 is the first release featuring a user guide for the RISC-V target (16.0.0 version, current HEAD. This fills a long-standing gap in our documentation, whereby it was difficult to tell at a glance the expected level of support for the various RISC-V instruction set extensions (standard, vendor-specific, and experimental extensions of either type) in a given LLVM release. We've tried to keep it concise but informative, and add a brief note to describe any known limitations that end users should know about. Thanks again to Philip Reames for kicking this off, and the reviewers and contributors for ensuring it's kept up to date.
LLVM 16 was a big release for vectorisation. As well as a long-running strand of work making incremental improvements (e.g. better cost modelling) and fixes, scalable vectorization was enabled by default. This allows LLVM's loop vectorizer to use scalable vectors when profitable. Follow-on work enabled support for loop vectorization using fixed length vectors and disabled vectorization of epilogue loops. See the talk optimizing code for scalable vector architectures (slides) by Sander de Smalen for more information about scalable vectorization in LLVM and introduction to the RISC-V vector extension by Roger Ferrer Ibáñez for an overview of the vector extension and some of its codegen challenges.
The RISC-V vector intrinsics supported by Clang have changed (to match e.g. this and this) during the 16.x development process in a backwards incompatible way, as the RISC-V Vector Extension Intrinsics specification evolves towards a v1.0. In retrospect, it would have been better to keep the intrinsics behind an experimental flag when the vector codegen and MC layer (assembler/disassembler) support became stable, and this is something we'll be more careful of for future extensions. The good news is that thanks to Yueh-Ting Chen, headers are available that provide the old-style intrinsics mapped to the new version.
I refer to 'experimental' support many times below. See the documentation on experimental extensions within RISC-V LLVM for guidance on what that means. One point to highlight is that the extensions remain experimental until they are ratified, which is why some extensions on the list below are 'experimental' despite the fact the LLVM support needed is trivial. On to the list of newly added instruction set extensions:
c.flw
, c.flwsp
, c.fsw
, c.fswsp
).c.fld
, c.fldsp
, c.fsd
, c.fsdsp
).LLDB has started to become usable for RISC-V in this period due to
work by contributor 'Emmer'. As they summarise
here,
LLDB should be usable for debugging RV64 programs locally but support is
lacking for remote debug (e.g. via the gdb server protocol). During the LLVM
16 development window, LLDB gained support for software single stepping on
RISC-V, support in
EmulateInstructionRISCV
for
RV{32,64}I, as well as extensions
A and M,
C,
RV32F and
RV64F, and
D.
Another improvement that's fun to look more closely at is support for "short forward branch optimisation" for the SiFive 7 series cores. What does this mean? Well, let's start by looking at the problem it's trying to solve. The base RISC-V ISA doesn't include conditional moves or predicated instructions, which can be a downside if your code features unpredictable short forward branches (with the ensuing cost in terms of branch mispredictions and bloating branch predictor state). The ISA spec includes commentary on this decision (page 23), noting some disadvantages of adding such instructions to the specification and noting microarchitectural techniques exist to convert short forward branches into predicated code internally. In the case of the SiFive 7 series, this is achieved using macro-op fusion where a branch over a single ALU instruction is fused and executed as a single conditional instruction.
In the LLVM 16 cycle, compiler optimisations targeting this microarchitectural feature were enabled for conditional move style sequences (i.e. branch over a register move) as well as for other ALU operations. The job of the compiler here is of course to emit a sequence compatible with the micro-architectural optimisation when possible and profitable. I'm not aware of other RISC-V designs implementing a similar optimisation - although there are developments in terms of instructions to support such operations directly in the ISA which would avoid the need for such microarchitectural tricks. See XVentanaCondOps, XTheadCondMov, the previously proposed but now abandoned Zbt extension (part of the earlier bitmanip spec) and more recently the proposed Zicond (integer conditional operations) standard extension.
It's perhaps not surprising that code generation for atomics can be tricky to
understand, and the LLVM documentation on atomics codegen and
libcalls is actually
one of the best references on the topic I've found. A particularly important
note in that document is that if a backend supports any inline lock-free
atomic operations at a given size, all operations of that size must be
supported in a lock-free manner. If targeting a RISC-V CPU without the atomics
extension, all atomics operations would usually be lowered to __atomic_*
libcalls. But if we know a bit more about the target, it's possible to do
better - for instance, a single-core microcontroller could implement an atomic
operation in a lock-free manner by disabling interrupts (and conventionally,
lock-free implementations of atomics are provided through __sync_*
libcalls). This kind of setup is exactly what the +forced-atomics
feature
enables, where atomic load/store can be lowered to a load/store with
appropriate fences (as is supported in the base ISA) while other atomic
operations generate a __sync_*
libcall.
There's also been a very minor improvement for targets with native atomics
support (the 'A' instruction set extension) that I may as well mention while
on the topic. As you might know, atomic operations such as compare and swap
that are lowered to an instruction sequence involving lr.{w,d}
(load reserved) and
sc.{w,d}
(store conditional). There are very specific rules about these
instruction sequences that must be met to align with the architectural
forward progress
guarantee (section 8.3, page 51),
which is why we expand to a fixed instruction sequence at a very late stage in
compilation (see original
RFC). This
means the sequence of instructions implementing the atomic operation are
opaque to LLVM's optimisation passes and are treated as a single unit. The
obvious disadvantage of avoiding LLVM's optimisations is that sometimes there
are optimisations that would be helpful and wouldn't break that
forward-progress guarantee. One that came up in real-world code was the lack
of branch folding, which would have simplified a branch in the expanded
cmpxchg
sequence that just targets another branch with the same condition
(by just folding in the eventual target). With some relatively simple
logic, this suboptimal codegen is
resolved.
; Before => After
.loop: => .loop
lr.w.aqrl a3, (a0) => lr.w.aqrl a3, (a0)
bne a3, a1, .afterloop => bne a3, a1, .loop
sc.w.aqrl a4, a2, (a0) => sc.w.aqrl a4, a2, (a0)
bnez a4, .loop => bnez a4, .loop
.aferloop: =>
bne a3, a1, .loop =>
ret => ret
As you can imagine, there's been a lot of incremental minor improvements over the past ~6 months. I unfortunately only have space (and patience) to highight a few of them.
A new pre-regalloc pseudo instruction expansion pass was added in order to allow optimising the global address access instruction sequences such as those found in the medany code model (and was later broadened further). This results in improvements such as the following (note: this transformation was already supported for the medlow code model):
; Before => After
.Lpcrel_hi1: => .Lpcrel_hi1
auipc a0, %pcrel_hi1(ga) => auipc a0, %pcrel_hi1(ga+4)
addi a0, a0, %pcrel_lo(.Lpcrel_hi1) =>
lw a0, 4(a0) => lw a0, %pcrel_lo(.Lpcrel_hi1)(a0)
A missing target hook (isUsedByReturnOnly
) had been preventing tail calling
libcalls in some cases. This was
fixed, and later support was added
for generating an inlined sequence of
instructions for some of the
floating point libcalls.
The RISC-V compressed instruction set extension defines a number of 16-bit
encodings that map to a 32-bit longer form (with restrictions on addressable
registers in the compressed form of course). The conversion 32-bit
instructions 16-bit forms when possible happens at a very late stage, after
instruction selection. But of course over time, we've introduced more tuning
to influence codegen decisions in cases where a choice can be made to produce
an instruction that can be compressed, rather than one that can't. A recent
addition to this was the RISCVStripWSuffix
pass, which for RV64 targets will
convert addw
and slliw
to add
or slli
respectively when it can be
determined that all the users of its result only use the lower 32 bits. This
is a minor code size saving, as slliw
has no matching compressed instruction
and c.addw
can address a more restricted set of registers than c.add
.
At the risk of repeating myself, this has been a selective tour of some additions I thought it would be fun to write about. Apologies if I've missed your favourite new feature or improvement - the LLVM release notes will include some things I haven't had space for here. Thanks again for everyone who has been contributing to make the RISC-V in LLVM even better.
If you have a RISC-V project you think me and my colleagues and at Igalia may be able to help with, then do get in touch regarding our services.