Software 42338 Published by

The FEX-EMU, which enables the execution of x86 and x86-64 binaries on an AArch64 host, has been updated.



FEX-2303

Read the blog post at  FEX-Emu's Site!

This month's code changes

With that out of the way, onward to this month's changes.

Optimize REP STOS instruction in to inline memset

This is an instruction that x86 offers that behaves similarly to a memory set operation. It behaves slightly differently since this allows you to set
the memory by element size, and also you can choose to direction in which the memory is set. In particular this instruction tends to get used for
zeroing out memory. Latest x86 CPUs have even optimized this instruction in order to be fast as possible. Previously FEX had decomposed this instruction
in to a complex series of code blocks that was inefficient for our JIT and everything surrounding it. Now we instead convert this to a single IR
operation called MemSet which exposes the semantics of how the instruction works. Allowing our IR to be cleaner and the backend to decompose it in
a more optimal fashion. Currently we emit a a fairly trivial loop that handles this memory set operation. ARM has recently announced that future CPUs
are going to support a memory set instruction that is very similar to the 8-bit REP STOS which will make this implementation even faster!

As seen by this graph, FEX is no where near a native implementation. It's important to note that even without writing "optimal" codegen, this change
has still given FEX up to an 11% performance improvement on its implementation. This was primarily focused around improving the IR, we can now
optimize the code that the JIT emits significantly more easily! Getting closer to native is likely something to come in the
future.

Add config option hide hypervisor CPUID bit

We encountered the first game that has anti-virtual machine code and refuses to run if it thinks it is running in a VM. While FEX isn't a virtual
machine, we expose this CPUID bit so software that cares can use it as hint to query FEX specific CPUID information. Now that this game has stumbled
upon this issue, we added a configuration profile to disable this CPUID bit for the game. If any other games also pick up on this issue then we will
need more profiles.

Proton and pressure-vessel startup optimizations

One of this months efforts have been about improving the time it takes for Proton to startup. pressure-vessel is the project that is used to setup the
Proton execution environment which takes a while overall. One of the hardest things about Proton is that it executes thousands of programs and does an
absolute ton of filesystem accesses. ARM devices typically don't have the highest performance filesystems, which makes one part of this hard, but also
FEX's filesystem overlay adds overhead to this. Additionally one of FEX's shortcomings currently is that every application execution must JIT fresh
code every time it restarts. Since pressure-vessel starts so many programs, a lot of the time is just spent emitting code to memory. There were a few
optimizations that went towards making this faster this month.

With the couple of optimizations in place we managed to shave a second off of the start-up time. Cutting the execution from 9.7 seconds down to 8.7
seconds. Or in the case of running on an Apple M1, execution is now down to 7 seconds. Almost all of this time improvement comes from faster syscall
wrapping and the remaining CPU time is code JIT and execution. It'll only get faster in the future!

Fix a race condition with syscall emulation

While this is a fairly minor change, we fixed a race condition around system calls which would consistently cause crashes when Steam was starting up.
Every piece of work that improves stability just makes the whole emulation experience so much better and needs to be celebrated!

Signal frame improvements!

A significant problem with using FEX is the debugging experience when something breaks. We spent a good amount of time this month improving how FEX
sets up its signal frames when the guest application hits a fault. Since we weren't following traditional signal frame generation, tooling around
backtracing was broken in most cases. We have now reworked this so that libSegFault will now work to give FEX a backtrace of the application's
state when it crashes.

We will be shipping a new rootfs which includes x86 and x86-64 libraries for libSegFault so that if users want to debug a crashing application, they
can try and get a backtrace.

AVX work continues

Another month, another bunch of AVX work that has been implemented.

Instructions implemented

  • VPHSUBSW
  • VHSUBPD/VHSUBPS
  • VPERMILPD/VPERMILPS
  • VPERMD/VPERMPS
  • VPHADDSW
  • VPTEST
  • VPMOVSD/VPMOVSS
  • VSHUFPD/VSHUFPS
  • VPSHUFD/VPSHUFHW/VPSHUFLW
  • VPSHUFB
  • VPALIGNR
  • VEXTRACTF128/VEXTRACTI128
  • VPBLENDVB/VBLENDVPD/VBLENDVPS
  • VBLENDPD/VPBLENDW

As you can see a lot of new instructions are now implemented. This now leaves us with about thirty more instructions that need to be implemented
before we can start avertising the features on SVE2-256bit supporting hardware. This is significant as we keep finding more and more games that are
requiring AVX to run

ARM emitter cleanups

Another change that isn't user facing but is always nice to point out some janitorial tasks that have been done. When we switched over to using our
own code emitter there were some design choices and implementations that weren't quite optimal. This usually culminates as developer pain when using
the emitter but was a necessary evil since we wanted to get rid of VIXL's assembler as fast as possible.  @Lioncache
spent some time this month cleaning up a lot of the dirty code in the emitter, in some cases making it slightly faster as well. This is always greatly
appreciated as it reduces maintenance burden when working in the JIT.

They also implemented an absolute ton of new instruction emitter functions which previously didn't exist. While we don't use these yet, we will likely
use them at some point which will make our lives easier in the future.

New development machines for our developers

Just recently a new Snapdragon laptop has gotten working OpenGL and Vulkan drivers up and running! We are gifting each of our developers one of these
great machines in order to ensure we have testing platforms for all the OpenGL 4, DXVK, and VKD3D applications we want to be running! Kudos to all the
developers that worked on bringing this hardware up so quickly!

Raw Changes

  • ARMEmitter

  • Tidy up some assertion handling ( e7069f9)

  • Remove predicate implicit conversion operators ( 41731e2)

  • Make second sxtw parameter a WRegister ( e71e3ec)

  • Remove implicit conversions from Register/XRegister/WRegister ( 378e069)

  • Remove predicate uint32_t conversion operators ( e869b2f)

  • Remove most implicit conversion operators for vector register types ( 0f45318)

  • Make VRegister constructor explicit ( 21fbcef)

  • Handle sequential registers in lists nicer ( ef02083)

  • Simplify size handling Advanced SIMD 3 different group ( 24904f4)

  • Simplify advanced SIMD copy ( e65b429)

  • Centralize handling for unsigned offset load-stores ( 1832cc8)

  • Handle SVE Integer Compare - Scalars group ( fe1faf9)

  • Finish off SVE Predicate Misc group ( 165db37)

  • Handle SVE partition break categories ( 4d65521)

  • Handle SVE integer compare with wide elements category ( 0a8fc2c)

  • Finish off SVE Permute Vector - Predicated group ( a4c694f)

  • Handle SVE index generation category ( e4488b0)

  • Handle a few more vector permutation categories ( 9c256bf)

  • Handle SVE2 Accumulate category ( 8c8b680)

  • Finish off SVE Misc category ( 2bd64ad)

  • Handle CPY (scalar) and CPY (SIMD&FP, scalar) ( 3c1ba84)

  • Handle predicated wide shifts ( dd2e70e)

  • Handle unpredicated wide shifts and unpredicated shifts by immediates ( 5fd68b6)

  • Handle SVE2 saturating add/subtract category ( cb3cfed)

  • Centralize instruction handling for a few categories ( 582108a)

  • Fixes some warnings that cropped up. ( 5da90aa)

  • Handle SVE SQDMULH/SQRDMULH (vector) ( 1089987)

  • Handle ADDVL/ADDPL and RDVL ( d810974)

  • Handle MLA/MLS (vector) and MAD/MSB ( 347abf0)

  • Handle SVE predicated mul/div and finish off integer reduction category ( c0bc5d9)

  • ASIMDOps

  • Amend a few error logs ( f7f2dc2)

  • Arm64

  • Fixes a race condition on syscall spilling SRA ( f6e2fe1)

  • VectorOps

  • Use SVE only with 256-bit op sizes ( 2f260ae)

  • Use movprfx with VBSL ( e8fd8ef)

  • Arm64Emitter

  • Use bit utils wrapper over __builtin_ffs ( 9e01730)

  • CPUID

  • Adds an config option to hide hypervisor bit ( 77fad28)

  • Core

  • Support Data in JIT buffer header ( c7c47a8)

  • Dispatcher

  • Fixes crash with misalign stack returning from signal ( d688026)

  • Support reconstructing RIP from block entry ( 66d879f)

  • Fixes guest stack register usage ( 81a89ab)

  • Minor flags optimization ( 6047ca9)

  • ELFCodeLoader

  • Adds an option to inject libSegFault ( a5762b6)

  • Emitter

  • ALUOps

  • Fix typos in log messages ( f2aa002)

  • EmulatedFiles

  • Optimize openat handler ( 545a216)

  • FEXBash

  • Move to Tools folder ( ef6f5d2)

  • FEXCore

  • Removes C wrapper interface ( f71f244)

  • FEXRootFSFetcher

  • Update link to rootfs links file ( 308fa76)

  • FEXServer

  • Change systemd service environment variable key ( d2e0adf)

  • FEXServerClient

  • Fixes instance where FEXServer can create a zombie ( 70aefc9)

  • FileManagement

  • Fixes Proton ( 2fca207)

  • Skip opening emulated writable files ( e310e29)

  • Optimize GetEmulatedFDPath with an FD! ( 55d3edb)

  • IR

  • Add VDupFromGPR ( b329442)

  • Allow specifying register size for AES enc/dec ops and PCLMUL ( 8689038)

  • JIT

  • Adds a JIT data header and tail. ( 60b76f5)

  • OpcodeDispatcher

  • Optimize REP STOS to MemSet operation ( fc38df2)

  • Restrict partial XMM stores to FPRs in StoreResult_WithOpSize ( b39a882)

  • Remove now unused _VDupElement path in LoadSource_WithOpSize ( 4d25de3)

  • Handle VPHSUBSW ( 9b23ae9)

  • Share MOVHPD implementation with MOVHPS ( f951a40)

  • Handle alignment for MOVAPS a little better ( 68b2072)

  • Handle VHSUBPD/VHSUBPS ( 9b12335)

  • Handle register variants of VPERMILPD/VPERMILPS ( 618f5bb)

  • Handle VPERMD/VPERMPS ( 645f40b)

  • Handle VPHADDSW ( 268dedd)

  • Optimize ALUOp handler ( 65b2da2)

  • Handle VPTEST ( a90f536)

  • Use VectorZero over VectorImm in InsertPSOpImpl ( ab03e59)

  • Handle VMOVSD/VMOVSS ( 25f0a03)

  • Handle VPMADDWD ( efafe0e)

  • Handle VSHUFPD/VSHUFPS ( 35746c7)

  • Handle VPSHUFD/VPSHUFHW/VPSHUFLW ( 3ac7b2c)

  • Handle VPSHUFB ( d408129)

  • Handle VPALIGNR ( a96ad0f)

  • Handle VEXTRACTF128/VEXTRACTI128 ( e6fc159)

  • Handle VPBLENDVB/VBLENDVPD/VBLENDVPS ( 4ef3066)

  • Handle VBLENDPD/VPBLENDW ( e255f1c)

  • Scripts

  • Update fit_native script for X1C/A78C ( 11c8db5)

  • Syscalls

  • Renamed fstatat64 to fstatat_64 ( c4b66b4)

  • VEXTables

  • Remove VPERMIL2PD and VPERMIL2PS entries ( 5f574fb)

  • VectorOps

  • Remove unnecessary mov in VUShrNI2/VSQXTN2/VSQXTUN2 ( b5bc8cd)

  • Only use VBSL 256-bit path if SVE is present ( 86a6118)

  • Misc

  • Support user supplied signal restorer. ( 143ef57)

  • Fix SDL2 directfb includes under Alpine Linux ( 3bc722c)



Release FEX-2303 · FEX-Emu/FEX