Software 43380 Published by

A new version of the FEX-EMU, which allows the execution of x86 and x86-64 binaries on an AArch64 host, has been released. The FEX Release FEX-2506 introduces significant enhancements, including a 25% reduction in JIT time due to the implementation of shared code buffers that can be accessed by multiple threads. This modification represents a notable enhancement over the prior JIT system, which operated independently for each guest application, leading to increased memory and CPU usage. The updated system consolidates all JIT code within a shared code buffer region, enabling all threads to utilize it if any single thread has JITed the code. This minimizes memory consumption, decreases the overall time utilized in the JIT, and positions FEX to initiate code caching to the filesystem for shared access across multiple instances of an application.

The updated system includes better JIT improvements, such as assigning registers directly in SSA IR, removing hashmap use in the Dead-Code-Elimination step, simplifying constants as needed, improving how stack push and pop work together, optimizing Xor operations with -1, adding more ways to clear registers, enhancing X87 FTWTag generation using bit manipulation, optimizing CDQ, fixing thunk callbacks that might damage registers, and correcting a race condition that caused memory tracking issues.



FEX Release FEX-2506

Welcome to the second half of the year! With this release we have some big changes to talk about, so let's jump right in!

Reduce JIT time by 25% by sharing code buffers between threads

This is an absolute banger of a change from our venerable developer  neobrain They have been working towards this as a goal for a while; The tricky nature of the feature making it difficult to land. Before we discuss how this improves performance, it is first necessary to discuss how FEX's JIT worked before this change.

Before this change, our JIT would execute independently for every thread that the guest application makes, without sharing those code buffers with other threads. This meant that if multiple threads execute the same code, they would all be JITing it, consuming memory and taking precious CPU cycles. Additionally, if a thread exits then all of that code buffer gets deleted as well and not reused at all. Not only does this consume more memory, it's actually worse off for CPUs because even if we are executing the same x86 code between threads, the ARM code is at different locations in memory, meaning we put even more pressure on our CPU's poor L2/L3 caches. This is becoming any even larger issue for newer games where they have multi-threaded job-queue systems where any of the threads in the pool could execute jobs and it becomes random chance if the same thread ends up executing the same code. Usually just means every thread in the pool will end up JITing all the code multiple times.

neobrain's change here is a fundamental shift to how FEX does its code JITing. In particular, all JIT code gets stored in to a shared code buffer region and if one thread has JITed the code, then all threads can reuse it. This means in an ideal case only one thread ever JITs code and all other threads benefit from it. In addition since code is now shared between threads, if a thread exits then all of that JITed code isn't lost, and a new thread can reuse it.

This change has some serious knock-on effects; Memory-usage is lower, total time spent in the JIT is lower, and it preps FEX to start caching code to the filesystem for sharing between multiple invocations of an application! So not only will memory usage of applications be lower, allowing more games to run on platforms with less RAM, but they should be faster as well since JIT time is lower and L2/L3 cache is hit less aggressively.

A pedantic edge case game called  RUINER improved from around 30FPS to 60FPS due to how it constantly JITs code due to threads being created and destroyed quickly! In some  games tested by neobrain, we see significantly less time spent in the JIT.

Go and test some games people!

More JIT optimizations

Not to be outdone, there were more JIT optimizations this month. This includes making the JIT itself faster, and also faster generated code so performance is improved in-game. Definitely go and look at the pull requests for these to know more, because walking through each individual change would take all day.

  • Inline register-allocation in to SSA IR
  • Stop using a hashmap in Dead-Code-Elimination pass
  • Constant-fold on the fly
  • Optimize pairs of stack pushes and pops
  • Optimize Xor with all -1
  • Add more cases for zeroing registers
  • Optimize X87 FTWTag generation using fancy bit-twiddling techniques
  • Optimize CDQ
  • Fix for thunk callbacks potentially corrupting registers

Fix a nasty race condition that causes invalid memory tracking

This was a big nasty bug that landed on our plate this last month. We noticed recently that after Steam shipping an update, it would crash very frequently under FEX, this only seemed to occur when games were downloading. It could technically be worked around by restarting Steam each time it crashed, but if you're downloading a big 100GB game like Spider-Man 2 then you're going to need to restart Steam a lot.

After some investigation we found out that Steam has seemingly updated its memory allocator, or made it more aggressively allocate and deallocate memory. This happens particularly frequently during a game download where each thread is now allocating and deallocating across the whole system.

When any memory syscall gets used under FEX, we need to track this in order to ensure that self-modifying code works correctly. We track the virtual memory regions and keep a map around to ensure if anything gets overwritten that we can invalidate code caches. Turns out we had mutex locking in the wrong location, which was causing us to have a different view of memory versus what the kernel had. This shows up when multiple threads perfectly interleaved a munmap and an mmap, and having FEX's mutex that tracked these end up in the wrong order. So FEX would end up thinking the mmap came first and then a munmap (at the same address!) came second, but it was actually the other way around.

This completely broke FEX's tracking resulting in some bad crashes. The core change was to make FEX's tracking mutex also wrap the syscall doing the memory operation and now everything is sorted and Steam is even more stable than before!

FEXServer fixes

We had a couple of minor bugs that came up this month that were fixed in our FEXServer. We had some cases where starting an application would cause the Server to early exit while FEX was still running. Which would cause strange behaviour, so we needed to fix it. These are now fixed so it should stay running while applications execute

Print a warning if an unknown FEX config option is set

Every so often FEX changes config options or old ones get phased out. We didn't have any way to alert the user that they have old config options sitting around, or even if they typo'd an option. Now if you're manually setting config options in the config JSON, it will print a warning to alert you to fix the option. Nice little quality of life change.

Fortification safe long-jump

Last month we fixed a nasty memory leak which required introducing a single long-jump usage inside of FEX. Turns out this broke FEX on some distros that enable fortification build options when compiling FEX. This is now fixed by using a long-jump that is safe against fortifications.

Raw Changes

FEX Release FEX-2506

  • Async

    • Don't destruct on self-moves ( 5e10336)
  • CMake

    • Generate DATA_DIRECTORY dynamically unless explicitly set ( 2a713a1)
    • Allow disabling explicit -mcpu usage ( b3297d1)
  • Config

    • Use CMAKE_INSTALL_FULL_LIBDIR when templating default values ( 06541f2)
    • Clean up use of templates ( 3dc8a3d)
  • ConstProp

    • optimize XOR with all-1 ( 221ae2d)
  • FEXCore

  • FEXServer

    • Don't time out while clients are still connected ( 2164d7b)
  • IR

    • Inline registers into the IR ( dc9f8aa)
    • drop DestSize inference ( debc57e)
  • InstcountCI

    • Adds tests for instructions discovered by  #4597 ( e81c84e)
  • InstructionCountCI

  • JIT

  • LibraryForwarding

    • Use CMAKE_INSTALL_FULL_LIBDIR instead of constructing the library install paths manually ( ba162bb)
    • Change target env to gnu ( 861ecbe)
  • Linux

    • SMCTracking
      • Fixes nasty race condition causing invalid memory tracking ( a08a6ce)
  • LinuxEmulation

    • Minor cleanup by separating VMA definitions ( 47f1ad6)
    • Implement custom longjump that is fortification safe ( 89e5041)
    • Fix bad compile time definition check ( 8b3e731)
  • LinuxSyscalls

    • Reduce code duplication between 32-bit and 64-bit paths ( 99816a2)
  • OpcodeDispatcher

  • Scripts

  • SyscallsVMATracking

  • TestHarnessRunner

    • Stop setting the guest RSP for ARM test runner. ( 8e079c1)
  • ThreadPoolAllocator

    • Add support for updating the size of managed data ( fec1ffa)
    • Fix assertion in _GLIBCXX_DEBUG builds ( 3857980)
  • Windows

    • Fix building with llvm-libcxx ( 2d0e19e)
  • Misc

    • Optimize CDQ ( 6549b66)
    • Fixes FEXServer path search ( 2ff9546)
    • Constant fold on the fly ( fedad27)
    • Stop using hashmap in DCE ( c865eb9)
    • Error when JSON config includes unknown options ( 5ff9bb3)
    • Reduce JIT time by 25% by sharing code buffers between threads ( e794584)
    • Pair push/pop ( 4928af5)
    • LogManager: Print source location when failing assertions ( ded8b32)
    • Fix callee saved floating-point arguments order issue ( 0147f7a)
    • Use -print-libgcc-file-name to get the compiler-rt file name ( 7a4fff8)
    • Run gcc target tests with block size 1 ( ae9a5b1)
  • unittests

Release FEX Release FEX-2506 · FEX-Emu/FEX