Why is GPU emulation so demanding? [Part 2: API Emulation / Wine]
Disclaimer: I am a former Wine developer, but I never really dealt with any of the lower-level stuff. Much of the Wine-related technical stuff I’m talking about here is just an educated guess which roughly gives a simplified explanation of what’s going on but should be taken with a grain of salt for anything else.
Another topic which was brought up in the thread which inspired this blog series was how Wine compares to Dolphin.
In case you don’t know, Wine (an acronym for Wine Is Not an Emulator) is a program collection which can be used to run many Microsoft Windows applications within Unix environments like Linux. However, Wine doesn’t need to emulate a whole PC to do this; instead it’s using a technique called “API emulation”. Instead of emulating the underlying hardware, the program binary code gets directly processed by the host but whenever a call to a Windows-specific function is going to be made, Wine doesn’t actually call that function but uses an own wrapper function which provides the same functionality like the original function but uses the underlying Unix APIs for the actual implementation. That’s why Wine is often called a compatibility layer. Read on for details.
In order to understand why what Wine is doing works, you’ll need to know some basics about how Windows program binaries usually look like. Basically, the program binary is just a bunch of x86/x64 machine code which per se isn’t operating system specific at all¹. However, what is OS specific are calls to functions stored in external system libraries, i.e. DLLs (Dynamic Link Libraries) or in Unix terminology SOs (Shared Objects). For example, kernel functions obviously are operating specific; but this also applies to libraries which strongly depend on the underlying system libraries (like Direct3D) and many libraries which just aren’t available on Unix platforms.
Now Wine uses some clever tricks to make Windows executables work on Unix anyway: It’s using means of static recompilation in order to produce an ELF executable (which can be run by any Unix-like operating system) from the Windows PE executable.This step is basically just a “format conversion”, just like you’d convert a PNG image to a JPEG image (.. just that it’s a lot harder and more technical). In that step it also takes care of the little quirks I mentioned above like calling convention.The resulting ELF image can basically be just run like any other Unix application.
It’s of course not quite as simple, Wine has to deal with a lot of other things, e.g. providing proper memory management to windows applications. That stuff is usually handled by the Windows kernel, which is not available on Unix platforms obviously. Instead, a separate process called wineserver is started whenever Wine is run. It basically does all the work that would usually be done by the Windows kernel.
The point is, most Windows applications don’t depend on the underlying hardware they’re running on. Instead, they’re using the “abstraction layer” provided by the system libraries. That’s what allows Wine to run as efficiently and well as it does: It doesn’t need to emulate a complete set of computer hardware but just needs to recompile the Windows executable to an ELF binary and simulate a Windows environment. A nontrivial task for sure, but with very good results 😉
 Apart from stuff like calling convention and other quirks which can be taken care of by means of static recompilation without causing any major problems though.
Why is GPU emulation so demanding? [Part 1: How to emulate a GPU]
Recently scummos from the Dolphin forums created a thread asking why GPU emulation in Dolphin is so demanding. This article series is meant to give some insight into the reasons behind this. Explaining this topic in-depth is very hard if you don’t know how hardware emulation works, but I’ll try to describe the most important concepts.
In order to understand how GPU emulation works, you should first ask yourself the question: How is the GPU even accessed by the currently running application? From an application programmer’s perspective it’s fairly simple because their are multiple Application Programming Interfaces (API) which can be used for this task. On PC hardware, you have Direct3D and OpenGL for this. On gaming consoles, the console vendor usually includes an own GPU programming API in the console’s Software Development Kit (SDK). The Gamecube’s/Wii’s GPU API is called GX. These GPU APIs aren’t really special, they’re just a bunch of functions programmers can use to access the GPU; for example an IDirect3DDevice9 object in Direct3D9 has a method called Clear() which can be used to clear the render target’s color/depth/stencil buffers. IDirect3DDevice9::CreateTexture() can be used to create a texture object with the specified parameters. IDirect3DDevice9::DrawIndexedPrimitive() will render some triangles to the screen. Similar calls exist in other GPU APIs (e.g. OpenGL: glClear, glGetTexture, glDrawArrays).
However, these functions aren’t the lowest level of access, their implementation are actually just a bunch of C code (or whatever language was used for implementing them) as well. Internally, those functions access the GPU registers in order to directly program the GPU.
Now you probably wonder what GPU registers are. For that, you’ll need to understand the concept of memory-mapped I/O (MMIO). It boils down to using the same address bus for both memory (RAM) and hardware devices (more precisely: I/O devices). That means that a certain part of the whole address range is dedicated to RAM, i.e. when the CPU tries to access that part, the request will transparently be redirected to the RAM. Another part of the whole address range might be dedicated to one of the I/O devices, etc. Take a look at Wikipedia for an example of a memory map. Assuming we have a dedicated chunk of mapped memory for the GPU, we’ll refer to that chunk as the GPU register block. Furthermore the GPU register block can be seen as a group of equally sized registers (Flipper uses 4 bytes for each register). We can refer to a register either by its index or (more commonly) by its byte offset from the beginning of the GPU register block. Each GPU register usually is linked to a specific part of GPU functionality (“execute a clear operation”) or a specific GPU configuration parameter (“color used for clearing the render target”).
Using MMIO, a GPU (or an I/O device in general) may expose certain capability to the CPU via the GPU registers. By reading from certain registers, the CPU may request information from the GPU. By writing to certain registers, the CPU may tell the GPU to execute some action (or to setup some parameters). For example, the implementation of a function like IDirect3DDevice9::Clear (see above) will write to some register X to tell the GPU what clear color to use will write to another register Y to tell the GPU to actually execute the clear operation. The nice thing about this is: glClear (OpenGL) and the clear functionality of GX work the same way internally! Of course, the GPU registers will be different on each GPU (I cheated a bit on this, IDirect3DDevice9::Clear actually sits a few layers above the GPU registers; one of them is the GPU driver which knows the GPU registers of the accessed hardware). In emulation, however, this isn’t an issue because, well, a console always uses the same hardware 😉
Note that this is the lowest-level way to access the GPU, so there’s nothing which may go wrong when emulating on register level (I’ll dedicate another blog post to explain what can go wrong).
But how do we actually emulate the Gamecube’s GPU (Flipper) or the Wii’s GPU (Hollywood)? First, let’s note that whenever I just say Flipper or Hollywood, I usually refer to both of them (unless noted otherwise). That’s because they are both using the same register set (something which is uncommon even for different GPU generations from the same vendor) and the only real difference is the clock rate and the amount of texture memory (both don’t really matter in Dolphin).
The way GPU emulation works is fairly simple once you’ve got perfect CPU emulation (ha! :P). Whenever the CPU accesses memory, the CPU emulation code checks what the accessed address is mapped to. If it’s mapped to the GPU, we’ve got a bunch of handler functions which will check what the accessed register is supposed to do and execute the proper action then. In case of an emulator like Dolphin which tries to use the GPU where appropriate for this, this includes translating the GPU registers to the available Direct3D/OpenGL API functions (using these APIs is the only way to access the GPU on PC hardware by the way because of user-mode restrictions, i.e. we can’t directly access the GPU registers from Dolphin; only GPU drivers may do so).
For example, Flipper has a range of registers called BP memory which describe the current render pipeline configuration (it’s similar to the pixel shader stage in modern GPU APIs or to render and sampler states in D3D9). Dolphin has a list of these BP registers and defines a name for each one for better readability in BPMemory.h. The actual emulation of BP registers happens in BPStructs.cpp in the BPWritten() function which gets called whenever the BP memory has been written to. It has a fairly big “switch” statement which basically just check what register was written to in order to call the proper handler function; you might notice that many register handlers don’t do anything at all (e.g. BPMEM_LOADTLUT0 which just “break”s) because they’re just general parameters which don’t need any immediate reaction but just affect behavior of later actions.
Our render target clearing example from above is chosen a bit unfortunate, but I’ll explain it anyway: If there was a GXClear function in the GX API (there isn’t), it would poke the BPMEM_CLEAR_AR and BPMEM_CLEAR_GB registers and write the alpha/red and green/blue components of the color used for clearring the back buffer there. To actually execute the clear operation the function would also poke the BPMEM_TRIGGER_EFB_COPY register. The value written to that register is described by the UPE_Copy struct in BPMemory.h: The 12th bit (named “clear” in the Dolphin code) is set to 1 (true) if a clear operation should be executed. Checking the implementation in BPStructs.cpp under the case BPMEM_TRIGGER_EFB_COPY, you can see that if (PE_copy.clear) is true, the function ClearScreen (implemented in BPFunctions.cpp) is called which does some other fancy stuff and then calls g_renderer->ClearScreen which is the backend specific implementation. Looking at the various Render.cpp files in each video backend’s source directory (e.g. OpenGL) you can see that these methods don’t do anything else than calling glClear, i.e. the respective clear function of the GPU API used for emulation!
Well, almost, because when you look at Direct3D9/11 you can see that they don’t use those functions – because they differ a bit in functionality from their GX counterpart. This is a limitation of the emulation in our hardware accelerated video backends: We can only emulate the hardware functionality which is actually exposed to us from the GPU APIs used for emulating Flipper. In case of ClearScreen a workaround is possible, but this isn’t possible for all Flipper features. That’s the nice thing about software rasterizer backends: You don’t end up debugging a strange graphics glitch just to find out that it can’t be fixed with the limited functionality exposed by Direct3D9 but you can just go ahead and fix it 😀
I’ll further elaborate on the limitations of Direct3D and OpenGL when emulating Flipper in another blog post, but that’s it for now – all you need to know about how GPU emulation works at the lowest level. Or at least a starting point for further reading 😉