IA32/x86 and GCC’s fPIC

Lately Valve has started using GCC’s fPIC option to compile their Linux binaries, and I remain unconvinced that this is a good idea.

The purpose of fPIC is to generate position independent code, or code that references data positions without the need for code relocation. Instead of referencing data sections by their actual address, you reference them by an offset from the program counter. In and of itself, it’s not a bad idea.

My observation on fPIC is that its usefulness varies depending on the platform. AMD64 has a built-in mechanism for referencing memory as an offset from the program counter. This makes generating PIC code nearly trivial, and can reduce generated code size because you don’t need full 64-bit address references. On the other hand, it can actually complicate relocation. Since the references are 32-bit, the data cannot be relocated more than 2GB away from the code. That’s a minor problem for loaders, but certainly a nastier problem for people implementing detours and the like.

So, what about x86? It has no concept of PC-relative addressing. In fact, it doesn’t even have an instruction to get the program counter (EIP)! Let’s take a simple C++ code snippet, and look at the disassembly portion for modifying g_something:

Select All Code:
int g_something = 0;
int do_something(int x)
    g_something = x;
    return ++g_something;

With GCC flags “-O3″ I get this assembly routine:

Select All Code:
0x080483d7 <_Z12do_somethingi+7>:       mov    ds:0x804960c,eax

With GCC flags “-fPIC -O3″ I get this:

Select All Code:
0x0804849a <__i686.get_pc_thunk.cx+0>:  mov    ecx, [esp]
0x0804849d <__i686.get_pc_thunk.cx+3>:  ret
0x08048441 <_Z12do_somethingi+1>:       call   0x8048496 <__i686.get_pc_thunk.cx>
0x08048446 <_Z12do_somethingi+6>:       add    ecx,0x12b6
0x08048451 <_Z12do_somethingi+17>:      mov    edx,DWORD PTR [ecx-0x8]
0x08048458 <_Z12do_somethingi+24>:      mov    DWORD PTR [edx],eax

The non-PIC version is one instruction. The PIC version is six instructions. As if that couldn’t be any worse, there’s an entire branch added into the fray! Let’s look at what it’s doing:

  • The call instruction calls a routine which simply returns the value at [esp]. The value at [esp] is the return address. This is a fairly inefficient way to get the program counter, but (as far as I know) the only way on x86 while avoiding relocation.
  • A constant offset is added to the EIP. The new address points to the global offset table, or GOT. The GOT is a big table of addresses, each entry being an address to an item in the data section. The entries in this table require relocating patching from the loader (and the code, subsequently, does not).
  • The actual address to the data is computed by looking up the GOT entry.
  • Finally, the value can be stored in the data’s memory.

Meanwhile, let’s look at the AMD64 versions. I apologize for the ugly AT&T syntax; GDB won’t show RIP-addressing on Intel mode.

PIC version:

Select All Code:
0x0000000000400560 <_Z12do_somethingi+0>:       mov    1049513(%rip),%rdx        # 0x500910 <_DYNAMIC+448>
0x000000000040056a <_Z12do_somethingi+10>:      mov    %eax,(%rdx)

Non-PIC version:

Select All Code:
0x0000000000400513 <_Z12do_somethingi+3>:       mov    %eax,1049587(%rip)        # 0x50090c <g_something>

Although there’s still one extra instruction, that’s a lot more reasonable. So, why would anyone generate fPIC code on x86?

Supposedly without any relocations, the operating system can keep one central, unmodified copy of a library’s code in memory. To me, this seems like a pretty meaningless advantage. Unless you’ve got 4MB of memory, chances are you have plenty of it (especially if you’re running Half-Life 1/2 servers). Also, the cost of relocation should be considered a negligible one-time expense. If it wasn’t, it’d mean you were probably doing something silly like loading a shared library quickly and repeatedly.

My thoughts on this matter are shared by the rest of the AlliedModders developers: don’t use GCC’s fPIC. On x86 the generated code is a lot uglier and slower because the processor doesn’t facilitate such addressing. On AMD64 the difference is small, but even so — as far as I know, Microsoft’s compiler doesn’t ever use such a crazy scheme. Microsoft uses absolute addressing on x86 and RIP-relative addressing on AMD64, and at least on x86 (I’m willing to bet on AMD64 as well), they’ve never used a global offset table for data.

Conclusion: Save yourself the run-time expense. Don’t use GCC’s fPIC on x86. If you’ve got a reason explaining otherwise, I’d love to hear it. This issue has been eating at me for a long time.

(Note: Yes, we told Valve about this. Of course, it was ignored, but that no longer bothers me.)

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">