Skip to the content.

Building a class-dump in 2019 2020

Note: This article has become outdated with some statements becoming innacurate. I’ve decided to leave this up as it can still useful for learning but take everything with a grain of salt

Building out a “class-dump”-like introspection tool for Apple platforms has changed considerably since the original class-dump came out. Learning these new (and old) technologies can be quite intimidating due to the steep learning curve and somewhat hard to find documentation.

This article attempts to explain the complete process of programmatically inspecting a Mach-O (Apple) binary to display the compiled Swift types and Objective-C classes by discussing the following:

This writeup takes its sweet time explaining things, but there’s a lot of concepts to go through. If you’re brand new to this stuff, I’d recommend going through the sections in chunks over several days and do the experiments. This is a loooooooong writeup, but if you can get through this whole thing, you’ll have a very good understanding of how a lot of internals work. Yay…

If you know most of this stuff, I’d recommend just jumping to the appropriate section that you need to learn.

For all you “heavyweights” out there (Ghidra, Hopper, IDA, jtool, & friends), I recommend you check out the Swift part as I have some suggestions on how to provide better Swift support for your tool.

And here we go


I. Mach-O File Format


The Mach-O file format is the “table of contents” and layout found right at the beginning of every Mach-O (read Apple) image. An image is any compiled, executable code including (but not limited to) executables, frameworks, dylibs, etc. Understanding the Mach-O format is great if you want to know where stuff is located on disk and where it will be when loaded into memory.

There are many great sources out there that already cover this topic well.

And if you want to pay money for some Mach-O tutorials…

But this stuff won’t stick unless you play around with it and do the experiments. Time to get your hands dirty.


1.1 The Mach-O Header


In the file <mach-o/loader.h>, there exists a C struct called mach_header_64 that is the beginning to all 64-bit compiled Apple images (well, sorta, it actually depends on some things like FAT files, 32-bit architecture, but don’t think about that now).

Note: When referring to C System headers on your OS X machine, you can usually resolve the header location to the following Terminal command: echo $(xcrun --show-sdk-path)/usr/include. This resolves to the base directory search path for C system headers. The resolved filepath of loader.h can be viewed via: cat $(xcrun --show-sdk-path)/usr/include/mach-o/loader.h | less -R

The mach_header_64 struct contains the following:

struct mach_header_64 {
        uint32_t        magic;          /* mach magic number identifier */
        cpu_type_t      cputype;        /* cpu specifier */
        cpu_subtype_t   cpusubtype;     /* machine specifier */
        uint32_t        filetype;       /* type of file */
        uint32_t        ncmds;          /* number of load commands */
        uint32_t        sizeofcmds;     /* the size of all the load commands */
        uint32_t        flags;          /* flags */
        uint32_t        reserved;       /* reserved */
};

Cross reference the above mach_header_64 with any compiled executable. I’ll pick grep, feel free to pick anything else:

lolgrep:~$ xxd -g 4 -e $(which grep) | head -2

The xxd command will dump the raw data of an executable to stdout. The -g 4 says to group all the values into 4 bytes, which is perfect since each member in the mach_header_64 struct is a 4 byte value. The -e option says to format the output in little-endian byte order (i.e. anything after a mac PowerPC).

Note: Now’s as good as time as any to mention type sizes in a programming language. An int in C on a 64 bit system will (typically) take up 32 bits or 4 bytes of storage. An Int in Swift will take up 64 bits or 8 bytes of storage. A pointer to any type of a value will take up the full 64 bits in size in both languages. In Swift, you can use the MemoryLayout type to verify size. In the C language family, use the sizeof function to determine the size of a type. Check out this Swift link or this generic languages link for a good writeup that dives further in to this.

After executing the above Terminal command, this produces the following output on my machine:

00000000: feedfacf 01000007 80000003 00000002  ................
00000010: 00000014 00000798 00200085 00000000  .......... .....

Going through the mach_header_64 struct members:

If you’re new to this stuff, check out the <mach/machine.h> header to make sure I’m telling the truth. I’ll let you figure the remaining struct members out yourself 👍

You can view the mach_header_64 in an alternative way by running otool -h to view the Mach-O header:

lolgrep:~$ otool -h $(which grep)

This produces the following on my machine:

Mach header
      magic cputype cpusubtype  caps    filetype ncmds sizeofcmds      flags
 0xfeedfacf 16777223          3  0x80           2    20       1944 0x00200085

Keep an eye on that ncmds with the value 20; this is what’s going to be discussed next.


1.2 Mach-O Load Commands


It’s these load commands (whose count is given by the ncmds from the mach_header_64) that can be interesting when exploring a compiled executable.

Each load command begins with a LC_ and whose description/usage is given in <mach-o/loader.h>

Use otool’s -l option with grep to display all the load commands in grep:

lolgrep:~$ otool -l $(which grep) | grep LC_ 
      cmd LC_SEGMENT_64
      cmd LC_SEGMENT_64
      cmd LC_SEGMENT_64
      cmd LC_SEGMENT_64
      cmd LC_SEGMENT_64
            cmd LC_DYLD_INFO_ONLY
     cmd LC_SYMTAB
            cmd LC_DYSYMTAB
          cmd LC_LOAD_DYLINKER
     cmd LC_UUID
       cmd LC_BUILD_VERSION
      cmd LC_SOURCE_VERSION
       cmd LC_MAIN
          cmd LC_LOAD_DYLIB
          cmd LC_LOAD_DYLIB
          cmd LC_LOAD_DYLIB
          cmd LC_LOAD_DYLIB
      cmd LC_FUNCTION_STARTS
      cmd LC_DATA_IN_CODE
      cmd LC_CODE_SIGNATURE

Make sure the ncmds count (20 for my version of grep) from the mach_header_64 matches the load command count…

lolgrep:~$ otool -l $(which grep) | grep LC_ | wc -l
20

Each of these commands does something specific. For example LC_LOAD_DYLIB is an instruction to load a framework (like Cocoa, or UIKit into the process’s address space). The LC_MAIN specifies the address to start the main function to the program. Exhaustively describing all the load commands could be a bit boring to most, so check out the <mach-o/loader.h> or other writeups if you’re interested in learning more about load commands.

Now that you have a overview of the load commands, remove the grep filtering and display the full output of otool -l:

lolgrep:~$ otool -l $(which grep)

otool -l produces a lot of output, but focus on the initial LC_SEGMENT_64 load commands closer to the beginning…

When exploring memory and the load commands, different areas of memory are grouped together. These Mach-O groupings are called segments. Segments will have different memory permissions.

Execute the following for your Terminal command:

lolgrep:~$ otool -l $(which grep) | grep LC_SEGMENT -A10

The output will display the Mach-O segments contained in grep. This will look something like this…

      cmd LC_SEGMENT_64
  cmdsize 72
  segname __PAGEZERO
   vmaddr 0x0000000000000000
   vmsize 0x0000000100000000
  fileoff 0
 filesize 0
  maxprot 0x00000000
 initprot 0x00000000
   nsects 0
    flags 0x0
--
      cmd LC_SEGMENT_64
  cmdsize 632
  segname __TEXT
   vmaddr 0x0000000100000000
   vmsize 0x0000000000004000
  fileoff 0
 filesize 16384
  maxprot 0x00000005
 initprot 0x00000005
   nsects 7
    flags 0x0
--
...

There’s a Mach-O segment called __PAGEZERO (given by the segname member), which when loaded into memory, has no memory permissions (see maxprot, and initprot). That means you can’t read, write, nor execute anything that resides in this memory segment. Hitting this memory is what happens when you fack up a pointer in C or dereference an implicitly unwrapped optional in Swift… it’s a dead zone in memory that’s designed to catch implicitly unwrapped (!)/nil/NULL/nullptr dereference bugs in your code by killing the process.

Below the __PAGEZERO segment, there’s the __TEXT segment. This segment has readable and executable permissions determined from the initprot value of 0x00000005. How can one translate the value 5 to mean readable and executable?

Think of this 5 value in bits…

executable writeable readable
1          0         1

The value 5 is 0b101 in binary, meaning everything except writable.

The next interesting part is the nsects value immediately below the initprot field. Inside a Mach-O Segment, there can be 0 or more “subcomponents” called sections. These sections group certain parts of functionality in an executable. In the above example, inside the __TEXT segment, there are 7 sections for grep.

Use grep to only show the Mach-O sections in grep

lolgrep:~$ otool -l $(which grep) | grep __TEXT -B2 -A9
      cmd LC_SEGMENT_64
  cmdsize 632
  segname __TEXT
   vmaddr 0x0000000100000000
   vmsize 0x0000000000004000
  fileoff 0
 filesize 16384
  maxprot 0x00000005
 initprot 0x00000005
   nsects 7
    flags 0x0
Section
--
--
Section
  sectname __text
   segname __TEXT
      addr 0x0000000100000c50
      size 0x00000000000028fc
    offset 3152
     align 2^4 (16)
    reloff 0
    nreloc 0
     flags 0x80000400
 reserved1 0
 reserved2 0
--
--
Section
  sectname __stubs
   segname __TEXT
      addr 0x000000010000354c
      size 0x0000000000000192
    offset 13644
     align 2^1 (2)
    reloff 0
    nreloc 0
...

From the above output, the first section inside the __TEXT segment is a section (confusingly) called __text. It’s this section where compiled code resides (unless someone is doing something sneaky).

Note: You’ll often see both the Mach-O segment and section grouped together via a period to specify the exact Mach-O location. For example, using the above paragraph, I could also say all compiled code resides in the __TEXT.__text section. Most tools out there use this methodology.

There are many, many interesting Mach-O sections. One could write a novel on just this topic. Again, check out the Mach-O links above to learn more about the different types of Mach-O sections.


1.3 File Offsets => Virtual Addresses (and back)


The Mach-O segment/section load command info provide a translation into the virtual address of stuff loaded into memory and to the file offsets on disk for an image.

Consider the following C code (if you’re new to this stuff, code this out with me):

Navigate to your /tmp directory:

lolgrep:~$ cd /tmp
lolgrep:/tmp$

Create ex.c with the following code:

int SomeGlobalInt = 8;
int SecondGlobalInt = 7;
int main() {
  return 0;
}

Compile ex.c:

lolgrep:/tmp$ clang ex.c -o ex

Query the *GlobalInt integer symbols using the nm tool (which displays symbol table information, more on that later)

lolgrep:/tmp$ nm -m ex | grep GlobalInt
0000000100001004 (__DATA,__data) external _SecondGlobalInt
0000000100001000 (__DATA,__data) external _SomeGlobalInt

You’ll (hopefully) see the global integer symbols in the output.

The -m option of the nm command displays the Mach-O section when printing locally implemented symbols in the symbol table. For external symbols, the -m option will display the library the symbol is located from. This will be discussed in much more detail below.

From the output, these global integers are located in the Mach-O __DATA segment inside the __data section, starting at virtual address 0x000000100001000 (_SomeGlobalInt) and 0x00000100001004 (_SecondGlobalInt).

One can translate these virtual addresses to the image file offset by consulting the Mach-O section load commands and hunting for the __DATA.__data Mach-O section.

lolgrep:/tmp$ otool -l ex | grep __data -A10
  sectname __data
   segname __DATA
      addr 0x0000000100001000
      size 0x0000000000000008
    offset 4096
     align 2^2 (4)
    reloff 0
    nreloc 0
     flags 0x00000000
 reserved1 0
 reserved2 0

The size member indicates the size of the __DATA.__data section (which is 8 bytes due to the 2 4 byte C ints). The virtual address of the __DATA.__data section is at 0x0000000100001000. The offset on disk in the executable to the __DATA.__data section is at 4096.

Verify this with xxd again by dumping the raw bytes at offset 4096 (or equivalent for your compilation):

lolgrep:/tmp$ xxd -g 4 -e -s 4096 -l 8 ex
00001000: 00000008 00000007                    ........

Breaking the options down:

At offset 4096 (or 0x1000 in hex) the value of 8 followed by the value of 7, matches the assigned values for SomeGlobalInt and SecondGlobalInt in the source code.

This means one can translate virtual load addresses to file offsets (and back) with the following formula:

symbol_offset_address = (virtual_symbol_address - containing_macho_section_virtual_address) + contain_macho_section_file_offset

This trick is used extensively in dsdump to find information in a file. For example, a pointer will reference another area in memory through a virtual address. Using the above method, if you know the virtual address, you can obtain the file offset of what the pointer is pointing to on disk.

All compiled code pointers will reference virtual addresses, not file offsets on disk.


1.4 Virtual Addresses at Runtime


You saw the file offsets on disk, you’ll now check out the virtual addresses of the C ints when they’re loaded into memory. Use Apple’s debugger, LLDB, to inspect the addresses in the ex program you compiled earlier.

Execute the following:

lolgrep:/tmp$  lldb ex -s <(echo -e "b main\n run\n x/2wx 0x00000100001000")

There’s a lot going on in this command. You’re using LLDB to debug the ex executable. The -s option says to perform the following actions from a script. The <() command is a trick called process substitution to make the output of a process be treated as a file. This would be the same thing as making a file with the following contents:

b main
run
x/2wx 0x00000100001000

The above script will set a breakpoint on the main function, then start up the ex program (via run). Program execution will then stop on the main function and will dump out the contents at the address 0x00000100001000 via the x (examine) command.

The x command is an LLDB command whose syntax was cherry picked over from GDB. The x command will examine memory at the provided address. The wx says to format the dereferenced memory in “word” size (4 bytes) and format the dereferenced output in hexadecimal. The 2 says to do it twice, once at address 0x0000000100001000 and once at address 0x0000000100001004

The output should not be suprising:

(lldb)  x/2wx 0x00000100001000
0x100001000: 0x00000008 0x00000007

This is the runtime values of SomeGlobalInt and SecondGlobalInt

And what’s the runtime address of these values?

(lldb) p/x &SomeGlobalInt
(void **) $1 = 0x0000000100001000
(lldb) p/x &SecondGlobalInt
(void **) $2 = 0x0000000100001004

These are the virtual addresses that match the virtual addresses given by the nm command.


1.5 PIE


Ohhhh but it gets a bit more confusing than that. In addition to the virtual load address, the OS can shift a loaded image’s virtual addresses at runtime to a different starting base address to help mitigate attacks. This is called Address Space Layout Randomization or simply ASLR.

Since an image can have a different address everytime it loads, this means that referencing virtual addresses needs to be able to reference addresses not based on an absolute virtual address value, but via a relative load address from the current memory address. This is known as a Position Independent Executable or PIE.

By default, every MH_EXECUTE you compile is position independent (I’m using clang-1100.0.33.8 for this writeup). You can confirm this with the following experiment:

Create the following C file, ex2.c:

#include <stdio.h>
int main() {
  printf("main is at: %p\n", main);
  return 0;
}

Compile it…

lolgrep:/tmp$ clang ex2.c -o ex2

Give it a couple of runs…

lolgrep:/tmp$ ex2
main is at: 0x1025c9f50
lolgrep:/tmp$ ex2
main is at: 0x101eeef50
lolgrep:/tmp$ ex2
main is at: 0x109fcaf50

Notice how the address of main changes from load to load. This is because the executables starting base address is changing for each run via ASLR.

Note: Some observant readers might have wondered why the virtual address didn’t change with LLDB in the previous experiment. By default, LLDB will remove ASLR when a program is being debugged. If the ex program where to run without LLDB, the address would have changed around from invocation to invocation.

Observe the virtual address of the main function in the compiled ex2 image through the nm command:

lolgrep:/tmp$ nm ex2 | grep main
0000000100000f50 T _main

On my compilation, main is at 0x00000100000f50. This means that the virtual address 0x00000100000f50 is being shifted every time the program is run. You can remove this automatic feature in clang easily enough via the -fno-pie option:

lolgrep:/tmp$ clang ex2.c -fno-pie -o ex2_nopie

Now give it a couple runs:

lolgrep:/tmp$ ex2_nopie
main is at: 0x100000f50
lolgrep:/tmp$ ex2_nopie
main is at: 0x100000f50
lolgrep:/tmp$ ex2_nopie
main is at: 0x100000f50

You can even observe the PIE bit in the Mach-O header. Compare the ex2 and ex2_nopie executables together:

lolgrep:/tmp$ diff -y <(xxd -g 4 -e ex2 | head -2)  <(xxd -g 4 -e ex2_nopie | head -2)
00000000: feedfacf 01000007 80000003 00000002  ..............   00000000: feedfacf 01000007 80000003 00000002  ..............
00000010: 00000010 00000558 00200085 00000000  ....X..... ... | 00000010: 0000000f 00000510 00000085 00000000  ..............

Compare the values in the second to last row in the lower right. You’ll see a 00200085 vs a 00000085. Consult the <mach-o/loader.h> header for the phrase PIE:

lolgrep:/tmp$ cat $(xcrun --show-sdk-path)/usr/include/mach-o/loader.h | grep PIE -A3
#define MH_PIE 0x200000     /* When this bit is set, the OS will
             load the main executable at a
             random address.  Only used in
             MH_EXECUTE filetypes.

You’ll see a #define MH_PIE 0x200000, which tells the loading framework (dyld) that it is capable to slide this program’s base address around to a different value.

Even though a program’s base address might change around when it’s loaded into memory, the virtual addresses that are referenced on disk will never change.


II. Symbol Table


The symbol table plays an immensily important role of declaring what symbols an image implements as well as what symbols it relies upon for that image to correctly function. All code/variables that survive out of the scope of code are candidates to be put into the symbol table for an image.

For example (no need to code this particular example out):

int someGlobalNumber = 5;
int foo(void) {
  int someNumber = 4;
  static someStaticNumber = 7;
  return someNumber + someStaticNumber + someGlobalNumber; 
}

In the above example, the foo function along with someStaticNumber and someGlobalNumber could end up in the symbol table, but the someNumber variable inside the function would not (typically) show up in the symbol table. Notice how someStaticNumber and someGlobalNumber survive outside the scope of the foo function, so could end up in the symbol table. Now, I said “could” because it’s quite possible to hide symbols that are private and shouldn’t be exposed to other modules (via symbol table strip‘ing or function attributes), but more on that later.


2.1 Symbol Table Overview


As mentioned, the symbol table references internally implemented and externally referenced symbols.

Write out the following ex3.m Objective-C file to your /tmp/ directory:

@import Foundation;

char *someData = "hello";
void someFunction() {}

int main() {
  printf("some val\n");
  NSLog(@"some different value");
}

Compile ex3.m

lolgrep:/tmp$ clang -fmodules ex3.m -o ex3

Then give it a run…

lolgrep:/tmp$ ex3
some val
2019-10-25 17:27:34.956 ex3[19476:2277376] some different value

You didn’t implement the code for printf nor NSLog. Use nm to look at how that’s being referenced in the symbol table:

lolgrep:/tmp$ nm ex3
                 U _NSLog
                 U ___CFConstantStringClassReference
0000000100002010 d __dyld_private
0000000100000000 T __mh_execute_header
0000000100000f30 T _main
                 U _printf
0000000100002018 D _someData
0000000100000f20 T _someFunction
                 U dyld_stub_binder

You’ll see undefined external symbols preceeded by a uppercase ‘U’. It’s dyld’s job to find the corresponding symbol in the appropriate library based upon the symbol table information. For the local symbols, you’ll see a ‘T’ for (__TEXT) for someFunction and main and D (__DATA) for someData. We’ll talk more about this a couple paragraphs down…


2.2 Symbol Table Implementation


You got the high up, now let’s jump into the weeds to see the symbol table data structures in action.

The location of the symbol table is given by the LC_SYMTAB Mach-O load command. The symbol table is merely just an array of a C struct called nlist_64 (we’re discussing 64-bit executables only here). This struct can be found in <mach-o/nlist.h> and has the following format:

struct nlist_64 {
    union {
        uint32_t  n_strx; /* index into the string table */
    } n_un;
    uint8_t n_type;        /* type flag, see below */
    uint8_t n_sect;        /* section number or NO_SECT */
    uint16_t n_desc;       /* see <mach-o/stab.h> */
    uint64_t n_value;      /* value of this symbol (or stab offset) */
};

The individual fields will be discussed in detail below.

Find the symbol table offset in the compiled ex3 image:

lolgrep:/tmp$ otool -l ex3 | grep LC_SYMTAB -A5
     cmd LC_SYMTAB
 cmdsize 24
  symoff 12488
   nsyms 9
  stroff 12652
 strsize 136

In this example, the symbol table starts at offset 12488 into the ex3 image.

Dump the raw bytes at the symoff value and do it for nsyms lines:

lolgrep:/tmp$ hexdump -s 12488 -e '1/4 "%08x " 1/1 "%02x " 1/1 "%02x " 1/2 "%04x " 1/8 "%010x" "\n"' ex3 | head -9
00000076 0e 0a 0000 0100002010
00000002 0f 01 0010 0100000000
00000016 0f 01 0000 0100000f30
0000001c 0f 0a 0000 0100002018
00000026 0f 01 0000 0100000f20
00000034 01 00 0300 0000000000
0000003b 01 00 0200 0000000000
0000005d 01 00 0100 0000000000
00000065 01 00 0100 0000000000

You gotta ❤️ the ugly formatting of the hexdump command…

Start at offset 12488 (or your equivalent on your compilation) with the -s option. After that, there’s the huge formatted output string declared by the -e option. Breaking it down:

Cross reference the raw nlist_64 data from hexdump to the much-easier-to-type format from nm:

lolgrep:/tmp$ nm -xp ex3
0000000100002010 0e 0a 0000 00000076 __dyld_private
0000000100000000 0f 01 0010 00000002 __mh_execute_header
0000000100000f30 0f 01 0000 00000016 _main
0000000100002018 0f 0a 0000 0000001c _someData
0000000100000f20 0f 01 0000 00000026 _someFunction
0000000000000000 01 00 0300 00000034 _NSLog
0000000000000000 01 00 0200 0000003b ___CFConstantStringClassReference
0000000000000000 01 00 0100 0000005d _printf
0000000000000000 01 00 0100 00000065 dyld_stub_binder

The -x will display the raw data of the nlist_64 struct (be aware that the -x output flips the first n_strx and n_value around in output since the n_value is considered to be the most important). Next up, the -p option will display the symbols in numerical order (instead of symbol table name order).

The nm command will also grab the name of the symbol which can be derived from the offset of the n_strx into the blob of characters given by the stroff in the Mach-O LC_SYMTAB obtained earlier, which had the stroff value 12652.

Going after the first symbol, the __dyld_private symbol has a n_strx offset of 0x00000076 (noticed I added a 0x to the beginning of the value to ensure it’s interpreted as hex) into the blob of characters pointed at by the stroff value. Use xxd to verify this:

lolgrep:/tmp$ xxd -s $(( 0x00000076 + 12652 )) ex3 | head -1
000031e2: 5f5f 6479 6c64 5f70 7269 7661 7465 0000  __dyld_private..

The bash $(( expr )) command will perform arithmetic on your behalf, adding 0x00000076 to 12652.

And that’s how the nm command can programmatically find the names for these symbols!


2.3 nlist_64 Overview


The symbol table’s nlist_64 array really can give off an impressive amount of information about an image. There’s a lot of info in <mach-o/nlist.h>, but some highlights will be reviewed below.

Looking at the nlist_64 value for the someData and someFunction symbols given by nm above on ex3…

nlist_64 fields: n_value          n_type n_sect  n_desc n_strx   
raw data:        0000000100002018 0f     0a      0000   0000001c _someData
                 0000000100000f20 0f     01      0000   00000026 _someFunction
lolgrep:/tmp$ otool -l ex3 | grep sectname
  sectname __text
  sectname __stubs
  sectname __stub_helper
  sectname __cstring
  sectname __unwind_info
  sectname __got
  sectname __cfstring
  sectname __objc_imageinfo
  sectname __la_symbol_ptr
  sectname __data

Note how the someFunction symbol is at the __text Mach-O section (index 1 in n_sect), the someData is at __data section due to the 0x0a (AKA 10).

Cross reference the data gathered from manually parsing n_sect with the output of nm’s -m option:

lolgrep:/tmp$ nm -m ex3 | grep some
0000000100002018 (__DATA,__data) external _someData
0000000100000f20 (__TEXT,__text) external _someFunction
nlist_64 fields: n_value          n_type n_sect  n_desc n_strx   
raw data:        0000000000000000 01     00      0300   00000034 _NSLog
                 0000000000000000 01     00      0200   0000003b ___CFConstantStringClassReference
                 0000000000000000 01     00      0100   0000005d _printf

Notice how the n_value (first column) is set to 0 for undefined symbols; they’re not implemented locally, so they shouldn’t have a local virtual address. The N_EXT bit is set (0x01) for the n_type, meaning the symbol is a public symbol. For the n_sect value (3rd column), these are all set to 0x00 or NO_SECT, meaning the Mach-O section is undefined (makes sense, they’re external symbols). Now let’s pick up on the remaining nlist_64 members.

lolgrep:/tmp$ otool -L ex3 | awk '{print $1}'
ex3:
/usr/lib/libSystem.B.dylib
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
/System/Library/Frameworks/Foundation.framework/Versions/C/Foundation

From the above, NSLog is expected to be found in Foundation (0x0300), __CFConstantStringClassReference is from CoreFoundation (0x0200), printf is from libSystem.B.dylib (0x0100)

Again, confirm this is true with nm:

lolgrep:/tmp$ nm -m ex3  | grep undefined
                 (undefined) external _NSLog (from Foundation)
                 (undefined) external ___CFConstantStringClassReference (from CoreFoundation)
                 (undefined) external _printf (from libSystem)
                 (undefined) external dyld_stub_binder (from libSystem)

As homework, navigate to the location of the Foundation framework (given by the otool -L ex3 command earlier) and confirm a symbol named NSLog exists there that is public for use.


2.4 Symbol Table Stripping


Create the ex4.c file with the following code:

#include <stdio.h>

void someFunction() { printf("yay\n"); }
int main() {
  someFunction();
  return 0;
}

Compile the ex4.c file:

lolgrep:/tmp$ clang ex4.c -o ex4

Check out the symbols with nm

lolgrep:/tmp$ nm ex4
0000000100002008 d __dyld_private
0000000100000000 T __mh_execute_header
0000000100000f70 T _main
                 U _printf
0000000100000f50 T _someFunction
                 U dyld_stub_binder

Now, strip the ex4 image

lolgrep:/tmp$ strip ex4

Stripping the symbol table will remove uneeded symbols. Dump the symbol table:

lolgrep:/tmp$  nm ex4
0000000100000000 T __mh_execute_header
                 U _printf
                 U dyld_stub_binder

Weird, the symbols aren’t there, but will it still run?

lolgrep:/tmp$ ex4
yay

In a MH_EXECUTE type image, any C/Objective-C/Swift function don’t need to be externally available so that information can be removed from the symbol table. Why is that? A MH_EXECUTE type of file should be ran by itself, it shouldn’t be loaded into another address space!

Note: Just because there’s no symbol in the symbol table for some code doesn’t mean that you can’t infer that a function is there. The LC_FUNCTION_STARTS load command will export a list of all the function/method locations (only code, NOT data) that are implemented by an image. This information is formatted in ULEB. This is useful for debuggers and crash analytics.

What if the above code was compiled as a shared library? What would happen to the symbol table? Compile ex4.c, but now add the -shared option:

lolgrep:/tmp$ clang -shared ex4.c -o ex4.shared

Ensure the ex4.shared file is the correct MH_DYLIB filetype (from mach_header_64’s filetype, remember? 😛) after compiling:

lolgrep:/tmp$ file ex4.shared
ex4.shared: Mach-O 64-bit dynamically linked shared library x86_64

Check out the symbols via nm

lolgrep:/tmp$ nm ex4.shared
0000000000002008 d __dyld_private
0000000000000f70 T _main
                 U _printf
0000000000000f50 T _someFunction
                 U dyld_stub_binder

Then run the strip command and check out the differences. What is the difference compared to the MH_EXECUTE image and why do you think that is?

Public symbols in shared libraries won’t be stripped out because there could be consumers in other images that rely on those symbols, so they need to keep their names intact.


III. Objective-C


Objective-C still plays quite a relevant role—even in Swift. A pure Swift class (i.e. class ASwiftClass {}) will inherit from an Objective-C class called SwiftClass on all Apple platforms (you’ll verify this in a second).

In addition, Swift methods can be stripped out of the symbol table, but Objective-C methods can still be resolved via other ways (as you’ll see shortly). If a Swift class overrides an Objective-C method (i.e. viewDidLoad), there’ll be a compiler generated Objective-C bridging method (called a thunk) which retains and rearranges assembly registers to the Swift calling convention. The thunk method is visible on the Objective-C side, while the actual Swift method can be stripped out. You’ll see at the end of this writeup that you can infer the stripped Swift method by using this Objective-C class knowledge and the Swift reflection type knowledge introduced later.


3.1 Objective-C Class List


Using the Mach-O knowledge you’ve built up earlier, it’s quite easy to hunt for Objective-C classes that are built into an image. All you have to do is look for the __DATA_CONST.__objc_classlist (or __DATA.__objc_classlist if pre iOS-13) Mach-O section in an image.

Build up an executable with Objective-C code and name it ex5.m:

@import Foundation;

@interface AClass : NSObject
@end

@implementation AClass
@end

@interface AnotherClass : AClass
@end

@implementation AnotherClass
@end

int main() { return 0; }

Compile ex5.m with the debugging information flag (-g):

lolgrep:/tmp$ clang ex5.m -fmodules -o ex5 -g

Debug the program with lldb, set a breakpoint on the main function, then run the program:

lolgrep:/tmp$ lldb -s <(echo -e "b main\nrun") ex5

Note: By default, LLDB disables PIE when executing programs. That means virtual addresses referenced in Mach-O load commands for an MH_EXECUTE image will likely be the same realized virtual memory address at runtime (provided you didn’t override LLDB’s settings).

With PIE disabled via LLDB, use this knowledge to run otool inside of the LLDB program to query the runtime location of the __objc_classlist Mach-O section:

(lldb) platform shell otool -l ex5 | grep classlist -A3
  sectname __objc_classlist
   segname __DATA_CONST
      addr 0x0000000100001000
      size 0x0000000000000010

Note: It’s worth noting that LLDB does have the functionality to dump Mach-O sections (via (lldb) image dump section [ImageName]), although I am not very happy with the output of that command.

For my instance, the __objc_classlist starts at 0x0000000100001000 which has a size of 0x10 bytes (the size of 2 pointers). Confirm that this address (0x0000000100001000) is the __DATA_CONST.__objc_classlist Mach-O section loaded at runtime:

(lldb) image lookup -va 0x0000000100001000
      Address: ex5[0x0000000100001000] (ex5.__DATA_CONST.__objc_classlist + 0)
      Summary: (void *)0x0000000100002148: AClass
       Module: file = "/tmp/ex5", arch = "x86_64"

This __DATA_CONST.__objc_classlist section is an array of pointers to Objective-C classes implemented by that image! Confirm this by dereferencing the address at 0x0000000100001000 and 0x0000000100001008:

(lldb) x/2gx 0x0000000100001000
0x100001000: 0x0000000100002148 0x0000000100002198

You should be already familiar with the x LLDB command from earlier. The gx in this command says to format the dereferenced memory in giant words (8 bytes) and format the dereferenced output in hexadecimal. The 2 says to do it twice, once at address 0x0000000100001000 and once at address 0x0000000100001008

It’s the dereferenced values, 0x0000000100002148 and the 0x0000000100002198 that represents Objective-C classes. Print both of these classes out:

(lldb) exp -l objc -- (Class)0x0000000100002148
(Class) $2 = AClass
(lldb) exp -l objc -- (Class)0x0000000100002198
(Class) $3 = AnotherClass

Note: As of around clang version clang-1100.0.33.8 (in Xcode 11), the default configuration for compiling the Objective-C __objc_class_list Mach-O section was moved from the __DATA Mach-O segment to the __DATA_CONST Mach-O segment. This “new” Mach-O segment disables write access to areas that only needs to be written upon image loading (via dyld opcodes) and nothing more. Be aware that if you have an older version of clang, you’ll see __objc_class_list in the __DATA Mach-O segment.


3.2 Objc4


You learned where the Objective-C classes are located in memory and on disk, now it’s time to look at the layout of an Objective-C class. There’s much more info than the <objc/runtime.h> header most developers know about.

This Objective-C class layout can be found on Apple’s opensource site.

The most recent opensource Objective-C class layout (at the time of writing this) can be found in a header named objc4/objc4-756.2/objc-runtime.new.h

Note: You can determine the version of your Objective-C runtime via a otool -l /usr/lib/libobjc.A.dylib | grep SOURCE -A2 | tail -1. This value will (hopefully) have a corresponding source code version on Apple’s https://opensource.apple.com/source/objc4/. Sometimes your version will be ahead of Apple’s opensource listings since they take their sweet time publishing it. For this writeup, my Objective-C runtime version (779.1) is ahead of Apple’s opensource listing (756.2)

In this header file, there is a C struct is called objc_class. All instances of an Objective-C class will have a pointer to this objc_class struct at offset 0. Below is a simplified layout of the objc_class struct.

struct objc_class {
    struct objc_class* isa;        // size 8 bytes, offset +0x0
    struct objc_class* superclass; // size 8 bytes, offset +0x8
    struct bucket_t *_buckets;     // size 8 bytes, offset +0x10
    mask_t _mask;                  // size 4 bytes, offset +0x14
    mask_t _occupied;              // size 4 bytes, offset +0x18
    uintptr_t bits;                // size 8 bytes, offset +0x20
}

… Skipping _buckets, _mask, and _occupied because they’re not applicable to this writeup…

If you want to resolve the pointer from the bits value, you’d have to bitwise AND it with 0x00007ffffffffff8UL. This is defined as the FAST_DATA_MASK in the objc-runtime-new.h header.


3.3 How to Disappoint Swift Developers


Earlier, I said all pure Swift classes on Apple platforms are really just Objective-C classes underneath, so let’s prove that.

Create the following Swift file, ex6.swift:

class APureSwiftClass {}

let a = APureSwiftClass()
print("\(a)")

Compile with debugging info:

lolgrep:/tmp$ swiftc ex6.swift -o ex6 -g

Run with LLDB via a breakpoint on main (yes, Swift still has a main, they just hide it from you)

lolgrep:/tmp$ lldb ex6 -s <(echo -e "br set -n main -s ex6\nrun")

And then hunt for the __objc_classlist Mach-O section:

(lldb) platform shell otool -l ex6 | grep classlist -A3
  sectname __objc_classlist
   segname __DATA_CONST
      addr 0x0000000100001018
      size 0x0000000000000008

On this compilation, the class list starts at virtual address 0x0000000100001018 and the size is 8 bytes (due to only one Objective-C class).

Dereference this single value array to get the raw Objective-C data for the APureSwiftClass:

(lldb) x/gx 0x0000000100001018
0x100001018: 0x0000000100002138

Confirm this address contains the class of interest:

(lldb) exp -l objc -- (Class)0x0000000100002138
(Class) $2 = ex6.APureSwiftClass

Note: By executing the above LLDB command, you are initializing the class for runtime. This will essentially change the objc_class’s bits from a class_ro_t to the class_rw_t. But for this particular example, it doesn’t matter, since you’re going after the non-pointer aligned bits.

On my machine, the pointer for APureSwiftClass (AKA a objc_class*, AKA a Class) starts at 0x0000000100002138. Dereference this value 5 times to display the raw objc_class data:

(lldb) x/5gx 0x0000000100002138
0x100002138: 0x0000000100002100 (isa)    0x00007fff91b6d478 (superclass)
0x100002148: 0x00007fff63aa1400 (boring) 0x0000000000000000 (boring)
0x100002158: 0x00000001000020b2 (bits!)

I’ve annotated the above output. Looking at the bits of the objc_class do you see how there is a 2 in the least significant hexadecimal value?

If you were to consult the objc-runtime-new.h file, you’ll see FAST_IS_SWIFT_STABLE has the following declaration:

#define FAST_IS_SWIFT_STABLE    (1UL<<1)

This is the bit that says if a class is Objective-C or Swift.

Now do the same thing for the superclass whose address you obtained in the previous LLDB command. First figure out what it’s called:

(lldb) exp -l objc -- (Class)0x00007fff91b6d478
(Class) $4 = Swift._SwiftObject

This class is called _SwiftObject. And what is it?

(lldb) x/5gx 0x00007fff91b6d478
0x7fff91b6d478: 0x00007fff91b6d4a0 0x0000000000000000
0x7fff91b6d488: 0x00007fff63aa1400 0x0000000000000000
0x7fff91b6d498: 0x00007fff811c6c50

Oh no! There’s no 2 in that 0x00007fff811c6c50 value! All your Swift classes on Apple platforms really just inherit from an Objective-C class underneath. Womp womp.

I anticipate this will change in a couple years, but for now, it’s always fun to bring Swift developers down to my level 😈

Note: This is the exact same trick I use for a custom LLDB script named dclass to print only Swift or Objective-C classes loaded into the process.


3.4 The bits value revisited (class_ro_t vs class_rw_t)

While LLDB is still paused (if not run it again and break on main), in the main function of ex6, execute the following Swift code:

(lldb) e import Foundation  # Needed to reference the NSClassFromString API in Swift
(lldb) p/x NSClassFromString("ex6.APureSwiftClass") # print the result in hexadecimal
(AnyClass?) $R18 = 0x0000000100002138 ex6.APureSwiftClass

Note how that 0x0000000100002138 (or equivalent on your computer) address matches with the dereferenced value obtained from __objc_classlist Mach-O section you found earlier.

Note: remember, whenever you reference a class, you’re initializing it, and changing the bits value from a class_ro_t* to a class_rw_t*. Executing the NSClassFromString, po‘ing a class in LLDB, or executing a e APureSwiftClass.self in Swift via LDLB will initialize the class. As of right noww, the APureSwiftClass should be initialized, meaning the class_rw_t* value will reside in the bits field. The class_ro_t* will point to memory contained in a Mach-O section, the class_rw_t* will point to a heap allocated reference due to the runtime not knowing how many methods there are in the ObjC class (ObjC categories, associated objects, other runtime fun, etc.) until the class is loaded.

Rerun the program with the run command.

(lldb) run

The program should have reset itself to the start of the main function. Since ASLR is disabled, the virtual addresses will remain unchanged from run to run. Dump the address for the APureSwiftClass (mine was 0x0000000100002138) without initializing the class (i.e. no po‘ing).

(lldb) x/5gx 0x0000000100002138
0x100002138: 0x0000000100002100 0x00007fff91b6d478
0x100002148: 0x00007fff63aa1400 0x0000000000000000
0x100002158: 0x00000001000020b2 <- bits, AKA (class_ro_t* | FAST_IS_SWIFT_STABLE)

Keep a record of the bits value as it will change in a second. My value is 0x00000001000020b2

Initialize the APureSwiftClass via Swift:

(lldb) e APureSwiftClass.self
(ex6.APureSwiftClass.Type) $R16 = ex6.APureSwiftClass

Rerun the earlier command and inspect the objc_class struct’s bits value:

(lldb) x/5gx 0x0000000100002138
0x100002138: 0x0000000100002100 0x00007fff91b6d478
0x100002148: 0x00007fff63aa1400 0x0000001800000000
0x100002158: 0x0000000100501682 <- bits, AKA (class_rw_t* | FAST_IS_SWIFT_STABLE)

The bits param has now changed to the class_rw_t* | FAST_IS_SWIFT_STABLE

If you’re building out an Objective-C runtime introspection tool, and you’re testing the tool on itself, make sure you know the correct struct that resides in the bits value. I burned a lot of hours working with the wrong struct by accidentially initializing Objective-C classes by po‘ing them in LLDB 🤦‍♂️

Fortunately, the class_ro_t and the class_rw_t struct both have int32_t flags value right at the beginning. The flags member can tell you if the class is initialized via the value is (1 « 31, AKA 0x80000000).

In the above example, if I didn’t know if a class was initialized at runtime, I’d start with the bits value (0x0000000100501682). I’d remove the Swift bit packed flags, turning the value into 0x0000000100501680. Then, I’d dereference this value with a size of 32 bits in LLDB

(lldb) x/wx 0x0000000100501680
0x100501680: 0x80080000

The 0x8 in the most significant bit means the class has already been initialized, meaning the bits value is a class_rw_t*.


3.5 The class_ro_t struct


The class_ro_t struct is the “key” value to exploring an Objective-C class. It’s the gateway to the class’s name, its methods, its properties, its instance variables, etc. And unlike the class_rw_t struct, this value is located in the __DATA.__objc_const Mach-O section, meaning one can query this information programatically on disk.

Here’s a simplified class_ro_t layout:

struct class_ro_t {
    uint32_t flags;
    uint32_t instanceStart;
    uint32_t instanceSize;
    uint32_t reserved;

    const uint8_t * ivarLayout;
    const char * name;
    method_list_t * baseMethodList;  // An array for method_t
    protocol_list_t * baseProtocols; // An array for protocol_t
    const ivar_list_t * ivars;       // An array for ivar_t
    const uint8_t * weakIvarLayout;
    property_list_t *baseProperties; // An array for property_t
}

Using this knowledge, find the const char* name of this Swift class. While still in LLDB, start a fresh run of the program

(lldb) run
There is a running process, kill it and restart?: [Y/n] Y

Execution should stop at the main breakpoint you made earlier. This time, query the (same) address you obtained through Mach-O/NSClassFromString, but this time use LLDB’s symbol lookup functionality:

(lldb) image lookup -s "type metadata for ex6.APureSwiftClass"
1 symbols match 'type metadata for ex6.APureSwiftClass' in /tmp/ex6:
        Address: ex6[0x0000000100002138] (ex6.__DATA.__data + 64)
        Summary: type metadata for ex6.APureSwiftClass

The type metadata is Swift’s naming convention for an Objective-C class. More on that later.

You’ll see the familiar 0x0000000100002138 starting address for the Objective-C class called APureSwiftClass. Grab the bits:

(lldb) x/5gx 0x0000000100002138
0x100002138: 0x0000000100002100 0x00007fff91b6d478
0x100002148: 0x00007fff63aa1400 0x0000000000000000
0x100002158: 0x00000001000020b2

Remove the Swift class mask to turn the bits value into the class_ro_t pointer and dereference it 4 times:

(lldb) x/4gx 0x00000001000020b0
0x1000020b0: 0x0000001000000080 0x0000000000000010
0x1000020c0: 0x0000000000000000 0x0000000100000f00 

Cross referencing the class_ro_t struct with the dumped memory:

struct class_ro_t {
    uint32_t flags;             // 0x00000080
    uint32_t instanceStart;     // 0x00000010
    uint32_t instanceSize;      // 0x00000010
    uint32_t reserved;          // 0x00000000
    const uint8_t * ivarLayout; // 0x0000000000000000
    const char * name;          // 0x0000000100000f00
...

The above output has the fields included to make the raw data easier to read.

In my program’s memory, the name of this class is found at address 0x0000000100000f00. Dereference the address using x’s string format.

(lldb) x/s 0x0000000100000f00
0x100000f00: "_TtC3ex615APureSwiftClass"

That’s how one can programmatically traverse every Objective-C class in an image and find it’s name!

There’s many more components to reading the class_ro_t struct, but that can be reviewed on your own time.


3.6 Other Mach-O ObjC Sections


There’s many other relevant Mach-O sections that pertain to Objective-C. Here’s a list of some of them to explore on your own time:


IV. DYLD Opcodes and Binding


Although this concept might be dry to many, it’s essential knowledge for understanding how Swift types and Objective-C classes interact with each other with different images. How can a class’s superclass be referenced in a different image?

This is the job for dyld opcodes. In the Mach-O LC_DYLD_INFO_ONLY load command, there exists a series of opcodes that tell dyld what symbols and where to bind them. This happens as soon as an image is loaded into memory.


4.1 Finding Opcodes


Build out the following Objective-C file called ex7.m:

@import Foundation;

@interface SubArray : NSArray
@end
@implementation SubArray
@end

int main () { return 0; }

In the above code, SubArray inherits from NSArray, which isn’t implemented in your code, but referenced via the Foundation module. You’ll see dyld binds NSArray to superclass field of the SubArray class.

Compile ex7.m, make sure to include the -fno-pie option:

lolgrep:/tmp$ clang -fmodules ex7.m  -o ex7 -fno-pie

Query the virtual address location of the SubArray Objective-C class:

lolgrep:/tmp$ nm ex7 | grep SubArray
00000001000010c8 S _OBJC_CLASS_$_SubArray
00000001000010a0 S _OBJC_METACLASS_$_SubArray

The SubArray class starts at 0x000001000010c8 (for me), whose isa will contain the value 0x000001000010a0 (the meta class). But what about it’s superclass? The address at 0x000001000010d0 (SubArray’s start address + the size of a pointer) needs to be bound to the NSArray class; this is where the opcodes come in.

Note: Frankly speaking, Apple’s dyldinfo’s -opcodes option is a bit disappointing for this information because it only displays the opcodes it interprets; it doesn’t display the realized virtual addresses that it will bind symbols to. That’s why dsdump’s -O option is better suited to showcase this:

Use dsdump’s -O option to display the opcodes:

lolgrep:/tmp$ dsdump -O ex7 | head -5
0x0000 BIND_OPCODE_SET_DYLIB_ORDINAL_IMM (2)
0x0001 BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM (0x00, _OBJC_CLASS_$_NSArray)
0x0018 BIND_OPCODE_SET_TYPE_IMM (1)
0x0019 BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB (2, 0x000000D0) (0x1000010D0)
0x001C BIND_OPCODE_DO_BIND (0x1000010D0, _OBJC_CLASS_$_NSArray)

Here you can see the resolved 0x1000010D0 address be bound to by the NSArray class.

Note: Jonathan Levin has a great writeup of what each of these opcodes do.

Verify at runtime the 0x1000010D0 (or equivalent) address is being bound via dyld’s debugging environment variables (see man dyld).

Add the DYLD_PRINT_BINDINGS=1 env var to the ex7 executable and observe the output.

lolgrep:/tmp$ DYLD_PRINT_BINDINGS=1 ex7

This produced the following on my machine:

dyld: bind: ex7:0x1000010D0 = CoreFoundation:_OBJC_CLASS_$_NSArray, *0x1000010D0 = 0x7FFF895E11C0
dyld: bind: ex7:0x1000010A8 = CoreFoundation:_OBJC_METACLASS_$_NSArray, *0x1000010A8 = 0x7FFF895E1E40
dyld: bind: ex7:0x1000010A0 = libobjc.A.dylib:_OBJC_METACLASS_$_NSObject, *0x1000010A0 = 0x7FFF91AC90F0
dyld: bind: ex7:0x1000010B0 = libobjc.A.dylib:__objc_empty_cache, *0x1000010B0 = 0x7FFF63AA1400
dyld: bind: ex7:0x1000010D8 = libobjc.A.dylib:__objc_empty_cache, *0x1000010D8 = 0x7FFF63AA1400

Again, observe the NSArray class being bound to address 0x1000010D0 (or equivalent) at runtime.

Understanding the binding opcodes is incredibly important to infer what symbol is being used from another image since these addresses are nil‘d out while on disk. It’s only at load time will these addresses get bound to an actual address so dyld opcode knowledge is a must for building a class-dump tool.


Swift


Finally! You got to Swift! Unlike all the previous concepts, exploring Swift types is evolving at a rapid pace. So much so that I must say that this information will likely get stale in less than a year.

This information only pertains to Swift 5.x, it will likely be obsolete when Swift 6.x rolls around


5.1 Swift Types


Unlike Objective-C where classes are only fair game, Swift can introspect any type–an enum, struct, class, protocol, etc!

To do this, several different Mach-O sections are utilized. By far the most important one is the __TEXT.__swift5_types. The __swift5_types contains an array of “relative pointers” to all the Swift types found in the image. Think of the __TEXT.__swift5_types section to be sort of like the __DATA__CONST.__objc_classlist section where it’s an array of pointers to Objective-C classes. The difference is that this section stores all Swift types and it only uses 4 bytes (by default, instead of 8 bytes via a pointer).

Relative pointers are a 4 byte signed integer that takes it’s value and adds it to the address.

In pseudo-c-ish-code…

#include <mach-o/getsect.h>

unsigned long sz = 0;
int32_t *data = getsectdata("__TEXT", "__swift5_types", &sz);
for (int i = 0; i < (sz / sizeof(int32_t)); i++) {
  uintptr_t nominalTypeDesc = (uintptr_t)data[i] + (uintptr_t)&data[i];
}

The relative pointers idea is a really nifty trick by the Swift developers. Not only does each relative pointer only use half the amount of bytes, but dyld doesn’t need to rebase any of the pointers when loaded into memory since it will always be a hardcoded offset to whatever address the image is loaded in.

These relative pointers will point to something called a nominal type descriptor, which is just a fancy way of saying the layout of a type (similar to the objc_class struct you saw earlier). These type descriptors come in varying sizes based upon the type they represent. For example, here’s the structural layout for a nominal type descriptor for a Swift class. A nominal type descriptor for, say, a Swift enum will have a different layout.

If you clicked on the above link, that’s a little hard on the eyes, right? Figuring out the offsets for C++ classes can be a pain the ass due to inheritance. Fortunately, Scott Knight provides an excellent article with simplified C struct offsets. If you’re interested in the Swift layouts, I’d strongly suggest you read Scott’s work, since Scott does a much better job explaining all the Swift struct layouts. So instead of focusing on all the different structs like Scott, I’ll do a deep dive into one struct layout: the layout for Swift classes.

Here’s the simplified layout for a Swift class in Swift 5:

struct NominalClassDescriptor {
// Inherited From TargetContextDescriptor
    uint32_t Flags // All nominal type descriptors begin with Flags, used to determine nominal type descriptor
    int32_t  Parent // The reference to the parent type descriptor 

// Inherited From TargetTypeContextDescriptor
    int32_t  Name // The name of the type descriptor
    int32_t  AccessFunctionPtr // A pointer to the metadata access function for this type.
    int32_t  Fields // A pointer to the field descriptor for type, if any            

// Implemented in NominalClassDescriptor
    int32_t  SuperclassType // The type of the superclass, expressed as a mangled type name         
    uint32_t MetadataNegativeSizeInWords // Ignore for this writeup
    uint32_t MetadataPositiveSizeInWords // Ignore for this writeup
    uint32_t NumImmediateMembers // Number of additional members stored after this class (aka NumImmediateMembers * sizeof(void*) payload)        
    uint32_t NumFields // Number of properties stored in this class   
    uint32_t FieldOffsetVectorOffset; // The offset of the field offset vector for this struct's stored properties in its metadata
}

You’ll explore this NominalClassDescriptor struct in a Swift program. Create ex8.swift with the following code:

class SomeClass {}

Compile with debugging info:

lolgrep:/tmp$ swiftc -g ex8.swift

Debug the ex8 executable by setting a breakpoint on main

lolgrep:/tmp$ lldb ex8 -s <(echo -e "br set -n main -s ex8\nrun")

While paused in main, query for the __swift5_types Mach-O section:

(lldb) platform shell otool -l ex8 | grep types -A3
  sectname __swift5_types
   segname __TEXT
      addr 0x0000000100000fa4
      size 0x0000000000000004

From the output, the address of interest is found at 0x0000000100000fa4 (on my machine), whose size is only 4 bytes (AKA there’s only one Swift type). Dereference this address to a 32 bit signed integer.

(lldb) x/wd 0x0000000100000fa4
0x100000fa4: -84

The address 0x100000fa4 has the value -84. Add these two values together to get the nominal type descriptor address for the SomeClass type:

(lldb) image lookup -a `0x0000000100000fa4 + -84`
      Address: ex8[0x0000000100000f50] (ex8.__TEXT.__const + 28)
      Summary: ex8`nominal type descriptor for ex8.SomeClass

The backticks is an LLDB trick to evaluate an expression before being parsed to the surrounding command. The resolved address is 0x0000000100000f50 (for me).

Cross referencing theNominalClassDescriptor layout from above, dereference 3 relative pointers to get the Name field.

(lldb) x/3wd 0x0000000100000f50
0x100000f50: -2147483568
0x100000f54: -28
0x100000f58: -20

The Name string is at offset -20 from address 0x100000f58. Use LLDB to evaluate this address:

(lldb) x/s `0x100000f58 + -20`
0x100000f44: "SomeClass"

BOOM! And that’s Swift reflection in a nutshell!


5.2 Swift Methods in a Class


The NominalClassDescriptor has 11 int32_t members, totalling 44 bytes. Immediately following the NominalClassDescriptor, there exists a varying amount of data. I won’t get into the nitty gritty of this (check out the TrailingObjects.h header if you want to learn more), but the prologue of the NominalClassDescriptor will look like the following (provided the class has implemented some methods):

  // End of NominalClassDescriptor here...
  uint32_t VTableOffset; // The offset to the vtable from the metadata
  uint32_t VTableSize; // The count of methods
  MethodDescriptor[VTableSize]; // Variable size of MethodDescriptor

The VTableSize will indicate an array of MethodDescriptor objects immediately following the VTableSize. The layout of the MethodDescriptor struct looks like this:

struct TargetMethodDescriptor {
  MethodDescriptorFlags Flags; //int32_t Flags describing the Impl
  int32_t Impl; // The relative pointer method implementation
}

The MethodDescriptorFlags looks like the following:

class MethodDescriptorFlags {
public:
  typedef uint32_t int_type;
  enum class Kind {
    Method,
    Init,
    Getter,
    Setter,
    ModifyCoroutine,
    ReadCoroutine,
  };

private:
  enum : int_type {
    KindMask = 0x0F,                // 16 kinds should be enough for anybody
    IsInstanceMask = 0x10,
    IsDynamicMask = 0x20,
  };

Note: If you’re building a Swift introspection tool, the MethodDescriptorFlags are absolute gold. The Impl will give you a virtual address, which you can cross reference to the symbol table to (hopefully) get the name of symbol. As you learned earlier, if the symbol table is stripped, you can’t resolve the name. Fortunately, you can still get a decent idea of the stripped symbol’s function by consulting the Flags field. For example, if the Flag tells you the method is a Getter, then you can look at the assembly of the function to find the direct field offset value. Once you know that value, you can cross reference the corresponding property (and it’s offset) to realize that method is the getter of that Swift property!

You will programmatically explore the Swift methods implemented in a Swift class. Build out ex9.swift with the following code:

class AClass {
  let aNumber = 4
  func aFunc() {}
}

Build with the debugging enabled, run, then pause on main:

lolgrep:/tmp$ swiftc -g ex9.swift && lldb ex9 -s <(echo -e "br set -n main -s ex9\nrun")

Query the location of the NominalClassDescriptor via LLDB:

(lldb) image lookup -rs type\ descriptor ex9
1 symbols match the regular expression 'type descriptor' in /tmp/ex9:
        Address: ex9[0x0000000100000f18] (ex9.__TEXT.__const + 32)
        Summary: ex9`nominal type descriptor for ex9.AClass

The image lookup -rs command will do a regex search for the symbol “type descriptor” that’s constrained to anything in the ex9 image. This is equivalent to you manually resolving the location of the nominal type descriptor via the relative pointers from __TEXT.__swift5_types array in the earlier example.

For me, the NominalClassDescriptor for AClass is at 0x0000000100000f18. Remember, the NominalClassDescriptor has a size of 0x2c (44) bytes. Resolve this offset via LLDB to grab the VTableOffset and VTableSize immediately following it.

(lldb) x/2wx `0x0000000100000f18 + 44`
0x100000f44: 0x0000000b 0x00000002

The VTableOffset has a value of 0xb, the VTableSize has a size of 2. That means immediately following this address, there’ll be 2 MethodDescriptors (each with 2 int32_t fields).

In LLDB, dump the additional 2 MethodDescriptors:

(lldb) x/6wd `0x0000000100000f18 + 44`
0x100000f44: 11
0x100000f48: 2
0x100000f4c: 16
0x100000f50: -464
0x100000f54: 1
0x100000f58: -440

Address 0x100000f4c holds the MethodDescriptorFlags with a value of 16 (or 0x10). Consulting the above MethodDescriptorFlags struct, one can determine this method is an instance method for the AClass. Prove this by resolving the Impl on the next address in LLDB:

(lldb) image lookup -a `0x100000f50 + -464`
      Address: ex9[0x0000000100000d80] (ex9.__TEXT.__text + 48)
      Summary: ex9`ex9.AClass.aFunc() -> () at ex9.swift:3

Excellent! You were able to resolve this method via Swift metadata to get the address 0x0000000100000d80 for aFunc! Verify this is truly the case via a good ol’ nm:

(lldb) platform shell nm ex9 | grep aFunc | xcrun swift-demangle
0000000100000d80 t ex9.AClass.aFunc() -> ()
0000000100000f4c s method descriptor for ex9.AClass.aFunc() -> ()

Again, both nm and the Swift metadata tells us the ex9.AClass.aFunc() will be found at address 0x00000100000d80


5.3 Swift Calling Convention


The calling convention differs a bit in Swift in both ARM and x86 families on Apple platforms. If you’re totally new to this stuff, I’d recommend reading Mike Ash’s writeup or this article, which explains the C and Objective-C x86_64 calling conventions first.

Before we can talk about Swift, let’s briefly recap the calling convention of Objective-C for x86_64 and ARM64:

Using the -[NSString writeToFile:atomically:] method as an example:

[@"test" writeToFile:@"/tmp/some_file.txt" atomically:NO]

This will get translated to always fun objc_msgSend family with the following registers:

         @"test" "writeToFile:atomically:" @"/tmp/some_file.txt" NO 
ARM64    X0      X1                        X2                    X3
X86_64   RDI     RSI                       RDX                   RCX

If you’re a deer in the headlights reading this, please read the above link(s) first.

Now onto Swift:

Swift changes the self around to R13 on x86_64 and X20 on ARM64. Since there’s no need for an Objective-C Selector, the RSI/X1 registers can be used for arguments.

This means that all arguments for Swift can start at the “first” register (RDI/X0) and the self argument will be at R13/X20. This has the additional benefit that these registers can survive across calling frames, i.e. they won’t get lost after returning from a frame

Note: Again for you Swift introspection people: the MethodDescriptorFlags is great again because it will tell you that a particular (strip‘d) function needs a Swift calling convention, where you’ll need to change around your parsing engine from RDI/X0 to R13/X20


5.4 Introspecting strip’d Swift/ObjC Classes


Let’s bring this all together with actual ARM64 compiled code for an iOS UIViewController Swift subclass.

Write ex10.swift with the following Swift code:

import UIKit
class ViewController : UIViewController {
  var meh: Int = 4
  override func viewDidLoad() {
    super.viewDidLoad()
    print("yayyyyy")
  }
  func swiftFunc() { }
}

Compile ex10.swift for an ARM64 CPU. Since this is iOS, you’ll need to use the appropriate target and SDK:

lolgrep:/tmp$ swiftc ex10.swift -sdk `xcrun --show-sdk-path  -sdk iphoneos` -target arm64-apple-ios99.99.99.99

Use dsdump to dump the Swift code:

lolgrep:/tmp$ dsdump  --swift ex10 --verbose=4 --defined --color
 class ex10.ViewController : UIViewController /System/Library/Frameworks/UIKit.framework/UIKit {

  // Properties
  var meh : Int

  // ObjC -> Swift bridged methods
  0x10000756c  @objc ViewController.viewDidLoad()
  0x100007924  @objc ViewController.init(nibName:bundle:)
  0x100007bc0  @objc ViewController.init(coder:)

  // Swift methods
  0x100007198  func ViewController.meh.getter // getter
  0x10000722c  func ViewController.meh.setter // setter
  0x1000072ec  func ViewController.meh.modify // modifyCoroutine
  0x1000075b4  func ViewController.swiftFunc() // method
 }

Now, strip the ex10 image:

lolgrep:/tmp$ strip ex10

Rerun the previous dsdump command:

lolgrep:/tmp$ dsdump  --swift ex10 --verbose=4 --defined --color

You’ll get output that looks similar to this:

 class ex10.ViewController : UIViewController /System/Library/Frameworks/UIKit.framework/UIKit {

  // Properties
  var meh : Int

  // ObjC -> Swift bridged methods
  0x10000756c  @objc ViewController.viewDidLoad <stripped>
  0x100007924  @objc ViewController.initWithNibName:bundle: <stripped>
  0x100007bc0  @objc ViewController.initWithCoder: <stripped>

  // Swift methods
  0x100007198  func <stripped> // getter
  0x10000722c  func <stripped> // setter
  0x1000072ec  func <stripped> // modifyCoroutine
  0x1000075b4  func <stripped> // method
 }

This part gets interesting for reverse engineering: Swift does not include overriden Objective-C methods in it’s metadata.

Let me say that again: The following Swift bridged code does not get picked up by the Swift metadata:

override func viewDidLoad() {
  super.viewDidLoad()
  print("yayyyyy")
}

So how does Objective-C know how to call Swift code with a completely different calling convention? The compiler will build a “bridging method” in Objective-C land called a thunk. That’s where this @objc ViewController.viewDidLoad method comes into play. It will rearrange all the registers around and make sure everything is properly retained so no Objective-C references go out of scope when the Swift code is executing. In addition, that’s where this “hidden” Swift method is located!

Dump the Objective-C thunk method at address 0x10000756c, use a perl regex to stop at the return of this method:

otool -tV ex10 | perl -lne 'print if /10000756c/ .. /ret\z/'

This will dump the ARM64 assembly for the “Objective-C viewDidLoad” thunk method. This is the output from my machine:

  000000010000756c  sub sp, sp, #0x40
  0000000100007570  stp x20, x19, [sp, #0x20]
  0000000100007574  stp x29, x30, [sp, #0x30]
  0000000100007578  add x29, sp, #0x30
  000000010000757c  mov x2, x0
* 0000000100007580  str x0, [sp, #0x18]
  0000000100007584  mov x0, x2
  0000000100007588  str x1, [sp, #0x10]
  000000010000758c  bl  0x100007d04 ; symbol stub for: _objc_retain
* 0000000100007590  ldr x20, [sp, #0x18]
  0000000100007594  str x0, [sp, #0x8]
* 0000000100007598  bl  0x100007368
  000000010000759c  ldr x0, [sp, #0x18]
  00000001000075a0  bl  0x100007cf8 ; symbol stub for: _objc_release
  00000001000075a4  ldp x29, x30, [sp, #0x30]
  00000001000075a8  ldp x20, x19, [sp, #0x20]
  00000001000075ac  add sp, sp, #0x40
  00000001000075b0  ret

I’ve added asterisks to the interesting ARM64 assembly instructions. X0 (self) will get retain‘d, X0 will transfer self to X20 and then call the Swift side of the viewDidLoad at address 0x100007368. Again, this method is not visible to the Swift metadata.

For those of you who are introspection tool builders, hopefully you’ll see a window to improve your toolset:

I can’t wait to see what y’all can do with this in the future 🍻

ARM64e Disk Pointers

TODO, I’ll get to this eventually :]

EL FIN

WOW! You read to here!? Here’s a 30 minute video of me explaining the above concepts.