an image

Performance Aware Programming

I'm taking Casey Muratori's Performance Aware Programming course, and getting a lot out of it. I am doing all the exercises in rust on linux (rather than C++ on Windows). So far that has not been a problem.

In part 1 of the course, I wrote a simulator for the 8086 processor that covered some but not all of the instruction set. I took the opportunity to learn a bit about GUI programming in rust with tui and then with egui. I wrote two versions of the simulator (though the second version is still incomplete), and visualized the state of memory and the instruction stream using tui (for version 1) and egui (for version 2). Perhaps I'll write about or share those at some point.

Now that we're in part 2, we're digging more specifically into performance. To do that, Casey has started talking about the CPU instructions that we're going to use to get relevant information.

rdtsc

The first instruction we're looking at is rdtsc. It used to be the case that rdtsc gave a measure of the actual clock pulses of the CPU. That is -- you could use it to count the actual number of cycles between instructions. It was added to CPUs with Pentium (i586) and had a great effect on how programmers measured performance.

Later, the meaning of rdtsc changed to something like a very accurate wall clock. I don't know the details yet, but apparently it had something to do with the fact that with multiple cores, CPUs were designed so that cores could be "boosted" to go fast or throttled to go slow and save power. This means that depending on the core your thread was scheduled on at any given time, rdtsc would report different values and you could not put readings from rdtsc in order, since they might have been taken from different cores. In addition, CPU's apparently do some magic called out-of-order execution which I don't understand yet. This might also have something to do with the changes.

In any case, rdtsc has changed to be something like a global (across all threads), monotonically-increasing counter that increments at about the speed of the base clock. Since I'm on linux, I can't follow along exactly with Casey's demonstrations. For example to check the base clock speed of his computer, he calls systeminfo. I don't have systeminfo on my computer, but I think inxi returns similar information:

➜ inxi
CPU: 16-core (8-mt/8-st) 12th Gen Intel Core i9-12900K (-MST AMCP-)
speed/min/max: 1842/800/5300:4000 MHz Kernel: 6.3.8-arch1-1 x86_64 Up: 23h 3m
Mem: 10838.7/64043.4 MiB (16.9%) Storage: 1.82 TiB (44.8% used) Procs: 477
Shell: Zsh inxi: 3.3.27

I think the important part of this output is "4000 MHz". That is -- I think I have a 4GHz "base speed". But I'm not totally sure. Hopefully as I go through the homework, I'll be able to confirm this.

Another reason I'm not just following along directly from Casey's demonstrations is because I'm trying to use rust as much as possible for this course. Therefore I need to translate the C++ listings to rust. The way I can call rdtsc from rust is:

    unsafe {
        core::arch::x86_64::_rdtsc();
    }

The call will be different depending on the CPU architecture I'm building for. For example if building for x86 instead of x86_64, then it will be:

        core::arch::x86::_rdtsc();

Finding a debugger, pwndbg

Casey uses RemedyBG in the course to debug his programs. Unfortunately, RemedyBG is only available on Window. So, I went looking for something similar that I could use. Luckily another person in the course made some recommendations:

screenshot

I followed the setup instructions for pwndbg, but I didn't really understand what it was or how it was supposed to work. I had to ask ChatGPT for clarification:

Me:

Is `pwndbg` an extension of `gdb`? A plugin? What is it?

ChatGPT:

Yes, pwndbg is indeed an extension of GDB. It's a Python module that improves GDB's functionalities and user interface, making it more convenient and visually appealing to use, especially for debugging binary exploits and reverse-engineering tasks.

It integrates with GDB through GDB's Python API. When you run GDB with pwndbg installed and configured, pwndbg uses this API to customize the GDB interface and add new commands.

In a broader sense, pwndbg can be considered as a "plugin" for GDB because it extends its functionalities, although "plugin" is not a term that's commonly used in the GDB ecosystem. More commonly, extensions like pwndbg are referred to as "GDB scripts" or "GDB Python scripts".

Other popular GDB extensions include GEF (GDB Enhanced Features) and PEDA (Python Exploit Development Assistance for GDB), which provide similar features to pwndbg.

Ok, so pwndbg adds a bunch of commands and changes the default output of gdb. The output is colorful, so rather than copying and pasting from my terminal, I'm going to use screenshots to demonstrate.

Inspecting rdtsc with pwndbg

Starting with this rust program:

fn main() {
    unsafe {
        core::arch::x86_64::_rdtsc();
    }
}

I build a debug build with cargo build. I don't build a release build because I couldn't find the rdtsc instruction in the dissassembly and think it might have been optimized away.

Then I point gdb to my binary (called homework):

gdb target/debug/homework

screenshot

I put a breakpoint in main with break main:

pwndbg> break main
Breakpoint 1 at 0x8640

Since I don't need all of the context that pwndbg exposes, I set the context sections to the ones I actually want:

pwndbg> set context-sections regs disasm
Set which context sections are displayed (controls order) to 'regs disasm'.

I start the program with run, which takes me to the breakpoint at main.

screenshot

From the disassembly output, I can already see the location of the rdtsc intruction. If I didn't, then I would need to step through the program until I did, so that I could find a place to put the breakpoint (assuming I don't want to step through every single instruction.) Actually, there's probably a way to specify a breakpoint by giving the line number from the source file, especially in debug builds. But I don't really know how or understand that just yet. So instead, I just set the breakpoint by the address (break *0x55555555c670):

pwndbg> break *0x55555555c670
Breakpoint 2 at 0x55555555c670: file /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/../../stdarch/crates/core_arch/src/x86/rdtsc.rs, line 26.

I continue the program with continue. Then I use nexti a few times and watch as the registers change. I'll include screenshots below, but for clarity I'll show just the most relevant output first:

Before the call to rdtsc.

Disassembly:

screenshot

Registers:

screenshot

*RAX  0x55555559e418 —▸ 0x55555555c630 (core::ptr::drop_in_place<std::rt::lang_start<()>::{{closure}}>::hd4d8780483a01e18) ◂— mov qword ptr [rsp - 8], rdi
*RBX  0x7fffff7fe000
*RCX  0x7fffff7ff000 ◂— 0x7fffff7ff000
*RDX  0x1
*RDI  0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax

After the call to rdtsc (calling nexti to step one instruction).

Disassembly:

screenshot

Registers:

screenshot

*RAX  0x95c927b8
 RBX  0x7fffff7fe000
 RCX  0x7fffff7ff000 ◂— 0x0
*RDX  0xf69a
 RDI  0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax

After the shift left shl

Disassembly:

screenshot

Registers:

screenshot

 RAX  0x95c927b8
 RBX  0x7fffff7fe000
 RCX  0x7fffff7ff000 ◂— 0x0
*RDX  0xf69a00000000
 RDI  0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax

After the or

Disassembly:

screenshot

Registers:

screenshot

*RAX  0xf69a95c927b8
 RBX  0x7fffff7fe000
 RCX  0x7fffff7ff000 ◂— 0x0
 RDX  0xf69a00000000
 RDI  0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax

Conclusion

The timestamp counter is the value that ended up in RAX (0xf69a95c927b8), and is also the u64 returned by the _rdtsc() function in rust code.

Translating this counter to a measurement like milliseconds will require me to the clock rate that is used by the timestamp counter. That's essentially what the homework assignment of the course is, so I'll be doing that shortly. In the meantime, I feel good about this exercise, because now know that I won't have any blockers completing the assignment on linux.

Full screenshots:

Scanning the output to find the relevant information is challenging. I suspect there's a smarter way to be doing this that makes things easier, e.g. by keeping the output in the same part of the screen each time.

screenshot screenshot screenshot screenshot screenshot screenshot