Performance Aware Programming
I'm taking Casey Muratori's Performance Aware Programming course, and getting a lot out of it. I am doing all the exercises in rust
on linux (rather than C++ on Windows). So far that has not been a problem.
In part 1 of the course, I wrote a simulator for the 8086 processor that covered some but not all of the instruction set. I took the opportunity to learn a bit about GUI programming in rust with tui and then with egui. I wrote two versions of the simulator (though the second version is still incomplete), and visualized the state of memory and the instruction stream using tui
(for version 1) and egui
(for version 2). Perhaps I'll write about or share those at some point.
Now that we're in part 2, we're digging more specifically into performance. To do that, Casey has started talking about the CPU instructions that we're going to use to get relevant information.
rdtsc
The first instruction we're looking at is rdtsc
. It used to be the case that rdtsc
gave a measure of the actual clock pulses of the CPU. That is -- you could use it to count the actual number of cycles between instructions. It was added to CPUs with Pentium (i586) and had a great effect on how programmers measured performance.
Later, the meaning of rdtsc
changed to something like a very accurate wall clock. I don't know the details yet, but apparently it had something to do with the fact that with multiple cores, CPUs were designed so that cores could be "boosted" to go fast or throttled to go slow and save power. This means that depending on the core your thread was scheduled on at any given time, rdtsc
would report different values and you could not put readings from rdtsc
in order, since they might have been taken from different cores. In addition, CPU's apparently do some magic called out-of-order execution which I don't understand yet. This might also have something to do with the changes.
In any case, rdtsc
has changed to be something like a global (across all threads), monotonically-increasing counter that increments at about the speed of the base clock. Since I'm on linux, I can't follow along exactly with Casey's demonstrations. For example to check the base clock speed of his computer, he calls systeminfo
. I don't have systeminfo
on my computer, but I think inxi
returns similar information:
➜ inxi
CPU: 16-core (8-mt/8-st) 12th Gen Intel Core i9-12900K (-MST AMCP-)
speed/min/max: 1842/800/5300:4000 MHz Kernel: 6.3.8-arch1-1 x86_64 Up: 23h 3m
Mem: 10838.7/64043.4 MiB (16.9%) Storage: 1.82 TiB (44.8% used) Procs: 477
Shell: Zsh inxi: 3.3.27
I think the important part of this output is "4000 MHz". That is -- I think I have a 4GHz "base speed". But I'm not totally sure. Hopefully as I go through the homework, I'll be able to confirm this.
Another reason I'm not just following along directly from Casey's demonstrations is because I'm trying to use rust
as much as possible for this course. Therefore I need to translate the C++ listings to rust. The way I can call rdtsc
from rust is:
unsafe {
core::arch::x86_64::_rdtsc();
}
The call will be different depending on the CPU architecture I'm building for. For example if building for x86
instead of x86_64
, then it will be:
core::arch::x86::_rdtsc();
Finding a debugger, pwndbg
Casey uses RemedyBG in the course to debug his programs. Unfortunately, RemedyBG is only available on Window. So, I went looking for something similar that I could use. Luckily another person in the course made some recommendations:
I followed the setup instructions for pwndbg
, but I didn't really understand what it was or how it was supposed to work. I had to ask ChatGPT for clarification:
Me:
Is `pwndbg` an extension of `gdb`? A plugin? What is it?
ChatGPT:
Yes, pwndbg is indeed an extension of GDB. It's a Python module that improves GDB's functionalities and user interface, making it more convenient and visually appealing to use, especially for debugging binary exploits and reverse-engineering tasks.
It integrates with GDB through GDB's Python API. When you run GDB with pwndbg installed and configured, pwndbg uses this API to customize the GDB interface and add new commands.
In a broader sense, pwndbg can be considered as a "plugin" for GDB because it extends its functionalities, although "plugin" is not a term that's commonly used in the GDB ecosystem. More commonly, extensions like pwndbg are referred to as "GDB scripts" or "GDB Python scripts".
Other popular GDB extensions include GEF (GDB Enhanced Features) and PEDA (Python Exploit Development Assistance for GDB), which provide similar features to pwndbg.
Ok, so pwndbg
adds a bunch of commands and changes the default output of gdb. The output is colorful, so rather than copying and pasting from my terminal, I'm going to use screenshots to demonstrate.
Inspecting rdtsc
with pwndbg
Starting with this rust program:
fn main() {
unsafe {
core::arch::x86_64::_rdtsc();
}
}
I build a debug build with cargo build
. I don't build a release build because I couldn't find the rdtsc
instruction in the dissassembly and think it might have been optimized away.
Then I point gdb
to my binary (called homework
):
gdb target/debug/homework
I put a breakpoint in main
with break main
:
pwndbg> break main
Breakpoint 1 at 0x8640
Since I don't need all of the context that pwndbg
exposes, I set the context sections to the ones I actually want:
pwndbg> set context-sections regs disasm
Set which context sections are displayed (controls order) to 'regs disasm'.
I start the program with run
, which takes me to the breakpoint at main.
From the disassembly output, I can already see the location of the rdtsc
intruction. If I didn't, then I would need to step through the program until I did, so that I could find a place to put the breakpoint (assuming I don't want to step through every single instruction.) Actually, there's probably a way to specify a breakpoint by giving the line number from the source file, especially in debug builds. But I don't really know how or understand that just yet. So instead, I just set the breakpoint by the address (break *0x55555555c670
):
pwndbg> break *0x55555555c670
Breakpoint 2 at 0x55555555c670: file /rustc/90c541806f23a127002de5b4038be731ba1458ca/library/core/src/../../stdarch/crates/core_arch/src/x86/rdtsc.rs, line 26.
I continue the program with continue
. Then I use nexti
a few times and watch as the registers change. I'll include screenshots below, but for clarity I'll show just the most relevant output first:
Before the call to rdtsc
.
Disassembly:
Registers:
*RAX 0x55555559e418 —▸ 0x55555555c630 (core::ptr::drop_in_place<std::rt::lang_start<()>::{{closure}}>::hd4d8780483a01e18) ◂— mov qword ptr [rsp - 8], rdi
*RBX 0x7fffff7fe000
*RCX 0x7fffff7ff000 ◂— 0x7fffff7ff000
*RDX 0x1
*RDI 0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax
After the call to rdtsc
(calling nexti
to step one instruction).
Disassembly:
Registers:
*RAX 0x95c927b8
RBX 0x7fffff7fe000
RCX 0x7fffff7ff000 ◂— 0x0
*RDX 0xf69a
RDI 0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax
After the shift left shl
Disassembly:
Registers:
RAX 0x95c927b8
RBX 0x7fffff7fe000
RCX 0x7fffff7ff000 ◂— 0x0
*RDX 0xf69a00000000
RDI 0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax
After the or
Disassembly:
Registers:
*RAX 0xf69a95c927b8
RBX 0x7fffff7fe000
RCX 0x7fffff7ff000 ◂— 0x0
RDX 0xf69a00000000
RDI 0x55555555c640 (homework::main::h3b62edde3892ce37) ◂— push rax
Conclusion
The timestamp counter is the value that ended up in RAX (0xf69a95c927b8
), and is also the u64 returned by the _rdtsc()
function in rust code.
Translating this counter to a measurement like milliseconds will require me to the clock rate that is used by the timestamp counter. That's essentially what the homework assignment of the course is, so I'll be doing that shortly. In the meantime, I feel good about this exercise, because now know that I won't have any blockers completing the assignment on linux.
Full screenshots:
Scanning the output to find the relevant information is challenging. I suspect there's a smarter way to be doing this that makes things easier, e.g. by keeping the output in the same part of the screen each time.