Working With Clocks in Linux
In a previous post, I mentioned that I was learning about the rdtsc
instruction in x86 assembly.
I finally got around to actually using it in a program, as I was doing the homework for Casey Muratori's Performance-Aware Programming course.
I built a simple profiler based on rdtsc
that allows me to measure the performance of various parts of my rust programs. Here's how I went about doing it.
Background
Casey's listing is copied below. When compiled for windows, it uses QueryPerformanceFrequency
and QueryPerformanceCounter
. When compiled for Linux, it uses a fixed frequency (1000000
or 1Mhz) and gettimeofday
on Linux:
/* ========================================================================
(C) Copyright 2023 by Molly Rocket, Inc., All Rights Reserved.
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Please see https://computerenhance.com for more information
======================================================================== */
/* ========================================================================
LISTING 70
======================================================================== */
#if _WIN32
#include <intrin.h>
#include <windows.h>
static u64 GetOSTimerFreq(void)
{
LARGE_INTEGER Freq;
QueryPerformanceFrequency(&Freq);
return Freq.QuadPart;
}
static u64 ReadOSTimer(void)
{
LARGE_INTEGER Value;
QueryPerformanceCounter(&Value);
return Value.QuadPart;
}
#else
#include <x86intrin.h>
#include <sys/time.h>
static u64 GetOSTimerFreq(void)
{
return 1000000;
}
static u64 ReadOSTimer(void)
{
// NOTE(casey): The "struct" keyword is not necessary here when compiling in C++,
// but just in case anyone is using this file from C, I include it.
struct timeval Value;
gettimeofday(&Value, 0);
u64 Result = GetOSTimerFreq()*(u64)Value.tv_sec + (u64)Value.tv_usec;
return Result;
}
#endif
/* NOTE(casey): This does not need to be "inline", it could just be "static"
because compilers will inline it anyway. But compilers will warn about
static functions that aren't used. So "inline" is just the simplest way
to tell them to stop complaining about that. */
inline u64 ReadCPUTimer(void)
{
// NOTE(casey): If you were on ARM, you would need to replace __rdtsc
// with one of their performance counter read instructions, depending
// on which ones are available on your platform.
return __rdtsc();
}
Before I read or tried to understand the details of listing, I tried to reproduce what I thought it was meant to do, which was to get the timer frequency, and something called the "performance counter".
I read about CLOCK_MONOTONIC
, clock_gettime
, and clock_getres
from the gnu libc
manual.
Clock resolution
The clock_getres
function is supposed to return the clock's actual resolution:
Function:
int clock_getres (clockid_t clock, struct timespec \*res)
Get the actual resolution of the clock identified by clock, storing it in *ts.
For instance, if the clock hardware for
CLOCK_REALTIME
uses a quartz crystal that oscillates at 32.768 kHz, then its resolution would be 30.518 microseconds, andclock_getres (CLOCK_REALTIME, &r)
would setr.tv_sec
to 0 andr.tv_nsec
to 30518.The return value is 0 on success and -1 on failure. The following errno error condition is defined for this function:
EINVAL
The clock identified by clock is not supported.
However, whenever I called it, I seemed to get 1 nanosecond:
fn clock_getres() -> libc::timespec {
let mut ts = unsafe { mem::zeroed() };
let clk_id = libc::CLOCK_MONOTONIC;
unsafe {
libc::clock_getres(clk_id, &mut ts);
}
ts
}
fn as_duration(ts: libc::timespec) -> Duration {
let duration = Duration::new(ts.tv_sec as u64, ts.tv_nsec as u32);
duration
}
fn main() {
let res = as_duration(clock_getres());
println!("res: {}", res.as_nanos()); // -> "res: 1"
}
Apparently, this is expected behavior. At least, according to this stackoverflow answer:
Unfortunately,
clock_getres()
POSIX function ... does not report timers granularity in Linux.It can return only two predefined results: either 1/HZ for low resolution clocks (where HZ is value of
CONFIG_HZ
macro used when configuring linux kernel, typical values are 100 300 1000) or 1 ns for high resolution clocks (hrtimers
).
So, I would need to either assume a resolution of 1μs or 1ns.
clock_gettime
or gettimeofday
Casey's listing uses gettimeofday
, which is described as the older api for reading the time:
The GNU C Library also provides an older, but still widely used, function [(
gettimeofday
)] for getting the current time with a resolution of microseconds.
Since the documentation for clock_getres()
suggests a 1ns resolution of the clock, I decided to use clock_gettime
instead of gettimeofday
.
Instrumentation
Finally, I wanted to make sure that it was as easy as possible to instrument my code with interval-based timing measurements.
The API I implemented was:
interval_start(&str)
interval_end(&str)
interval_duration(&str) -> std::time::Duration
In retrospect, I could simply this further to:
type IntervalHandle = usize;
interval_start() -> IntervalHandle;
interval_end(i: IntervalHandle) -> std::time::Duration;
Perhaps I'll do that later.
The important thing is that this makes it pretty much painless to measure any part of my program without the use of a profiler macro and without passing an instance of a profiler
around the program:
interval_start("parse json");
let json = parser::parse(&json)?;
interval_end("parse json");
Code
The first version of the profile is copied below.
There are a few tricky things happening, which I'd love to remove if I could think of how to.
-
First, I want globally available mutable variables. This isn't really a thing in rust, so instead I use
lazy_static!
for the "globally available" part andMutex
for the mutable part. I haven't worked with enough rust code to know whether this is a common pattern. I'd like to not have to make this aMutex
, since this profiler only really makes sense in a single-threaded environment. (As mentioned in a previous post,rdtsc
will report very different values on different threads. You don't want to try comparing the raw values against each other.) -
Second, I want to be able to convert
rdtsc
values to some real world time I can relate to, like microseconds or nanoseconds. So I takerdtsc
measurements before and after one second has passed (usingclock_gettime
) which tells me how manyrdtsc
ticks happen per second. I then save that to a static variable. (I think I'm probably losing precision along the way, unforunately.) -
Third, since the previous step takes time, the user of the library should probably not lazily wait to create the static values. You wouldn't want the 1 second delay in the step above to happen within an interval. So I provide an
init
function that forces the lazy statics to be created, which makes me think that I must be missing some common pattern in rust for doing what I'm trying to do (or it's simply not done for reasons I don't know). -
Fourth, there's no error handling (or testing). I don't do any checking for key collisions, or if a key doesn't exist in the map.
use std::collections::HashMap;
use std::mem;
use std::sync::Mutex;
use std::time::Duration;
#[macro_use]
extern crate lazy_static;
extern crate libc;
lazy_static! {
static ref TIMESTAMP_START: Mutex<HashMap<String, u64>> = Mutex::new(HashMap::new());
static ref TIMESTAMP_END: Mutex<HashMap<String, u64>> = Mutex::new(HashMap::new());
static ref RDTSC_TO_MICROSECONDS: f64 = {
let rdtsc_start = rdtsc();
let os_start = clock_gettime();
let mut os_end = clock_gettime();
let mut os_elapsed = as_duration(os_end) - as_duration(os_start);
while os_elapsed < Duration::from_secs(1) {
os_end = clock_gettime();
os_elapsed = as_duration(os_end) - as_duration(os_start);
}
let rdtsc_end = rdtsc();
let rdtsc_to_microseconds = 1000000.0 / (rdtsc_end as f64 - rdtsc_start as f64);
rdtsc_to_microseconds
};
}
pub fn init() {
let _start = TIMESTAMP_START.lock().unwrap();
let _end = TIMESTAMP_END.lock().unwrap();
let _rdtsc_to_microseconds = *RDTSC_TO_MICROSECONDS;
}
pub fn interval_start(s: &str) {
TIMESTAMP_START
.lock()
.unwrap()
.insert(s.to_string(), rdtsc());
}
pub fn interval_end(s: &str) {
TIMESTAMP_END.lock().unwrap().insert(s.to_string(), rdtsc());
}
pub fn interval_duration(s: &str) -> Duration {
let end_map = TIMESTAMP_END.lock().unwrap();
let start_map = TIMESTAMP_START.lock().unwrap();
let end = end_map.get(s).unwrap();
let start = start_map.get(s).unwrap();
Duration::from_micros(((end - start) as f64 * *RDTSC_TO_MICROSECONDS) as u64)
}
fn rdtsc() -> u64 {
unsafe { core::arch::x86_64::_rdtsc() }
}
fn clock_gettime() -> libc::timespec {
let mut ts = unsafe { mem::zeroed() };
let clk_id = libc::CLOCK_MONOTONIC;
unsafe {
libc::clock_gettime(clk_id, &mut ts);
}
ts
}
fn as_duration(ts: libc::timespec) -> Duration {
let duration = Duration::new(ts.tv_sec as u64, ts.tv_nsec as u32);
duration
}
What's next
All those tricky bits aside, I'm perfectly happy with this code for now, given what I need it to do -- which is just to instrument my "haversine json processor" (a separate homework assignment).
When I did that, I found that the computing the haversine distance only accounted for 28% of the total time of my program, whereas parsing json took 69%:
read string: 0.02221 (25223μs)
parse json: 0.69383 (787803μs)
get point pairs: 0.00000 ( 0μs)
haversine: 0.28395 (322403μs)
all: 1.00000 (1135435μs)