an image

Working With Clocks in Linux

clocks

In a previous post, I mentioned that I was learning about the rdtsc instruction in x86 assembly.

I finally got around to actually using it in a program, as I was doing the homework for Casey Muratori's Performance-Aware Programming course.

I built a simple profiler based on rdtsc that allows me to measure the performance of various parts of my rust programs. Here's how I went about doing it.

Background

Casey's listing is copied below. When compiled for windows, it uses QueryPerformanceFrequency and QueryPerformanceCounter. When compiled for Linux, it uses a fixed frequency (1000000 or 1Mhz) and gettimeofday on Linux:

/* ========================================================================

   (C) Copyright 2023 by Molly Rocket, Inc., All Rights Reserved.

   This software is provided 'as-is', without any express or implied
   warranty. In no event will the authors be held liable for any damages
   arising from the use of this software.

   Please see https://computerenhance.com for more information

   ======================================================================== */

/* ========================================================================
   LISTING 70
   ======================================================================== */

#if _WIN32

#include <intrin.h>
#include <windows.h>

static u64 GetOSTimerFreq(void)
{
	LARGE_INTEGER Freq;
	QueryPerformanceFrequency(&Freq);
	return Freq.QuadPart;
}

static u64 ReadOSTimer(void)
{
	LARGE_INTEGER Value;
	QueryPerformanceCounter(&Value);
	return Value.QuadPart;
}

#else

#include <x86intrin.h>
#include <sys/time.h>

static u64 GetOSTimerFreq(void)
{
	return 1000000;
}

static u64 ReadOSTimer(void)
{
	// NOTE(casey): The "struct" keyword is not necessary here when compiling in C++,
	// but just in case anyone is using this file from C, I include it.
	struct timeval Value;
	gettimeofday(&Value, 0);

	u64 Result = GetOSTimerFreq()*(u64)Value.tv_sec + (u64)Value.tv_usec;
	return Result;
}

#endif

/* NOTE(casey): This does not need to be "inline", it could just be "static"
   because compilers will inline it anyway. But compilers will warn about
   static functions that aren't used. So "inline" is just the simplest way
   to tell them to stop complaining about that. */
inline u64 ReadCPUTimer(void)
{
	// NOTE(casey): If you were on ARM, you would need to replace __rdtsc
	// with one of their performance counter read instructions, depending
	// on which ones are available on your platform.

	return __rdtsc();
}

Before I read or tried to understand the details of listing, I tried to reproduce what I thought it was meant to do, which was to get the timer frequency, and something called the "performance counter".

I read about CLOCK_MONOTONIC, clock_gettime, and clock_getres from the gnu libc manual.

Clock resolution

The clock_getres function is supposed to return the clock's actual resolution:

Function: int clock_getres (clockid_t clock, struct timespec \*res)

Get the actual resolution of the clock identified by clock, storing it in *ts.

For instance, if the clock hardware for CLOCK_REALTIME uses a quartz crystal that oscillates at 32.768 kHz, then its resolution would be 30.518 microseconds, and clock_getres (CLOCK_REALTIME, &r) would set r.tv_sec to 0 and r.tv_nsec to 30518.

The return value is 0 on success and -1 on failure. The following errno error condition is defined for this function:

EINVAL

The clock identified by clock is not supported.

However, whenever I called it, I seemed to get 1 nanosecond:

fn clock_getres() -> libc::timespec {
    let mut ts = unsafe { mem::zeroed() };
    let clk_id = libc::CLOCK_MONOTONIC;
    unsafe {
        libc::clock_getres(clk_id, &mut ts);
    }
    ts
}

fn as_duration(ts: libc::timespec) -> Duration {
    let duration = Duration::new(ts.tv_sec as u64, ts.tv_nsec as u32);
    duration
}

fn main() {
    let res = as_duration(clock_getres());
    println!("res: {}", res.as_nanos()); // ->  "res: 1"
}

Apparently, this is expected behavior. At least, according to this stackoverflow answer:

Unfortunately, clock_getres() POSIX function ... does not report timers granularity in Linux.

It can return only two predefined results: either 1/HZ for low resolution clocks (where HZ is value of CONFIG_HZ macro used when configuring linux kernel, typical values are 100 300 1000) or 1 ns for high resolution clocks (hrtimers).

So, I would need to either assume a resolution of 1μs or 1ns.

clock_gettime or gettimeofday

Casey's listing uses gettimeofday, which is described as the older api for reading the time:

The GNU C Library also provides an older, but still widely used, function [(gettimeofday)] for getting the current time with a resolution of microseconds.

Since the documentation for clock_getres() suggests a 1ns resolution of the clock, I decided to use clock_gettime instead of gettimeofday.

Instrumentation

Finally, I wanted to make sure that it was as easy as possible to instrument my code with interval-based timing measurements.

The API I implemented was:

interval_start(&str)
interval_end(&str)
interval_duration(&str) -> std::time::Duration

In retrospect, I could simply this further to:

type IntervalHandle = usize;
interval_start() -> IntervalHandle;
interval_end(i: IntervalHandle) -> std::time::Duration;

Perhaps I'll do that later.

The important thing is that this makes it pretty much painless to measure any part of my program without the use of a profiler macro and without passing an instance of a profiler around the program:

    interval_start("parse json");
    let json = parser::parse(&json)?;
    interval_end("parse json");

Code

The first version of the profile is copied below.

There are a few tricky things happening, which I'd love to remove if I could think of how to.

use std::collections::HashMap;
use std::mem;
use std::sync::Mutex;
use std::time::Duration;
#[macro_use]
extern crate lazy_static;
extern crate libc;

lazy_static! {
    static ref TIMESTAMP_START: Mutex<HashMap<String, u64>> = Mutex::new(HashMap::new());
    static ref TIMESTAMP_END: Mutex<HashMap<String, u64>> = Mutex::new(HashMap::new());
    static ref RDTSC_TO_MICROSECONDS: f64 = {
        let rdtsc_start = rdtsc();
        let os_start = clock_gettime();
        let mut os_end = clock_gettime();
        let mut os_elapsed = as_duration(os_end) - as_duration(os_start);
        while os_elapsed < Duration::from_secs(1) {
            os_end = clock_gettime();
            os_elapsed = as_duration(os_end) - as_duration(os_start);
        }
        let rdtsc_end = rdtsc();
        let rdtsc_to_microseconds = 1000000.0 / (rdtsc_end as f64 - rdtsc_start as f64);
        rdtsc_to_microseconds
    };
}

pub fn init() {
    let _start = TIMESTAMP_START.lock().unwrap();
    let _end = TIMESTAMP_END.lock().unwrap();
    let _rdtsc_to_microseconds = *RDTSC_TO_MICROSECONDS;
}

pub fn interval_start(s: &str) {
    TIMESTAMP_START
        .lock()
        .unwrap()
        .insert(s.to_string(), rdtsc());
}

pub fn interval_end(s: &str) {
    TIMESTAMP_END.lock().unwrap().insert(s.to_string(), rdtsc());
}

pub fn interval_duration(s: &str) -> Duration {
    let end_map = TIMESTAMP_END.lock().unwrap();
    let start_map = TIMESTAMP_START.lock().unwrap();
    let end = end_map.get(s).unwrap();
    let start = start_map.get(s).unwrap();
    Duration::from_micros(((end - start) as f64 * *RDTSC_TO_MICROSECONDS) as u64)
}

fn rdtsc() -> u64 {
    unsafe { core::arch::x86_64::_rdtsc() }
}

fn clock_gettime() -> libc::timespec {
    let mut ts = unsafe { mem::zeroed() };
    let clk_id = libc::CLOCK_MONOTONIC;
    unsafe {
        libc::clock_gettime(clk_id, &mut ts);
    }
    ts
}

fn as_duration(ts: libc::timespec) -> Duration {
    let duration = Duration::new(ts.tv_sec as u64, ts.tv_nsec as u32);
    duration
}

What's next

All those tricky bits aside, I'm perfectly happy with this code for now, given what I need it to do -- which is just to instrument my "haversine json processor" (a separate homework assignment).

When I did that, I found that the computing the haversine distance only accounted for 28% of the total time of my program, whereas parsing json took 69%:

                   read string: 0.02221 (25223μs)
                    parse json: 0.69383 (787803μs)
               get point pairs: 0.00000 (   0μs)
                     haversine: 0.28395 (322403μs)
                           all: 1.00000 (1135435μs)