Skip to content

Latest commit

 

History

History
266 lines (216 loc) · 11 KB

19-timing.org

File metadata and controls

266 lines (216 loc) · 11 KB

Timing

waiting for a response which never arrives. smoltcp never times out or retries because all the times we’re giving it are zero.

Linux clock_gettime(2)

Linux timers

Linux insides chapter on Timers

http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/

There are several clock sources on x86, including the Programmable Interrupt Timer (PIT) which is used in EuraliOS to switch processes (See section 1), the Time Stamp Counter which was used in section 5 to time different frame allocation methods, and the High Precision Event Timer.

The Time Stamp Counter

To get the Time Stamp Counter we’ll move the time_stamp_counter() function used in section 5 into time.rs.

CPUID flags constant_tsc, nonstop_tsc, rdtscp

RDTSCP. Out-of-order execution, cpuid

Calibrating with the PIT

The PIT works on a 1.193182 MHz clock, which counts down and emits an interrupt every time its counter reaches zero. The count the PIT starts from can be changed, and defaults to 65536 giving an interrupt frequency of about 18 Hz.

  • PIT ticks per interrupt
  • Sum of PIT ticks
  • Last TSC
  • TSC per PIT tick

In general we might want to vary the PIT ticks per interrupt, for example to adjust the time allocated to different processes, or for power saving, so we should be able to account for these changes.

Every PIT interrupt we will a) Add the current number of PIT ticks per interrupt to the cumulative sum; b) Store the difference between current and last TSC, and the current TSC. When a time is requested we can get the TSC, and use the ticks since the last PIT interrupt and the number of TSC ticks per PIT interrupt to calculate a higher precision time than PIT interrupts alone could provide.

Some applications need high resolution timing, without the overheads of system calls. The RDTSC instruction can be executed by user programs if the time stamp disable (TSD) flag in register CR4 is clear, but the user program needs to be able to access other information to get a time. The Linux virtual Dynamic Shared Object (vDSO) provides such a mechanism, by mapping a region of user program’s address space to read-only data.

In a new file kernel/src/time.rs we can use an AtomicU64 to store the number of PIT clock ticks. Each time an interrupt occurs, the timer_interrupt() function in =interupts.rs will call this pit_interrupt_notify() function:

use core::sync::atomic::{AtomicU64, Ordering};

/// The Programmable Interrupt Timer frequency divider
const PIT_TICKS_PER_INTERRUPT: u64 = 65536;

/// Cumulative number of PIT ticks since start
static PIT_TICKS: AtomicU64 = AtomicU64::new(0);

pub fn pit_interrupt_notify() { 
  // Increment the number of PIT ticks
  PIT_TICKS.fetch_add(PIT_TICKS_PER_INTERRUPT, Ordering::Relaxed);
  ...
}

In future the number of PIT ticks may vary, but for now it’s just hard-wired to the value it has on startup.

To calibrate the TSC we can store the value it had last PIT interrupt, and calculate the change in TSC:

/// Previous value of Time Stamp Counter
static LAST_TSC: AtomicU64 = AtomicU64::new(0);

pub fn pit_interrupt_notify() { 
    ...
    let new_tsc = time_stamp_counter();
    let last_tsc = LAST_TSC.swap(new_tsc, Ordering::Relaxed);

    println!("PIT: {:x} -> {:x} : {}", last_tsc, new_tsc, new_tsc - last_tsc);
}

Figure fig-pit shows that there is some variation in the number of TSC counts between PIT interrupts. At least some of this may be because we’re running on a virtual machine (QEMU) rather than a physical machine. That variation would translate into variations in our calibration.

./img/19-01-PIT.png

We can divide the change in TSC by the number of PIT ticks to get a TSC_PER_PIT calibration. We can use a moving average to reduce the variation by averaging the new calibration with the previous:

static TSC_PER_PIT: AtomicU64 = AtomicU64::new(0);

pub fn pit_interrupt_notify() {
    ...
    let new_tsc_per_pit = (new_tsc - last_tsc) / PIT_TICKS_PER_INTERRUPT;
    let ma_tsc_per_pit = (new_tsc_per_pit + TSC_PER_PIT.load(Ordering::Relaxed)) / 2;
    TSC_PER_PIT.store(ma_tsc_per_pit, Ordering::Relaxed);
    println!("TSC per PIT tick: {} => avg {}", new_tsc_per_pit, ma_tsc_per_pit);
}

./img/19-02-TSC-per-PIT.png

We get values around 2275 +/- 5. Since the PIT clock is 1.193182 MHz, that would mean a TSC frequency of about 2.7GHz. That agrees with the CPU frequency from my laptop’s /proc/cpuinfo when it’s busy:

model name	: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
stepping	: 9
microcode	: 0x84
cpu MHz		: 2700.000

Calculating time

For now we just need a function which will calculate time since the computer restarted, because smoltcp only needs time from an arbitrary starting point. If the CPU ran at a constant frequency then the TSC might be used directly, but clock frequencies can vary in time, and in a multi-core machine each core could have a different TSC. Modern CPUs solve this by having a constant TSC syncronised between cores, but not on older models.

Instead we’ll try using the PIT ticks to provide a monontonically increasing time, and the number of TSC ticks since the last interrupt to provide an (also monotonic) time offset, calibrated using the previous PIT interval. Fetching the values stored by the pit_interrupt_notify() function:

pub fn microseconds_monotonic() -> u64 {
    // Number of PIT ticks
    let pit = PIT_TICKS.load(Ordering::Relaxed);
    // Number of TSC ticks since last PIT interrupt
    let tsc = time_stamp_counter() - LAST_TSC.load(Ordering::Relaxed);
    // Number of TSC counts per PIT tick
    let tsc_per_pit = TSC_PER_PIT.load(Ordering::Relaxed);
    ...
}

We can combine the pit and tsc counters into an effective TSC count: pit * tsc_per_pit + tsc (note that this is not the same as just taking the TSC due to its rate variation). We then need to convert PIT ticks into microseconds. The PIT frequency is 1.193181666 MHz, so the interval between each tick is 0.83809534 microseconds. Rather than using floating point which would be software emulated (the soft-float feature in x86_64-euralios.json), we can try writing integer arithmetic ourselves. Integer multiplication can be slow, but division is even slower in general. Dividing by powers of 2 is a simple and fast bit shift, so we can find a ratio close to 0.83809534 with a power of two in the denominator. One candidate is 878807/(1024*1024), which has a value 0.838095(66) i.e. out by about one second every 30 days. Putting this together, a simple (naive) way to calculate the number of microseconds is:

((pit * tsc_per_pit + tsc) * 878807) / (1024*1024 * tsc_per_pit)

Unfortunately this has several problems. The most serious is that it will overflow 64-bit integers after about 2 hours, which seems unacceptably short when 2^64 microseconds is over half a million years. This occurs because we calculate an even larger number and then divide by (1024*1024 * tsc_per_pit) which is about 31 bits, so the top 31 of the 64 available bits are not used in the output.

To reduce the number of bits which are thrown away there are (at least) two things we can do:

  1. Reduce the amount by which pit is multiplied, tsc_per_pit currently. That quantity pit * tsc_per_pit + tsc is sub-nanosecond precision, not needed when the output is microseconds. We can choose a scaled TSC rate per PIT tick, for example calculating the initial ticks to 1/16th of a PIT tick.
  2. We can factorize 878807 into smaller factors e.g. 878807 = 437 * 2011. Fortunately it seems to have smaller factors than nearby numbers. Rather than multiplying/dividing by 878807/1048576, a two-step process of multiplying by 2011/2048 and then 437/512 requires fewer bits.

The code is now:

const scaled_tsc_rate: u64 = 16;
let scaled_tsc = (tsc * scaled_tsc_rate) / tsc_per_pit;
((((pit * scaled_tsc_rate + scaled_tsc) * 2011) / 4096) * 437) / (256 * scaled_tsc_rate)

This still contains one (unavoidable) division by tsc_per_pit, but the other two divisions are by powers of 2 and so should compile to bit shifts. The largest number we divide by is now 4096, a 12 bit shift. The remaining 52 bits will overflow after about 142 years, by which point the time will be out by at least half an hour. It’s not perfect, but that’s probably good enough for most applications.

Making time functions available to user programs

We now have a kernel function to calculate milliseconds, but need to make it available to user programs. We could add a syscall, or a kernel thread to listen on a Rendezvous, but it might be fun to try a different way and at the same time avoid the overhead of a syscall.

We can now try this out with a user program timing_test

#![no_std]
#![no_main]

use euralios_std::{debug_println,
                   syscalls::{self, STDIN},
                   time};

#[no_mangle]
fn main() {
    loop {
        let _ = syscalls::receive(&STDIN);
        debug_println!("[timing_test] TSC: {} microseconds: {}",
                       time::time_stamp_counter(),
                       time::microseconds_monotonic());
        syscalls::thread_yield();
    }
}

which should print the TSC and time in microseconds every time it receives a message, for example a key press. A result is shown in figure fig-timer-test:

./img/19-03-timer-test.png

In most cases there are two TIME: lines for every [timing_test] line. The TIME: lines are printed from the kernel interrupts.rs after the character message is sent and before process::schedule_next() is called. A key is pressed, message sent, a TIME: output is printed, the context is switched to the timing_test program, it prints a [timing_test] line, and then the key is released and another TIME: line is printed. It looks like there’s about a 3 or 4ms delay for printing the TIME: line, switching context, and receiving the message. Printing the [timing_test] line isn’t included because that happens after the time is calculated. That seems like a long time, and at some point I’d like to understand what’s causing that delay: Is is the printing or the context switch?