waiting for a response which never
arrives. smoltcp
never times out or retries because all the times
we’re giving it are zero.
Linux insides chapter on Timers
http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/
There are several clock sources on x86, including the Programmable Interrupt Timer (PIT) which is used in EuraliOS to switch processes (See section 1), the Time Stamp Counter which was used in section 5 to time different frame allocation methods, and the High Precision Event Timer.
To get the Time Stamp Counter we’ll move the time_stamp_counter()
function used in section 5 into time.rs
.
CPUID flags constant_tsc, nonstop_tsc, rdtscp
RDTSCP. Out-of-order execution, cpuid
The PIT works on a 1.193182 MHz clock, which counts down and emits an interrupt every time its counter reaches zero. The count the PIT starts from can be changed, and defaults to 65536 giving an interrupt frequency of about 18 Hz.
- PIT ticks per interrupt
- Sum of PIT ticks
- Last TSC
- TSC per PIT tick
In general we might want to vary the PIT ticks per interrupt, for example to adjust the time allocated to different processes, or for power saving, so we should be able to account for these changes.
Every PIT interrupt we will a) Add the current number of PIT ticks per interrupt to the cumulative sum; b) Store the difference between current and last TSC, and the current TSC. When a time is requested we can get the TSC, and use the ticks since the last PIT interrupt and the number of TSC ticks per PIT interrupt to calculate a higher precision time than PIT interrupts alone could provide.
Some applications need high resolution timing, without the overheads of system calls. The RDTSC instruction can be executed by user programs if the time stamp disable (TSD) flag in register CR4 is clear, but the user program needs to be able to access other information to get a time. The Linux virtual Dynamic Shared Object (vDSO) provides such a mechanism, by mapping a region of user program’s address space to read-only data.
In a new file kernel/src/time.rs
we can use an AtomicU64 to store
the number of PIT clock ticks. Each time an interrupt occurs, the
timer_interrupt() function in =interupts.rs
will call this
pit_interrupt_notify()
function:
use core::sync::atomic::{AtomicU64, Ordering};
/// The Programmable Interrupt Timer frequency divider
const PIT_TICKS_PER_INTERRUPT: u64 = 65536;
/// Cumulative number of PIT ticks since start
static PIT_TICKS: AtomicU64 = AtomicU64::new(0);
pub fn pit_interrupt_notify() {
// Increment the number of PIT ticks
PIT_TICKS.fetch_add(PIT_TICKS_PER_INTERRUPT, Ordering::Relaxed);
...
}
In future the number of PIT ticks may vary, but for now it’s just hard-wired to the value it has on startup.
To calibrate the TSC we can store the value it had last PIT interrupt, and calculate the change in TSC:
/// Previous value of Time Stamp Counter
static LAST_TSC: AtomicU64 = AtomicU64::new(0);
pub fn pit_interrupt_notify() {
...
let new_tsc = time_stamp_counter();
let last_tsc = LAST_TSC.swap(new_tsc, Ordering::Relaxed);
println!("PIT: {:x} -> {:x} : {}", last_tsc, new_tsc, new_tsc - last_tsc);
}
Figure fig-pit shows that there is some variation in the number of TSC counts between PIT interrupts. At least some of this may be because we’re running on a virtual machine (QEMU) rather than a physical machine. That variation would translate into variations in our calibration.
We can divide the change in TSC by the number of PIT ticks to get a
TSC_PER_PIT
calibration. We can use a moving average to reduce the
variation by averaging the new calibration with the previous:
static TSC_PER_PIT: AtomicU64 = AtomicU64::new(0);
pub fn pit_interrupt_notify() {
...
let new_tsc_per_pit = (new_tsc - last_tsc) / PIT_TICKS_PER_INTERRUPT;
let ma_tsc_per_pit = (new_tsc_per_pit + TSC_PER_PIT.load(Ordering::Relaxed)) / 2;
TSC_PER_PIT.store(ma_tsc_per_pit, Ordering::Relaxed);
println!("TSC per PIT tick: {} => avg {}", new_tsc_per_pit, ma_tsc_per_pit);
}
We get values around 2275 +/- 5. Since the PIT clock is 1.193182 MHz,
that would mean a TSC frequency of about 2.7GHz. That agrees with the
CPU frequency from my laptop’s /proc/cpuinfo
when it’s busy:
model name : Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
stepping : 9
microcode : 0x84
cpu MHz : 2700.000
For now we just need a function which will calculate time since the
computer restarted, because smoltcp
only needs time from an
arbitrary starting point. If the CPU ran at a constant frequency then
the TSC might be used directly, but clock frequencies can vary in
time, and in a multi-core machine each core could have a different TSC.
Modern CPUs solve this by having a constant TSC syncronised between cores,
but not on older models.
Instead we’ll try using the PIT ticks to provide a monontonically
increasing time, and the number of TSC ticks since the last interrupt
to provide an (also monotonic) time offset, calibrated using the
previous PIT interval. Fetching the values stored by the
pit_interrupt_notify()
function:
pub fn microseconds_monotonic() -> u64 {
// Number of PIT ticks
let pit = PIT_TICKS.load(Ordering::Relaxed);
// Number of TSC ticks since last PIT interrupt
let tsc = time_stamp_counter() - LAST_TSC.load(Ordering::Relaxed);
// Number of TSC counts per PIT tick
let tsc_per_pit = TSC_PER_PIT.load(Ordering::Relaxed);
...
}
We can combine the pit
and tsc
counters into an effective TSC
count: pit * tsc_per_pit + tsc
(note that this is not the same as
just taking the TSC due to its rate variation). We then need to
convert PIT ticks into microseconds. The PIT frequency is 1.193181666
MHz, so the interval between each tick is 0.83809534
microseconds. Rather than using floating point which would be software
emulated (the soft-float
feature in x86_64-euralios.json
), we can
try writing integer arithmetic ourselves. Integer multiplication can
be slow, but division is even slower in general. Dividing by powers of
2 is a simple and fast bit shift, so we can find a ratio close to
0.83809534 with a power of two in the denominator. One candidate is
878807/(1024*1024)
, which has a value 0.838095(66)
i.e. out by
about one second every 30 days. Putting this together, a simple
(naive) way to calculate the number of microseconds is:
((pit * tsc_per_pit + tsc) * 878807) / (1024*1024 * tsc_per_pit)
Unfortunately this has several problems. The most serious is that it
will overflow 64-bit integers after about 2 hours, which seems
unacceptably short when 2^64
microseconds is over half a million
years. This occurs because we calculate an even larger number and then
divide by (1024*1024 * tsc_per_pit)
which is about 31 bits, so the
top 31 of the 64 available bits are not used in the output.
To reduce the number of bits which are thrown away there are (at least) two things we can do:
- Reduce the amount by which
pit
is multiplied,tsc_per_pit
currently. That quantitypit * tsc_per_pit + tsc
is sub-nanosecond precision, not needed when the output is microseconds. We can choose a scaled TSC rate per PIT tick, for example calculating the initial ticks to 1/16th of a PIT tick. - We can factorize 878807 into smaller factors e.g. 878807 = 437 * 2011. Fortunately it seems to have smaller factors than nearby numbers. Rather than multiplying/dividing by 878807/1048576, a two-step process of multiplying by 2011/2048 and then 437/512 requires fewer bits.
The code is now:
const scaled_tsc_rate: u64 = 16;
let scaled_tsc = (tsc * scaled_tsc_rate) / tsc_per_pit;
((((pit * scaled_tsc_rate + scaled_tsc) * 2011) / 4096) * 437) / (256 * scaled_tsc_rate)
This still contains one (unavoidable) division by tsc_per_pit, but the other two divisions are by powers of 2 and so should compile to bit shifts. The largest number we divide by is now 4096, a 12 bit shift. The remaining 52 bits will overflow after about 142 years, by which point the time will be out by at least half an hour. It’s not perfect, but that’s probably good enough for most applications.
We now have a kernel function to calculate milliseconds, but need to make it available to user programs. We could add a syscall, or a kernel thread to listen on a Rendezvous, but it might be fun to try a different way and at the same time avoid the overhead of a syscall.
We can now try this out with a user program timing_test
#![no_std]
#![no_main]
use euralios_std::{debug_println,
syscalls::{self, STDIN},
time};
#[no_mangle]
fn main() {
loop {
let _ = syscalls::receive(&STDIN);
debug_println!("[timing_test] TSC: {} microseconds: {}",
time::time_stamp_counter(),
time::microseconds_monotonic());
syscalls::thread_yield();
}
}
which should print the TSC and time in microseconds every time it receives a message, for example a key press. A result is shown in figure fig-timer-test:
In most cases there are two TIME:
lines for every [timing_test]
line. The TIME:
lines are printed from the kernel interrupts.rs
after the character message is sent and before
process::schedule_next()
is called. A key is pressed, message sent,
a TIME:
output is printed, the context is switched to the
timing_test program, it prints a [timing_test]
line, and then the
key is released and another TIME:
line is printed. It looks like
there’s about a 3 or 4ms delay for printing the TIME:
line,
switching context, and receiving the message. Printing the
[timing_test]
line isn’t included because that happens after the
time is calculated. That seems like a long time, and at some point I’d
like to understand what’s causing that delay: Is is the printing or
the context switch?