Original Post

It is well known, that emulator timings differ widely from actual hardware. But there were also numerous open questions on the load/store performance of different address ranges (WRAM, ROM, VIP, VSU).
I wrote (and attach) a test binary that carries out $10000 loops over various instruction (pairs).
From the results I composed this quick overview graph. Timings are given in units of 20us via the hardware timer (the 20us was broken in mednafen until most recently.)
In the snippet below r10 was preload with the different addresses:
call init_timer
movw $10000, r11 ; a macro for movhi and movea
in.b [r10],r12
add -1, r11
bne _loop0
call plot_debug

A few quick insights:
– reading from VIP memory is significantly slower! and ROM reads are still slower than WRAM reads
– a single full word store can be faster then a byte or half-word store (with CACHE disabled) because of the required extra fetches.
– also we see the flag-hazard after add -1 if we branch right after that instruction.

This is just a first check and more detailed ones will follow here.

  • This topic was modified 2 years, 5 months ago by enthusi.
  • This topic was modified 2 years, 5 months ago by enthusi.
4 Replies

read access to ROM costs one extra cycle compared to WRAM (or 2 if WCR is clear).
read access to Framebuffer costs additional 4 cycles.
uns = 0x10000
timer = 20e-6
wram = 0x7b9
rom = 0x85c
framebuffer = 0xaeb
mhz = 20e6

(((rom – wram) * timer )*mhz/runs) = 1 cycles per run for ROM vs WRAM
(((framebuffer – wram) * timer )*mhz/runs) = 5 cycles per run for framebuffer vs WRAM

These are some good numbers!

IIRC, v810 is a 5-stage pipleine w/ no branch delay slot. Depending on v810 internals, that may add a cycle per loop for every branch that’s taken, because the next insn after the branch may have already been fetched and is invalid. This may even happen if the v810 has branch prediction.

IF you have not done so already, is there a program you can write to test the cost of a branch taken and account for this in your test program loops?

Yes this is good info, thanks enthusi.

Since this data shows that reads from RAM are always faster than reads from ROM even when both are set to use 1 wait state, I would like to suggest one additional set of tests alongside your “Cache OFF” and “Cache ON” categories: executing from RAM.

I would expect that the time it takes to execute from RAM would fall between the timings for the “Cache OFF” and “Cache ON” states, since opcodes must still be fetched during execution, but are being fetched quicker from RAM than from ROM. This could be useful in case of routines that need a performance boost but are too large to fit inside the insn cache, sort of like using RAM as a “level 2” insn cache.

Yeah, good points. Both of you. Executing from RAM without cache should be notably faster than from ROM but then again, given what we learned/know about cache: it should almost always be on in any time critical part 🙂
I will happily run more dedicated tests with this framework – at some point 🙂

  • This reply was modified 2 years, 5 months ago by enthusi.


Write a reply

You must be logged in to reply to this topic.