Original Post

In my quest to know everything, I set my eyes on the Nintendo instructions, and their undocumented cycle counts. We know through the development manual that the MPYHW instruction takes 9 cycles, but what about the others?

What I did was write a routine in assembly that enables the instruction cache and the hardware timer with a 20-microsecond tick frequency, then runs a short loop with three consecutive instances of a given instruction. At the end, I stop the timer and see how many timer ticks it took. I then subtract that number from another run that didn’t have any instructions in the loop.

I used the ADD instruction as a baseline, since it takes 1 cycle. Using MUL and DIV to verify it was working, I got the correct cycle counts:

ADD     1966    1.0         1 cycle
MUL     26214   13.33367   13 cycles
DIV     75366   38.33469   38 cycles

Logically, I did the same thing on all the Nintendo instructions to determine their cycle counts. The counts may surprise you.

Please: If you’re a coder, run your own tests to get some more sample data on these. I don’t doubt my methods, but I’d like some validation from other sources.

My findings are as follows:

MPYHW   18350   9.33367     9 cycles
CLI     23593   12.00051   12 cycles
SEI     23593   12.00051   12 cycles
XB      12452   6.33367     6 cycles
XH      2622    1.33367     1 cycle
REV     43909   22.33418   22 cycles

MPYHW takes 9 cycles as expected.

CLI and SEI both take 12 cycles for some unimaginable reason. Curious is the fact that my program didn’t log an extra third of a cycle like it did for all the other instructions I tested. Someone else’s testing would be appreciated to get a clearer view on this count.

XB looks like it takes 6 cycles. Fair enough. But what caught my eye is that XH only takes 1 cycle. I’d have expected them to be pretty close to each other.

The REV instruction expectedly takes a while to complete. In my tests, it clocked in (pun intended) at 22 cycles.

7 Replies

As long as I’m at it…

The development manual indicates that using any register other than r0 for reg1 in the XB and XH instructions may cause problems, but regardless of which registers I specified or the values in those registers, the instructions performed correctly.

MPYHW, on the other hand, is giving me some mysterious results when bits 16-31 don’t sign-extend bit 15 (just like the developer’s manual says). I’m gonna have to put together a test program just for that instruction to figure out exactly what it’s doing.

MPYHW definitely performs a faithful multiplication, but exactly how it behaves when extending off the left side of the register is something I’m gonna have to investigate further in the morning.

In the mean time, have a ROM! All controls are done with the left D-Pad. Up and down change the value of the current digit, and left and right change the digit.

Hmm… some of those results are a bit surprising. It just seems like what’s the point of having custom CPU instructions if they’re not much faster than what you could do with just a few instructions in software (sure… they’re a little bit more convenient, and more compact, but that hardly seems worth a CPU customization).

Guy Perfect wrote:
Curious is the fact that my program didn’t log an extra third of a cycle like it did for all the other instructions I tested. Someone else’s testing would be appreciated to get a clearer view on this count.

Guy Perfect wrote:

ADD     1966    1.0         1 cycle

What about ADD? How about running it on a larger chunk of regular V810 instructions to see if it really is an anomaly, or just that some do and some don’t (you might make a connection between them)?

And just out of curiosity… why 3 instructions in a row? Do you get the same results with just 1? How about 10?


Regarding CLI and SEI, do we even know how long the stock V810 methods take? Specifically:

movea 0xEFFF, $0, $10
stsr $PSW, $11
and $10, $11
ldsr $11, $PSW

stsr $PSW, $10
ori 0x1000, $10, $10
ldsr $10, $PSW

I can’t find the duration of LDSR and STSR in the V810 manual.

HorvatM wrote:
Regarding CLI and SEI, do we even know how long the stock V810 methods take?

That’s a good question… I don’t see that listed in the manual. One thing I did notice is that CLI and SEI have the same opcode as EI and DI on the V830 (which is based on the V810 architecture, though not necessarily implemented the same). In the V830 case, they claim to take 4 cycles each. For comparison, LDSR and STSR take 5 cycles on the V830.


I put together a program to answer this once and for all. It tests all register-based instructions with a simple assembly loop:

# s32 CycleTest(s32 arg1, s32 arg2, s32 num);

    # r2 = 0x02000000, base address for hardware control ports
    # r6 = arg1
    # r7 = arg2
    # r8 = num, also used as the loop iterator

    # Configure the hardware timer
    MOVHI 0x0200, r0, r2
    MOV   -1, r1
    ST.B  r1, 0x0018[r2] # Count/reload low  = 0xFF
    ST.B  r1, 0x001C[r2] # Count/reload high = 0xFF

    # Enable and clear the instruction cache
    MOVEA 0x0803, r0, r1
    LDSR  r1, CHCW

    # Enable the timer with 20-microsecond ticks
    MOVEA 0x0011, r0, r1
    ST.B  r1, 0x0020[r2]

    # Execute the instruction 10 times for the given number of iterations
        MOV r6, r9
        MOV r7, r10

        # This comment is located 32 bytes into the function.

        # When the function is not modified, nothing happens in this loop

        # The following bytes are meant to be overwritten in RAM
        BR .Lcycle_end; NOP; NOP; NOP; # Written by 16- and 32-bit instructions
        BR .Lcycle_end; NOP; NOP; NOP; # Written by 32-bit instructions
        BR .Lcycle_end                 # Always present for consistency

        # End-of-loop code for 32-bit instructions (3 16-bit instructions)
        ADD -1, r8
    BNZ .Lcycle_loop

    # End-of-loop label

    # Disable the timer and instruction cache
    ST.B r0, 0x0020[r2]
    LDSR r0, CHCW

    # Retrieve and return the number of timer ticks taken
    IN.B 0x0018[r2], r6   # Timer count low
    IN.B 0x001C[r2], r7   # Timer count high
    SHL  8, r7            # r7 = r7 << 8 | r6;
    OR   r6, r7
    MOV  -1, r10          # r10 = -1 - r7 & 0xFFFF;
    SUB  r7, r10
    ANDI 0xFFFF, r10, r10

    JMP [r31]


This function gets copied into RAM at run-time. Those NOPs are dummy bytes that are replaced with meaningful instructions by the program. The reason there are two sets of NOPs is to accommodate both 16- and 32-bit instructions. The following BR instruction is always present to ensure that the loop takes the same number of cycles always except for the desired instructions.

The C code that drives this looks like this:

// Gets the number of timer ticks for a loop of 4 instances of an instruction
s32 GetCount(const INST *inst, s32 num) {
    s32 len = (SIZE_CYCLETEST + 3) / 4;
    u32 arg1 = 0, arg2 = 0;
    s32 x, y, offset = 32;
    u8 func[len];
    u16 bits[2];

    // Copy the function into memory
    memcpy32(func, &CycleTest, len);

    // If we're not overwriting with an instruction, ignore this all
    if (inst != NULL) {

        // Encode the instruction into data bits and get its size
        len = FORMATS[inst->format](inst, bits);

        // Copy the instruction into the function buffer 4 times
        for (x = 0; x < 4; x++) for (y = 0; y < len; y++) {
            *(u16 *)(&func[offset]) = bits[y];
            offset += 2;

        // Grab the instruction's pre-defined operands
        arg1 = inst->val1;
        arg2 = inst->val2;

    // Call the function from the byte buffer
    return ((s32 (*)(u32, u32, s32)) func)(arg1, arg2, num);

My main function calls this function 5 times for each instruction (predefined in a const table at the top of the program), and averages the counts. It then subtracts the count from a null call (no instruction overwritten), then divides by the count for ADD, which is known to be 1 cycle.

The output on the hardware looks like this:

ADD (Immediate)    051E =  1 cycle
ADD (Register)     051E =  1 cycle
ADDF.S             6F5C = 22 cycles
ADDI               051F =  1 cycle
AND                051E =  1 cycle
ANDI               051E =  1 cycle
CLI                3D71 = 12 cycles
CMP (Immediate)    06ED =  1 cycle
CMP (Register)     051E =  1 cycle
CMPF.S             228F =  7 cycles
CVT.SW             4666 = 14 cycles
CVT.WS             27AE =  8 cycles
DIV                C148 = 38 cycles
DIVF.S             DFFF = 44 cycles
DIVU               B70A = 36 cycles
LDSR               28F6 =  8 cycles
MOV (Immediate)    051F =  1 cycle
MOV (Register)     051E =  1 cycle
MOVEA              051F =  1 cycle
MOVHI              051E =  1 cycle
MPYHW              2CCC =  9 cycles
MUL                4148 = 13 cycles
MULF.S             83D7 = 26 cycles
MULU               4147 = 13 cycles
NOT                051E =  1 cycle
OR                 051E =  1 cycle
ORI                051F =  1 cycle
REV                6F5C = 22 cycles
SAR (Immediate)    051E =  1 cycle
SAR (Register)     051E =  1 cycle
SEI                3D70 = 12 cycles
SETF               051E =  1 cycle
SHL (Immediate)    051E =  1 cycle
SHL (Register)     051E =  1 cycle
SHR (Immediate)    051F =  1 cycle
SHR (Register)     051E =  1 cycle
STSR               28F5 =  8 cycles
SUB                051E =  1 cycle
SUBF.S             83D7 = 26 cycles
TRNC.SW            4147 = 13 cycles
XB                 1D70 =  6 cycles
XH                 051F =  1 cycle
XOR                051E =  1 cycle
XORI               051E =  1 cycle

All instructions with documented cycle counts have the correct count, so that's a relief. The floating-point instructions fall within their given range. The undocumented cycle counts? Well, that's why I made this program.

LDSR and STSR are 8 cycles each. I was expecting 1 cycle. This is how we learn things, though. Suddenly the CLI and SEI instructions being 12 cycles don't sound so bad.

MPYHW is 9 cycles as seen before. Likewise for XB at 6 cycles, XH at 1 cycle and REV at 22 cycles.

A ROM of this program is attached to this post. After the test is finished, up and down on the left D-Pad scroll the list of instructions.

I figured out the operation of MPYHW.

* reg1 is treated as a 17-bit integer, sign-extended to 32 bits in size.
* reg2 is treated as a 32-bit, signed integer.
* Multiplication happens normally, storing the result in reg2.
* r30 is not affected as it is in MUL and MULU.


// On an unsigned variable
reg2 *= (reg1 & 0x0001FFFF) | ((reg1 & 0x00010000) ? 0xFFFE0000 : 0);

// On a signed variable
reg2 *= reg1 << 15 >> 15;


Write a reply

You must be logged in to reply to this topic.