Original Post

My PCM mixer runs fine on its own but slows execution down to a crawl when paired with music and rendering, so I’m looking to optimize it wherever I can, including dropping down and rewriting parts of it in assembly where practical. When examining the output of building with both -Os and -O3 in gccVB 4, though, I noticed the following peculiar pattern when accessing variables kept in WRAM:

	movhi hi(_masterMusVolume),r0,r10
	ld.b lo(_masterMusVolume)[r10],r14
	movhi hi(_noiseVolume),r0,r27
	movhi hi(_musDataStart),r0,r10
	movhi hi(_freeVSUChannelCur),r0,r25
	movhi hi(_noiseVelocity),r0,r26
	movhi hi(_noiseLeft),r0,r29
	movhi hi(_noiseRight),r0,r31
	ld.b lo(_noiseVolume)[r27],r11
	ld.w lo(_musDataStart)[r10],r18
	movhi hi(_vbTranspose),r0,r10
	ld.w lo(_vbTranspose)[r10],r10
	ld.b lo(_freeVSUChannelCur)[r25],r17
	ld.b lo(_noiseVelocity)[r26],r23
	ld.b lo(_noiseLeft)[r29],r22
	ld.b lo(_noiseRight)[r31],r12

Since WRAM on the VB is located at 0x05000000 and therefore aligned on a 64KB boundary, wouldn’t it be more economical to, say, movhi hi(_WRAMStart),r0,r10 just once and then ld lo(_variable)[r10] subsequently for each WRAM access? Why doesn’t the code do this or something similar?

I could rewrite this particular routine in assembly (it runs over 8000 times a second via the timer interrupt, so it needs to be as fast as possible) but this kind of code is generated all over the place whenever WRAM is read or written, so that to me just seems like putting a band-aid over a larger problem. Is this a bug in gccVB or is there a way to coax the compiler into generating more efficient code here?

3 Replies

Thanks M.K., that should work for WRAM accesses.

Upon closer inspection I see that this pattern is also applied to other areas of memory. I found a simple example using hardware registers:

movhi 0x200, r0, r10
movea 0x20, r10, r10
ld.b [r10], r11
mov 5, r12
andi 0xFF, r11, r11
ori 0x10, r11, r11
st.b r11, [r10]
movhi 0x200, r0, r11
movea 0x18, r11, r11
st.b r12, [r11]
movhi 0x200, r0, r11
movea 0x1C, r11, r11
st.b r0, [r11]

This is the equivalent assembly when built with -Os to:

HW_REGS[TCR] |= TIMER_20US;
HW_REGS[TLR] = 0x05;
HW_REGS[THR] = 0x00;

The instruction ‘movhi 0x200, r0, r11’ is executed twice even when nothing is done in between to change the value of r11, making this unnecessary. This is when compiled with -Os for code size. Is this something that can be worked around (without writing it by hand in asm) or a bug in GCC/v810?

Took a look tonight at the gcc 4.4.2 patch that’s floating out there, and I think I might have an idea of what’s causing this: in output_move_single…

	return "movhi hi(%1),%.,%0\n\tmovea lo(%1),%0,%0";

That line occurs several times for each time a 32-bit quantity needs to be loaded, and basically encodes those two instructions as a couplet, always. So the compiler doesn’t have a chance to optimize away the extra instruction. Looks either to me like a bug, or it simply doesn’t bother optimizing that case by design. I’m leaning toward the former, as it’s clearly suboptimal code. Anybody with knowledge of GCC have any ideas how to fix it?

 

Write a reply

You must be logged in to reply to this topic.