Arduino – AVR GCC multiplication

March 18, 2009 by mekonik

While working on my Arduino magnet levitation (details here), I stumped upon some problems with the way the compiler, AVR GCC, handles multiplication. I had to implement optimized multibyte multiplication routines because Arduino in my project preforms some digital signal processing with 20kHz sampling frequency. Since I couldn’t find those routines online, ~~I’m posting the code of some of them here~~ I posted the code on GitHub at https://github.com/rekka/avrmultiplication.

Arduino uses ATmega168 or similar 8-bit RISC processor. The most serious limitation of these chips is that their instruction set contains mostly instructions working with 8-bit arguments. This limitation becomes more pronounced when one wants to multiply multibyte numbers. When multiplying two k-byte numbers into a 2k-byte result, one has to perform k^2 multiplications and in the order of k^3 additions.

AVR architecture offers 3 multiplication instructions, mul, muls and mulsu, see the AVR instruction set. All of them take 2 registers as arguments, perform 8bit x 8bit -> 16bit multiplication and store the result in R1:R0 register pair. mul assumes that both arguments are unsigned, muls assumes signed arguments and mulsu assumes signed and unsigned argument.

These distinctions are neccessary as negative numbers are stored in two’s complement form. For 8-bit numbers that means that negative number -x is stored as 2^8 – x. When multiplying 8-bit numbers we get

-x * y = (2^8 – x) * y = 2^8 * y – x * y
-x * -y = (2^8 – x) * (2^8 -y) = 2^16 – 2^8 * x – 2^8 * y + x* y.

The instruction muls and mulsu thus have to remove the terms 2^8 * x and 2^8 * y when appropriate. We don’t have to worry about the internal workings of these instruction unless we want to perform multibyte multiplication.

16 bit x 16 bit -> 32 bit multiplication

Suppose that we have two 16 bit numbers x and y. We want to compute x * y. This operation yields a 32 bit result. To implement this operation using 8bit instructions, we have to split x and y into the high byte and low byte, x1, x0 and y1, y0. Then we have

x = 256 * x1 + x0
y = 256 * y1 + y0

First suppose that x and y are unsigned numbers. Multiplication then yields

x * y = 65536 * x1 * y1 + 256 * x1 * y0 + 256 * x0 * y1 + x0 * y0

Thus this can be implemented using 4 mul instructions and some additions with appropriate shifts. When x and y are signed numbers in two’s complement notation, the result becomes more complicated:

-x * y  = (2^16 - 2^8 * x1 - x0) * (2^8 * y1 + y0)
        = -x0*y0 - x1*y0*2^8 - x0*y1*2^8 + y0*2^16 - x1*y1*2^16 + y1*2^24
-x * -y = (2^16 - 2^8 * x1 - x0) * (2^16 - 2^8 * y1 - y0)
        = x0*y0 + x1*y0*2^8 + x0*y1*2^8 - x0*2^16 - y0*2^16 +
          -  x1*y1*2^16 - x1*2^24 - y1*2^24 + 2^32

Signed 16bit mult.

As you can see, there are extra terms that have to be removed. 2^32 is removed automatically as we store only 32 bits of the result. The rest can be removed by a proper combination of mul, muls and mulsu instructions. There are 4 multiplications and their signature turns out to be quite simple, see the figure on the right. The arrows signify a single 8-bit multiplication, the red byte is to be treated as unsigned, the blue one as signed. The results then have to be added together with appropriate shifts. For instance, the result of multiplication of x1 and y0 must be shifted by 1 + 0 = 1 byte to the left.

Sign 16bit X unsign 16bit

This scheme can be also extended to, for example, signed 32bit x 32bit -> 64bit. In this case, we can simply treat x1, x0, y1 and y0 as the 16bit words of the 32bit operands. In the case of signed 16bit x unsigned 16bit multiplication, one simply changes blue color of the unsigned operand to red, see the figure below.

You can find more on unsigned multiplication in AVR assembler here.

Rounding

Right shifts are often necessary when one is using fixed-point integer math. They are essentially divisions by powers of 2. For example, when we want to multiply x by number y between 0 and 1, we first multiply y by a power of 2, preferably 2^8 or 2^16 depending on the desired precision, so that y is an integer. Then we multiply x and y and divide the result by the same power of 2. This is done as a right shift. And as with the ordinary division, rounding of the result gives a better precision. It is quite clear that 0.9 is better to round to 1 and not to 0. Unfortunately, regular right shift rounds everything down to the closest integer.

Rounding

There is a simple solution to this. We can test the most significant bit (MSB) of the part that is shifted out of the result. If this bit is set, we simply add 1 to the result. If it is cleared, we don’t have to anything. This adds more accuracy to multiplications, especially when the result is an operand for another multiplication.

AVR GCC issues

AVR GCC handling of multibyte multiplication

Let’s have a look at how AVR GCC handles multiplication. First suppose that we want to multiply two signed 8-bit numbers and get a 16-bit result. We would write something like:

char a = -10;
char b = 10;
int x;

void setup() {
  x = a * b;
}

This produces the following code:

  x = a * b;
  c2:	80 91 01 01 	lds	r24, 0x0101 // load a
  c6:	20 91 00 01 	lds	r18, 0x0100 // load b
  ca:	82 02       	muls	r24, r18
  cc:	c0 01       	movw	r24, r0
  ce:	11 24       	eor	r1, r1
  d0:	90 93 09 01 	sts	0x0109, r25 // store x high byte
  d4:	80 93 08 01 	sts	0x0108, r24 // store x low byte

As you can see, the compiler produces the corect 16-bit result, using only one muls instruction. No need for typecasting. The instruction eor r1, r1 clears the register R1, that is supposed to be 0 by AVR GCC convention. But you can notice that the movw instruction is unnecessary.

Now let’s see what happens when we want to multiply two 16-bit signed numbers and get a 32-bit result. When we write:

int a = -10;
int b = 10;
long x;

void setup() {
  x = a * b;
}

we get the following code:

x = a * b;
  c2:	20 91 02 01 	lds	r18, 0x0102
...
  d2:	ac 01       	movw	r20, r24
  d4:	24 9f       	mul	r18, r20
  d6:	c0 01       	movw	r24, r0
  d8:	25 9f       	mul	r18, r21
  da:	90 0d       	add	r25, r0
  dc:	34 9f       	mul	r19, r20
  de:	90 0d       	add	r25, r0
  e0:	11 24       	eor	r1, r1
  e2:	aa 27       	eor	r26, r26
  e4:	97 fd       	sbrc	r25, 7
  e6:	a0 95       	com	r26
  e8:	ba 2f       	mov	r27, r26
  ea:	80 93 0a 01 	sts	0x010A, r24
...

As you can see, it performs only 3 multiplications, doing 16bit x 16bit -> 16bit and then extending the result to 32 bits. That’s not what we want since we lose the 2 highes bytes of the multiplication. We have to typecast ints into longs, writing

 x = (long) a * b;

But this produces:

  x = (long) a * b;
  c2:	60 91 02 01 	lds	r22, 0x0102
  c6:	70 91 03 01 	lds	r23, 0x0103
  ca:	88 27       	eor	r24, r24     // extension to 32 bit
  cc:	77 fd       	sbrc	r23, 7
  ce:	80 95       	com	r24
  d0:	98 2f       	mov	r25, r24
  d2:	20 91 00 01 	lds	r18, 0x0100
  d6:	30 91 01 01 	lds	r19, 0x0101
  da:	44 27       	eor	r20, r20     // extension to 32 bit
  dc:	37 fd       	sbrc	r19, 7
  de:	40 95       	com	r20
  e0:	54 2f       	mov	r21, r20
  e2:	0e 94 fd 01 	call	0x3fa	; 0x3fa <__mulsi3>
  e6:	60 93 0a 01 	sts	0x010A, r22
...

In this case, the compiler extends the operants to 32bit first, and then calls a 32bit x 32bit -> 32bit multiplication routine. But this is very wastful as the routine performs 10 multiplications instead of the necessary 4 and other overhead that is required for full long multiplication. The whole multiplication (together with the memory access) takes 72 cycles, instead of the optimized 38 cycles. That makes a difference of more than 2μs on a single multiplication instruction. It is even a bigger difference when no memory access is neccessary, for example when multiplying local variables. Then it is 56 versus 22 cycles. Which saves 2μs out of 3,5μs.

AVR GCC handling of a multiplication by a constant

The way AVR GCC handles multibyte multiplication is caused by writing the code in C that doesn’t allow for exact specifications of the operand and result sizes and as such it is not a bug, it’s a feature that we have to be aware of. But there is a bug in AVR GCC that causes the compiler to produce a suboptimal code for multiplications by a constant, see forum post at AVR Freaks forum. The problem is that multiplication by a power of 2 is interpreted as shifts even when it is worse than actual multiplication. Shifts are sometimes worse because AVR offers only 1-bit shifts of 8-bit operands. For instance, the simple code

int a = -10;
int x;

void setup() {
  x = a * 64;
}

gets compiled as

  ca:	36 e0       	ldi	r19, 0x06	; 6
  cc:	88 0f       	add	r24, r24
  ce:	99 1f       	adc	r25, r25
  d0:	3a 95       	dec	r19
  d2:	e1 f7       	brne	.-8

This takes more then 30 cycles, while writing this as a multiplication would take 8. That’s a huge difference.

Code

The header files ready to include in your sketch can be downloaded here:

https://github.com/rekka/avrmultiplication

The functions are implemented as macros. This means that you have to call them in a little bit different manner than regular C functions. For example, using signed 16bit x 16bit -> 32bit multiplication is performed by:

int x = 12;
int y = -32;
long result32;
MultiS16X16to32(result32, x, y);

Also, this means that the whole multiplication code is included at every place that you use a macro. This can be undesirable if you are using one macro many times. Of course, you can write your own stub function, for example:

long FuncMultiS16X16to32(int x, int y) {
     long result32;
     MultiS16X16to32(result32, x, y);
     return result32;
}

The notation of macros is simple. It starts with Multi, followed by U, SU or S, depending on the signature of arguments and 16X16, 32X16 or 16X8 depending on the size of arguments. It is finished by to16, to H16, toL16 or toH32 with or without rounding (Round), indicating what part of the result is stored (whole, L for Low or H for High).

The library is not complete, it contains only the methods that I needed so far. But it contains all 16×16 methods I hope. I’m gonna try to expand it as soon as I need another version of the methods. Also, I can include additional versions if there is interest. Let me know.

Here are the 16X16 codes:

// longRes = intIn1 * intIn2
#define MultiU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"mul %B1, %A2 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
) 

// intRes = intIn1 * intIn2 >> 16
// uses:
// r26 to store 0
// r27 to store the byte 1 of the 32bit result
#define MultiU16X16toH16(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mul %B1, %A2 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" , "r27" \
) 

// intRes = intIn1 * intIn2 >> 16 + round
// uses:
// r26 to store 0
// r27 to store the byte 1 of the 32bit result
// 21 cycles
#define MultiU16X16toH16Round(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"mul %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mul %B1, %A2 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"lsl r27 \n\t" \
"adc %A0, r26 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" , "r27" \
) 

// signed16 * signed16
// 22 cycles
#define MultiS16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"muls %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mulsu %B2, %A1 \n\t" \
"sbc %D0, r26 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %D0, r26 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
)

// signed16 * signed 16 >> 16
#define MultiS16X16toH16(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"muls %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mulsu %B2, %A1 \n\t" \
"sbc %B0, r26 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %B0, r26 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26", "r27" \
)

// multiplies a signed and unsigned 16 bit ints with a 32 bit result
#define MultiSU16X16to32(longRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"movw %A0, r0 \n\t" \
"mulsu %B1, %B2 \n\t" \
"movw %C0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %D0, r26 \n\t" \
"add %B0, r0 \n\t" \
"adc %C0, r1 \n\t" \
"adc %D0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (longRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26" \
)

// multiplies signed x unsigned int and returns the highest 16 bits of the result
#define MultiSU16X16toH16(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"mulsu %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %B2, %A1 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %B0, r26 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26", "r27" \
)

// multiplies signed x unsigned int and returns the highest 16 bits of the result
// rounds the result based on the MSB of the lower 16 bits
// 22 cycles
#define MultiSU16X16toH16Round(intRes, intIn1, intIn2) \
asm volatile ( \
"clr r26 \n\t" \
"mul %A1, %A2 \n\t" \
"mov r27, r1 \n\t" \
"mulsu %B1, %B2 \n\t" \
"movw %A0, r0 \n\t" \
"mul %A1, %B2 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"mulsu %B1, %A2 \n\t" \
"sbc %B0, r26 \n\t" \
"add r27, r0 \n\t" \
"adc %A0, r1 \n\t" \
"adc %B0, r26 \n\t" \
"lsl r27 \n\t" \
"adc %A0, r26 \n\t" \
"adc %B0, r26 \n\t" \
"clr r1 \n\t" \
: \
"=&r" (intRes) \
: \
"a" (intIn1), \
"a" (intIn2) \
: \
"r26", "r27" \
)

Posted in Arduino, AVR | Tagged 16bit, 32bit, Arduino, AVR, AVR GCC, multiplication | 4 Comments

4 Responses

on October 13, 2009 at 7:41 am Kenny Root

This would be a good chance for you to fix GCC to use optimized routines in those cases!
on February 3, 2011 at 12:47 pm Greg

Excellent, thanks for the info! I was counting clock cycles based on assembler instructions to get a ballpark figure for program performance then was quite disappointed to see the results in C++.
- on February 22, 2011 at 11:21 am mekonik
  
  Hey Greg, glad you found it useful. Norbert
on August 25, 2011 at 7:23 am Luis

Hi! I’ve been trying to use your macros in my AVR C program, but the compiler generates the following errors:

error: can’t find a register in class ‘SIMPLE_LD_REGS’ while reloading ‘asm’ error: ‘asm’ operand has impossible constraints

Nothing fancy with my program (it’s not an Arduino/Wiring sketch, just plain AVR C), my avr-libc is version 1.6.8-2 and compiling with avr-gcc version 4.3.5-1 for an ATmega644. Can you think of the cause for this issue? Thanks for the great post!

Comments are closed.

Mekonikuv blog