While working on my Arduino magnet levitation (details here), I stumped upon some problems with the way the compiler, AVR GCC, handles multiplication. I had to implement optimized multibyte multiplication routines because Arduino in my project preforms some digital signal processing with 20kHz sampling frequency. Since I couldn’t find those routines online, ~~I’m posting the code of some of them here~~ I posted the code on GitHub at https://github.com/rekka/avrmultiplication.

Arduino uses ATmega168 or similar 8-bit RISC processor. The most serious limitation of these chips is that their instruction set contains mostly instructions working with 8-bit arguments. This limitation becomes more pronounced when one wants to multiply multibyte numbers. When multiplying two *k*-byte numbers into a 2*k*-byte result, one has to perform *k*^2 multiplications and in the order of *k*^3 additions.

AVR architecture offers 3 multiplication instructions, **mul**, **muls** and **mulsu**, see the AVR instruction set. All of them take 2 registers as arguments, perform 8bit x 8bit -> 16bit multiplication and store the result in R1:R0 register pair. **mul** assumes that both arguments are unsigned, **muls** assumes signed arguments and **mulsu** assumes signed and unsigned argument.

These distinctions are neccessary as negative numbers are stored in two’s complement form. For 8-bit numbers that means that negative number *-x* is stored as 2^8 – *x*. When multiplying 8-bit numbers we get

- -x * y = (2^8 – x) * y = 2^8 * y – x * y
- -x * -y = (2^8 – x) * (2^8 -y) = 2^16 – 2^8 * x – 2^8 * y + x* y.

The instruction **muls** and **mulsu** thus have to remove the terms 2^8 * x and 2^8 * y when appropriate. We don’t have to worry about the internal workings of these instruction unless we want to perform multibyte multiplication.

### 16 bit x 16 bit -> 32 bit multiplication

Suppose that we have two 16 bit numbers *x* and *y*. We want to compute *x *** y*. This operation yields a 32 bit result. To implement this operation using 8bit instructions, we have to split *x* and *y* into the high byte and low byte, *x*1, *x*0 and *y*1, *y*0. Then we have

x = 256 * x1 + x0 y = 256 * y1 + y0

First suppose that *x* and *y* are unsigned numbers. Multiplication then yields

x * y = 65536 * x1 * y1 + 256 * x1 * y0 + 256 * x0 * y1 + x0 * y0

Thus this can be implemented using 4 **mul** instructions and some additions with appropriate shifts. When *x* and *y* are signed numbers in two’s complement notation, the result becomes more complicated:

-x * y = (2^16 - 2^8 * x1 - x0) * (2^8 * y1 + y0) = -x0*y0 - x1*y0*2^8 - x0*y1*2^8 + y0*2^16 - x1*y1*2^16 + y1*2^24 -x * -y = (2^16 - 2^8 * x1 - x0) * (2^16 - 2^8 * y1 - y0) = x0*y0 + x1*y0*2^8 + x0*y1*2^8 - x0*2^16 - y0*2^16 + - x1*y1*2^16 - x1*2^24 - y1*2^24 + 2^32

As you can see, there are extra terms that have to be removed. 2^32 is removed automatically as we store only 32 bits of the result. The rest can be removed by a proper combination of mul, muls and mulsu instructions. There are 4 multiplications and their signature turns out to be quite simple, see the figure on the right. The arrows signify a single 8-bit multiplication, the red byte is to be treated as unsigned, the blue one as signed. The results then have to be added together with appropriate shifts. For instance, the result of multiplication of x1 and y0 must be shifted by 1 + 0 = 1 byte to the left.

This scheme can be also extended to, for example, signed 32bit x 32bit -> 64bit. In this case, we can simply treat x1, x0, y1 and y0 as the 16bit words of the 32bit operands. In the case of signed 16bit x unsigned 16bit multiplication, one simply changes blue color of the unsigned operand to red, see the figure below.

You can find more on unsigned multiplication in AVR assembler here.

### Rounding

Right shifts are often necessary when one is using fixed-point integer math. They are essentially divisions by powers of 2. For example, when we want to multiply *x* by number *y* between 0 and 1, we first multiply *y* by a power of 2, preferably 2^8 or 2^16 depending on the desired precision, so that *y* is an integer. Then we multiply *x* and *y* and divide the result by the same power of 2. This is done as a right shift. And as with the ordinary division, rounding of the result gives a better precision. It is quite clear that 0.9 is better to round to 1 and not to 0. Unfortunately, regular right shift rounds everything down to the closest integer.

There is a simple solution to this. We can test the most significant bit (MSB) of the part that is shifted out of the result. If this bit is set, we simply add 1 to the result. If it is cleared, we don’t have to anything. This adds more accuracy to multiplications, especially when the result is an operand for another multiplication.

## AVR GCC issues

### AVR GCC handling of multibyte multiplication

Let’s have a look at how AVR GCC handles multiplication. First suppose that we want to multiply two signed 8-bit numbers and get a 16-bit result. We would write something like:

char a = -10; char b = 10; int x; void setup() { x = a * b; }

This produces the following code:

x = a * b; c2: 80 91 01 01 lds r24, 0x0101 // load a c6: 20 91 00 01 lds r18, 0x0100 // load b ca: 82 02 muls r24, r18 cc: c0 01 movw r24, r0 ce: 11 24 eor r1, r1 d0: 90 93 09 01 sts 0x0109, r25 // store x high byte d4: 80 93 08 01 sts 0x0108, r24 // store x low byte

As you can see, the compiler produces the corect 16-bit result, using only one **muls** instruction. No need for typecasting. The instruction **eor r1, r1** clears the register R1, that is supposed to be 0 by AVR GCC convention. But you can notice that the **movw** instruction is unnecessary.

Now let’s see what happens when we want to multiply two 16-bit signed numbers and get a 32-bit result. When we write:

int a = -10; int b = 10; long x; void setup() { x = a * b; }

we get the following code:

x = a * b; c2: 20 91 02 01 lds r18, 0x0102 ... d2: ac 01 movw r20, r24 d4: 24 9f mul r18, r20 d6: c0 01 movw r24, r0 d8: 25 9f mul r18, r21 da: 90 0d add r25, r0 dc: 34 9f mul r19, r20 de: 90 0d add r25, r0 e0: 11 24 eor r1, r1 e2: aa 27 eor r26, r26 e4: 97 fd sbrc r25, 7 e6: a0 95 com r26 e8: ba 2f mov r27, r26 ea: 80 93 0a 01 sts 0x010A, r24 ...

As you can see, it performs only 3 multiplications, doing 16bit x 16bit -> 16bit and then extending the result to 32 bits. That’s not what we want since we lose the 2 highes bytes of the multiplication. We have to typecast ints into longs, writing

x = (long) a * b;

But this produces:

x = (long) a * b; c2: 60 91 02 01 lds r22, 0x0102 c6: 70 91 03 01 lds r23, 0x0103 ca: 88 27 eor r24, r24 // extension to 32 bit cc: 77 fd sbrc r23, 7 ce: 80 95 com r24 d0: 98 2f mov r25, r24 d2: 20 91 00 01 lds r18, 0x0100 d6: 30 91 01 01 lds r19, 0x0101 da: 44 27 eor r20, r20 // extension to 32 bit dc: 37 fd sbrc r19, 7 de: 40 95 com r20 e0: 54 2f mov r21, r20 e2: 0e 94 fd 01 call 0x3fa ; 0x3fa <__mulsi3> e6: 60 93 0a 01 sts 0x010A, r22 ...

In this case, the compiler extends the operants to 32bit first, and then calls a 32bit x 32bit -> 32bit multiplication routine. But this is very wastful as the routine performs 10 multiplications instead of the necessary 4 and other overhead that is required for full long multiplication. The whole multiplication (together with the memory access) takes **72** cycles, instead of the optimized **38** cycles. That makes a difference of more than **2μs** on a single multiplication instruction. It is even a bigger difference when no memory access is neccessary, for example when multiplying local variables. Then it is 56 versus 22 cycles. Which saves 2μs out of 3,5μs.

### AVR GCC handling of a multiplication by a constant

The way AVR GCC handles multibyte multiplication is caused by writing the code in C that doesn’t allow for exact specifications of the operand and result sizes and as such it is not a bug, it’s a feature that we have to be aware of. But there is a bug in AVR GCC that causes the compiler to produce a suboptimal code for multiplications by a constant, see forum post at AVR Freaks forum. The problem is that multiplication by a power of 2 is interpreted as shifts even when it is worse than actual multiplication. Shifts are sometimes worse because AVR offers only 1-bit shifts of 8-bit operands. For instance, the simple code

int a = -10; int x; void setup() { x = a * 64; }

gets compiled as

ca: 36 e0 ldi r19, 0x06 ; 6 cc: 88 0f add r24, r24 ce: 99 1f adc r25, r25 d0: 3a 95 dec r19 d2: e1 f7 brne .-8

This takes more then 30 cycles, while writing this as a multiplication would take 8. That’s a huge difference.

## Code

The header files ready to include in your sketch can be downloaded here:

https://github.com/rekka/avrmultiplication

The functions are implemented as macros. This means that you have to call them in a little bit different manner than regular C functions. For example, using signed 16bit x 16bit -> 32bit multiplication is performed by:

int x = 12; int y = -32; long result32; MultiS16X16to32(result32, x, y);

Also, this means that the whole multiplication code is included at every place that you use a macro. This can be undesirable if you are using one macro many times. Of course, you can write your own stub function, for example:

long FuncMultiS16X16to32(int x, int y) { long result32; MultiS16X16to32(result32, x, y); return result32; }

The notation of macros is simple. It starts with Multi, followed by U, SU or S, depending on the signature of arguments and 16X16, 32X16 or 16X8 depending on the size of arguments. It is finished by to16, to H16, toL16 or toH32 with or without rounding (Round), indicating what part of the result is stored (whole, L for Low or H for High).

The library is not complete, it contains only the methods that I needed so far. But it contains all 16×16 methods I hope. I’m gonna try to expand it as soon as I need another version of the methods. Also, I can include additional versions if there is interest. Let me know.

Here are the 16X16 codes:

// longRes = intIn1 * intIn2 #define MultiU16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // intRes = intIn1 * intIn2 >> 16 // uses: // r26 to store 0 // r27 to store the byte 1 of the 32bit result #define MultiU16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" , "r27" \ ) // intRes = intIn1 * intIn2 >> 16 + round // uses: // r26 to store 0 // r27 to store the byte 1 of the 32bit result // 21 cycles #define MultiU16X16toH16Round(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "lsl r27 \n\t" \ "adc %A0, r26 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" , "r27" \ ) // signed16 * signed16 // 22 cycles #define MultiS16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "muls %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mulsu %B2, %A1 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // signed16 * signed 16 >> 16 #define MultiS16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "muls %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mulsu %B2, %A1 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ ) // multiplies a signed and unsigned 16 bit ints with a 32 bit result #define MultiSU16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // multiplies signed x unsigned int and returns the highest 16 bits of the result #define MultiSU16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ ) // multiplies signed x unsigned int and returns the highest 16 bits of the result // rounds the result based on the MSB of the lower 16 bits // 22 cycles #define MultiSU16X16toH16Round(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %A1, %B2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "lsl r27 \n\t" \ "adc %A0, r26 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ )

on October 13, 2009 at 7:41 amKenny RootThis would be a good chance for you to fix GCC to use optimized routines in those cases!

on February 3, 2011 at 12:47 pmGregExcellent, thanks for the info! I was counting clock cycles based on assembler instructions to get a ballpark figure for program performance then was quite disappointed to see the results in C++.

on February 22, 2011 at 11:21 ammekonikHey Greg, glad you found it useful. Norbert

on August 25, 2011 at 7:23 amLuisHi! I’ve been trying to use your macros in my AVR C program, but the compiler generates the following errors:

error: can’t find a register in class ‘SIMPLE_LD_REGS’ while reloading ‘asm’ error: ‘asm’ operand has impossible constraints

Nothing fancy with my program (it’s not an Arduino/Wiring sketch, just plain AVR C), my avr-libc is version 1.6.8-2 and compiling with avr-gcc version 4.3.5-1 for an ATmega644. Can you think of the cause for this issue? Thanks for the great post!