The usual implementation starts with a random string of 28 characters. Then in each generation, it makes say 100 copies of the string, with a small chance (say 5%) per character of that character being replaced by a random character. Then it selects the string that is closest to the target sequence (the highest number of correct characters with correct positions)

METHINKS IT IS LIKE A WEASEL

and repeats the process until one of the copies exactly matches the target sequence.

The program is not supposed to model any real biological system. But when one allows for insertions and deletions of random letters, this simple modification of the algorithm moves it somewhat closer to how DNA mutates. It turns out that an implementation of such modification is straightforward in Mathematica 6.

alphabet = CharacterRange["a", "z"]~Join~{" "}; source = Characters[""]; target = Characters["methinks it is like a weasel"]; numindividuals = 50; maxgenerations = 500; (* Mutation rates: {deletion, insertion, replacement} *) mutationrates = {0.02, 0.02, 0.02}; With[{mr = Join[#, {1 - Total[#]}] &[mutationrates]}, mutate[l_] := Flatten[ If[RandomReal[] < mr[[2]], {RandomChoice[alphabet], #}, #] &[ RandomChoice[ Unevaluated[ mr -> {{}, {#, RandomChoice[alphabet]}, RandomChoice[alphabet], #}]] & /@ l ] ] ] Timing[res = NestWhileList[ SortBy[Table[mutate[#], {numindividuals}], EditDistance[target, #] &][[1]] &, source, (# != target) &, 1, maxgenerations]; ] (* Printout *) Column[StringJoin /@ res[[;; ;; 10]]] ListPlot[EditDistance[target, #] & /@ res, PlotRange -> All, Filling -> Axis, AxesOrigin -> {0, 0}]

In this code, we start with an empty string “”. In each generation, we produce a pool of 50 individuals (copies). While creating a copy, each letter in the parent string has a 2% chance to be erased, 2% chance to be changed to a random letter and 2% chance that a random letter will be inserted next to it. Then we sort the pool with respect to the Levenshtein edit distance from the target string. We select the string with the first string with the smallest distance and use it as the parent string for the copies in the next generation. We repeat the process until one of the copies exactly matches the target string or when we achieve the maximal number of generations, whichever comes first.

The following is a sample output of the program:

h ee il eal es i lie weael etn i s lie aweael eth i is lie aweael ethns i is lie a eael ethns i is lie a eael ethns i is lie a weael ethns i is lie a weael ethnk i is lie a weasel ethnks i is lie a weasel ethnks i is lie a weasel ethnks i is like a weasel ethnks i is like a weasel ethnks i is like a weasel ethnks i is like a weasel ethnks i is like a weasel ethnks i is like a weasel methnks i is like a weasel methnks i is like a weasel methnks i is like a weasel methnks i is like a weasel methnks i is like a weasel methiks i is like a weasel methiks it is like a weasel methiks it is like a weasel methiks it is like a weasel methiks it is like a weasel methiks it is like a weasel

Note that the printout shows only every 10th generation. You can see that there is no locking of correct letter, for example w was removed and replaced by a space from line 5 to line 6, even though it was in the correct spot. The following figure show the evolution of the edit distance of the best string in each generation:

And this is a sample output for an initial string “anything is possible for this weasel code”:

anything is possible for this weasel code aptaink zs fukr t wasel taink s k wasel think s k wasel thinks is lk easel mhinks is ike weasel meinki is ike a weasel meink i is ike a weasel meink i is ike a weasel meink i i like a weasel meink i i like a weasel meink i is like a weasel meink i is like a weasel meink i is like a weasel meink i is like a weasel meink i is like a weasel meink i is like a weasel meink i is like a weasel mehnk i is like a weasel methnk it is like a weasel methnk it is like a weasel methnk it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel methink it is like a weasel

During my attempt to learn Japanese, I had to find a good way to learn the *kanji* characters. Even though I learned hiragana and katakana in a day or two and I mastered reading of these characters quite automatically while reading Japanese text, kanji turned to be a quite different story. At first I just tried remembering the shapes in the hope that they will stick in my memory, but it turned to be futile for me. As somebody who grew up with the Latin alphabet, I am terrible at remembering shapes, not to mention stroke order.

So I tried memorizing the shapes using *drills*, pretty much the way Japanese schoolkids learn in the elementary school. It didn’t take too long for me to realize how boring this method is. Also, kanji-a-day is not exactly the speed I was aiming for.

I looked around a bit and found two books that offered *mnemonics methods*, somewhat similar to each other. Guide to Remembering Japanese Characters by Kenneth G. Henshall and Remembering the Kanji by James W. Heisig. Both books contain about 2000 kanji and provide mnemonics for rapid learning. I eventually chose Heisig’s book since I think it is a more systematic way of learning, and the mnemonics focus more on the meaning than on the shape of kanji.

I think that for me, the book did what it promised to do. I’m able to recognize most of the books kanji, I can write the kanji in the correct stroke order (I think this is the strength of Heisig’s method compared to drills). I could even impress my Japanese friends by correcting their stroke order mistakes.

The main criticism of the book that you’ll find online is the ordering of characters. Almost all other systems teach the most frequent characters first. Heisig chooses a different approach, he groups characters according to *primitive elements* they contain. Primitive elements are parts of the kanji that are shared among multiple characters. Many of them are characters themselves. Thus learning characters that serve as primitive elements followed by characters containing them as primitive elements in sequence reinforces the learning process. It worked great for me. The drawback however is that you are learning a mix of common and rare kanji. Heisig is honest about this, and he states clearly in the introduction that the book is intended for self-study for those who want to master all 2000 most common characters.

A neat way to compare the ordering of the characters in the two books and also the order in which the characters are learned in Japanese elementary schools is to plot the frequency of use versus the order in which they are learned. All the data is available online thanks to the awesome work of Jim Breen et al. and his KANJIDIC2 project. The data can be downloaded in XML format. This makes it easy to work with. I used Wolfram’s Mathematica with its built-in ability to process XML data.

The kanjidic2.xml file contains an entry for each Japanese kanji. When available, the kanji entry contains the information about the grade in which it is learned and the Henshall’s and Heisig’s indexes that represent the order in which the character is learned in each of these books. Moreover, 2500 most-used characters contain the frequency-of-use ranking. The kanjidic2.xml file description states:

The 2,500 most-used characters have a ranking; … The frequency is a number from 1 to 2,500 that expresses the relative frequency of occurrence of a character in modern Japanese. This is based on a survey in

, so it isnewspaperstowards kanji used in newspaper articles. The discrimination between the less frequently used kanji is not strong. (emphasis mine)biased

So even though it has limitations, the ranking can still provide an useful comparison. As for the grade information, the kanjidic2.xml description says:

1 through 6 indicate the grade in which the kanji is taught in Japanese schools. 8 indicates it is one of the remaining Jouyou Kanji to be learned in junior high school, and 9 indicates it is a Jinmeiyou (for use in names) kanji.

I produced 3 plots. I plotted the dependence of frequency-of-use ranking on Heisig’s index, Henshall’s index and the grade, respectively. I also included the characters that don’t appear in those learning systems but have frequency-of-use ranking to see which “common” kanji are omitted. The plots are below. The points to the right of the dashed line represent characters that have frequency-of-use ranking but don’t appear in the learning system. I added a random noise to the horizontal position of each of these points for the frequency-of-use ranking distribution to be apparent. I also added horizontal noise to the grade plot as the grades are integer values which would again hide the ranking distribution.

The first plot confirms that the criticism of Heisig’s book is valid. There is no obvious correlation between the order in which Heisig presents the characters in his book and frequency of use. I would recommend his book only to those who wanna learn all 2042 kanji that it contains. On the other hand, the Heisig’s order makes for a more entertaining and effective learning, and I can’t stress enough how that is important for learning success.

Henshall’s book shows a much better correlation. The first half obviously presents characters in the frequency-of-use order. And so does the Japanese elementary school system. It is apparent to me that Henshall based his order on the grade system, presenting the 6 grades in the first half of his book and the high school characters (grades 8 and 9 in kanjidic2 data) in the second half.

Interestingly, the grade 1 contains some fairly uncommon characters such as 犬, 耳, 虫, 糸 and 貝, all with frequency-of-use ranking above 1300. But this is obviously an artifact and limitation of the frequency ranking source, *newspapers*. In fact, these characters are quite useful in basic Japanese.

Arduino uses ATmega168 or similar 8-bit RISC processor. The most serious limitation of these chips is that their instruction set contains mostly instructions working with 8-bit arguments. This limitation becomes more pronounced when one wants to multiply multibyte numbers. When multiplying two *k*-byte numbers into a 2*k*-byte result, one has to perform *k*^2 multiplications and in the order of *k*^3 additions.

AVR architecture offers 3 multiplication instructions, **mul**, **muls** and **mulsu**, see the AVR instruction set. All of them take 2 registers as arguments, perform 8bit x 8bit -> 16bit multiplication and store the result in R1:R0 register pair. **mul** assumes that both arguments are unsigned, **muls** assumes signed arguments and **mulsu** assumes signed and unsigned argument.

These distinctions are neccessary as negative numbers are stored in two’s complement form. For 8-bit numbers that means that negative number *-x* is stored as 2^8 – *x*. When multiplying 8-bit numbers we get

- -x * y = (2^8 – x) * y = 2^8 * y – x * y
- -x * -y = (2^8 – x) * (2^8 -y) = 2^16 – 2^8 * x – 2^8 * y + x* y.

The instruction **muls** and **mulsu** thus have to remove the terms 2^8 * x and 2^8 * y when appropriate. We don’t have to worry about the internal workings of these instruction unless we want to perform multibyte multiplication.

Suppose that we have two 16 bit numbers *x* and *y*. We want to compute *x *** y*. This operation yields a 32 bit result. To implement this operation using 8bit instructions, we have to split *x* and *y* into the high byte and low byte, *x*1, *x*0 and *y*1, *y*0. Then we have

x = 256 * x1 + x0 y = 256 * y1 + y0

First suppose that *x* and *y* are unsigned numbers. Multiplication then yields

x * y = 65536 * x1 * y1 + 256 * x1 * y0 + 256 * x0 * y1 + x0 * y0

Thus this can be implemented using 4 **mul** instructions and some additions with appropriate shifts. When *x* and *y* are signed numbers in two’s complement notation, the result becomes more complicated:

-x * y = (2^16 - 2^8 * x1 - x0) * (2^8 * y1 + y0) = -x0*y0 - x1*y0*2^8 - x0*y1*2^8 + y0*2^16 - x1*y1*2^16 + y1*2^24 -x * -y = (2^16 - 2^8 * x1 - x0) * (2^16 - 2^8 * y1 - y0) = x0*y0 + x1*y0*2^8 + x0*y1*2^8 - x0*2^16 - y0*2^16 + - x1*y1*2^16 - x1*2^24 - y1*2^24 + 2^32

As you can see, there are extra terms that have to be removed. 2^32 is removed automatically as we store only 32 bits of the result. The rest can be removed by a proper combination of mul, muls and mulsu instructions. There are 4 multiplications and their signature turns out to be quite simple, see the figure on the right. The arrows signify a single 8-bit multiplication, the red byte is to be treated as unsigned, the blue one as signed. The results then have to be added together with appropriate shifts. For instance, the result of multiplication of x1 and y0 must be shifted by 1 + 0 = 1 byte to the left.

This scheme can be also extended to, for example, signed 32bit x 32bit -> 64bit. In this case, we can simply treat x1, x0, y1 and y0 as the 16bit words of the 32bit operands. In the case of signed 16bit x unsigned 16bit multiplication, one simply changes blue color of the unsigned operand to red, see the figure below.

You can find more on unsigned multiplication in AVR assembler here.

Right shifts are often necessary when one is using fixed-point integer math. They are essentially divisions by powers of 2. For example, when we want to multiply *x* by number *y* between 0 and 1, we first multiply *y* by a power of 2, preferably 2^8 or 2^16 depending on the desired precision, so that *y* is an integer. Then we multiply *x* and *y* and divide the result by the same power of 2. This is done as a right shift. And as with the ordinary division, rounding of the result gives a better precision. It is quite clear that 0.9 is better to round to 1 and not to 0. Unfortunately, regular right shift rounds everything down to the closest integer.

There is a simple solution to this. We can test the most significant bit (MSB) of the part that is shifted out of the result. If this bit is set, we simply add 1 to the result. If it is cleared, we don’t have to anything. This adds more accuracy to multiplications, especially when the result is an operand for another multiplication.

Let’s have a look at how AVR GCC handles multiplication. First suppose that we want to multiply two signed 8-bit numbers and get a 16-bit result. We would write something like:

char a = -10; char b = 10; int x; void setup() { x = a * b; }

This produces the following code:

x = a * b; c2: 80 91 01 01 lds r24, 0x0101 // load a c6: 20 91 00 01 lds r18, 0x0100 // load b ca: 82 02 muls r24, r18 cc: c0 01 movw r24, r0 ce: 11 24 eor r1, r1 d0: 90 93 09 01 sts 0x0109, r25 // store x high byte d4: 80 93 08 01 sts 0x0108, r24 // store x low byte

As you can see, the compiler produces the corect 16-bit result, using only one **muls** instruction. No need for typecasting. The instruction **eor r1, r1** clears the register R1, that is supposed to be 0 by AVR GCC convention. But you can notice that the **movw** instruction is unnecessary.

Now let’s see what happens when we want to multiply two 16-bit signed numbers and get a 32-bit result. When we write:

int a = -10; int b = 10; long x; void setup() { x = a * b; }

we get the following code:

x = a * b; c2: 20 91 02 01 lds r18, 0x0102 ... d2: ac 01 movw r20, r24 d4: 24 9f mul r18, r20 d6: c0 01 movw r24, r0 d8: 25 9f mul r18, r21 da: 90 0d add r25, r0 dc: 34 9f mul r19, r20 de: 90 0d add r25, r0 e0: 11 24 eor r1, r1 e2: aa 27 eor r26, r26 e4: 97 fd sbrc r25, 7 e6: a0 95 com r26 e8: ba 2f mov r27, r26 ea: 80 93 0a 01 sts 0x010A, r24 ...

As you can see, it performs only 3 multiplications, doing 16bit x 16bit -> 16bit and then extending the result to 32 bits. That’s not what we want since we lose the 2 highes bytes of the multiplication. We have to typecast ints into longs, writing

x = (long) a * b;

But this produces:

x = (long) a * b; c2: 60 91 02 01 lds r22, 0x0102 c6: 70 91 03 01 lds r23, 0x0103 ca: 88 27 eor r24, r24 // extension to 32 bit cc: 77 fd sbrc r23, 7 ce: 80 95 com r24 d0: 98 2f mov r25, r24 d2: 20 91 00 01 lds r18, 0x0100 d6: 30 91 01 01 lds r19, 0x0101 da: 44 27 eor r20, r20 // extension to 32 bit dc: 37 fd sbrc r19, 7 de: 40 95 com r20 e0: 54 2f mov r21, r20 e2: 0e 94 fd 01 call 0x3fa ; 0x3fa <__mulsi3> e6: 60 93 0a 01 sts 0x010A, r22 ...

In this case, the compiler extends the operants to 32bit first, and then calls a 32bit x 32bit -> 32bit multiplication routine. But this is very wastful as the routine performs 10 multiplications instead of the necessary 4 and other overhead that is required for full long multiplication. The whole multiplication (together with the memory access) takes **72** cycles, instead of the optimized **38** cycles. That makes a difference of more than **2μs** on a single multiplication instruction. It is even a bigger difference when no memory access is neccessary, for example when multiplying local variables. Then it is 56 versus 22 cycles. Which saves 2μs out of 3,5μs.

The way AVR GCC handles multibyte multiplication is caused by writing the code in C that doesn’t allow for exact specifications of the operand and result sizes and as such it is not a bug, it’s a feature that we have to be aware of. But there is a bug in AVR GCC that causes the compiler to produce a suboptimal code for multiplications by a constant, see forum post at AVR Freaks forum. The problem is that multiplication by a power of 2 is interpreted as shifts even when it is worse than actual multiplication. Shifts are sometimes worse because AVR offers only 1-bit shifts of 8-bit operands. For instance, the simple code

int a = -10; int x; void setup() { x = a * 64; }

gets compiled as

ca: 36 e0 ldi r19, 0x06 ; 6 cc: 88 0f add r24, r24 ce: 99 1f adc r25, r25 d0: 3a 95 dec r19 d2: e1 f7 brne .-8

This takes more then 30 cycles, while writing this as a multiplication would take 8. That’s a huge difference.

The header files ready to include in your sketch can be downloaded here:

https://github.com/rekka/avrmultiplication

The functions are implemented as macros. This means that you have to call them in a little bit different manner than regular C functions. For example, using signed 16bit x 16bit -> 32bit multiplication is performed by:

int x = 12; int y = -32; long result32; MultiS16X16to32(result32, x, y);

Also, this means that the whole multiplication code is included at every place that you use a macro. This can be undesirable if you are using one macro many times. Of course, you can write your own stub function, for example:

long FuncMultiS16X16to32(int x, int y) { long result32; MultiS16X16to32(result32, x, y); return result32; }

The notation of macros is simple. It starts with Multi, followed by U, SU or S, depending on the signature of arguments and 16X16, 32X16 or 16X8 depending on the size of arguments. It is finished by to16, to H16, toL16 or toH32 with or without rounding (Round), indicating what part of the result is stored (whole, L for Low or H for High).

The library is not complete, it contains only the methods that I needed so far. But it contains all 16×16 methods I hope. I’m gonna try to expand it as soon as I need another version of the methods. Also, I can include additional versions if there is interest. Let me know.

Here are the 16X16 codes:

// longRes = intIn1 * intIn2 #define MultiU16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // intRes = intIn1 * intIn2 >> 16 // uses: // r26 to store 0 // r27 to store the byte 1 of the 32bit result #define MultiU16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" , "r27" \ ) // intRes = intIn1 * intIn2 >> 16 + round // uses: // r26 to store 0 // r27 to store the byte 1 of the 32bit result // 21 cycles #define MultiU16X16toH16Round(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mul %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mul %B1, %A2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "lsl r27 \n\t" \ "adc %A0, r26 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" , "r27" \ ) // signed16 * signed16 // 22 cycles #define MultiS16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "muls %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mulsu %B2, %A1 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // signed16 * signed 16 >> 16 #define MultiS16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "muls %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mulsu %B2, %A1 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ ) // multiplies a signed and unsigned 16 bit ints with a 32 bit result #define MultiSU16X16to32(longRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "movw %A0, r0 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %C0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %D0, r26 \n\t" \ "add %B0, r0 \n\t" \ "adc %C0, r1 \n\t" \ "adc %D0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (longRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26" \ ) // multiplies signed x unsigned int and returns the highest 16 bits of the result #define MultiSU16X16toH16(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %B2, %A1 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ ) // multiplies signed x unsigned int and returns the highest 16 bits of the result // rounds the result based on the MSB of the lower 16 bits // 22 cycles #define MultiSU16X16toH16Round(intRes, intIn1, intIn2) \ asm volatile ( \ "clr r26 \n\t" \ "mul %A1, %A2 \n\t" \ "mov r27, r1 \n\t" \ "mulsu %B1, %B2 \n\t" \ "movw %A0, r0 \n\t" \ "mul %A1, %B2 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "mulsu %B1, %A2 \n\t" \ "sbc %B0, r26 \n\t" \ "add r27, r0 \n\t" \ "adc %A0, r1 \n\t" \ "adc %B0, r26 \n\t" \ "lsl r27 \n\t" \ "adc %A0, r26 \n\t" \ "adc %B0, r26 \n\t" \ "clr r1 \n\t" \ : \ "=&r" (intRes) \ : \ "a" (intIn1), \ "a" (intIn2) \ : \ "r26", "r27" \ )]]>

**WARNING: This is an advanced project that requires quite a substantial amount of tweaking and understanding of the software internals to make it work. Please consider this post just a description of what I did. Furthermore, it is more than 3 years old and therefore hopelessly outdated; the software most likely needs nontrivial changes.**

*This is my first attempt to describe something that I don’t have much experience with. If you find an error, if there is something that is not clear or if there is something I could improve, please, leave a comment.*

The general setup is quite straightforward. The main part is an electromagnet consisting of a coil on an iron core. The current through my coil from a push-type solenoid is 300mA at 12V. At the bottom of the electromagnet, there is a Hall effect sensor directly on the iron core, positioned in such a way that the sensor detection axis is aligned with the core axis (figure on the right).

- Arduino Duemilanove
- Switched power supply 12V
- Linear Hall effect sensor Honeywell SS19
- Norton operational amplifier MC3401P
- NPN transistor MPSA06
- Rectifier 1N4001
- Electromagnet (I used a coil from a 12V push-type solenoid)
- Resistors 2x 1k, 5k6, 47k, 68k, 330k, 4x 1M
- Capacitors 2x 1μ

Since I don’t have much experience, I tried to keep the circuit as simple as possible. The project consists of two quite independent parts.

The first part is the coil driver. I used a small transistor to turn the coil on or off, added a reverse-biased diode across the coil for protection of the transistor against fly-back currents, and put a capacitor across the supply for noise reduction. The base of the transistor is connected to a Arduino digital output through a 1k resistor, with an extra LED to indicate the pin state.

The second part is a bit more involved. I used linear Hall effect sensor SS19 from Honeywell. It is a tiny black box that has 3 pins, 2 of which are connected to GND and +5V, the 3rd one is the output. The sensor translates the perpendicular (to the two largest faces of the sensor) component of a magnetic field into voltage on the output. In my case, it was 2.15V with no field and 3.0 V with the maximal field (coil on and magnet nearby). Thus I attached 2 Norton operational amplifiers (out of 4 in MC3401). The first stage subtracts ~1.5V while the second amplifies it by a factor of ~3. That gives a signal in the range 1.8 — 4.5V, in the working range of the amplifier. This amplified signal then connects to an Arduino analog input pin. Note that I used a *Norton *op amp. It amplifies current difference, unlike the usual op amp that amplifies voltage difference. That’s why the wiring is slightly different. I also had to add a load resistor 5k6 on the sensor output to make it work properly.

The magnetic field needed to keep a magnet from falling changes with the distance from the electromagnet. In our case, the position of the magnet is monitored by the Hall effect sensor. The closer the magnet is to the sensor the higher voltage is on the output. And the closer the magnet is the smaller field is required to keep it from falling. The situation is shown in the following figure:

The purple straight line represents the magnetic field required to keep a magnet from falling with respect to the reading on the sensor. In reality, it is nonlinear curve but that’s not important for us. We need to produce a field that will keep a magnet at a specific position. For that we will modulate the coil’s power so that the field produced is given by the blue line in the previous plot. The two extreme values are the maximal field of the coil and the field when the coil is off. If we are successful, the magnet should be stable at the intersection of the two curves in the middle of the plot. If the magnet gets too close to the iron core the shear attractive force of these two is enough to attract it. That is represented by the rightmost intersection. We need to prevent this situation.

That sounds quite easy, doesn’t it? But there is a small complication. The reading on the sensor includes the magnetic field of the coil. In fact, because the sensor is right on the core this component will probably be bigger than the component we are interested in, the field of the magnet itself. Fortunately for us, we know what signal we use to drive the coil. However, if we plot the driving signal and the magnetic field of the coil, we get the following plot:

This behavior is caused by the large inductance of the coil. When the transistor is open, the there is +12V on the coil. But the change in current induced magnetic field that resist this change and it takes more time for the full current to flow. When the transistor is closed, the current continues to flow through the diode in parallel. The resistance of the coil and the diode disipates the energy and the current decreases to zero. In my case, it takes around **5ms** for the coil to energize or deenergize. That is way too long for us to ignore. But we can model this behavior. A coil can be approximated as a resistance *R* and inductance *L* in series. The circuit can be simplified like this:

The differential equation for the current *I *through the coil then reads

where *V* is the voltage on the coil. In our case, it is +12V when the driving signal is 1 and -0.6V (the diode voltage) when the signal is 0. The magnetic field is proportional to the current. For simplicity, we rewrite the equation for dimensionless field power *y* and applied voltage *x*, *x* being -0.05 = -0.6/12 or 1 and *y* being in the range 0 — 1. That yields the equation

*λ* is a parameter depending on the coil. But this is a equation for a low-pass filter. This continuous form can be discretized, see Low-pass filter on wikipedia. Assuming constant timesteps, * *Δ*t* = 1, we get the final formula

The parameter *α* depends on the coil and on the timestep. It can be found using

where *T* is the number of timesteps it take for the coil to energize from 0 to 0.632 of its maximal field. This has to be found by some experimentation. Generally, *α* is different for the energization and deenergization phase since the diode gives rise to extra resistance in the circuit.

Now with some theoretical background, we can proceed to describe the algorithm that controls the output signal. First, we have to find the following parameters (**field** shall denote the reading on the sensor):

**sample frequency**= how often to sample**field**and adjust the output signal, I use 10kHz**baseline**=**field**with no magnetic fields**coilMag**= the maximal strength of the coil, i.e. the difference of**field**with the coil on and**baseline****alphaInc**= the constant*α*from above for energization of the coil**alphaDec =**the constant*α*from above for deenergization

We have to keep track of the following variables:

**signal**= last signal ouput, 0 or 1**filter**= value in the range 0 to 1, this is the field of the coil in the last step, the result of the discretized formula.

With these parameters, we perform the following steps with the **sample frequency**:

- Update the
**filter**- If
**ouput == 1**, use**filter += alphaInc(1 – filter)**, - else
**filter -= alphaDec(filter + 0.05).**

- If
- Read the sensor input,
**field**. Based on this value, find the value of the permanent magnet’s field alone,**mag = field – coilMag * filter – baseline**. - Using the value
**mag**that represents the distance of the magnet from the sensor, estimate the required coil power required to keep the magnet from falling,**power**. This is estimated based on the simple model shown in Figure 1. - Produce new
**signal**.- If
**power > filter**, set**signal = 1**(need to energize) - else set
**signal = 0**(need to deenergize)

- If

I should make some remarks on my software implementation:

To achieve a constant sampling and output rate, some constant time-base is necessary. The standard millis() method doesn’t allow for the desired precision. One option is to use the build-in timers, like millis() is using, and change their resolution. But the simplest way is to use the A/D converter itself for timing purposes. The time it takes for ADC to perform a reading is constant and can be changed by changing the ADC prescaler settings. It is given by the formula

clock = 16,000,000 prescaler = 2, 4, 8, 16, 32, 64 or 128 conversionCycles = 13 * clock / prescaler;

The default prescaler value is 128, which gives the maximal conversion rate to be ~~9.6kHz. I’m using 64 which leads to 19230.8Hz sampling rate. The ADC clock value, clock/prescaler, affects the ADC accuracy. ATMEL recommends the range 50-200kHz for maximal accuracy, with rates up to 1MHz if less accuracy is needed. My choice gives 250kHz ADC clock, that’s quite close to the recommendation.

Now for the synchronization. When a conversion is finished, an interrupt flag in the ADC registers is set. Waiting for this flag provides the necessary synchronization. To ensure that the next conversion is started as soon as one is finished, the ADC is configured to work in a free running mode with an auto-trigger enabled. The functions **analogSetup** and **analogStart** configure the ADC and start the first conversion while **analogNext** waits for a conversion to finish and thus provides the synchronization for the program.

The coil simulation is done by fixed-point integer math. That means that values of **filter**, **alphaDec** and **alphaInc** in the range 0 — 1 are multiplied by 2^16 = 65536. When a multiplication such as **coilMag * filter** is performed, the actual code is

result = coilMag * filter >> 16;

Since AVG GCC is adding extra unnecessary instructions for multibyte multiplications like this, I implemented my own assembler routines. This is not really required for this simple code, but I also experiment with additional digital signal processing to achieve greater stability of the magnet, and this processing is quite costly in terms of multiplications. Also, my own multiplication routine allows me to perform rounding of the result of >> 16 and that adds extra precision to the computation.

The Arduino performs all computations but the code is written in such a way that it is necessary for it to be connected to the computer. The levitation routine is initiated when Arduino receives 32 bytes from the computer. These 32 bytes must contain 16 integers and are written in the array **ap**. The significance of the individual fields is

ap[0] = alphaInc * 2^16 ap[2] = alphaDec * 2^16 ap[3] = 0.05 * 2^16 ap[4] = lowestMag ap[5] = powerDecay ap[15] = counter

**alphaInc **and **alphaDec **are the constants from above, **0.05 **= 0.6/12 is the diode voltage. **lowestMag **and **powerDecay **are two parameters of the linear function used to compute the power of the coil magnetic field using the formula

power = 255 + (ap[4] - (mag)) * ap[5];

The levitation routine is stopped when a byte 0 is received by Arduino.

Since the constants alphaInc, alphaDec and diode voltage seem fixed, I enter them by hand. To find their value, I simply guess, it is not that hard. To control the Arduino code and to monitor the performance, I use a simple Mathematica program. It looks like this:

The plots allow me to see the sensor reading, the signal output and the computed field of the permanent magnet (the three lines from top to bottom). To find a value **alphaDec**, for example, I guess some value and use the counter setting (ap[15]) to be 10. This causes the signal to be a simple square wave. Then I take a look at the computed magnet’s field. With no permanent magnet around, it should be constant and the plots should look like this:

If I guess the value too low or too high, I see some bumps like:

Finally, here is the code:

https://github.com/rekka/levitation

To make it work with the 1Mbaud communication, you will also have to~~ download~~ check my optimized Serial library,

~~wiring_serial.c~~ (the library is too outdated now. Fortunately, the official Arduino libraries have been updated and the issue has been fixed, it seems.)

With the current implementation, the project faces a serious problem, instability. It works well for levitating a magnet for short time, minutes at most. After a while, oscillations develop and a magnet eventually falls. The reason for this is the lack of energy dissipation, or damping. The magnet under an electromagnet can be approximated by the equation

where *y* is the distance from an equilibrium. In words, the higher the magnet is, the closer it is to the electromagnet’s core and the stronger it is attracted. The lower it is, the further from the core, the less attraction there is. This system is unstable, the solutions of the equation diverge. By applying power to the electromagnet, we are trying to modify the equation to have the form

where *c* is some positive constant. Now this equation is stable, it is a harmonic oscillator. If there is a oscillation, it will be preserved, but it won’t grow. The problem appears when there is a delay between reading of the sensor and producing new output. And this will be there in every real system. This translates into appearance of a new term in the equation

Again, *a* is a positive constant. Since –*a* indicates how much oscillations are damped, this term will increase the osciallations. The question is how to fight this term. I experimented with a couple ideas that I had, like diferentiation or integration of the signal, or using some other more involved filters. So far, I wasn’t able to produce a consistent oscillation damping. But research continues.

- http://de.sevenload.com/sendungen/Computerclub2/folgen/N7wXqna-Folge-20-Computer-club2 – video from a German podcast (in German), levitation starts at 16:25, they use ATmega8 and a Hall effect sensor in the base
- http://bea.st/sight/levitation/ – combined with a wireless energy transfer, uses ATtiny26, two Hall sensors on the coil, one on each side to compensate for the coil’s field. Also, check out his wirelessly powered levitating bulb.

- http://amasci.com/maglev/maglev.html – magnetic cradle, this is a levitation
*above*electromagnets. Uses an array of coils with hall effect sensors to mimic a superconductor. - http://sites.google.com/site/simerlab/ – professional magnetic levitation devices, above electromagnets.

**Update: detailed description here**

It took only a few days to figure out all the problems and my magnet floating device was born:

Here’s a video:

And another, shorter one:

A small cylindrical magnet can float as well:

As you can see by the blurred edges, the tiny magnet oscillates a bit. After a good calibration, the oscillations can be kept very small and the magnet can keep hovering for minutes. The big dart is much more stable and can float pretty much indefinitely. The device uses a small Hall effect sensor (SS19 from Honeywell, available for $0.50 from AllElectronics) to sense the field of the permanent magnet and uses that information to modulate the magnetic field of the electromagnet. Since the sensor is on the electromagnet,

the reading on it is the sum of the fields of the floating magnet as well as the electromagnet. The greatest challenge was separating these two and getting the floating magnet’s field only. After some theoretical research into inductors and the Amper’s law and experimentation, I achieved pretty good stability of a hovering magnet or a magnetic dart or whatever. The result is not completely perfect, some small oscillations are still noticeable. I think that I achieved the limitations given by the Arduino A/D converter. There is always some noise to be expected. I will post more details together with the source codes for both Arduino and my Mathematica 6 control center when I find just a little more time. I even created this blog because I wanted to share this beauty with the world

I also made some discoveries:

- It is really hard to buy a small, cheap electromagnet. I simply couldn’t find any. If you know how to buy one for say $4, let me know. I had to do a terrible thing, buy a solenoid from AllElectronics for $3.85 and tear it apart to get the coil from it. It works pretty well for me.
- The Arduino core library, Serial, for communication through the serial interface, is very slow because it is not optimized for the microcontroller it runs on, ATmega168. These are 8-bit RISC processors without an instruction for division, but the library uses 16-bit variables and division for no reason. The library also doesn’t offer an output buffer and thus the Serial.print methods lock the program waiting for one byte to be sent before they can send the next one. That can take precious time that is needed when a fast loop is being executed. Therefore I modified the wiring_serial.c file to optimize it for 8-bit RISC and added the output buffer (that can be completely disabled as well). See my other blog post for the file download.

The result looks pretty cool. Actually, it looks awesome. It is, however, only a device that keeps the magnet flying by *pulling* it up. My ultimate goal it the real “antigravity” device, a magnet flying* above* electromagnets. That is gonna take much more time but is hopefully feasible with this kind of simple circuitry, as was demonstrated by the maglev craddle or some neat levitating contraptions from SimerLab.

A couple more pics at the end. I will post a video too if I feel like it. But it seems that YouTube is full of crappy videos anyway.

]]>
~~You can dowload the modified here.~~(See below) See the instructions on how to use the file below.

I first started interfacing Arduino from Java, but soon ran into the problem with standard vs. nonstandard baud rates. The standard baud rates 57600 baud or 115200 baud are not good for the Arduino clock, 16MHz. While 57600 is quite reliable, there are many errors in the data stream from PC to Arduino at 115200. And 57600 was just too slow for my purposes. The solution would be using the baud rates that can be derived from the 16MHz clock using the formula

baud = 16,000,000/(16 * n);

where n = 0, 1, … is an integer. So I thought I would use 250, 500 or 1000 kilobaud. I was relieved when I found that the FT232RL chip, the USB/serial interface for the Arduino board, supports all of these. The RXTX Java library, however, does not support nonstandard baud rates. Since this is the only Java serial library for Windows I know of, I had to move on and use a different tool.

Mathematica 6 with the new dynamic functionality and the handy NETLink package was the perfect alternative. Also the built-in plotting functions came in handy.

Soon I had a 1 Mbaud link up and running. Everything worked fine when sending data from Arduino to PC. But bytes were missed and dropped when sending even short commands to Arduino. I thought that it was the limitation of ATmega168 microcontroller that was not capable of processing the incoming bytes. But looking at the Serial library, the file wiring_serial.c, I found that the library was simply not optimized.

The biggest problem was the use of division in the interrupt routine that reads bytes from the USART controller and stores them into a buffer. Since the ATmega168 doesn’t have an instruction for division, the division takes around 200 clock cycles. That makes the whole routine run for around 15μs (microseconds). But with 1Mbaud serial communication, there is only 10μs to process one byte. With the help of westfw on the Arduino forum, we found that there is a simple fix for this.

The library uses a ring buffer that is defined in the wiring_serial.c file as

#define RX_BUFFER_SIZE 128 unsigned char rx_buffer[RX_BUFFER_SIZE]; int rx_buffer_head = 0; int rx_buffer_tail = 0;

The head and tail locations of the first and the last byte of the data in the buffer. When the data is inserted or removed, the head or tail is incremented, respectively. To make sure that they wrap around and stay in the range 0 to RX_BUFFER_SIZE, the following code is used

rx_buffer_tail = (rx_buffer_tail + 1) % RX_BUFFER_SIZE;

The problem is that % gets compiled as the modulo operation, instead of the much faster bitwise and, & 127. The difference is around 200 clock cycles, or some 13μs, compared to 1 cycle. Also, int is not necessary since I don’t expect anyone use buffer sizes above 256 when there’s only 1kB of SRAM on ATmega168. The solution has two steps:

- define the variables as unsigned char
unsigned char rx_buffer_head = 0; unsigned char rx_buffer_tail = 0;

- use the following code to wrap the values
rx_buffer_tail = rx_buffer_tail + 1; rx_buffer_tail %= RX_BUFFER_SIZE;

Modifying all places with the % operation in the wiring_serial.c file results in a much improved function. I did some tests and found:

- An empty sketch (empty setup() and loop()) with the standard library takes 976 bytes. With the optimization it is only 852 bytes. That’s a difference
**124 bytes**for everybody, no matter if they’re using Serial. (The reason for that is the interrupt routine that is always linked). Also the other functions in Serial are somewhat shorter. - The interrupt routine is heavily utilized when receiving data and with the current library it takes around 250 cycles only to read one byte. That’s 15μs. When reading 10 kB/s, 15% of the processor power is spend on that. With 1Mbaud, the routine is not fast enough to read the incoming bytes and they are dropped. Also the standard practice is to have
**if (Serial.available() >0)**in the loop(), which takes as much as the interrupt. The modified routine happily works with**1 Mbaud (1,000,000 baud) speeds**. The interrupt routine takes only 4μs to execute and that gives plenty time to process incoming bytes that come every 10μs. Also Serial.available() (14 cycles) and Serial.read() (18 cycles) run much faster, taking only a few cycles.

Since those optimization don’t change anything except they substantially improve the speed and code size, it would be nice to modify the wiring_serial.c file accordingly in the core Arduino library.

The other nice feature would be a buffer for outgoing data. The functions Serial.print and Serial.println send all bytes all at once and lock the program until all bytes are send. It can take a couple millisecond to send just a few bytes on the slower connections like 9600 or 19200. But it makes a lot of difference for faster connections as well. This can be avoided with buffering of sent data.

~~Until those features are included in the Arduino libraries, I posted the modified version of wiring_serial.c here. Just download the file wiring_serial.c and overwrite the old version in the directory ARDUINODIR\hardware\cores\arduino.~~ Unfortunately, I lost the file while moving to a different server. Also, it is very outdated, so you are safer to download the updated official version.

The size of the outgoing buffer can be set by modifying the code

#define TX_BUFFER_SIZE 32

at the beginning of the file. Just replace 32 by your desired buffer size. But make sure to use a power of 2, i.e. one of the values 4, 8, 16, 32, 64, 128 or 256. With other values, the compiler can’t optimize the % statements and you have the old speeds… Setting buffer size 0 will use the original unbuffered output. That saves memory and some code that would be used by the buffered write routines, in case you don’t need buffered output.

I tested and use this version on my ATmega168. The file *should* work on ATmega8 as well. It compiles, but I couldn’t test it. If you find any problems or have a suggestion, drop a comment.

Enjoy.

]]>