Not all Freaks are the same
Updated: Nov 27, 2020
This is a story about unreproducible bugs and the man decided to solve them.
I have been working on the firmware v2.5 of the Freak module. Some of the new features have required changes that, even when minimal, require spending a few extra cycles of CPU.
The Freak module, in order to simulate the virtual analog filters, requires to efficiently solve nonlinear differential equations. The ARM microncontroller (STM32F405) used in the module has very decent performance. But it is not near to the power of a PC computer. Most of my filter algorithms run near the limit of the microcontroller. I have to do a lot of manual optimizations and tricks in order to reduce the number of operations performed every sample.
One may think that one multiplication or addition in a line of code does not have a big impact in the performance. But sometimes that's exactly the case. When solving nonlinear equations, a single line of code is executed many times as part of an iterative algorithm. The most common method to solve a nonlinear equation is the Newton's Method https://en.wikipedia.org/wiki/Newton%27s_method
One big issue with a method like this is that the number of iteration required to solve the equation is not known. The Newton method consist on making successive approximations until the error is small enough, therefore, we consider the solution as valid. This is not ideal for real time systems because we are not certain that we will achieve a solution in a constrained time. One sample may take 1 or 2 iterations to find a solution, other can take 15.
The models in the Freak filter have to solve such equations. By analyzing the equations and their behavior, it is possible to optimize the code and develop a strategy to speedup the convergence to a solution.
I usually develop (and test) the Freak firmware in the original board; the one that I manually assembled when I created the first module.
This board has served me so well. I have abuse it and re-flash it so many times. But a few days ago my friend Omri Cohen helped me discover a big problem that threatened the release of the new firmware.
Omri was having some strange bugs when running the beta version of the firmware. The module would hang, sometimes text didn't show or you cannot switch filters. All these problems were very odd since I was running exactly the same firmware. I tried, to reproduce them by changing compiler versions, moving back a forth in the firmware revisions, connecting the module to my worst power supply but my module always worked fine.
The only clue I had was that, in the past, I have faced similar issues when one of the nonlinear algorithms fails to converge in a constrained time. These problems could only happen if Omri's module would run slightly slower than mine.
As I mentioned before, I'm running the module to it's limits. The way I measure it is rudimentary but reliable. Before starting to compute the algorithm, a pin of the microntroller goes high. When the algorithm is done, the pin goes low. This provides me with a CPU measurement like this:
In the image above we can see that the pulse has a duty cycle of 70% which means that the other 30% of the CPU is available for background task like drawing graphics, handling the screen, processing buttons etc. In the bottom we can see the "Persistence" graph that accumulates the results and displays them with a probability scale. We can see that sometimes the computations take more than the current 70% and sometimes less. This is because of the aforementioned nonlinear solvers. As long as I don't go too close to the 100% it should be fine.
In some of the other computationally heavy filters the CPU can go closer to 100% leaving very little time to the background tasks. When exceeding the limit, the module starts behaving strange.
How could Omri's module run slower? Maybe the crystal oscillator or the capacitors are defective.
In order to test that hypothesis I took one board that I had in a box along with other broken boards. This board was from the same batch as Omri's board but I damaged a pin while making some other experiments. After flashing the board and installing it in my test rig I found that this board had exactly the issues that Omri described.
All the components in the board look exactly the same, with exception of the microcontroller.
It turns out that the board I use has an STM32F405 revision "Y" while the production boards have a revision "2". This means that the microcontrollers are slightly different. One of the two is more recent. I don't know exactly which. The result of this difference is that the revision "2" boards seem to make some operations slightly slower. I'm not sure what exactly the problem is, but in an algorithm that uses 80% of the CPU in average, sometimes it goes over so close to the 100% and it leaves no room for the background tasks.
I have heard stories in the past about how a bug in a microntroller can make you hit your head against the wall until you read the Errata document. But this is the first time I face a problem like this: the microcontroller works, but it does something different and slightly slower.
The solution: make the code run faster. After checking the code of the algorithms that were affected, I optimized a few lines of code by doing some algebra using Mathematica. I removed a few multiplication here and there, by means of factoring or grouping terms. I added a few lookup tables created within the Vult language and the results were satisfactory. The code runs smooth again in all boards. 🤟🏼
For now and on, I will switch my main development board. I have to say goodbye to the Freak #1 board and start testing all the new firmware in a microcontroller revision "2". That way all the future updates will work in all the Freak filters that have been shipped.
I will keep adding features to the Freaks until it is not possible anymore.