Learn how to Go Out of Enterprise
One of many quickest methods to exit of enterprise is to ship product that requires a technician to come back to the house and change it – aka a “Truck Roll”. One in all these notorious failures is the Wi-Fi door lock maker Lockstate. In 2017 Lockstate put out a firmware replace that triggered a share of their locks to “brick” and require the lock be eliminated and returned to the producer. Lockstate didn’t exit of enterprise, however they did must rebrand as RemoteLock on account of their status harm. Thankfully for them solely a small share of the locks bricked so the corporate did survive, barely. The small share of failures means the replace had handed regular High quality Assurance (QA) testing. There may be by no means sufficient testing!
Each IoT machine has bugs. At this time’s units have many 1000’s of strains of code and too many {hardware} options which means there are many bugs hiding in each machine. Bugs are in your product. I do know it, and you understand it. I suggest that one of the best resolution for these bugs is to “harden” your firmware to make it extra resilient and carry on truckin’ if one thing unhealthy occurs. Under are my Suggestions and Methods for hardening Z-Wave firmware to outlive a failure. These concepts are for Silicon Labs SDK however related strategies apply to Trident IoT with clearly some totally different particulars.
I’m solely discussing resilient firmware and never speaking about making a product safer which can also be referred to as “hardening”. Safety is a vital subject which have to be deliberate earlier than coding begins. Probably the most primary safety measure is to set the Debug Lock on all manufacturing models. Debug Lock disables the debugger port of the chip making it troublesome to learn out the flash picture. There are a number of extra layers of safety within the Z-Wave chips that may be enabled however at a naked minimal you should definitely set Debug Lock.
Seven Tricks to Harden Z-Wave Firmware
- Assume Every part is Damaged
- Use FOR As a substitute of WHILE
- Exchange Default_Handler
- Allow the Different Watchdog
- Reboot if no Comms in a Day
- Allow Stack Overflow Checking in FreeRTOS
- Run Static Evaluation Instruments
1. Every part is Damaged
This can be a philosophical concept it’s worthwhile to preserve at the back of your thoughts with each line of code you write. Assume the whole lot is damaged on a regular basis – {hardware} by no means goes “prepared”, a queue is all the time full, a mutex by no means switches, an not possible state happens and related kinds of failures. The commonest code approach is to all the time examine for error circumstances of any operate that returns a price. All the time examine inputs for validity.
Probably the most insidious failures are stack overflows. This could occur the place elements of RAM are overwritten and I’ve discovered it superb that the code can preserve working even after trashing doubtlessly a whole lot of reminiscence areas. The problem is there isn’t a solution to predict what may occur. All kinds of issues that “can by no means occur” completely will occur when the stack overflows. As a result of restricted RAM on Z-Wave chips, it’s simple to overflow the stack.
One other widespread not possible situation is when the ability provide sags simply sufficient to flip bits in methods which might be technically not possible. Robust magnetic fields from close by motors and even cosmic radiation can flip bits in not possible methods. There may be really nothing that “can’t occur”. Thus, all the time code with the thought that the not possible can occur as a result of ultimately it will.
Wi-fi IoT units utilizing Z-Wave are sometimes wired on to mains-power. They can’t be simply rebooted such as you do along with your laptop when it freezes. If a tool bricks, it’s useless doubtlessly for months and even years earlier than an influence failure brings it again on-line. As soon as a tool is wired in, it’s normally there for a few years and with Z-Wave it may be many years. Making certain the firmware is resilient when (not if!) the not possible happens will preserve prospects joyful since they by no means knew the machine rebooted – it stored on truckin’.
2. Use FOR as a substitute of WHILE
The Silicon Labs SDK together with the bootloader has many whereas(hardware_busy) loops that can wait perpetually and might trigger the machine to brick. For instance, in Silicon Labs for those who allow the LFXO (32KHz crystal oscillator) however don’t have the crystal wired up, the startup code waits for the LFXO to be “prepared” with some time loop. Whereas that is apparent when debugging firmware it will trigger a tool within the subject to brick if for some cause the crystal stops working. The easy resolution to that is so as to add a FOR loop with a timeout enabling the code to proceed.
Instance in em_cmu.c:
Exchange: whereas ((LFXO->STATUS & _LFXO_STATUS_ENS_MASK) != 0U) { }
With: for (int i=0; (i<1000)&&((LFXO->STATUS & _LFXO_STATUS_ENS_MASK) != 0U); i++) {__NOP()}
Be aware the __NOP() is critical to stop the compiler from optimizing the loop and eradicating it. The timeout worth (1000 on this case) have to be chosen primarily based on testing. I normally set it to 10X the everyday worth. Following the FOR must be an assert to examine that the timeout didn’t happen. Be aware using < and never == for the examine of the timeout. If the not possible had been to occur, there’s a likelihood “i” might skip previous precisely 1000 after which since it is a 32-bit quantity, the timeout can be ready for a very long time for the 32-bit quantity to wrap all the best way round. That is one other defensive coding approach the place at the back of my thoughts I’m pondering of the not possible and coding to be resilient even when the not possible occurs.
3. Default_Handler
Segger has a fantastic article on debugging the various “fault handlers” within the Cortex-M processors. The article supplies code for a lot of totally different handlers to assist debug the fault and make the code extra resilient.
Default_Handler is in startup_
At a minimal, put the Segger advisable code in for not less than among the exception handlers to make debug simpler. Typically, it’s a good suggestion to mild an LED (ideally crimson) or another exterior indicator to assist throughout debug. Different concepts are to log the handle and situation that triggered the exception and retailer it within the Consumer Information Web page/NVM that may be learn out from manufacturing models that had been returned from the sphere by indignant prospects. Then carry out forensic evaluation to establish the trigger and launch a firmware replace that solves the issue.
4. Watchdogs
Watchdog timers are essential for dependable 24x7x365 operation of an IoT machine. A watchdog timer is a timer that slowly counts down. From time to time, the firmware “feeds” the watchdog by resetting the counter to a excessive worth. If the counter reaches zero, a full reset of the chip is triggered which reboots the chip and hopefully resolves the error situation. The watchdog timer sometimes takes a few seconds of being starved earlier than the reset to make sure it doesn’t falsely reset. The trick to a resilient watchdog is deciding when to feed it, and extra importantly, when to not. I wrote a weblog publish on watchdog timer greatest practices again within the 500 collection days which nonetheless applies.
There could also be alternatives to additional align the present 700/800 collection watchdog timer implementation within the Z-Wave SDK with established greatest practices. The watchdog is fed each time the FreeRTOS idle process is executed. The one two methods the watchdog will hearth is that if FreeRTOS crashes and stops servicing the idle process or if a process sits in a decent loop lengthy sufficient. There are numerous different failure mechanisms that may happen that brick the machine however proceed to execute the idle process. The commonest failure is a queue is full and the code skips over writing the queue and waits till later however the process draining the queue has crashed. Even when the watchdog is rebooting the chip, if the failure isn’t resolved by a reboot, then the machine remains to be bricked because it’s caught in a reset loop.
The idle process is often compiled right into a library which makes altering it not possible. Trident has the whole supply code for the SDK obtainable, making it attainable to enhance the code to comply with one of the best practices. The principle effort in enhancing the watchdog is figuring out the whole lot that may trigger the code to lock up. Test that every one queues, mutexes, state machines and maybe even peripherals are idle earlier than feeding the watchdog. The opposite secret is to disable the watchdog throughout improvement. Watchdog resets are notoriously good at hiding faults because the chip appears to briefly cease however you ship the command once more the whole lot is working once more as a result of the chip rebooted and is hiding the bug! The Z-Wave chips have a second watchdog timer so you’ll be able to create your individual strong watchdog following one of the best practices.
5. Reboot When no Communication for a Day
Good day, is anybody listening? The idea right here is principally a long-duration watchdog timer. If the controller hasn’t despatched a body and/or hasn’t acknowledged the receipt of a body in twenty-four hours, possibly a reboot will clear issues up. I’ve seen this within the 500 collection the place on uncommon events studying the HomeID from the exterior NVM would fail. Because of this, the machine would overlook the HomeID and assume some random quantity. This random quantity would then be caught within the machine for days or even weeks and even years till the machine rebooted for some cause. This was a traditional Unimaginable Situation that appeared to occur on a reasonably common foundation when a number of tens of 1000’s of Z-Wave units have been working for a number of months. The one resolution for the end-customer was to tear the bricked machine out of the wall! Or extra generally manufacturing facility reset it and rejoin the community which additionally resulted in one-star critiques.
The answer is straightforward, setup a 24-hour software program timer and examine the RX/TX statistics and if nothing has made it by means of, reboot! This fixes uncommon not possible circumstances in a manner that the majority prospects wouldn’t even discover. This examine is barely wanted for always-on or FLiRs (LSEN) units as deep sleeping units reboot each time they get up. Under is an implementation you’ll be able to drop into your software.
| / Normally that is in app.c or your individual software recordsdata
#embrace #outline ONCE_PER_DAY (1000*60*60*24UL) static SSwTimer CheckTxStatsTimer; void CheckTxStatsCallback(SSwTimer *pTimer){ pStats = ZAF_getNetworkStatistics(); if ((pStats->tx_frames == 0) && (pStats->rx_frames == 0)) { whereas(true); // reboot if no frames have been transmitted or obtained within the final 24hrs } zpal_radio_clear_network_stats(); // zero the stats. } // in ApplicationTask simply earlier than the FOR loop AppTimerInit(EAPPLICATIONEVENT_TIMER, xTaskGetCurrentTaskHandle()); AppTimerRegister(&CheckTxStatsTimer, true, CheckTxStatsCallback); // auto reloads TimerStart(&CheckTxStatsTimer,ONCE_PER_DAY); |
6. Stack Overflow Checking
An actual-time-operating-system provides complexity, however FreeRTOS has a characteristic to examine for a stack overflow which is enabled by default. The variable configCHECK_FOR_STACK_OVERFLOW is ready to 2 by default in FreeRTOSConfig.h. This allows some checking and fills the stack house with 0xA5s which is then checked with every process swap. Inspecting RAM after working the code for a while within the debugger can present insights as to how near overflowing the stack has occurred to this point. The examine calls vApplicationStackOverflowHook if there’s a failure however there may be solely an assert within the weak operate. My advice is so as to add a breakpoint right here throughout testing and contemplate rebooting within the launched code. Stack overflow checking is barely advisable throughout improvement and testing because of the further overhead.
The insidious drawback with stack overflows is that they usually require a number of issues to go unsuitable on the identical time – a process swap, an interrupt, the radio sending or receiving knowledge and having to discover a new mesh route, code allocating sizable short-term buffers and possibly much more code that makes use of up the restricted stack house. As talked about above, I’ve noticed the stack overflowing, trashing many dozens of reminiscence areas and the code retains working however ultimately there may be an not possible situation or extra usually a hardfault exception. Because of this, the failure is commonly ignored because it solely occurred that “one time” however in actuality, it occurs so much. Getting the failure to occur repeatedly in a managed setting is commonly very troublesome. I’ve had dozens of models arrange testing a selected failure case which might take all weekend to lastly set off. Then not having sufficient knowledge on the unit that failed makes it much more exasperating.
7. Static Code Evaluation
Use Claude, CodeX or different static code evaluation instruments to evaluate all firmware. AI continues to enhance at an exponential fee to grade code high quality. Usually AI can advocate modifications to repair the code, however I might fastidiously examine over the options. AI can hallucinate or just begin making issues up out of nowhere. The GCC compiler has a -fanalyzer choice that can discover a number of attention-grabbing issues. Coverity is an business chief on this subject however is costly. What instruments have you ever used?
Sources
I’m not the one one to make the siren name to make your code resilient. Jack Ganssle wrote the Embedded Muse e-newsletter for 27 years on embedded programming. He encompasses a failure each challenge with loads of attention-grabbing tales and lots of tips about resilient embedded C coding. Micheal Barrs’ C Coding Customary must be on each firmware engineer’s bookshelf or bookmarked. The guide is full of sensible recommendation on coding for resilience and maintainability. Possibly I ought to write a guide on embedded coding primarily based on this Journey?
Subsequent Steps
Half 6 of the Z-Wave Developer’s Journey discusses {hardware} greatest practices. I current my suggestions and methods for making low-cost, simple to debug and manufacture Z-Wave merchandise from my 25 plus years of Z-Wave expertise. As we proceed alongside the Z-Wave Developer’s Journey, I welcome your feedback and questions. Please be at liberty to achieve out to me instantly by way of e-mail.
In regards to the Writer
Eric Ryherd has been on the forefront of Z-Wave innovation since 2003, starting as a advisor and later serving as a Area Utility Engineer at Silicon Labs. Over the course of his profession, he has contributed to the design and improvement of a variety of Z-Wave merchandise, together with sensors, distant controls, motorized window shades, and in-wall dimmers, lots of that are in the marketplace immediately.
Though he “retired” in 2022, Eric stays deeply engaged in embedded programs and Z-Wave improvement by means of his weblog, DrZWave.weblog, and ongoing IoT consulting tasks. He’s additionally a well-recognized face at Z-Wave Alliance Unplug Fests, the place he often serves because the lead coordinator, supporting interoperability and developer collaboration.







