Root Cause Analysis For Electronics Primer

Introduction

Root Cause Analysis is the general term for a range of methodologies for finding the root cause of problems. The term ‘root cause’ distinguishes clearly between the fundamental underlying reason for a problem and other ‘causes’ which may only be contributory or ‘symptoms’ which are how the impacts of a root cause are observed. This distinction makes clear that complex problems often have many interrelated factors causing them. This is especially true for electronics.

Common RCA Methodologies

Most RCA Methodologies originate in manufacturing engineering and are rarely taught to development engineers. For electronics, these methodologies help to provide a more structured approach for engineers, beyond relying on their individual experience and intuition.

Ishikawa (Fishbone diagrams) – This is a structured brainstorming approach and benefits most from multi-skilled teams approaching problems with potentially diverse root causes. . However, where problems are beyond the experience or conception of the RCA team it can be an ineffective approach. The 6Ms are common starting points for possible root causes – Machine, Material, Manpower, Method, Measurement, Mother Nature (this was developed in the 1960s). Read more here What is a Fishbone Diagram?

5 Whys – This is an iterative process finding the immediate causes for symptoms until the root cause is found. Strictly, this is a single-track method and so although it is very effective at quickly finding causes it takes rigour and discipline to reliably find the root cause and can be ineffective at finding problems which have multiple root causes. Read more here The Five Why Analysis

Fault Tree Analysis – This is a thorough methodology that, although in its academic form can be very complex, is extremely effective at resolving complex problems with multiple root causes. Seasoned users will also be familiar at translating completed trees into Reliability Block Diagrams, customer support sequences and other business processes. Read more here What is Fault Tree Analysis?

DMAIC – Define, Measure, Analyse, Improve, Control. This methodology is more commonly associated with process control for Lean-Six Sigma practitioners. It is an iterative process to continually define the desired state of a system and to measure changes from it. Effective instrumentation, telemetry or measurement are essential to delivering improvement. Read more here DMAIC for Beginners

Fault Replication

Most RCA methodologies begin with asking for a clear problem definition. In electronics, especially when dealing with EMC or embedded software issues faults are rarely easily reproduceable and so this can be difficult to achieve. Because complex electronics are regularly operating at the limits of data rates, signal integrity, crosstalk or EMI and others, problems rarely have a single root cause.

More likely is that problems are caused by the summation of a range of inter-related causes. Given that these circumstances only occur in representative environments and in assembled products, in-circuit testing becomes extremely important. Technologies like boundary scan (through a JTAG chain) can give engineers access in-circuit to monitor and replicate problems effectively.

Fault Analysis

Even when problems occur spontaneously and can’t predictably be triggered, with suitable measurement it will be possible to collect data related to the problem. Here, unavoidably, an engineer’s intuition is important in understanding what to measure and interpreting those data to find likely sub-causes. Following an RCA methodology helps to structure thought and find these sub-cases more quickly.

Where problems are specific to individual pieces of hardware, testing and experimentation must not be damage the units under test. Often, Design For Test, is a low priority in the development of electronics and so functional testing can be an ineffective tool for RCA. This is doubly true because problems cannot always be conceived at the design phase to allow test access. JTAG boundary scan allows valuable measurement and control in-circuit with needing design changes.

Fault Correction & Prevention

Once a root cause, or a cause whose contribution to a problem is sufficient to make the difference between the symptoms occurring or not, is found then correction and prevention follow. Correction is short-term remediation to a problem which can fix the problem for individual units. Prevention is the implementation of process or design changes to permanently resolve the issue. This is the outline of what is known as a Corrective Action & Preventive Action (CAPA) process.

In order to resolve problems as early as possible, it is common for a mix of corrective and preventive actions to be required. For example, before a design change can be made it may be necessary to implement both corrective screening on a production line and preventive service interventions.

This necessitates communication between different arms of a business. This is another benefit of following an RCA methodology because, if used throughout a business, it provides a common language for different stakeholders to communicate and resolve these problems. The alternative is engineers explaining their own circuitous troubleshooting to non-technical colleagues or third parties.

Risk Management

Problems need to be resolved in the interests of the business and it is always worth considering if an electronics problem is worth root cause analysing. Root cause analysis, especially when conducted by a team, is both time and resource intensive and the impact of problems should be understood not just at the point of correction & prevention but at the point of replication. This requires engineers to understand how their business manages risk.

In software vernacular, bugs are triaged. It is good practice to do the same in electronics and gain an understanding of the risk problems pose before engaging in RCA. There are many analogues electronics engineers will find between this article and their own work Best practices for triaging software defects and bugs

Example – Video Feeds

As an example, an analogue camera on a medical instrument had a metres-long signal path back to a video processing card, from which it was converted to CoaXPress and relayed to output terminals. In use occasional, significant distortion was seen on the video feeds at the output terminal. Although the symptom on the output terminals made it relatively clear that the major issue was seen in analogue distortion it was extremely difficult to determine specifically which causes were tipping the system over its noise tolerance.

With EMI as a prime candidate for the issue, it took time to understand the likely causes of interference in the system’s use environment. Engineers visited customers to understand the operating environment and replicate it in the lab.

A small number of immediate corrective actions were possible by minimising antenna effects in cabling and the system as a whole. The preventive actions were significant design improvements, including the shielding and improved signal-noise in the Power over Coax camera feed.

With the benefit of hindsight, it was supposed that this problem had always occurred but was only actioned as a result of both the increasing number of products in the field and the resolution of other more critical problems making this rise to the top of the triage.

Example – Spontaneous Reboots

As a purely digital example of the complexity of electronics RCA a small subset of SoCs running Linux on finished products out in the field started experiencing spontaneous reboots. Replication of the problem proved challenging due to the spontaneous and infrequent nature of the problem. The expected in-field occurrence of this problem, even among a population of several hundred, was less than one a month. However, it’s impact was severe and it needed to be resolved urgently.

There was no correlation found between specific batches of the SoC or with versions of software used. Preventive actions were taken to review and resolve an on-boot race condition and also improper registry access causing unpredictable behaviour. This reduced the problem frequency in test units in the lab but did not fully prevent it.

The corrective solution was to implement Highly Accelerated Stress Screening on the production line because it had been found in the lab that some units behaved differently in otherwise identical situations. This stress screening deliberately overclocked the SoCs and found latent problems upfront before delivery to customers.

A number of potential root cause preventive actions lay with the Original Equipment Manufacture (OEM) of the SoCs being used. Here, the documented root cause analysis evidence helped to get their buy-in to resolving the problem. However, as a result of some of the preventive measures being out of direct control, the corrective HASS remained in place for an extended time.

Conclusion

Electronics, whether in prototyping, development, production or service are fundamentally complex and benefit from rigorous approaches to root cause analysis. The benefits of this are seen in upskilling and enabling more engineers to resolve problems that they may not otherwise have the experience or skill to resolve, the more focussed and easier implemented corrective and preventive actions and the better communication within an organisation. Additionally, as a secondary benefit, RCA allows the incidental discovery of other likely problems related to observed problems, even if their effects have not yet been felt.