The Cost Of Quality is the bane of my working life. I imagine similar thoughts are going through the heads of the engineers at Thales, Belfast after their huge success in landing this deal Thales awarded £1.16bn to produce missiles for Ukraine. Scaling production is really hard, and making it happen quickly is even harder.

If you haven’t come across the Cost of Quality before it is how we attribute a cost to poor quality in our finished product. This lets us make sensible decisions and plan ahead for development and production. It’s plain common sense to catch a quality problem early in development rather than when its out with customers and you need to recall it. Cost of Quality puts that idea in the language your budget-holders will understand.

I remember working at a smart meter factory shortly after the announcement of the UK government’s smart meter rollout back in 2017 and learnt some valuable lessons that are paying dividends now.

What even is good product?

I’d like to focus on an electronics example and talk about what quality actually means. If you want to produce good ol’ fashioned whatsits you can probably just take a look at them as they roll off the production line. For electronics, a visual inspection isn’t going to tell you everything that you need to know. In fact, you are faced with three distinct problems:

- You want to test all of the functionality of the finished product, even if you wouldn’t normally use that functionality. A good example of this is whether your phone can call an ambulance when you don’t have signal on your network. This is known as test coverage

- You want to know that your tests are giving you a true result. Your test equipment needs to give you confidence that when you get a test fail, it is really the product at fault, and not your testing. This is known as test recall

- You want to know that your tests accurately assess the true state of your product. Remember some production problems which you can detect might not cause a fault now but might be indicators of a fault which will develop later, or even more frustratingly, the fault might be intermittent. This is known as test repeatability

Whether you’re in Thales, or medical devices or submarines, your definition of quality would rightly be very different to our friendly whatsit maker.

Test, Test, Test!

Where Cost of Quality comes into its own is when we begin to talk not just about the cost of bad quality, but the cost of quality that is too good.

Let’s talk in terms of production cycle time (or Takt time if you’re a lean six sigma fan) and just a little bit of arithmetic. You’re producing 10,000 missiles in a year on a single 16/5 (hours a day/days per week) production line. That’s the same as one every 2.5 hours. Missiles really need to work, and you’re working in low volumes, so we’re doing mostly manual testing at the end of the production line to make sure everything works. You might call this SOAK testing (Shakedown Of All Kit – although I think this might be a backronym). An operator can check that the rollerons work and shine an IR light at the homer and make sure the debug readouts are correct.

For a complex product like a missile, as soon as that SOAK test takes more than 2.5 hours you need another station in parallel to get the same line throughput. That means more people, more space on the factory floor, more test equipment, more cost of quality.

More testing is not always proportionate. You should ask yourself if the latest test you’re introducing significantly reduces the risk of your customer getting bad product and also if you can reduce that same risk more effectively in another way.

The Traceability Fallacy

Herein lies the traceability fallacy. Just as for testing, more traceability is not always proportionate. Of course if you happen to work under AS-9100 or ISO-13485 or in some other regulated industry then traceability might be a requirement, probably for good reason.

Before you introduce a burdensome traceability process for your part whether at component, subsystem, system or product level you should ask yourself:

- Is a significant portion of the product risk in the manufacturing inputs as opposed to design, processes or use?

- Does the level of traceability you are considering align with the expected variation in your inputs or processes?

- Are you actually going to have the time to root cause analyse and track the faults?

If your answer to any of these is no then providing traceability is not going to significantly help you with any quality problems, but will come at a cost.

For our missile example, it’s a single-use item once it’s with the customer and they are maxed out ramping up their production capacity. Maybe this is a controversial opinion, but I don’t think they need traceability.

Rework

Let’s be honest, low volume production is normally a long, long way from six sigma levels of process control.

When you test your finished product on your production line it is not always going to pass. You might throw the test failures in the bin, or in the case of high volume products like missiles you are more likely to send it for rework. I once worked for a rapidly scaling medical devices company where quality was really important, the scary statistic we ended at was that ended up building every finished system more than two times. If you haven’t worked for safety-critical or mission-critical products this might sounds absurd, but it is the reality of a focus on quality.

What this means for your cost of quality is not only do you need twice the SOAK testing capacity at the end of your production line, but twice the capacity all the way through production. In essence, whether you will increase capacity by serialising or parallelising production, our example missile manufacturer needs to cut their Takt time from 2.5 hours to 1.25 hours to achieve the same manufacturing output. More cost!

Early Testing

SOAK testing can only be done at the end of the production line but there are many other forms of testing you can use to ensure quality further up the line.

This can take many forms, from incoming quality control of parts in the supply chain to in-circuit testing of electronic subsystems to random sampling of production batches for in-detail optical inspection.

In the world of electronics and complex control systems in missiles, faults are often pesky and evasive. To reduce the cost of quality don’t just blindly implement every test you can think of, and don’t always test early in production as a universal substitute for a well-planned testing regime.

Designing a well-tested production line from scratch is complex, so let’s start with a simple example and say that since our missile manufacturer has ramped up production quantities they start to see more quality problems. The fastest and most thorough knee-jerk reaction to a new problem is to add another step at your end-of-line SOAK testing (cue the eternal struggle between manufacturing and quality engineers). Job done? No.

To reduce the cost of quality in implementing this testing you need to identify the root causes, or possible root causes if you can’t be sure. Let’s say that our missile manufacturer has had to onboard a second supplier of critical electronics components and has also had to set up more soldering stations in order to assemble and populate the PCBs and wiring looms. JTAG boundary scan testing could be a great way of testing your PCBs, but you don’t want to only do this because even though it would pick up the faults you might have a chance of picking up your fault closer to the source by optical inspection of your components on arrival. Job done? Still no.

Keep both tests until you can prove that your new test further up the line is reducing your failure rate at SOAK. Only then can you prove your testing is effective and relax your SOAK testing and remove any other containment test procedures you’ve implemented. The cost of quality will get worse before they get better, but this is how you develop an effective, and minimal, testing solution for identified problems. Think of it as red-tagging but for your test procedures.

Diagnosis

So, a missile has come off the line, and it is your job to diagnose the fault and deal with it as a manufacturing engineer. The goals are clear, you need to:

- Repair this missile and get it out the door to the customer (corrective action)

- Identify any problems which might reflect problems in the manufacturing process or the product (preventative action)

You break out your engineering toolkit and start building Ishikawa diagrams, fault trees and doing 5 Whys exercises with your team. Hang on a minute. What does this really achieve?

Management is desperately trying to ramp up production. Your process engineers are trying to standardise the production line but have barely got past Sort (caution - 5S reference). The design team is already working on the product after the product after the product you are making on the line. Answer two questions for me honestly:

- Is the cost of you root cause analysing and making this missile good worth it? It will take hours of your time and you could probably very quickly recover a significant portion of the value in parts without having to get to the fifth Why.

- Can the line realistically afford a stop, or even a delay, to implement an improvement you’ve identified. Is taking the line offline for a day really worth it to resolve a problem that you only expect to happen 1 in every 1,000 units?

Yes, we all want to make good product, but the Cost of Quality goes both ways. If we are too fixated on good product we can spend even more money going too far the other way. This is an especially difficult and heart-wrenching argument to make to the sort of problem-solving, analysis-loving engineer such as yourself.

Design Like You Mean It

This is where DFT (Design For Test) really comes into its own. When you are developing your product, don’t just think about how you can test your entire system, but how you can control the product quality as early in the manufacturing process as possible, long before it becomes a complete product.

As a rule of thumb:

- If you are probably going to be reworking rejects off your production line then the added value in your production step before the product is next tested in some way should be the same order of magnitude as the cost in parts, people, time and equipment to rework it.

- If you are going to be discarding rejects then the added value in your production step before the product is next tested in some way should be the same order of magnitude as the cost in parts, people, time and equipment to replace it up to that point in the production line.

This rule can steer you towards the right cost of quality – walking the line between too much testing and too many rejects.

We are now talking about a production line where testing is done at many stages. Given there will be rejects, however amazingly your product is designed and your processes are controlled, this means that your flow rate through the line will be different at different stages. Line balancing is not the same as achieving a uniform Takt time all the way through your production line.

Testing & Simulation

So far we have been talking about testing on the production line to ensure proportionate quality of our finished product. One of the key reasons for product faults is due to the design, as much as the manufacturing process. This is why Design Failure Modes and Effects Analyses (DFMEAs) exist as well as their Process cousins (PFMEAs).

In principle, the optimal solution for verifying your designs during the early stages of the development cycle and ensuring quality on your production line are very different. In practise, as part of being a great engineer with DFT at the forefront of their mind, you might want a solution for both.

Simulation is a theoretical exercise to create a representation of your product and evaluate its performance without needing the physical item in your hand. For electronics this can range from the FIDES model (which, as co-creators, Thales will be very familiar with) where Bills of Material are assessed to product a reliability estimate to SPICE simulation and even into the world of Finite Element Analysis.

Testing means getting stuck into your actual product, whether software or hardware. On the production line it needs to be testing not simulation but again you have multiple choices. Do you do functional testing or do you do some form of in-circuit (again, we’re talking electronics) test?

JTAG is one of the technologies that can span both development and production allowing DFT best practise to be used throughout the product lifecycle with less implementation effort and so a lower cost of quality.

Muda, Muri, Mura

It’s time to add yet another factor into the Cost of Quality debate. If you’ve come across the Toyota Production System before then you will definitely have heard of the three Ms. I want to extend this to talking about the Cost of Quality.

Here is a beautiful convoy of lorries, each full to the brim, delivering their cargoes as efficiently as possible. The problem with quality issues is that they’re hard to predict and things might end up looking a little different.

Here is another convoy. This one is suffering from Muda. Because some of the cargo has been rejected on quality grounds the first two lorries left only part-full to stick to their delivery schedule. To make up the order a third lorry was needed. When talking about a production line quality rejects reduce the output from your line. To keep the same output you will be reducing your Takt time or adding parallel lines. This leads to part-utilising equipment and people, extending shifts and filling your buffers with Work-In-Progress.

This convoy has a Muri issue. The Muri problem is the one we have already covered where the same equipment needs to be run harder to deliver the same output when you have quality problems. When this impacts testing on the line itself, it can lead to a vicious cycle of overburdened equipment, stressed operators and more quality escapes.

This last one suffers from Mura. Quality issues on the line can lead to line stops, desperate games of catchup and unevenness in order to achieve the same result. At an equipment level this means your line isn’t being used evenly, tools and parts will be used or sit idly unevenly, all making your six sigma goal just that little bit harder. At a process level this means you need to build in buffers so that your systems are not close-coupled. This all contributes to the Cost of Quality.

WIP

The more testing you do and the higher your quality standards the more Work-In-Progress (WIP) you will inevitably have. This can be in one of three places:

- Quality rejects being reworked at Tier 2 engineering stations

- Quality rejects queuing to re-enter the production line at the relevant station

- Product being held in buffers between stations to decouple the production line stations and better handle the Mura problem from above

Every piece of WIP has a value and that is money not in the company’s bank account. For our missile manufacturer example, they have 10,000 to produce a year. Their Takt time may only be 2.5 hours but over 250 person hours are needed to manufacture it. Once you’ve added in stock held in buffers, in goods-in and goods-out, a product like this might be inside the factory for well over a month. The BBC’s ‘Inside The Factory’ has a rather stylistic representation of Just In Time manufacturing with their timer and the real world isn’t like that!

What does that actually mean? Let’s suppose this WIP has a value halfway between the parts cost of £100k and a finished product price of £300k. So that is, approximately, 1,000 missiles somewhere in the factory at a value of £200k each. How much cash does the business need to fill that production line with WIP? £200M.

This could easily be enough to cripple a business if you don’t watch out for it!

Money Is For The Accountants

Every single quality reject (and some of those rejects will be test failures not product failures), every single product in rework, every single product held in a buffer is taking up £200k in cashflow. Imaging if the business could invest that money in growing production, or even just putting it in the bank.

To go back to our definition of what makes a good product – test recall and test repeatability could not be more important in reducing WIP. Don’t implement tests on a production line:

- Unless you know what you are testing for

- You can make corrective action (rework) or write off the value fast

- They are implemented as early as possible in the production process

- You are not rejecting good product and are continuously re-evaluating to make sure this remains true

As an engineer, whether you are designing products or whether you are setting up production lines minimising WIP is really, really important. If the business doesn’t have cash it can’t invest it can’t grow and might not even be able to pay the bills.

It’s Not All About You

Back at the smart metering factory I first worked in, I remember lorries full of parts queuing to get into goods-in because there wasn’t space in the factory because of all of the WIP stacked in every corner.

Although you typically have larger buffers at the goods-in and goods-out of your own production facility the real system flows all the way from raw materials provided by second and third tier suppliers and then through to your customers. The closer to Just In Time you run and the leaner your line runs the more easily you will have knock-ons to your upstream and downstream supply chains. If you had a perfectly lean facility then even Paddy McGuinness and Cherry Healey taking a single loaf off the line for filming ‘Inside The Factory’ could cause a line stop. Beware of the film crews.

But more seriously, we have all seen crates upon crates of parts sat in warehouses waiting for inbound quality inspection. This is no less a Cost of Quality than WIP on the production line itself. Let’s suppose our missile manufacturer has a Bill of Material that is thousands of lines long and to ensure quality every inbound shipment has 100% inspection. This means to produce a missile every 2.5 hours (10,000 a year) we will need to be inspecting an inbound part every few seconds. Now let’s say that your test recall is not 100% and in fact you are falsely rejecting even a tiny fraction of parts. This is enough to cause not only continual shortages and stops on your line but also get those lorries queuing outside the gates.

Seek And You Will Find

I will bet you any amount of money that if you give a product to an engineer and tell them something is wrong with it, that given enough time, they will find a problem even if there wasn’t one to begin with. You might find a bug in some code, you might find a slightly off solder joint or you might find a hairline fracture in a casing that might one day grow. This is known as overtesting.

Testing is not about finding everything that could possibly be wrong, it is about proportionately managing risk to ensure you have a safe and functional product. The General Product Safety Regulation (in the UK and EU) doesn’t require that products are faultless, it requires that they are safe. Absolutely, I am an advocate for safety first, and my tolerance to safety risks in products I work with is very low but every decision you make needs to be proportionate. To quote the Engineering Council’s UK-SPEC on what is expected of a Chartered Engineer:

“Developing and implementing appropriate hazard identification and risk management systems and culture”

The operative word here is appropriate, no knee-jerking allowed!

Finding The Balance

This article has been long, too long probably. The Cost of Quality is tricky so here is a summary of the warning signs to help you tread the fine line:

Put more effort into testing if:

You have tangible evidence that your products are not safe
Your testing methodology doesn’t let you test early
You don’t have a defined test regime
Your rework volume has a significant impact on line throughput
You have high value products and capacity to rework
Your test strategies between development and production have nothing in common
Customers are complaining about quality
You haven’t heard of DFT before

Put less effort into testing if:

You have WIP coming out of your ears
End-of-line tests are finding faults that earlier tests should have found
Your tests are repeatably finding problems that won’t realistically have an impact within the life of the product
You are unbalancing the line to the point that it is causing Muda, Muri or Mura
You have traceability of things you don’t, or never realistically will, understand
You don’t have the resource to take preventative action on faults
Your testing is mitigating FMEA items that didn’t need further mitigation
Your supply chain manager is paying extra for storage and rearranging deliveries

If not many of these apply to you, you have probably got it about right!

Oh COQ!