Wednesday, November 28, 2012

Enumeration hell

tl;dr some bugs are beyond imagining

"Rational people don't count bugs."

There's a rash statement. Let's say that rational people who do count bugs ought to count other, less pointless more meaningful things, too.

Bugs* are rotten to count. There are plenty of posts** about this, and I won't go over the same ground here. Counting bugs is a bit like counting holes – superficially obvious until someone takes a shovel to your cheese.

But the big problem with a bug count is that it summarises a potentially useful collection of information into a number that is mostly meaningless. A single nasty that makes the wheels fall off is worth any number of niggles that make the horn too loud. Unless you're driving a clown car.

In our idealised model, we're counting surprises because it's interesting to see how many are left. None is none on any scale, and if there's none, we're done. We're still not done if we've got one left, because that one might be a stinker.

You've noticed that I've only given you one knob to twiddle*** on these toys. You only get to change the budget – you don't get to change the context****. This is a cheap manipulation on my part, because I've been asking you to concentrate on where you might set that budget to feel reasonably confident that the thing is tested.

So far, we've not considered bug stink in our model. It's time that changed.

In the same way that our model gives each bug a chance of being found, it gives each bug a quality I'll call cost. That's probably not the best word, but it's the one I've chosen for now*****. I'll give it a local meaning. Cost is the amount by which the value of the system goes down when it contains the bug. Quality is value to someone. Trouble makes that value go down. Cost, here, is not cost of fixing the bug. It's the cost of leaving it in, and it's the cost to the the end users.

Bugs aren't made equal, so we'll need to consider a distribution again, this time of (our local definition of) cost. Experience leads me to believe that most bugs have low cost, some bugs have higher cost, and a very few (so few that they might not exist in a given system) have astronomically large costs that outweigh the value of the system.

In earlier examples, each bug had the same cost. The distribution I've chosen to use in this model, to match my experience, is called a "power law" distribution. Power law distributions fir lots of things observed in the real world, such as city sizes, distribution of wealth, and the initial mass of stars. Power law maths underlie the Pareto Principle (aka the 80:20 rule), and Taylor's Law****** (and , more incomprehensibly, phase changes). If you want to dive into this, set your head up with this handy note comparing the similarities of Power/Zipf/Pareto in a real (if rather antique) context.

Why have i picked this distribution? Because it feels right. Instinct is no justification, so you can expect that we'll have a look at other distributions later. For now, though here's a fourth assumption:

4        The cost of a bug to (all the end users over the life of a product) has a power law distribution.

Enough of the hands-waving. Let's play.

Below you should find an identical machine to last time's closing toy, but with costs set to match a pareto-style distribution. You'll quickly see that there are two "stuff found" numbers, and that the size of the yellow dot is related to the cost. Run this a few times.

Don't be surprised if, occasionally, you see a simply huge yellow dot. Try hovering over the top right of the square set of 400 circles, and click on the ? you see to reveal a god-like understanding of how much trouble this system is hiding. Know that, generally, you'll see the total trouble is around 1000*******. If you see around 2000, expect that one of the bugs has a cost of 1000. If you happen to see around 11000, you've probably got a fat 10K bug hiding away.

In our most recent outing, I hope you got a feel for why it's hard to use a bug rate to say that you're done testing. If you play with the models in this posting, you may get an idea for how 'not done' feels in terms of the cost of what you've left behind.

I hope you're still considering where your omnicognisant self would set a reasonable budget so you could say with confidence that you'd done enough. Have a look at the left-hand graph of what's been found. It's still very front-loaded, but you'll see the occasional big spike as a particularly troublesome bug is revealed.

Let's rack up the difficulty another notch. I set up the model above so that the budget and the bug distribution meant that you got to find most of the bugs in a relatively brief exercise. Of course, that's no use at all. Here's another; more bugs, smaller budget. Crucially though, in this model plenty of the bugs are very hard to find indeed. You're not going to find the lot, so that's what this model looks like.

Hopeless, isn't it? If the real world looks anything like our model, how can anyone be bothered to give a sensible answer when asked to set out a budget?

Next time, all being well, we'll approach these frustrations sideways on. We won't find clarity, but we may find perspective.

* I'm not going to define "bug", because it's a vague word, and therein lies its power. But if there's a scale that runs through vague to countable, then I suggest these two ideas are at opposite ends.
** Try Michael Bolton's Another Silly Quantitative Model and Elisabeth Hendrickson's What Metrics do you use in Agile.
*** there's lots more interactivity to come. For now though, mull on how it must feel to be a leader whose only effective control is over budget-setting, then be nicer to your poor distant Chief Money Officer next time.
**** suggestions accepted, but without any guarantee they'll be used.
***** "Law" appears to be used by some scientists in a similarly-imprecise way to the way some lawyers use "Proof". Business people naturally choose to use both words with abandon. I would treat the word "Law" here with as much scepticism as you might treat it in Moore's Law. They're empirical laws, and describe, rather than necessarily account for, system behaviour.
******* 1000 what? I don't care. Stop your whining and go count the number of things in this list.

Monday, November 19, 2012

Models, lies and approximations

tl;dr – Some of these bugs are not like the others

Here's hoping you've enjoyed playing with trucks and bowls and your imaginations. If we're going to be able to use our model as an illustration of much value , we have to recognise that in relation to software testing it contains a useful approximation, and a misleading lie.

There's a limited collection of things to find. This can be a useful approximation for exploration in software testing – if one assumes that a system has a finite collection of possible behaviours, then the set of possible but undesirable behaviours is limited too (as compared with the vast set of rubbish things that it doesn't actually do). This is good to consider when told "there's always a bug" by an idiot*.

You might further refine this by adjusting your view from the large mass of observably rotten behaviour to the smaller selection of practical changes that make the system more desirable. You'll also recognise that the collection, while limited, is only fixed if everything else is fixed. In our model, the collection of bugs is fixed – so we need to be clear that the approximation and the model assumes that, just of now, no one's changing stuff**.

The rate of finding things falls, as the number of things that can be found falls. This is obviously true, but is perversely also a misleading lie***. Idiots (sometimes, the same idiots who believe "there's always a bug") assume, because of the statement's obvious truth, that when the rate of finding bugs drops, the system is close to clean. Bonnnnng.

Sometimes, and it feels like often, it's because the people testing it have run out of imagination. While we may understand more as we reveal more, and while a system may become cleaner as it gets fixed, a dropping bug rate certainly does not imply you've found all the bugs.

Some testing is done by an unbending list of dull asserts, which run clear and green when they've not managed to either trigger or observe any trouble. Michael Bolton reasonably calls these "checks" rather than tests. Some testers, limited by corporate focus or personal blandness, don't do much better, demonstrating simply**** that a system meets expectations.

As any fule kno, some bugs are harder to find than others. If you find you've run out of bugs, it's likely you've run out of bugs that you're set up to find. Sometimes, that's OK. But sometimes, a bug that's hard for you to find is easy for someone else to find. If that someone else isn't a paid tester, but is heaven forfend, a paying customer, we get the "why didn't you find that" conversation.

So, then. A couple of approximations for this model.

1        Some bugs are harder to find than others.

I'll model this by giving some bugs a high chance of being found, and other a low chance. The way that easy-to-hard works amongst the bugs is called a probability distribution. We can pick a distribution. Our earlier example, the trucks and bowls, would be a fixed, or flat distribution, where everything has the same chance, because we assume that trucks/bowls are effectively equal within the model. That's the stupid but easy assumption that lies under the misleading lie. Bugs are different.

2        We don't have a find-fix-retest cycle in our model. Nothing is being changed.

This makes the model easier to understand, because we're not looking at two things going on at the same time. Of course it's inaccurate. The trick is to use the imagination to wonder how that inaccuracy might appear. Models should be visceral, so go with your emotion if you don't want to be analytical.

Finally, a wrinkle. Exploring is about learning. As we discover more, we get better at discovering, not worse. We start out rubbish, and one trick that distinguishes good testers is how quickly they get better (not how good they start). This leads us to

3        Everything's harder to find early on.

In our model, we have a tester. The chance of them finding any given bug starts at a particular value (0, say) and increases. In this model, it increases over time, and it's much easier to go from nothing to halfway good than it is to go from halfway good to perfect. There are lots of different ways of modelling this – again, use your imagination to think how the model might change.

So – here's a model of a tester discovering bugs. I've fixed it so that there are 400 bugs to find, but some are harder than others. The tester gets better over time.

* Not that they'll listen. Indeed, that's close-on the definition of an idiot, which is apparently from a Latin word meaning "ignorant person". Clearly they are, if they're ignoring you.
** I'm aware that this is impossible and in many ways undesirable in the real world. My model, my rules. Just making them explicit.
*** something to have in mind whenever someone says "obviously"
**** but oh, in such a complicated way

Friday, November 02, 2012

Broken Trucks

tl;dr – you still need your imagination, even with real-life examples

Temporary note – the truck graphic has gone, the graphs are back. I'll remove this note when I restore the graphics...

Magic pennies? Pshaw.

Let me put this another way.

This problem has been put to me frequently in my testing life. Here's one close-to-actual situation.

My client has a hundred trucks. Each has a bit of kit, and I've ben told that the bit of kit needs to be replaced occasionally. Actually, not so occasionally – it's new kit, and I'm told that it's likely to fail at least once in the first hundred days use.

So, how many trucks will experience that failure in their first hundred days? All of them? Also, how long should we test for? How many rigs should we use? How reliable is that suspiciously-round 1 in 100 figure?

As it happens, there's a bit of maths one can do. If the chance of a truck failing is 1%, then the chance of it not failing is 99%. The chance of it not failing for 2 days in a row is 99% * 99% (just over 98%). For 3 days, 99% * 99% * 99% (a tad over 97%).

Can you see where I'm going? The chance of a truck not failing for 10 days in a row is 99% * [99% another 9 times]. That's 99%^10.

For 100 days in a row, it's 99% ^ 100. Which is about 37%*.

So after a hundred days, I'm likely to still have 37 trucks, more or less, that haven't failed yet.

Which makes around 63 trucks that I need to go and mend**.

The maths is satisfying, but it doesn't tell me any more than the question I was first asked. Nonetheless, we know that all good testers have an practically unlimited supply of extra questions to ask, so we're probably not completely satisfied.

However, if go grab my hi-viz jacket and get to work on the trucks, I'll get a better idea of what happens. I'll find that some days everything works as well as it did yesterday, and occasionally three new trucks phone in failed. I'll get an idea that I'll see more failures when there are more things that work – so as the period goes on, I'll see fewer and fewer. Some trucks could go on for ages (I'm sure that you've all heard of immortal lightbulbs, too. Survivorship bias – mostly.)

Working on the trucks allows a visceral, complex experience. It takes a while to get, it's not terribly transferrable, and it's hard to forget. You know it deeply and in many different ways. You are "experienced". The maths approach is different; the result is ephemeral, and you may remember the method more easily. To imagine its implications, you'll have to think hard. You are "expert"***, and because you can remember the method, you might be able to re-apply it in a different context.

In between these two, there are models and simulations. Models aren't reality, but neither are they primarily symbolic (at least, not on the outside). I hope that the right model might engender something between experience and expertise. For what it's worth, I think that asking "How long should I test for to be confident that I'm not going to see problem X much in real life" is a fair question, and I think that "It depends" is a rotten answer without some idea on what "it" might depend.

I've given you three machines below. 10 trucks, 100 trucks, 1000 trucks. I've knocked out various noisy bits, but it's otherwise the same simulation. Have a play. You can change the budgets. Think about what the frequency of failure tells you, especially over time. While you play, just have in the back of your mind the ways that this kind of failure differs from the failures that we discover when exploring...

Right now, I'm posting this from EuroSTAR – it's looking good! Follow @esconfs on twitter, or watch for the #esconfs hastag. And @TheTestLab, of course.

 * We're assuming here that a once-broken truck is no more likely (or less likely) to break down again. We're also assuming that the non-broken trucks are at no greater chance of breaking. In one of the cases I'm thinking of, the "broken" truck was entirely functional as far as most people wee concerned, so the broken trucks didn't get less use, and the working trucks didn't get more use. If you're thinking of an un-enlargeable fleet of trucks with broken axles, we've got different models.
** If I'm swift to mend, some of these probably will have needed to be mended more than once.
*** Nobody said that being experienced and being expert were mutually exclusive. You can be both, you can be either, most of us are neither outside our fields of interest.

Thursday, November 01, 2012

An experiment with probability

It's been a busy day.

So, just for now, here's a very abstract experiment (and I'll give you the kit to play with the experiment)

Imagine you've got a hundred bowls in front of you.

In each bowl, you've put a hundred pennies – 99 dull ones, and one magic one.

Every day, you get to look at one random penny from each bowl. You drop the penny back in its bowl when you've looked.

If you had a hundred days, how many magic pennies might you see?

If you wanted to see all the magic pennies, how long would you plan to spend to be reasonably confident? How confident is reasonably?

If you want to work this out, do. You could find out empirically and viscerally, too, but you don't necessarily need a hundred quid in coppers and a couple of seasons: Play with the thing below. There are a hundred purple circles, that go yellow with a chance of 1:100 every tick of the red number. You can change the red number before you start. Press the arrow in the circle to set things going. Treat "work done" as "days spent" for now – no one thinks finding magic pennies is a real job.

Cheers -