Friday, December 14, 2012

Modelling super powers

tl; dr it's not just the tactics that matter

In this scenario, I've modelled five testers. Each has a super power.

  • One logs bugs more easily than the others – effectively, they log as many bugs as they can see. The others can only log 10 bugs for each bit of budget they consume.
  • One only logs big bugs – bugs with a cost of 10 or more. The others log any bug they find.
  • One learns three times more effectively than the others.
  • One switches tactic twice as often.
  • One finds it easier to retain their skills after switching tactic.
What difference might each of these qualities make?

Run the exercise a few times. Make some changes. You may find it easier fullscreen, and of course the XML is available to play with. Does the model match your experience?

More to the point, perhaps, how are you comparing the different testers in the model? How do you compare real testers on your team?

Friday, December 07, 2012

Diversity matters, and here's why

tl;dr – It ain't what you do, it's the way that you do it

We've got a model which tells us we have an hopeless problem. I promised some perspective.

Let's try throwing people at our problem. In the exercise below, we're using five testers. If a bug has a 1:100 chance of being found by one tester in one cycle, surely five testers should have a better chance.*

How much better? Run the thing and find out.

Less than impressed? That's because hard-to-find bugs are still hard to find, even with a few more people. Your one-in-five-million shot is not going to jump into your lap if you've only managed to make the chance of finding it one-in-a-million.

There's a key quality I've not changed in this model. We've said that some bugs are harder to find than others. We've not yet mentioned, or modelled, that my Mum merrily finds problems that have eluded me. The way that you don't see my bugs on your machine. The way that performance testing jiggles bugs by the bucketload out of systems which seemed to be working just fine, or the way that unit testing and usability studies find (mostly) entirely different collections of bugs.

Our model should reflect that any individual bug** should have lots of different likelihoods of being found. For this model, we're going to make the choice of likelihood depend on the tactic that is in use. Indeed, for this model, that's what differentiates and defines "tactic" – changing tactic changes the distribution of likelihoods across the collection of bugs.

Below, you'll find an exercise which again has five testers working in parallel. This time, each tester has their own individual profile of chances, and a bug that one finds easily may be much less likely to be found by another.

In the model, we do this by setting up tactics. Each tester has one tactic, which they use exclusively. Each tactic is set up in the same way across the full population of bugs – it's just a distribution of probabilities. If you were to look at one bug, you'd find it has five separate probabilities of being found. Have a play.

The difference is clear.

Diversity matters*. In this model, it matters a lot; more than budget, more than the number of testers.

For those of you who prefer analysis over play, also clear if you think about the chances of finding an individual bug. Tactic 1's million-to-one chance bug may be a billion-to-one for tactic 2, too, but tactic 3 might well see it as a hundred-to-one. Ultimately, the no-chance-for-that-bug tactic would continue to have no chance whatever your budget (or patience), but by having many tactics, one increases the chance of having a technique that with easily find that particular bug easy.

QED – but I hope the toys help the demonstrandum at least as much than the argument.

Note that a key assumption hidden in this model of diverse approaches is that the different tactics are utterly different. In the real world, that's hard. There's plenty of refinement to do to our model to make it a more accurate reflection of the world. However, the central idea remains: in this model of discovery, you get much more from changing your variety than from changing your effort.

This then is the perspective – in this exploratory environment, persistence is what leads to hopelessness. Variety gets you closer. Just for fun, here's a model with the five tactics, and just one tester – but this tester can switch tactics. I'll be mean, so they switch randomly, and each time they switch, their skill slides backwards. Look at the poor beggar ticking away; hardly ever gets over 50%.

See how well this works with just one tester.

One random tester does better***** than five monotonic testers? You're surprised by that conclusion? Enough with the rhetoricals: I have (metaphorical) knobs for you to play with.

The sharp-eyed will notice an extra button – I've finally given you a reset. Indeed, this is a rather more interactive machine than you've had so far – you can change the number of bugs and the cost model. You can also give (not entirely reliably) the machine a (not entirely reliable) "seed" to start from as it builds the model, which lets you replay scenarios. Be aware that the I've not sorted out a fully-intuitive workflow around set/start/stop/change/reset, nor have I tested this well (it's mine, and I'm too close to do a job to be proud of). I'd appreciate any feedback – be aware that behaviours may change in the near future.

If you want to dig deeper into the model, I've made a change that allows you to play with the machine offline. Download the .swf, and the Exercise.xml file from the same directory. Bung them in the same folder on your own machine, and the .swf will pick up your local copy. Have a play with Exercise.xml and see what you can come up with. I'll share interesting exercises and conclusions here, and you're welcome to post them on your own site. I'd like to hear about your postings, because then I'll know who to send updated machines to. I'll open-source this sometime,

There's lots further one can go with this model, and over the next few posts, we'll explore and illustrate the effects of some common constraints and balances.

It looks like I'll be teaching exploratory testing in Amsterdam early next year. I'm just about to set the dates. If you want 30% off the price for simply telling me you're interested, you've got a couple of days to catch the opportunity.

Cheers -


* maths people will know; (1- (1-0.01) ^ 5) ~ 4.9%, which is just a tad more unlikely than a 1:20 chance.
** for the purposes of this explanation, let's assume we can identify at least one bug.
*** this panders directly to my prejudices, so I'm pleased to reach this conclusion, and find it hard to argue against effectively. I'd be grateful**** if you felt able to disagree.
**** through gritted teeth, but still grateful.
***** better?

Wednesday, November 28, 2012

Enumeration hell

tl;dr some bugs are beyond imagining

"Rational people don't count bugs."

There's a rash statement. Let's say that rational people who do count bugs ought to count other, less pointless more meaningful things, too.

Bugs* are rotten to count. There are plenty of posts** about this, and I won't go over the same ground here. Counting bugs is a bit like counting holes – superficially obvious until someone takes a shovel to your cheese.

But the big problem with a bug count is that it summarises a potentially useful collection of information into a number that is mostly meaningless. A single nasty that makes the wheels fall off is worth any number of niggles that make the horn too loud. Unless you're driving a clown car.

In our idealised model, we're counting surprises because it's interesting to see how many are left. None is none on any scale, and if there's none, we're done. We're still not done if we've got one left, because that one might be a stinker.

You've noticed that I've only given you one knob to twiddle*** on these toys. You only get to change the budget – you don't get to change the context****. This is a cheap manipulation on my part, because I've been asking you to concentrate on where you might set that budget to feel reasonably confident that the thing is tested.

So far, we've not considered bug stink in our model. It's time that changed.

In the same way that our model gives each bug a chance of being found, it gives each bug a quality I'll call cost. That's probably not the best word, but it's the one I've chosen for now*****. I'll give it a local meaning. Cost is the amount by which the value of the system goes down when it contains the bug. Quality is value to someone. Trouble makes that value go down. Cost, here, is not cost of fixing the bug. It's the cost of leaving it in, and it's the cost to the the end users.

Bugs aren't made equal, so we'll need to consider a distribution again, this time of (our local definition of) cost. Experience leads me to believe that most bugs have low cost, some bugs have higher cost, and a very few (so few that they might not exist in a given system) have astronomically large costs that outweigh the value of the system.

In earlier examples, each bug had the same cost. The distribution I've chosen to use in this model, to match my experience, is called a "power law" distribution. Power law distributions fir lots of things observed in the real world, such as city sizes, distribution of wealth, and the initial mass of stars. Power law maths underlie the Pareto Principle (aka the 80:20 rule), and Taylor's Law****** (and , more incomprehensibly, phase changes). If you want to dive into this, set your head up with this handy note comparing the similarities of Power/Zipf/Pareto in a real (if rather antique) context.

Why have i picked this distribution? Because it feels right. Instinct is no justification, so you can expect that we'll have a look at other distributions later. For now, though here's a fourth assumption:

4        The cost of a bug to (all the end users over the life of a product) has a power law distribution.

Enough of the hands-waving. Let's play.

Below you should find an identical machine to last time's closing toy, but with costs set to match a pareto-style distribution. You'll quickly see that there are two "stuff found" numbers, and that the size of the yellow dot is related to the cost. Run this a few times.

Don't be surprised if, occasionally, you see a simply huge yellow dot. Try hovering over the top right of the square set of 400 circles, and click on the ? you see to reveal a god-like understanding of how much trouble this system is hiding. Know that, generally, you'll see the total trouble is around 1000*******. If you see around 2000, expect that one of the bugs has a cost of 1000. If you happen to see around 11000, you've probably got a fat 10K bug hiding away.

In our most recent outing, I hope you got a feel for why it's hard to use a bug rate to say that you're done testing. If you play with the models in this posting, you may get an idea for how 'not done' feels in terms of the cost of what you've left behind.

I hope you're still considering where your omnicognisant self would set a reasonable budget so you could say with confidence that you'd done enough. Have a look at the left-hand graph of what's been found. It's still very front-loaded, but you'll see the occasional big spike as a particularly troublesome bug is revealed.

Let's rack up the difficulty another notch. I set up the model above so that the budget and the bug distribution meant that you got to find most of the bugs in a relatively brief exercise. Of course, that's no use at all. Here's another; more bugs, smaller budget. Crucially though, in this model plenty of the bugs are very hard to find indeed. You're not going to find the lot, so that's what this model looks like.

Hopeless, isn't it? If the real world looks anything like our model, how can anyone be bothered to give a sensible answer when asked to set out a budget?

Next time, all being well, we'll approach these frustrations sideways on. We won't find clarity, but we may find perspective.

* I'm not going to define "bug", because it's a vague word, and therein lies its power. But if there's a scale that runs through vague to countable, then I suggest these two ideas are at opposite ends.
** Try Michael Bolton's Another Silly Quantitative Model and Elisabeth Hendrickson's What Metrics do you use in Agile.
*** there's lots more interactivity to come. For now though, mull on how it must feel to be a leader whose only effective control is over budget-setting, then be nicer to your poor distant Chief Money Officer next time.
**** suggestions accepted, but without any guarantee they'll be used.
***** "Law" appears to be used by some scientists in a similarly-imprecise way to the way some lawyers use "Proof". Business people naturally choose to use both words with abandon. I would treat the word "Law" here with as much scepticism as you might treat it in Moore's Law. They're empirical laws, and describe, rather than necessarily account for, system behaviour.
******* 1000 what? I don't care. Stop your whining and go count the number of things in this list.

Monday, November 19, 2012

Models, lies and approximations

tl;dr – Some of these bugs are not like the others

Here's hoping you've enjoyed playing with trucks and bowls and your imaginations. If we're going to be able to use our model as an illustration of much value , we have to recognise that in relation to software testing it contains a useful approximation, and a misleading lie.

There's a limited collection of things to find. This can be a useful approximation for exploration in software testing – if one assumes that a system has a finite collection of possible behaviours, then the set of possible but undesirable behaviours is limited too (as compared with the vast set of rubbish things that it doesn't actually do). This is good to consider when told "there's always a bug" by an idiot*.

You might further refine this by adjusting your view from the large mass of observably rotten behaviour to the smaller selection of practical changes that make the system more desirable. You'll also recognise that the collection, while limited, is only fixed if everything else is fixed. In our model, the collection of bugs is fixed – so we need to be clear that the approximation and the model assumes that, just of now, no one's changing stuff**.

The rate of finding things falls, as the number of things that can be found falls. This is obviously true, but is perversely also a misleading lie***. Idiots (sometimes, the same idiots who believe "there's always a bug") assume, because of the statement's obvious truth, that when the rate of finding bugs drops, the system is close to clean. Bonnnnng.

Sometimes, and it feels like often, it's because the people testing it have run out of imagination. While we may understand more as we reveal more, and while a system may become cleaner as it gets fixed, a dropping bug rate certainly does not imply you've found all the bugs.

Some testing is done by an unbending list of dull asserts, which run clear and green when they've not managed to either trigger or observe any trouble. Michael Bolton reasonably calls these "checks" rather than tests. Some testers, limited by corporate focus or personal blandness, don't do much better, demonstrating simply**** that a system meets expectations.

As any fule kno, some bugs are harder to find than others. If you find you've run out of bugs, it's likely you've run out of bugs that you're set up to find. Sometimes, that's OK. But sometimes, a bug that's hard for you to find is easy for someone else to find. If that someone else isn't a paid tester, but is heaven forfend, a paying customer, we get the "why didn't you find that" conversation.

So, then. A couple of approximations for this model.

1        Some bugs are harder to find than others.

I'll model this by giving some bugs a high chance of being found, and other a low chance. The way that easy-to-hard works amongst the bugs is called a probability distribution. We can pick a distribution. Our earlier example, the trucks and bowls, would be a fixed, or flat distribution, where everything has the same chance, because we assume that trucks/bowls are effectively equal within the model. That's the stupid but easy assumption that lies under the misleading lie. Bugs are different.

2        We don't have a find-fix-retest cycle in our model. Nothing is being changed.

This makes the model easier to understand, because we're not looking at two things going on at the same time. Of course it's inaccurate. The trick is to use the imagination to wonder how that inaccuracy might appear. Models should be visceral, so go with your emotion if you don't want to be analytical.

Finally, a wrinkle. Exploring is about learning. As we discover more, we get better at discovering, not worse. We start out rubbish, and one trick that distinguishes good testers is how quickly they get better (not how good they start). This leads us to

3        Everything's harder to find early on.

In our model, we have a tester. The chance of them finding any given bug starts at a particular value (0, say) and increases. In this model, it increases over time, and it's much easier to go from nothing to halfway good than it is to go from halfway good to perfect. There are lots of different ways of modelling this – again, use your imagination to think how the model might change.

So – here's a model of a tester discovering bugs. I've fixed it so that there are 400 bugs to find, but some are harder than others. The tester gets better over time.

* Not that they'll listen. Indeed, that's close-on the definition of an idiot, which is apparently from a Latin word meaning "ignorant person". Clearly they are, if they're ignoring you.
** I'm aware that this is impossible and in many ways undesirable in the real world. My model, my rules. Just making them explicit.
*** something to have in mind whenever someone says "obviously"
**** but oh, in such a complicated way

Friday, November 02, 2012

Broken Trucks

tl;dr – you still need your imagination, even with real-life examples

Temporary note – the truck graphic has gone, the graphs are back. I'll remove this note when I restore the graphics...

Magic pennies? Pshaw.

Let me put this another way.

This problem has been put to me frequently in my testing life. Here's one close-to-actual situation.

My client has a hundred trucks. Each has a bit of kit, and I've ben told that the bit of kit needs to be replaced occasionally. Actually, not so occasionally – it's new kit, and I'm told that it's likely to fail at least once in the first hundred days use.

So, how many trucks will experience that failure in their first hundred days? All of them? Also, how long should we test for? How many rigs should we use? How reliable is that suspiciously-round 1 in 100 figure?

As it happens, there's a bit of maths one can do. If the chance of a truck failing is 1%, then the chance of it not failing is 99%. The chance of it not failing for 2 days in a row is 99% * 99% (just over 98%). For 3 days, 99% * 99% * 99% (a tad over 97%).

Can you see where I'm going? The chance of a truck not failing for 10 days in a row is 99% * [99% another 9 times]. That's 99%^10.

For 100 days in a row, it's 99% ^ 100. Which is about 37%*.

So after a hundred days, I'm likely to still have 37 trucks, more or less, that haven't failed yet.

Which makes around 63 trucks that I need to go and mend**.

The maths is satisfying, but it doesn't tell me any more than the question I was first asked. Nonetheless, we know that all good testers have an practically unlimited supply of extra questions to ask, so we're probably not completely satisfied.

However, if go grab my hi-viz jacket and get to work on the trucks, I'll get a better idea of what happens. I'll find that some days everything works as well as it did yesterday, and occasionally three new trucks phone in failed. I'll get an idea that I'll see more failures when there are more things that work – so as the period goes on, I'll see fewer and fewer. Some trucks could go on for ages (I'm sure that you've all heard of immortal lightbulbs, too. Survivorship bias – mostly.)

Working on the trucks allows a visceral, complex experience. It takes a while to get, it's not terribly transferrable, and it's hard to forget. You know it deeply and in many different ways. You are "experienced". The maths approach is different; the result is ephemeral, and you may remember the method more easily. To imagine its implications, you'll have to think hard. You are "expert"***, and because you can remember the method, you might be able to re-apply it in a different context.

In between these two, there are models and simulations. Models aren't reality, but neither are they primarily symbolic (at least, not on the outside). I hope that the right model might engender something between experience and expertise. For what it's worth, I think that asking "How long should I test for to be confident that I'm not going to see problem X much in real life" is a fair question, and I think that "It depends" is a rotten answer without some idea on what "it" might depend.

I've given you three machines below. 10 trucks, 100 trucks, 1000 trucks. I've knocked out various noisy bits, but it's otherwise the same simulation. Have a play. You can change the budgets. Think about what the frequency of failure tells you, especially over time. While you play, just have in the back of your mind the ways that this kind of failure differs from the failures that we discover when exploring...

Right now, I'm posting this from EuroSTAR – it's looking good! Follow @esconfs on twitter, or watch for the #esconfs hastag. And @TheTestLab, of course.

 * We're assuming here that a once-broken truck is no more likely (or less likely) to break down again. We're also assuming that the non-broken trucks are at no greater chance of breaking. In one of the cases I'm thinking of, the "broken" truck was entirely functional as far as most people wee concerned, so the broken trucks didn't get less use, and the working trucks didn't get more use. If you're thinking of an un-enlargeable fleet of trucks with broken axles, we've got different models.
** If I'm swift to mend, some of these probably will have needed to be mended more than once.
*** Nobody said that being experienced and being expert were mutually exclusive. You can be both, you can be either, most of us are neither outside our fields of interest.

Thursday, November 01, 2012

An experiment with probability

It's been a busy day.

So, just for now, here's a very abstract experiment (and I'll give you the kit to play with the experiment)

Imagine you've got a hundred bowls in front of you.

In each bowl, you've put a hundred pennies – 99 dull ones, and one magic one.

Every day, you get to look at one random penny from each bowl. You drop the penny back in its bowl when you've looked.

If you had a hundred days, how many magic pennies might you see?

If you wanted to see all the magic pennies, how long would you plan to spend to be reasonably confident? How confident is reasonably?

If you want to work this out, do. You could find out empirically and viscerally, too, but you don't necessarily need a hundred quid in coppers and a couple of seasons: Play with the thing below. There are a hundred purple circles, that go yellow with a chance of 1:100 every tick of the red number. You can change the red number before you start. Press the arrow in the circle to set things going. Treat "work done" as "days spent" for now – no one thinks finding magic pennies is a real job.

Cheers -


Wednesday, October 31, 2012


...not in the sense of putting a new veneer over an old worktop, but in the sense of one's head breaking into fresh air after a long, deep, dive.

There may be a bit of bobbing up and down and gasping for a while.

I have a daughter. A new one. The first one, for me. Indeed, the first child of any flavour for ether my wife or I. Very new. Very lovely. We're all happy and healthy, all's good. Those of you who want to know more may know that I have pictures on tumblr, and some of you already have the password. Thank you for all the good wishes. Buzz me (directly, not here) if you want access. I'm trying to keep specific details off the internet until the whole identity/privacy thing shakes itself down. Or until she's old enough to vote for herself. Whichever comes first.

Anyway, excuses aside, adult life comes back in with a crunch next week.

I'm on the program committee for EuroSTAR. Along with the other committee members*, I'll need to be visible and available and therefore I'll need to be in Amsterdam form Monday to Thursday. That'll be a shock. One of the reasons I'm here is to get the dust off the testing neurones – I've spent a few weeks mainly thinking about babystuff.

We were very enthusiastic about the program in Galway in March. The EuroSTAR elves have been working like crazy to put together the actual conference, and it's going to be excellent to see it happen. These days, I tend to spend my time at conferences in @TheTestLab. I can't do two things, so Bart is running it with Martin Janson (Martin is one of the TestEye bloggers, worked with us on the EuroSTAR TestLab in 2010, and helmed a fantastic TestLab at Let's Test last May). I can't think of better hands for it to be in. The TestLab will be, we're told, central to the conference, and (at last!) easy to find. I'll be spending time there, if you want to come and find me, but I'll also be going to tracks and sessions, having conversations in corridors and bars, and basically making a nuisance of myself.

It's worth noting that, after the conference theme "Innovate: Renovate" was announced, both Shmuel and I told our colleagues that our wives were due in the weeks before the event. Indeed, I think they had the same due date, so both of us knew while we set the theme, but neither of us could say. Shmuel's done the Dad thing rather more (hats off to him) so I'll be the one looking more startled. Of course, the other difference is that he has a beard. It's harder to look startled with a beard.

Then, for those interested in my life in a furry hat, I charge back to London and change costume and language. After singing on one of this week's top-ten soundtracks (we're on Halo 4) on Friday 9th the London Bulgarian Choir will hit the ground running for their big gig of 2012. It's also the first gig for the choir after their leader had a baby** and we have a whole new way of doing a show. Stories throughout the first half, a wedding in the second half, presumably breastfeeding in the interval. We're at a gorgeous 500-seat venue in central London. In 2010, we sold it out to the last chair. This time, who knows... Anyway, here's a facebook event (one of at least four doing the rounds). Invite yourself.

Last thing on my list – I'm playing with systems again. I should have something a bit special for you, starting tomorrow. For now, let's see if I can get flash to embed here...

Cheers -


* Zeger van Hese is our program chair, and Julian Harty and Shmuel Gershon are the other committee people. Here, look.
** Same baby as my baby? Same baby. Our baby. Blimey.

Wednesday, August 15, 2012

Dog Show Quality

tl; dr – I'm ranting about arbitrary judgement

I've been enjoying the occasional horse ballet this Olympics. Back in my dim and distant, I learnt to distinguish some of the more obvious of the tiny communications and balances necessary to keep half a tonne of twitchy muscle in shape. At this remove, and especially when watching experts through my lo-res crystal bucket, I now see very little, but one can only admire the quality of the riding, the riders, and the ridden.

Dressage is exacting, it gives joy, and it's an extraordinary skill gained through dedication and talent. Once upon a time there might have been a working reason to get your horse to trot in goosestep, but in the arena those practicalities are subverted into competition. Criteria are set out, judges judge, competitors are measured and ranked. Everyone's an adult, everyone's there by choice, the horses love it – so what's the harm?

None at all, at a guess.

It's not quite the same in all competitions. For instance, competition requirements which set out strict parameters for dog breeds have led to a variety of unpleasant canine complaints as certain breeds of dog have become caricatures of their ideals, or as unmeasured and unexpected emergent properties have popped up (or out, as the case may be). Preset breed standards make it relatively straightforward to judge objectively – but the social act of judging the quality of a dog against those criteria can drive the breed as a whole, indeed the community of breeders as a whole, into unexpected and unwanted unpleasantness. But hey, if that's your bag, go to it.

My interest here is in software and systems and the people who make them. I'm using the situations above as a leaping-off point for an extended metaphor concerning software development and development processes. Be forewarned: I'm not about to respond to commentary from lovers of dressage or of dogs.

Occasionally, I'll be on site, or in a lecture, or talking with a colleague, or reading a paper, and the words "Dog Show Quality"* will pop into my head.

Dog Show Quality is where "Quality" is its own goal. A measured characteristic, where the measurement is made against an arbitrary set of criteria which don't have any necessary relevance outside the narrow limits the criteria themselves describe. You can measure it objectively, and you can use it as a goal for incremental change – but that goal may worsen your situation over time.

Something that is judged to have good Dog Show Quality is not necessarily rubbish. Indeed, it may in itself be best of breed. It's the measurement that is rubbish, and in being rubbish, leads to rubbish. It's not a bad thing to say "Every line of this software has a test", but "Every line of all our software must have a test" is no guarantee of goodness, and defining "Quality" as "Every line has a test" is just a small and noxious step away.

I tend to think Dog Show Quality most often when confronted with ossified software development process. Where signoff matters more than assessment, where knowing where the documents are is more important than what's in them, where spelling is more important than sense. When I talk to the Divisional Champion of Quality Assurance, and they care far more about adherence to process than whether their rotten product creaks, crashes, destroys data and chews the faces off their users' children, I'll think Dog Show Quality.

Dog Show Quality is mostly pointless and sometimes harmful. Making working systems is an exacting and skilled pleasure, but it's also done for money and directly affects real people. If, in the pursuit of quality, your judgement relies on measurements and you continue to measure the relatively unimportant, you're indulging yourself. Get over it, or get a green blazer with brass buttons and hand out rosettes. Your choice.

* There's nothing original under the sun, and this phrase has popped into my head unbidden for years. You may have read or written something similar. If you think I've simply written a cover version of your work without attribution, then I'm really sorry. I've searched, and I've not found your stuff. Please let me know, and if I've stood upon your shoulders to write this, I'll make a clear acknowledgement and link to your ideas.

Monday, August 13, 2012

Session timer

tl;dr - here's a thing to help you explore within the limits you set yourself

@eviltester* just asked me where I keep my session timer.

I don't appear to keep it anywhere.

Sorry about that. Fixed now.

If you want to use it**, try this page:

I use it*** to keep me on the rails. Here's how:

If I've decided that an exploration is worth 60 minutes of my life, I load it up, change the number in the circle to 60, and hit start. If the phone goes or someone arrives, I hit pause. I want to see how much of my gamble budget I've used, because that knowledge changes the game. Proportion matters more than minutes. Various numbers show up for reasons useful to me – elapsed time, excess time. There are two (pretty much identical) timers, because I run distractions as mini sessions.

I don't want to change the timer while it's going, and I don't trust my hands to avoid the error, so I've made it so one can't (easily) change the timer while it's going. It doesn't go bing because 1) I don't care that much if I go a bit long and 2) I have a timer, so I already know. If you change the system clock, it won't complain, but it won't be much use either, and midnight's not well handled****.

If you work in a similar way to me, it may be handy. If not, it won't.

The timer is ten years old, and can be a bit of a pain with CPU as it's written in AS1 and published for Flash 5. The world and his dog uses iOS now, or HTML5/JS, or if stuck in flashland (as I am) AS3 and FP10+. Use as you wish. Feedback always gratefully received.

* also and elsewhere if you dig a little
** a more-useful-for-work version than the one I deploy in my workshops
*** if I use it. I don't use sessions for testing so much. I do use timed sessions for exploring documentation, investigating ideas, writing, prototyping, rehearsing and sketching. And (non-IT use) cooking, of course. For that, I use a kitchen timer.
**** Hint: it may be time to go to bed.

Monday, August 06, 2012

Contextual spelling

tl;dr simplicity can hide depth

Some people expect that, as they discover more, they get closer to finished. But exploration can instead take you further from where you started.

Got a Mac*? Try this...

I open a new TextEdit document. I type "contact sant". On my machine, the word "contact" is highlighted as a spelling problem. I note that if I change "sant" to "Sant" or "san", "contact" is no longer highlighted. Perhaps one gives more context, one less. If so, I've bracketed a sweet spot, which is good to know. I might come back to this.

Because I'm in Text Edit, I suspect I'm actually using Apple's Cocoa text. The same problem shows up in MacJournal. However, it does not show up in Mail or Evernote. I'll not follow this particular path of enquiry for now. Maybe someone else can inform me.

That's two paths ignored. What I want to do is to dig deeper into spelling. If I flip open the Edit:Spelling:"Spelling and Grammar" panel, the alternatives to "contact" offered are "kontakt" and "kontant". If I change "sant" to "santos", I'm offered** "contacta", "contacte", "contacto", "contacté" and "contactó".

I note that a drop-down box says "Automatic by Language", and to my mostly-monoglot eye, the first suggestions look more Germanic, the second more Latin. Though neither TextEdit nor MacJournal allows me to set the language of a text fragment, it's clear that the suggested spellings are from two different non-English dictionaries***, and that the choice of one excludes others.

Could it be that Cocoa text decides what language a text fragment is in before it goes off to get spelling suggestions? Is it really deciding from four letters?

Over to you.


* I'm still on Snow Leopard, v 10.6.8. You may not be. You'll know better than I.
** actually, nothing changes immediately. I need to close and re-open the "Spelling and Grammar" panel to see the new suggestions. Yeah, I'm not exploring that, either.
*** so that's three in total, smart-alec.

Tuesday, May 29, 2012

Looking for Trouble in Big Data

tl;dr - basing your decisions on aggregates? Tested that?

I was at another "Data Science" meetup yesterday – this one with a focus on the music business*. I was feeling overwhelmingly awkward, and once I'd managed to chip a front tooth (on a chip, of all things), I kept my head down. There was some interesting stuff, and some marketing guff. I made notes, and bits of it stuck.

Before I go further, here's information that'll make you smile if you're interested in the effect of piracy: The music industry can predict 90 days of legitimate sales from BitTorrent traffic in a given artist/work (Greg Mead, MusicMedia). And something to, perhaps, make you frown: Playing obscure, "long-tail" music on the radio really doesn't affect the degree to which most of us listen, mostly, to the most-played 1000 artists (Mark Levy, I may be wrong on these, so verify your facts before relying on my half-baked bloggery.

One of my sets of notes was on pathologies, some specific to "big data" (ie volume, changes in data format, data dirtiness, geographical and temporal reach), and I thought readers of this blog might find the list interesting. I guess you'll recognise some from preparing large-scale UAT data. Others you might recognise from the particular needs of aggregation. Clearly, I've not sanitised** my notes so they're all formed as pathologies; some are warnings, other solutions, but I think you'll be able to judge the underlying pathologies for yourself.

  • Aggregating timestamped data from global sources: how will you deal with different timestamps? To convert everything to universal time, you'll need to capture location – and even then does it make sense to compare trends following an event that happens at 11am London time with figures from LA at 3am, or Tokyo at 7pm?
  • What if a data source changes information retrospectively (ie deleted tweets)?
  • Trying to fit the real world into hierarchies where an entity has a single parent may be a lost cause. Think of tracks re-released in box sets, of the line-up of the SugaBabes (or, come to that, the Supremes), of a singer who performs at a festival on his own, then with his band, then as a surprise guest drummer. Specific one-to-many and many-to-one relationships may not serve your demographic.
  • Duplication; an artist, single, album, album track and best-of might all have the same name. Living in a Box, anyone?***
  • De-duplication; The London Bulgarian Choir and London Bulgarian Choir are the same motley crew, though clearly the name data isn't precisely the same. It's not just the obscurely-choral edge cases - happens to Smashing Pumpkins too. And Sean Puff-Diddlydee.
  • Data from the past may be utterly different from that captured today.
  • Neat, dense, obscure, meaningful codes are unique and apparently useful. But they're unreadable by the humans who enter them, and therefore subject to a whole bunch of pathologies; finger trouble, un-spotted mistakes, misuse and misunderstanding. We were told an apocryphal tale of a country who used the same 10 codes, over and over again, for all their releases.
  • Mapping location to IP address is not consistent if one is consuming data spanning years.
  • If you're going to visualise, you're going to graph. If you're going to graph, you're going to count, so you'd better make sure you're counting countable things. Turning real interactions into countable stuff generally involves some machine learning or emergent-behaviour tuning, so you're into a whole new barrel of pathologies there.
  • Visualisation is no fun without the facility to compare, or (less so) to give context.
  • Playing with data is bad. One needs a purpose.
  • Not playing with data is bad. One does not know what one may find.
  • Batch, hourly summaries are too slow – if you don't design your architecture for real-time, you may be left out in the cold.
  • Thinking that individual items of data have value, when the real information is in the relationship between data.
  • There's no single registry of meta-data, so there are clearly meta-duplication troubles.
  • Don't assume that the same comment on facebook / MySpace**** / etc are equivalent. Different channels have different semantics.
* Emphasis ultimately more on business than music.
** monotonised
*** No? Just me? I can still hum it, but I've never owned a copy nor knowingly chosen it to listen to.
**** It's still there - and actually saw an upturn in fortunes last year. As far as anyone can tell. Some bits of the industry seem happy to be 18 months behind in their analysis.

Thursday, May 17, 2012

LEWT 11 is on Saturday (that's 19 May 2012)

LEWT is the London Exploratory Workshop on Testing. For those of you who don't know, it's a peer conference. Details are at at

LEWT11 is on Saturday, and we have two spaces available if you'd like to take part. As ever, you'll need to prepare a very short talk / topic for discussion. If you have contacts or colleagues who would be interested, they should contact me swiftly.

We also have two further spaces that we keep open for people with less than two years experience in testing. Neither are taken. If you know someone who would find it useful to take part, they should also get in touch, but there's no requirement to bring a talk.

[Aside: remarkably, the last LEWT was in March, and I ran a similar thing at Let's Test. Peer workshops à go-go...]

Sunday, May 13, 2012

Apple's Keynote has synchronisation problems - here's a short diagnosis

EuroSTAR virtual is coming up, and I'm not keen on presenting live without an actual audience. Brrr.

So I decided to pre-record my talk. Turns out the audience want slides, not a talk, and I assumed that I would put the slides together in Keynote, and click along to the pre-recorded sound of my own voice. However, the client wants PowerPoint, which I don't have. I could play it live on the day, but I was doubtful whether gotomeeting would take audio from a prerecorded source along with live video, and I certainly didn't want to sort all this out on the day and to deadline. I decided to export the audio and slides to a quicktime movie. Keynote says it can, so I assumed it could. And we all know where that leads.

If you've got a pre-recorded backing track, Keynote gives you a one-shot go at recording all the side transitions as it plays. The talk is around 30 minutes long, so I needed to put the rest of my world on hold and tap along to the sound of my own voice for half an hour. If I made a mistake at the end, I'd need to do it again from the start. This is clearly not ideal*, and I wondered how well the supporting technology might work

For the sake of this posting, let's ignore the ever-yawning pit of despair as I listened to my stumbling and ill-chosen words over and over again, but I challenge you to listen to a talk of your own and not cringe. Let's also ignore Keynote's habit of telling you the recording is 'out of sync' whenever you change anything, at all, and not withdrawing its warning when you undo the changes. Let's just say that, some time later, I had a recording of clicks to go with the recording of my voice, and that Keynote wasn't being neurotic about the sync.

I exported to a Quicktime movie. It took, for a 30-minute movie, about an hour. Bear in mind this is mostly still, white slide-deck - and even my grimy old kit converts HDTV to 720p H.264 Quicktime at double speed. On playback, the slides ended well before the words, but I didn't watch the thing through to see where it had gone out - it was correct at the start, way out at the end, and it seemed to get worse as the deck went on. Putting that one down to experience, I decided that a plausible cause was that I was asking my kit to do too much, or that a visually-heavyweight regular request (I prefer to dissolve from one slide to another) caused it to go out of sync by a little, often. I took out most of the dissolve transitions, turned off everything else the machine was doing, re-recorded all my clicks, tried exporting again.

And it was a bit better, but not right.

Now I could have gone on incrementally changing, but I'm a tester, and I don't believe my own hypotheses. I exported again, immediately. Another hour. I went for a bike ride, came back and cooked. The second was differently out, and it was worse. I watched, as far as I could bear it, flipping back and forth using a binary search approach to isolate momments where things went way out. In the first, a particular and simple transition did not complete. It ended late – subsequent slides tried to catch up, but went far too early. In the second, a short video that should play as soon as a slide was shown started after a long pause. I made a few changes to those bits, ignored the now meaningless** sync warning, and exported again, just to see. The third export was rubbish, as you'd expect, but differently rubbish; the problem transitions worked this time, but subsequent slide transitions were more banjaxed than before. Oddly, the sync wasn't as far out at the end.

Noticing that you can get Keynote to simply play back automatically without export, I re-recorded (a familiar process by now) and played back. Remember, I was using no export – just automatic transitions as if a ghostly machine hand was pressing buttons, not me. The presentation was borked in the first three minutes; one slide hung on far too long. I didn't stick around to see what else was going to go wrong.

I felt I should have tried this earlier. Clearly, no matter what was recorded, the playback itself was a problem – though not one related to CPU usage or contemporaneous export. And not one related to the rate of slides; Keynote happily works for me, but not for pre-scripted changes. The problem didn't consistently make the slides faster than the audio, or vice-versa.

I flipped open the source*** – the recorded timestamps were fine. Indeed, I've used them as a script for whichever poor AV person at the client gets the job of clicking through my deck. This information confirmed my hypothesis.

My diagnostic process, such as it was, was hampered by how long it took to set up an experiment and get the results (around two hours for each). I didn't bother try find a minimum viable set as I was trying to find a workaround, not trying to work out what was going wrong. I did try to eliminate some plausible problems, and I got good information while trying to get beyond the emergent behaviour, and into the internals of what was going on.

If I wanted to carry on, I could create a selection of Keynote files to see whether playback is a problem in all circumstances, or just some. I could consider a disconfirming hypothesis (maybe it's not playback, maybe it's my machine and playback, or maybe there's another set of corrupted timestamps that I've not looked for). I could waste more days of my life trying to get to the bottom if it. Apple, if you're out there, buzz me and pay me and I might just take it on.

But for now, I'm done.

I've taken the timings as a script. I've exported the slides to an obscure Quicktime format that contains unstoppable audio and all the transitions. It plays automatically, and happily moves on at a human-driven keypress. It's the right quality, the right audio, the right collection of transitions in the right order. I hope it suits the client – but I'll wait to see.

We already know that PowerPoint is no fun to work in. However, if you need to work on a self-playing prerecorded presentation, I suggest that Keynote is as much use as a banana lightsaber.

For an inspiring alternative, watch the video for Bob Dylan's Subterranean Homesick Blues. True fact, trivia fiends: It was recorded in London behind the Savoy Hotel. La di bloody da.

The EuroSTAR virtual conference is on Wednesday. Be there, or be somewhere else. Sod that - be there, and be somewhere else. I shall be.

* I've checked the manual, and there is a way of pausing, but no straightforward way of undoing a moving a mistaken slide change.
** If it's wrong when it doesn't warn you, why pay attention when it does.
*** deep breath: I'm not using packages (it's a silently-changed option in Keynote 09), so the saved file is obscured. For Keynote, thankfully, there's a way out: change the file extension to .zip, and unzip it to a folder. Inside the folder, I looked into the ugliest large-ish file (index.apxl, here), recognised it as not-really-XML (the file chewed up textmate and spat it out), found timestamps tagged with narration-event that seem to correspond to my expectations, clipped that bit out and fiddled with it until the events, slides, builds and timestamps were unfogged and human-readable. I may be a Mac user, but that doesn't mean I'm so much of a numpty I can't get down to business.

Monday, April 02, 2012

Peer workshop linked to Let's Test

Would you like to come to a LEWT-ish peer workshop on the Sunday before Let's Test, in Runö?

Contact me if you're interested.

Better yet, go here to find out more and to sign up:

Tuesday, March 27, 2012

Bit of a gap?

A few of you have gently but reasonably taken me to task for apparently dropping the "How to Manage Exploratory Testing" series right in the middle.

Well, this is a blog; it's more of a resting pace for the half-baked than fertile soil for the half grown. You may be relieved to know that there is plenty more to come*.

Let's see, however, what the readership want. I have stuff on

  • The joy of reviewing 407 abstracts for EuroSTAR – tools, pathologies, the lessons I took away about how I might write a better abstract.
  • How Fear influences Strategy, and other hand-wavy ideas
  • Lessons from the Stanford AI course, and what a (moderately) sapient tool based in current technologies might help us to do
  • More Black Box machines (these ones record stuff, I hope)
  • Something very interactive, very testing, very systems, but nothing like the black box things.
  • How to test Spreadsheets (vital, often overlooked, perhaps now getting attention. My potential contribution has been half-baked for years now, so help me push it over the edge)
  • What it felt like to read the Weekend Testing sessions on something I wrote.
  • Opportunity vs control, secrets vs lies, and the extension of the imagination. Will involve my hands waving fast enough to tingle.
  • Peer relationships in peer conferences, and how I feel about LEWT
  • The next in the Managing ET series. I'm particularly looking forward to the ones on SBT, the Enquiring Metricator, and Lenfle. 

Which of these float your boat?

* but let's hope it's not as poorly-written as the wretched 'off-piste' entry. Or, indeed, the preceding metaphor in the § above. Blogs may be hardly-read, but that's no reason for them to be awful to read.

Tuesday, January 31, 2012

Comedy bug, and upcoming events

So I saw this, and I thought you should, too: Just an Ordinary Day in Skyrim . Particularly funny bug at 20s or so in.

While I'm here*, the Exploratory Testing workshops in Oxford went very well, thankyou. Those of you in the know will already know that I'm organising a 2-day workshop on exploratory testing techniques in Amsterdam on 12 and 13 April. Contact me if you want to be kept up-to-date (and to get the earliest-bird discount). Details to hit the web soon. There's been a bit of a hiatus in the 'lots of ways to manage ET' series, but the rest are in the works. Expect to see one at most this week, none at all next week. So much for writing a new post daily; turns out day-to-day stuff gets in the way.

If I was around, I'd be setting off to one of the following low-cost local events;
Paul Gerrard's UK Test Manager Forum is next week, in London. 2 days (tutes, talks and dinner) for just under £300 inc VAT. Maybe, on reflection, that's not a particularly astonishingly low-cost event. The TMF's track format has a keen eye for conversation rather than lecture, which is all to the good, and so it's hard to pick out a particular recommendation. In the tutorials, look out for Dave Evans' hands-on workshop 'Specification By Example, in Practice'.

Chris Ambler is organising TestFest in Brighton on Feb 22. Tickets are £60. As he describes it, it will have not one, but two parallel interactive elements demonstrating test tools and approaches. Twin TestLabs anyone? The event has a clear focus on Brighton (and the UK's) gaming industry, and is as far as I'm aware the first time that a testing conference has paid the large and creative community much attention beyond a couple of track sessions. There's lots to discover from game testing (see bug above), and it looks great. Anyone with an interest in where we've come from will want to know that Geoff Quentin** will be in evidence.

++Missed this in the initial posting... ++
TCL is running a Zappers event in London, also on February 22nd. Zappers is a great opportunity to meet and talk and test. The event is free, and TCL don't simply lay on food and beer, they also set up a software/system target or two, and award prizes for finding bugs. TCL organise regular Zappers events all over the world, and huge kudos to them for doing so.

++Missed this too... ++
Tony Bruce and Nathan Bain set up the Agile Testers meetup a few years ago, and Tony's persistence has made it a monthly must-do. The next is on Feb 27th. Breaking the mould, (the meetup is generally in a pub, and is more about talking to each other than being spoken to) it's at SkillsMatter in London, and Chris Guest from Microsoft will be talking about Powershell.

++And another...++
The Agile Tester's Meetup is back to its usual format a couple of days later. From 5:30 on Wednesday 29 Feb (leap evening, so perhaps there will be a tale to tell about bugs of the day from iPhone owners) at the Shooting Star opposite Liverpool Street station. Free entry, good chat. Sign up on LinkedIn or (my preference) Meetup.

The SIGiST is doing its thing on 7th March for £132. Thankfully, this time you've got Lee Copeland, Julian Harty and Lloyd Roden, all of whom are worth your time. As a special treat, the usually-fascinating Allan Kelly gets a 15 minute slot too.

Ministry of Testing (the ever-present STC in ninja disguise) are running TestBash in Cambridge on March 23rd. It's a day long, and costs £99-£150 depending on how fast you move. Among the excellent selection of speakers, Steve Green and Alan Richardson are two of the best practicing and practical testers I've worked*** with. Not only this, but both have a keen focus on the exploratory end of the spectrum, and both have novel and well-thought through things to say.

* I'm not going to be nerd-sniped into diagnosis, but it seems that you won't have seen this posting until Monday 6th, though it's dated and was written January 31st. Dave's probably already packed his bag for the tutorial.
** Geoff played a vital part in initiating a number of things that now seem a necessary part of our industry (the SIGiST, EuroSTAR, BS7925 and the awful exam, various training organisations and approaches). If that list seems a little old-school, remember that's what initiating means. One reason for the longevity of these bodies is that Geoff built their sustaining communities by seeking out participants whose views were at odds with his own, but who in some sense matched his keenness to communicate and engage with other testers. If the SIGiST, exam and so on now seem moribund, that is perhaps because of the lack of diversity of their current participants. Personally, Geoff has been a crucial and much valued influence – and we taught a class together that was built in part around our fundamental disagreements.
*** Worked with as in: found real bugs in real stuff for real money, working with real people for months at a time. That's what they do.

Tuesday, January 17, 2012

Known ways of managing ET #05 - Off-Piste Testing

tl;dr – scripted guides may not help exploration

Team leaders tell me ‘My testers use manual testing scripts*, but I want them to do more than just plod through – I want them to find problems’. This seems a reasonable idea, but is fundamentally incoherent; ‘Finding a problem’ is not necessarily a complementary goal to the action of ‘following a script’. However, it happens. Let’s look at two common approaches. I’ll call them Iron Script, and (by way of contrast) Marshmallow Script.

Iron Script
The scripts remain authoritative. Testers are expected to deviate at agreed points, typically by changing their data within reasonable limits. Adding new events or changing the path is frowned upon; in extremes, a bug that is found using a non-standard path may be rejected, and the tester asked to reproduce the bug on the accepted path. If you can get some kind of pre-approval for diversions taken through error states and validation checks, you’ll open up a bunch of interesting routes whose valuable information might otherwise be used against you by project pedants.

It’s my experience that scripts in these situations frequently mirror end-user activity, and that the steps focus on triggers at the expense of observations. If your testers must run through the script, then they must, but don’t let them get dull. Remember that you are a team of testers, not users, and that you can still get unassailably-helpful information from querying the database, watching the CPU, intercepting transactions, using a diff tool on the filesystem, or any other neat trick that takes your fancy. Constraints breed creativity.

Marshmallow Script
The scripts are a guide, a collection of hints or waypoints. Testers can deviate wherever they want, using the scripts to get to interesting points, or as a checklist to verify their expectations. The scripts act as fat charters, and by giving their names to paths, help testers to talk about what they’ve done and what they’ve found. This isn’t bad, as far as it goes.

However, the approach puts happy paths – reliable routes that demonstrate limited capability – at the core of decisions about what to test and how to test it. This emphasis can be a terrible drag on the swift revelations that might be desired from unfettered ET. It can wastefully restrict your testers’ imaginations, and seems to reinforce manual testing at the detriment of small cheap tools.

I tend to find that these approaches exist in parallel, but may not be acknowledged as such. It is powerful – sometimes, too powerful – to wonder out loud whether the team as a whole is looking to move away from their scripts or to add to their scripts. This can turn out to be emotive enough to be ignored in polite company; bringing it up in public can make people very impolite indeed.

One might question why the team is writing scripts at all. Scripts are expensive to make and hard to maintain. If they exist to give the testers a framework to explore requirements and product use while writing them, other frameworks might work just as well. If they are primarily a guide for novices or a crutch for the feeble, then perhaps one needs to rethink one’s approach to learning, and possibly to hiring. If they are primarily a way of recording work, then why not record the work with something more unambiguous, or more searchable? If they exist because an environment is hard to automate, then I would wonder if everything scripted is quite so hard to automate. If they exist to keep testers on a leash, then I have no further questions.

These are, however, rationalisations of a generally irrational position. I think the answer lies not in conscious choice, but in habit. The approach seems common in situations where budgets are only available for work that can be measured with reference to requirements and scripts, yet where the test team and its decision makers know that their survival-in-current-form requires fast information about emerging trouble. Maybe it’s endowment bias; if no one wants to chuck away all those scripts they’ve been working on so hard, then the scripts will remain central to the work no matter what the work is. In the first, future plans don’t match current practice. In the second, neither does the past. I often see both. Is it any wonder that the team lead’s goals might not match their means?

As a skier**, I’m drawn to the term ‘Off Piste’, and the term ‘Off-Piste Testing’*** seems a popular metaphor for this approach. Between the mountain top and the snug chalet in the valley floor, there are many paths: some groomed, marked and filled with tourists; others quiet or exposed, with cliffs and bears. There is an implication that off-piste is only for the most skilled, keen and well-equipped skier. The implied kudos is eagerly lapped-up by testers, and off-piste testing can be used as a motivator with two caveats; status should be earned through good work, and good information can gained from diverse approaches. Whatever the rules of the mountain might be, it is perilous to restrict off-piste testing to your elite.

More importantly, off-piste is still downhill. Scripts, whether used as hard or soft guides, bias testers towards a set of activities that most typically follow whatever actions are expected to be valuable to the user, system or owner. These activities are not the only ways to find problems. Those who manage exploratory testing by running after scripts will handicap their team.

* For this blog post, script means a step by step set of instructions to be read through and followed manually by a tester. Some of you may be aghast that such things are still in use. Some of you may see no alternative. For each of you, please believe that each position exists.
** Note to prospective clients – if you're near a properly-skiable mountain and book me to come to you close to a weekend during the season, I may have a seriously-tasty winter offer for you.
*** ‘piste-off testing’, anyone? Just me? Hey ho.

Tuesday, January 10, 2012

Uncommon Ways of Managing ET #04 - Post-partum Labelling

tl;dr – tl;dr your ET notes to see where you've been

I’ve worked with plenty of testers who don’t timebox their time, don’t set out a charter before testing, and don’t do formal debriefs. Clearly, they’re not following session-based testing, but that doesn’t mean they’re necessarily doing bad exploratory work. Indeed, some of the most brilliant exploratory testers I’ve worked with are fully able to do all these things yet choose not to for much of their exploratory testing.

Personally, I almost always have a timebox, and find I prefer my results (but not my activity) if I make good notes. I can find charters trivial or restrictive, and debriefs can lead me to remember only the edited highlights of my exploration – so if my debrief sucks, my memory can be less useful than if I’d not debriefed at all.

Charters, timeboxes, notes and debriefs have a value to the team and the project as well as to the tester. If the team habitually relies on them, but an individual works best without them, then you’re faced with a choice of whether to force that tester towards an ineffective discipline, or whether to damage team performance. Which is no fun.

This then is a simple and obvious alternative, but I’m not aware of much that has been written to describe it. Nonetheless, I’m sure that many readers will recognise the activities below, and I can’t claim that this is in any way novel. Perhaps if no one’s written about it, it doesn’t seem legitimate, so no one writes about it. Perhaps I’ve just forgotten what I’ve read. Anyway, the following is a collection, and to that extent an imaginary extension, of stuff that has worked for me. I’m going to call it Post-partum labelling*. If you’ve got a better name, or know where someone else has named it, super. Comment away.

After a chunk** of testing, the exploratory tester describes their work in a sentence or two. They write this up in public. For example:
8 Jan – 60 minutes: Ed used a javascript profiling tool to analyse the web app.
8 Jan – 120 minutes: Sue spent 2 hours exploring reported instabilities related to switching sort order while autosaving.
9 Jan – 180 minutes: Brin and Rudi spent 30 minutes watching two users interact with the demo app for the first time, and spent the next 60 minutes reviewing and annotating video of the sessions.
10 Jan – 180 minutes: Jonno spent 3 hours on batch input, generating 3088 files that together contained all possible orderings of 5-step transactions.

This works well if you’ve set aside time for experienced and self-directed explorers to test. If you’re expecting a terse diurnal list like the one above, you might find it to be a good fit with daily news. It’s perhaps not such a good fit if you’ve got testers who have problems with focus, or if your test approach means that your list grows by more than half a dozen lines a day.

The list won’t help you know where testing is going, but it’s great to help you know where it’s been. Everyone in the team can see who explored what and when, so you know who to talk to, you know what’s been hit recently, and your memory and understanding of the system’s recent history has enough to help fill in the blanks. The team knows what it is paying attention to, and knows where individual interests lie. I think this is generally more useful than having an obscured testing genius bringing the system to its knees in interminably unfathomable ways.

Writing a post-partum label helps me put most recent test activity behind me, and allows me to think diversely as I enter the next round. I like knowing that I’ll need to write a public one-line summary of my hours of discovery; it helps maintain focus.

While I like a timebox, you might not. I wouldn’t insist on timeboxes if I was doing post-partum labelling. The people in the team know the budget, and they’re already trusted. The exploration is done when it’s done; forcing a timebox is a silly micromanagement. However, if people on your team are prone to pissing away their time and don’t embrace timeboxes or similar tools as part of their personal discipline, they’re probably not the best people to be doing post-partum labelling.

It’s time to change approach when your post-partum labels turn into “looked at login, again” or “checked out last week’s bugfixes”. If your label can be made before exploring, then it probably should be. Post-partum labels arrive after, and may not fit what you would have expected at the start. If you’re exploring, this is often a good thing.

Please, don’t get the impression that the label is an adequate substitute for notes. Sometimes, awfully, unfortunately, that’s what it is. Try to avoid this.

I’ve used similar approaches when I’ve been the exploratory addition to a team that has been relying solely on scripted or massive and confirmatory tests. I found it helpful when we had more test ideas than we could easily manage, and yet had target pathologies, observations and triggers that urgently called for our attention. Post-partum labelling helped me fit my work with other explorers and the rest of the team, helped me gain trust by offering visibility, acted as a trigger and conduit for other people to bring me ideas. It let my team spin very swiftly back through a couple of weeks of exploration, identifying which set of notes might hold relevance. It gave explorers who weren’t happy with SBT fit into a team that was trying to gain the disciplines of SBT. It wasn’t much good for assessing coverage. It didn’t link to requirements. It was rubbish for steering. But I liked it.

I’m very tempted to extend the idea further. I want to capture the information electronically. I want to add tags, to allow me to analyse where we’ve been spending time. I’m keen to describe problems found. I’d like to try using Stefan Butlin’s interesting TestPad web app (and I shall, it’s neat). However, these adjustments change the emphasis of the list. Have a look:
8 Jan – 60 minutes: Ed used a javascript profiling tool to analyse the web app. [code, performance, UX] We’re spending plenty of time inside check_constraints(), which looks recursive.
8 Jan – 120 minutes: Sue spent 2 hours exploring reported instabilities related to switching sort order while autosaving. [instability, UX, autosave] She found a reproducible crashing bug, logged a couple of UX issues, and identified potential exploitation.
9 Jan – 180 minutes: Brin and Rudi spent 30 minutes watching two users interact with the demo app for the first time, and spent the next 60 minutes reviewing and annotating video of the sessions. [UX]We identified and logged UX Issues around context menus, the hiding menu bar, and error messages.
10 Jan – 180 minutes: Jonno spent 3 hours on batch input, generating 3088 files that together contained all possible orderings of 5-step transactions. [batch, instability] The system correctly accepted 182, correctly rejected 2900, but hung on 2 that it should have rejected. No bugs logged yet, as we think this may be to do with a mistake in the configuration data in the test system.

Do you find yourself skipping over stuff now? I do. It’s as if it’s all too much to hold together. You’ll be keeping this information somewhere else, too, I expect – and I think that’s where it should stay. Keep the list single-purpose. You’ll find it lives in people’s heads more easily and more consistently, becoming part of the shared consciousness of the test team. And how cool is that?

* made-up name. Obviously. Post-partum is a latin term used to refer to the mother after giving birth (as opposed to post-natal, which apparently applies to the baby). You know what a label is. I want to get across the idea of a tester giving their work a unique and meaningful title, once it’s done.
** a chunk? What’s a chunk? I find that my mind merrily organises memory and activity, and groups the similar and temporally-close. If you have control over your interruptions, you’ve come to the limits of your chunk when you choose to change task. Sometimes, you don’t choose consciously. It’s still a chunk. My chunks of time testing are often hours. Writing, just minut… hey! A squirrel***!
*** I can see six, right now, in the evergreen oak outside my window. No, seven. Five. A parrot!

Thursday, January 05, 2012

Known ways of managing ET #04 - Set Aside Time

tl; dr – scheduling ET changes the game. This is how to cope.

The team decides to budget a fixed amount of time for exploratory testing. Of course, that’s not the end of the story. This post is about what happens next.

First some background and disclosure: This sort of decision has been the trigger for a fair proportion of my client engagements since around 1998*. So I see this more often than I might. Generally someone on the team has eloquently persuaded a budget-holder that the project will find value in time spent exploring**, and I get to work with fresh, enthusiastic and newly empowered testers. So I find the situation delightful. Who wouldn’t? I’m sure these two complementary perspectives colour my experiences, and I expect they have coloured my writing too.

Budgeting a chunk of time to explore what you’ve made is a fine idea. As a management approach, however, it’s a bit hands-off. Sometimes, neither sponsor nor enthusiast has worked out the details of what is expected from whom and by when. More particularly, although there’s often a great sense of adventure, there’s not much consideration about the strategies for coping with inevitable change. Here then are some of those changes, a few related pathologies, and some strategic and tactical tweaks that have worked for me.

Dealing with lots of ideas

There will be a monstrous growth in the number of testing ideas; test teams have no problem coming up with new ideas about what and how to test. The practical problems lie in picking the best, dropping the newly-redundant, classifying and slicing and managing the set. Picture people trying to stuff an octopus*** into a rubber sack. This is a natural part of testing; one of the characteristics of wicked problems is that you can’t pin down the solution set.

As with all monstrous growths, the quantity of test ideas will be limited by external factors. If you’re keeping ideas on sticky notes, you’ll run out of wall space – which is perhaps better than putting everything in an expandable database that no one bothers to read. The most usual limits**** are to do with patience and attention span. When working within the set, the team will learn crucial filtering and throwing away skills, but will also run up against endowment bias; it’s hard to let go of something you own, and harder to let go of something you’ve worked hard to gain. There is likely to be conflict or denial – all the more so if the team has to date been under the consensual hallucination of potential completeness which underpins script-only approaches.

The growth in quantity may not be not matched by a growth in diversity or quality of ideas. This is only made worse by a testing guru (sometimes me) who sees it as his or her job to come up with new ideas. A good way to defuse some of this problem is to encourage the team to not only add to the ideas themselves, but to challenge ideas and remove them from play. If you make your influential tester the champion for diversity or quality in the collection, that can help too. I’ve often seen teams hamstrung by an inappropriate attraction to combinatorial collections; given a set of test ideas, someone draws up a table of (say) data type against input, and works through, left to right, top to bottom. Stop this in its tracks; tables and ordered progressions indicate a single real idea, one which is ripe for heavy optimising with a tool. If you can’t automate, decide which combinations are most important to do right now. If you can’t prioritise or optimise, hit the combinations randomly and learn from them as you go.

I like to constrain the numbers of test ideas in play by saying how much time an idea might need, and keeping the total amount of time needed under some reasonable limit. Although I can get as sucked-in by combinatorials as the next tester, I find that I tend to prefer diversity over depth. I try to temper this bias by paying attention to the agreed strategy and the current context – which means it’s good to have talked about what is valuable to the project. If I find myself pouring out ideas in some consultantish denial-of-service attack on the capabilities of the team, I’ll find a corner and write until I’ve calmed down, then see if my list triggers people to put their own ideas up, rather than fill the wall with my collection.

Dealing with fast feedback

Exploratory testing makes some feedback much faster, so a commitment to exploratory testing will change the feedback characteristics of the project. If there is a personal connection between the designers, coders and testers to support this feedback, the consequences can be excellent. Problems get fixed quickly, the team can develop a better understanding of what quality means on the project, and testers swiftly find out what they need to supply in a useful bug report. I’ve often seen palpable increases in respect and communication, which leads to trust and an overall greasing of the machinery of a development team.

Teams who have built massive confirmatory automated test are used to fast feedback, but of a very different flavour from that provided by exploratory testing. Feedback on on novel modes of behaviour and unanticipated failures can deeply challenge a team who thought their design was watertight. I’ve been told that bugs aren’t bugs unless they’re found by a customer, and that any bug without an immediately-obvious fix is a feature. I see both of these reactions as denial, and an indication that I should have prepared the ground better for the kind of feedback I can offer from exploring the product. The situation is made much easier if you are working with a real customer or user, rather than acting as proxy. The cry of pain might also indicate that your testing priorities don’t match the priorities of the people constructing the system – it’s your call whether you need to re-align, or whether you should embrace the difference. I’ve written more about the correspondences and conflicts of exploratory testing on agile teams in Testing in an Agile Environment.

More problematically, some people don’t want feedback when they’re in the throes of making stuff. I’m one; it gets in the way of the fragile extension of the imagination that is at the heart of how I make new things. Some testers are remarkably insensitive to this, others imagine that they need to somehow sweeten the pill. When I have results or questions, I prefer to make it known that I have feedback, and to give that feedback either to a group at a time when everyone is tuned in, or to an individual on invitation. Of course, it’s great to be able to march up to someone’s desk and get an immediate response, but what’s good for you might not be good for your colleague. Decide whether it's the right time to ask before you get between momma bear and her cubs.

As a test team gets closer to its audience, some will imagine that the test team risks losing its independence, and will resist – for instance – exchanging design information or locating the testers a shout away from the coders. Isolating the testers is an obvious but stupid way of encouraging independence of thought. You’ll find more in The Irrational Tester.

Conversely, test teams who have no existing connection with their designers and coders throw their feedback into a void. Swift, accurate and targeted information might seem valuable, but is worthless to the builders if it is delayed by procedure and snowed under by noise. The feedback in this case is primarily for the users (and sometimes the sellers) of the software. It’s crucial to understand your audience.

Some legally-minded people (and some sellers) don’t want information about new failures and will restrict or censor feedback. Some need plausible deniability, and don’t want to even look. If you have this situation as a test manager, messing about with ET won’t fix your problems.

Dealing with decisions and organisational oversight

Groups that are new to ET tend to see a large expansion in the number of problems found before they see a reduction in the number of problems made. More than once, when I’ve been dropped into an agile team as an exploratory tester and customer representative, I’ve had to take the judgement about whether to horribly disrupt the vendor team’s workflow by filling it with bugs. Clearly, if relations are good, blowing the workflow is bad – even if it is an indication of a crisis ignored. So far, I’ve managed to avoid purposefully blocking the flow. However, although it is a decisive and disruptive step, new exploratory testing groups can fatally disrupt the flow easily, unconsciously and even gleefully (which is nauseating, but happens). When bringing ET into a project, it’s vital to have awareness of this dire ability throughout the project team. If the workflow is king but the quality poor, decision makers will need to prepare for compromises on all fronts.

Once exploratory testing is chugging along, you hope to reach a point where fewer bugs are being made. I’ve had complaints from metrics people that fewer bugs are being found. This is a fine demonstration of failure demand, and I find it easier to set it as desired goal at the outset, rather than have to explain it as a side effect. I’ve found it useful to put this in terms of ‘we will not necessarily find more problems, but we will find more useful problems and find them faster’. Similarly, some metrics people count a day when lots of problems have been found as a bad day; it’s easier to help them deal with their pain if you’ve already had a chat about how much worse it would be if all that trouble was revealed later.

A decision to be hands-off can make some managers feel insecure. This feeling may lead them back towards the test team with unexpected needs for oversight and control*****. To avoid this, any team that has been given a degree of autonomy needs voluntarily to help their sponsor feel secure. I find that it helps to make a clear agreement not only about responsibilities, but about material deliverables. For instance: “We will keep a steady pace of exploration, shown by a public counter of time spent. We will display our test ideas and will be able to discuss them at any time with anyone on the project. We will make a visual distinction of those ideas which we have explored, those we are about to do, those we will probably do, those which we have recently eliminated, and those which have recently arrived. All time spent exploratory testing will be documented, and the documentation kept at <link/location>. All bugs logged from exploratory testing will be cross-referenced to their discovery documentation. Where we cannot keep to these commitments, we will make a collected note of the exceptions. We will come to you at the start, end and at regular intervals throughout significant testing efforts to keep you up-to-date. We will come to you with requests for help with obstacles we cannot overcome ourselves and with decisions about changes to budget and staff, but not with decisions about test direction and prioritisation. You will allow time for our regular reports and requests, and will champion our autonomy as set out here. If you are unable to give us the help we ask for, you will let us know swiftly and with reason.

Budgets change. Sometimes a project wants more exploration, sometimes there’s less available to spend on it. While the test team may have started out with a fixed exploration budget, and may be comfortable cutting or expanding its testing to suit, there may be questions around how it would drive a change and require more (or less) from its sponsors. This is to misunderstand testing as a service – the people to whom one provides a service will be the people who ask for more, who want less, who adjust the balance. Clearly, the test team will be engaged in the negotiation, but I would question the motivation of a test team that prefers its own budgetary decisions over the informed decisions of its customers and sponsors.

Lots of teams seem scared of putting exploratory testing in front of auditors. I’m not sure why; the auditors I’ve met seem to do a lot of exploration in their work, and I’ve always found it helpful to ask the appropriate auditors about what they might expect to see from exploratory testing before we start exploring. If there is, for instance, an unadjustable regulation that stipulates that all tests must be planned, the auditors are not only most likely to know about it, but to be able to give you an indication about what they might accept (i.e. charter+timebox in plan post-hoc). I understand that session-based testing was developed in part to allow exploratory testing to be audited. If auditors have a place in your organisation, then it’s better to expect them****** than to hide; talk to your auditors and negotiate what they need for assurance. I wrote a note about this here on the blog in June 2011: How to Assure Exploratory Testing.

Crisis reveals underlying truths. I can't recall a project that has identified every necessary task, or given each task the time that was needed or budgeted. Testing, especially when considered with a potentially-unlimited discovery element, is eminently squashable and so is usually squashed – which tends to reveal uncomfortable truths about how the overall organisation understands testing. If exploratory testing is squashed out of existence when testing as a whole is squashed, your decision makers see ET as a luxury. If exploratory testing takes the whole pie when (but only when) testing is squashed, decision makers see ET as a short cut. Both these positions are pathologies. You might be able to spot them early by indulging in a spot of scenario planning, or you might trust your instinct. I work from the position that testing is a service – mostly optional, generally valuable – which I find both reasonable and benign, but my position could be a pathology in your organisation.

As a team grows into exploration, it will develop a library of tools. By tool, I don’t mean a downloadable executable, but something that arises from the combination of mindless machinery with data, configuration and conscious application by the minds of the team. A chainsaw needs a lumberjack*******. Some tools arise as testers automate their manual, brain-engaged testing – and as the automation takes over, the tool will change the way a tester triggers problems and their ability to observe surprises, not always for the better. Other tools arise because they enable a whole new class of tests, and using the tool even briefly exposes a new perspective with its own easily accessible bugs. A tool armoury is one of the core assets of an exploratory testing team; exploratory testing without tools is weak and slow. As with any library, it needs to be organised to be useful. If I can, I keep a public list of tools, techniques, tricks and test data, perhaps tagged with general useful areas and knowledgable users. I encourage people to name and share their approaches. I try to get individuals to pair on tool use, or to hold swift training workshops.

One of the strengths of session-based testing is the way that it uses brief and frequent retrospectives to move skills through the team. Any exploration has learning at its heart, otherwise discovery builds nothing. Apart from skills in tools and test approaches, a test team needs to build knowledge of the system they are testing. We all know the truism that by the end of a project, the testers know a system better than anyone else. Exploratory test teams also build their awareness of how the system is used (and abused), and have broad connections throughout the users of their system and across the various sponsors and stakeholders. The edges of the team blur as people are seconded in and out; not all exploratory testers will identify themselves as testers. I have occasionally found easy acceptance of exploratory testing way outside the test team, which can give a neatly circular confirmation that the team made a good decision to set time aside for exploration.

In conclusion…

Test teams setting out to explore need to have a conversation with the overall project that acknowledges that this discovery activity cannot be completed, and seeks to find out what the project most values so that the test team can align its work to the overall desires of its customer and the goals of the project. It’s good to have a one-page strategy outlining goals and responsibilities. Internally, the team will need to work out how to make its work visible and trustable, and how to support its exploration and learning. It will need to organise and constantly refine a collection of tools and test ideas, drawing inspiration from diverse sources. As exploration becomes more understood, used and valued, the test team will broaden its skills and blur its edges, bringing its services closer to the coders and the customers.

* Not that anyone in my circles at that time saw – or named – exploratory testing as a distinct thing.
** I’ve mentioned one way that teams arrive at this point in Known ways of managing ET #03 - The Gamble
*** You’re here? Good for you. Welcome, friend. I know that you are interested in details and diversions. As am I. Between us, I don’t mean an octopus. Imagine Cthulhu.
**** because they’re the smallest, and so are arrived at first.
***** which sounds to me like Marshall McLuhan’s reversal effect. You’ll want to read Michael Bolton’s article for more.
****** It’s not like they’re the Spanish Inquisition.
******* and transport infrastructure, logistics plan, pulp mill, petroleum industry...