Workroom Productions

Monday, February 25, 2013

New Black Box machine (Puzzle 6)

I've cobbled together a new black box machine, for your amusement.

If you email or DM me your simple-as-possible description of what it's doing, and if it's right, I'll send out a kudos tweet to the world. I'll stop after a few.

Report bugs widely, but please include me on their distribution. I know of at least three: one with fonts, another on resize, and the differing UI on the logo/? buttons.

Here's a big one, if you prefer.

I think I may have finally got far enough with the transition to AS3 (which may make it simpler to port to HTML5).

If you like this, get the last seat on my Amsterdam workshop. Orange button, top left of the blog.

Cheers - James

P.S. Apologies for the formatting problems which I assume will turn up here. MacJournal's decided to change how it blogs, so I'm using Blogger directly. Yikes. Formatting problems, default behaviour troubles, "can't save" warnings. Shocker.

Friday, December 14, 2012

Modelling super powers

tl; dr it's not just the tactics that matter

In this scenario, I've modelled five testers. Each has a super power.

One logs bugs more easily than the others – effectively, they log as many bugs as they can see. The others can only log 10 bugs for each bit of budget they consume.
One only logs big bugs – bugs with a cost of 10 or more. The others log any bug they find.
One learns three times more effectively than the others.
One switches tactic twice as often.
One finds it easier to retain their skills after switching tactic.

What difference might each of these qualities make?

Run the exercise a few times. Make some changes. You may find it easier fullscreen, and of course the XML is available to play with. Does the model match your experience?

More to the point, perhaps, how are you comparing the different testers in the model? How do you compare real testers on your team?

Friday, December 07, 2012

Diversity matters, and here's why

tl;dr – It ain't what you do, it's the way that you do it

We've got a model which tells us we have an hopeless problem. I promised some perspective.

Let's try throwing people at our problem. In the exercise below, we're using five testers. If a bug has a 1:100 chance of being found by one tester in one cycle, surely five testers should have a better chance.*

How much better? Run the thing and find out.

Less than impressed? That's because hard-to-find bugs are still hard to find, even with a few more people. Your one-in-five-million shot is not going to jump into your lap if you've only managed to make the chance of finding it one-in-a-million.

There's a key quality I've not changed in this model. We've said that some bugs are harder to find than others. We've not yet mentioned, or modelled, that my Mum merrily finds problems that have eluded me. The way that you don't see my bugs on your machine. The way that performance testing jiggles bugs by the bucketload out of systems which seemed to be working just fine, or the way that unit testing and usability studies find (mostly) entirely different collections of bugs.

Our model should reflect that any individual bug** should have lots of different likelihoods of being found. For this model, we're going to make the choice of likelihood depend on the tactic that is in use. Indeed, for this model, that's what differentiates and defines "tactic" – changing tactic changes the distribution of likelihoods across the collection of bugs.

Below, you'll find an exercise which again has five testers working in parallel. This time, each tester has their own individual profile of chances, and a bug that one finds easily may be much less likely to be found by another.

In the model, we do this by setting up tactics. Each tester has one tactic, which they use exclusively. Each tactic is set up in the same way across the full population of bugs – it's just a distribution of probabilities. If you were to look at one bug, you'd find it has five separate probabilities of being found. Have a play.

The difference is clear.

Diversity matters*. In this model, it matters a lot; more than budget, more than the number of testers.

For those of you who prefer analysis over play, also clear if you think about the chances of finding an individual bug. Tactic 1's million-to-one chance bug may be a billion-to-one for tactic 2, too, but tactic 3 might well see it as a hundred-to-one. Ultimately, the no-chance-for-that-bug tactic would continue to have no chance whatever your budget (or patience), but by having many tactics, one increases the chance of having a technique that with easily find that particular bug easy.

QED – but I hope the toys help the demonstrandum at least as much than the argument.

Note that a key assumption hidden in this model of diverse approaches is that the different tactics are utterly different. In the real world, that's hard. There's plenty of refinement to do to our model to make it a more accurate reflection of the world. However, the central idea remains: in this model of discovery, you get much more from changing your variety than from changing your effort.

This then is the perspective – in this exploratory environment, persistence is what leads to hopelessness. Variety gets you closer. Just for fun, here's a model with the five tactics, and just one tester – but this tester can switch tactics. I'll be mean, so they switch randomly, and each time they switch, their skill slides backwards. Look at the poor beggar ticking away; hardly ever gets over 50%.

See how well this works with just one tester.

One random tester does better***** than five monotonic testers? You're surprised by that conclusion? Enough with the rhetoricals: I have (metaphorical) knobs for you to play with.

The sharp-eyed will notice an extra button – I've finally given you a reset. Indeed, this is a rather more interactive machine than you've had so far – you can change the number of bugs and the cost model. You can also give (not entirely reliably) the machine a (not entirely reliable) "seed" to start from as it builds the model, which lets you replay scenarios. Be aware that the I've not sorted out a fully-intuitive workflow around set/start/stop/change/reset, nor have I tested this well (it's mine, and I'm too close to do a job to be proud of). I'd appreciate any feedback – be aware that behaviours may change in the near future.

If you want to dig deeper into the model, I've made a change that allows you to play with the machine offline. Download the .swf, and the Exercise.xml file from the same directory. Bung them in the same folder on your own machine, and the .swf will pick up your local copy. Have a play with Exercise.xml and see what you can come up with. I'll share interesting exercises and conclusions here, and you're welcome to post them on your own site. I'd like to hear about your postings, because then I'll know who to send updated machines to. I'll open-source this sometime,

There's lots further one can go with this model, and over the next few posts, we'll explore and illustrate the effects of some common constraints and balances.

It looks like I'll be teaching exploratory testing in Amsterdam early next year. I'm just about to set the dates. If you want 30% off the price for simply telling me you're interested, you've got a couple of days to catch the opportunity.

Cheers -

James

* maths people will know; (1- (1-0.01) ^ 5) ~ 4.9%, which is just a tad more unlikely than a 1:20 chance.
** for the purposes of this explanation, let's assume we can identify at least one bug.
*** this panders directly to my prejudices, so I'm pleased to reach this conclusion, and find it hard to argue against effectively. I'd be grateful**** if you felt able to disagree.
**** through gritted teeth, but still grateful.
***** better?

Wednesday, November 28, 2012

Enumeration hell

tl;dr some bugs are beyond imagining

"Rational people don't count bugs."

There's a rash statement. Let's say that rational people who do count bugs ought to count other, less pointless more meaningful things, too.

Bugs* are rotten to count. There are plenty of posts** about this, and I won't go over the same ground here. Counting bugs is a bit like counting holes – superficially obvious until someone takes a shovel to your cheese.

But the big problem with a bug count is that it summarises a potentially useful collection of information into a number that is mostly meaningless. A single nasty that makes the wheels fall off is worth any number of niggles that make the horn too loud. Unless you're driving a clown car.

In our idealised model, we're counting surprises because it's interesting to see how many are left. None is none on any scale, and if there's none, we're done. We're still not done if we've got one left, because that one might be a stinker.

You've noticed that I've only given you one knob to twiddle*** on these toys. You only get to change the budget – you don't get to change the context****. This is a cheap manipulation on my part, because I've been asking you to concentrate on where you might set that budget to feel reasonably confident that the thing is tested.

So far, we've not considered bug stink in our model. It's time that changed.

In the same way that our model gives each bug a chance of being found, it gives each bug a quality I'll call cost. That's probably not the best word, but it's the one I've chosen for now*****. I'll give it a local meaning. Cost is the amount by which the value of the system goes down when it contains the bug. Quality is value to someone. Trouble makes that value go down. Cost, here, is not cost of fixing the bug. It's the cost of leaving it in, and it's the cost to the the end users.

Bugs aren't made equal, so we'll need to consider a distribution again, this time of (our local definition of) cost. Experience leads me to believe that most bugs have low cost, some bugs have higher cost, and a very few (so few that they might not exist in a given system) have astronomically large costs that outweigh the value of the system.

In earlier examples, each bug had the same cost. The distribution I've chosen to use in this model, to match my experience, is called a "power law" distribution. Power law distributions fir lots of things observed in the real world, such as city sizes, distribution of wealth, and the initial mass of stars. Power law maths underlie the Pareto Principle (aka the 80:20 rule), and Taylor's Law****** (and , more incomprehensibly, phase changes). If you want to dive into this, set your head up with this handy note comparing the similarities of Power/Zipf/Pareto in a real (if rather antique) context.

Why have i picked this distribution? Because it feels right. Instinct is no justification, so you can expect that we'll have a look at other distributions later. For now, though here's a fourth assumption:

4 The cost of a bug to (all the end users over the life of a product) has a power law distribution.

Enough of the hands-waving. Let's play.

Below you should find an identical machine to last time's closing toy, but with costs set to match a pareto-style distribution. You'll quickly see that there are two "stuff found" numbers, and that the size of the yellow dot is related to the cost. Run this a few times.

Don't be surprised if, occasionally, you see a simply huge yellow dot. Try hovering over the top right of the square set of 400 circles, and click on the ? you see to reveal a god-like understanding of how much trouble this system is hiding. Know that, generally, you'll see the total trouble is around 1000*******. If you see around 2000, expect that one of the bugs has a cost of 1000. If you happen to see around 11000, you've probably got a fat 10K bug hiding away.

In our most recent outing, I hope you got a feel for why it's hard to use a bug rate to say that you're done testing. If you play with the models in this posting, you may get an idea for how 'not done' feels in terms of the cost of what you've left behind.

I hope you're still considering where your omnicognisant self would set a reasonable budget so you could say with confidence that you'd done enough. Have a look at the left-hand graph of what's been found. It's still very front-loaded, but you'll see the occasional big spike as a particularly troublesome bug is revealed.

Let's rack up the difficulty another notch. I set up the model above so that the budget and the bug distribution meant that you got to find most of the bugs in a relatively brief exercise. Of course, that's no use at all. Here's another; more bugs, smaller budget. Crucially though, in this model plenty of the bugs are very hard to find indeed. You're not going to find the lot, so that's what this model looks like.

Hopeless, isn't it? If the real world looks anything like our model, how can anyone be bothered to give a sensible answer when asked to set out a budget?

Next time, all being well, we'll approach these frustrations sideways on. We won't find clarity, but we may find perspective.

* I'm not going to define "bug", because it's a vague word, and therein lies its power. But if there's a scale that runs through vague to countable, then I suggest these two ideas are at opposite ends.
** Try Michael Bolton's Another Silly Quantitative Model and Elisabeth Hendrickson's What Metrics do you use in Agile.
*** there's lots more interactivity to come. For now though, mull on how it must feel to be a leader whose only effective control is over budget-setting, then be nicer to your poor distant Chief Money Officer next time.
**** suggestions accepted, but without any guarantee they'll be used.
***** "Law" appears to be used by some scientists in a similarly-imprecise way to the way some lawyers use "Proof". Business people naturally choose to use both words with abandon. I would treat the word "Law" here with as much scepticism as you might treat it in Moore's Law. They're empirical laws, and describe, rather than necessarily account for, system behaviour.
******* 1000 what? I don't care. Stop your whining and go count the number of things in this list.

Monday, November 19, 2012

Models, lies and approximations

tl;dr – Some of these bugs are not like the others

Here's hoping you've enjoyed playing with trucks and bowls and your imaginations. If we're going to be able to use our model as an illustration of much value , we have to recognise that in relation to software testing it contains a useful approximation, and a misleading lie.

There's a limited collection of things to find. This can be a useful approximation for exploration in software testing – if one assumes that a system has a finite collection of possible behaviours, then the set of possible but undesirable behaviours is limited too (as compared with the vast set of rubbish things that it doesn't actually do). This is good to consider when told "there's always a bug" by an idiot*.

You might further refine this by adjusting your view from the large mass of observably rotten behaviour to the smaller selection of practical changes that make the system more desirable. You'll also recognise that the collection, while limited, is only fixed if everything else is fixed. In our model, the collection of bugs is fixed – so we need to be clear that the approximation and the model assumes that, just of now, no one's changing stuff**.

The rate of finding things falls, as the number of things that can be found falls. This is obviously true, but is perversely also a misleading lie***. Idiots (sometimes, the same idiots who believe "there's always a bug") assume, because of the statement's obvious truth, that when the rate of finding bugs drops, the system is close to clean. Bonnnnng.

Sometimes, and it feels like often, it's because the people testing it have run out of imagination. While we may understand more as we reveal more, and while a system may become cleaner as it gets fixed, a dropping bug rate certainly does not imply you've found all the bugs.

Some testing is done by an unbending list of dull asserts, which run clear and green when they've not managed to either trigger or observe any trouble. Michael Bolton reasonably calls these "checks" rather than tests. Some testers, limited by corporate focus or personal blandness, don't do much better, demonstrating simply**** that a system meets expectations.

As any fule kno, some bugs are harder to find than others. If you find you've run out of bugs, it's likely you've run out of bugs that you're set up to find. Sometimes, that's OK. But sometimes, a bug that's hard for you to find is easy for someone else to find. If that someone else isn't a paid tester, but is heaven forfend, a paying customer, we get the "why didn't you find that" conversation.

So, then. A couple of approximations for this model.

1        Some bugs are harder to find than others.

I'll model this by giving some bugs a high chance of being found, and other a low chance. The way that easy-to-hard works amongst the bugs is called a probability distribution. We can pick a distribution. Our earlier example, the trucks and bowls, would be a fixed, or flat distribution, where everything has the same chance, because we assume that trucks/bowls are effectively equal within the model. That's the stupid but easy assumption that lies under the misleading lie. Bugs are different.

2        We don't have a find-fix-retest cycle in our model. Nothing is being changed.

This makes the model easier to understand, because we're not looking at two things going on at the same time. Of course it's inaccurate. The trick is to use the imagination to wonder how that inaccuracy might appear. Models should be visceral, so go with your emotion if you don't want to be analytical.

Finally, a wrinkle. Exploring is about learning. As we discover more, we get better at discovering, not worse. We start out rubbish, and one trick that distinguishes good testers is how quickly they get better (not how good they start). This leads us to

3        Everything's harder to find early on.

In our model, we have a tester. The chance of them finding any given bug starts at a particular value (0, say) and increases. In this model, it increases over time, and it's much easier to go from nothing to halfway good than it is to go from halfway good to perfect. There are lots of different ways of modelling this – again, use your imagination to think how the model might change.

So – here's a model of a tester discovering bugs. I've fixed it so that there are 400 bugs to find, but some are harder than others. The tester gets better over time.

* Not that they'll listen. Indeed, that's close-on the definition of an idiot, which is apparently from a Latin word meaning "ignorant person". Clearly they are, if they're ignoring you.
** I'm aware that this is impossible and in many ways undesirable in the real world. My model, my rules. Just making them explicit.
*** something to have in mind whenever someone says "obviously"
**** but oh, in such a complicated way

Friday, November 02, 2012

Broken Trucks

tl;dr – you still need your imagination, even with real-life examples

Temporary note – the truck graphic has gone, the graphs are back. I'll remove this note when I restore the graphics...

Magic pennies? Pshaw.

Let me put this another way.

This problem has been put to me frequently in my testing life. Here's one close-to-actual situation.

My client has a hundred trucks. Each has a bit of kit, and I've ben told that the bit of kit needs to be replaced occasionally. Actually, not so occasionally – it's new kit, and I'm told that it's likely to fail at least once in the first hundred days use.

So, how many trucks will experience that failure in their first hundred days? All of them? Also, how long should we test for? How many rigs should we use? How reliable is that suspiciously-round 1 in 100 figure?

As it happens, there's a bit of maths one can do. If the chance of a truck failing is 1%, then the chance of it not failing is 99%. The chance of it not failing for 2 days in a row is 99% * 99% (just over 98%). For 3 days, 99% * 99% * 99% (a tad over 97%).

Can you see where I'm going? The chance of a truck not failing for 10 days in a row is 99% * [99% another 9 times]. That's 99%^10.

For 100 days in a row, it's 99% ^ 100. Which is about 37%*.

So after a hundred days, I'm likely to still have 37 trucks, more or less, that haven't failed yet.

Which makes around 63 trucks that I need to go and mend**.

The maths is satisfying, but it doesn't tell me any more than the question I was first asked. Nonetheless, we know that all good testers have an practically unlimited supply of extra questions to ask, so we're probably not completely satisfied.

However, if go grab my hi-viz jacket and get to work on the trucks, I'll get a better idea of what happens. I'll find that some days everything works as well as it did yesterday, and occasionally three new trucks phone in failed. I'll get an idea that I'll see more failures when there are more things that work – so as the period goes on, I'll see fewer and fewer. Some trucks could go on for ages (I'm sure that you've all heard of immortal lightbulbs, too. Survivorship bias – mostly.)

Working on the trucks allows a visceral, complex experience. It takes a while to get, it's not terribly transferrable, and it's hard to forget. You know it deeply and in many different ways. You are "experienced". The maths approach is different; the result is ephemeral, and you may remember the method more easily. To imagine its implications, you'll have to think hard. You are "expert"***, and because you can remember the method, you might be able to re-apply it in a different context.

In between these two, there are models and simulations. Models aren't reality, but neither are they primarily symbolic (at least, not on the outside). I hope that the right model might engender something between experience and expertise. For what it's worth, I think that asking "How long should I test for to be confident that I'm not going to see problem X much in real life" is a fair question, and I think that "It depends" is a rotten answer without some idea on what "it" might depend.

I've given you three machines below. 10 trucks, 100 trucks, 1000 trucks. I've knocked out various noisy bits, but it's otherwise the same simulation. Have a play. You can change the budgets. Think about what the frequency of failure tells you, especially over time. While you play, just have in the back of your mind the ways that this kind of failure differs from the failures that we discover when exploring...

Right now, I'm posting this from EuroSTAR – it's looking good! Follow @esconfs on twitter, or watch for the #esconfs hastag. And @TheTestLab, of course.

* We're assuming here that a once-broken truck is no more likely (or less likely) to break down again. We're also assuming that the non-broken trucks are at no greater chance of breaking. In one of the cases I'm thinking of, the "broken" truck was entirely functional as far as most people wee concerned, so the broken trucks didn't get less use, and the working trucks didn't get more use. If you're thinking of an un-enlargeable fleet of trucks with broken axles, we've got different models.
** If I'm swift to mend, some of these probably will have needed to be mended more than once.
*** Nobody said that being experienced and being expert were mutually exclusive. You can be both, you can be either, most of us are neither outside our fields of interest.

Thursday, November 01, 2012

An experiment with probability

It's been a busy day.

So, just for now, here's a very abstract experiment (and I'll give you the kit to play with the experiment)

Imagine you've got a hundred bowls in front of you.

In each bowl, you've put a hundred pennies – 99 dull ones, and one magic one.

Every day, you get to look at one random penny from each bowl. You drop the penny back in its bowl when you've looked.

If you had a hundred days, how many magic pennies might you see?

If you wanted to see all the magic pennies, how long would you plan to spend to be reasonably confident? How confident is reasonably?

If you want to work this out, do. You could find out empirically and viscerally, too, but you don't necessarily need a hundred quid in coppers and a couple of seasons: Play with the thing below. There are a hundred purple circles, that go yellow with a chance of 1:100 every tick of the red number. You can change the red number before you start. Press the arrow in the circle to set things going. Treat "work done" as "days spent" for now – no one thinks finding magic pennies is a real job.

Cheers -

James

Wednesday, October 31, 2012

Resurfacing...

...not in the sense of putting a new veneer over an old worktop, but in the sense of one's head breaking into fresh air after a long, deep, dive.

There may be a bit of bobbing up and down and gasping for a while.

I have a daughter. A new one. The first one, for me. Indeed, the first child of any flavour for ether my wife or I. Very new. Very lovely. We're all happy and healthy, all's good. Those of you who want to know more may know that I have pictures on tumblr, and some of you already have the password. Thank you for all the good wishes. Buzz me (directly, not here) if you want access. I'm trying to keep specific details off the internet until the whole identity/privacy thing shakes itself down. Or until she's old enough to vote for herself. Whichever comes first.

Anyway, excuses aside, adult life comes back in with a crunch next week.

I'm on the program committee for EuroSTAR. Along with the other committee members*, I'll need to be visible and available and therefore I'll need to be in Amsterdam form Monday to Thursday. That'll be a shock. One of the reasons I'm here is to get the dust off the testing neurones – I've spent a few weeks mainly thinking about babystuff.

We were very enthusiastic about the program in Galway in March. The EuroSTAR elves have been working like crazy to put together the actual conference, and it's going to be excellent to see it happen. These days, I tend to spend my time at conferences in @TheTestLab. I can't do two things, so Bart is running it with Martin Janson (Martin is one of the TestEye bloggers, worked with us on the EuroSTAR TestLab in 2010, and helmed a fantastic TestLab at Let's Test last May). I can't think of better hands for it to be in. The TestLab will be, we're told, central to the conference, and (at last!) easy to find. I'll be spending time there, if you want to come and find me, but I'll also be going to tracks and sessions, having conversations in corridors and bars, and basically making a nuisance of myself.

It's worth noting that, after the conference theme "Innovate: Renovate" was announced, both Shmuel and I told our colleagues that our wives were due in the weeks before the event. Indeed, I think they had the same due date, so both of us knew while we set the theme, but neither of us could say. Shmuel's done the Dad thing rather more (hats off to him) so I'll be the one looking more startled. Of course, the other difference is that he has a beard. It's harder to look startled with a beard.

Then, for those interested in my life in a furry hat, I charge back to London and change costume and language. After singing on one of this week's top-ten soundtracks (we're on Halo 4) on Friday 9th the London Bulgarian Choir will hit the ground running for their big gig of 2012. It's also the first gig for the choir after their leader had a baby** and we have a whole new way of doing a show. Stories throughout the first half, a wedding in the second half, presumably breastfeeding in the interval. We're at a gorgeous 500-seat venue in central London. In 2010, we sold it out to the last chair. This time, who knows... Anyway, here's a facebook event (one of at least four doing the rounds). Invite yourself.

Last thing on my list – I'm playing with systems again. I should have something a bit special for you, starting tomorrow. For now, let's see if I can get flash to embed here...

Cheers -

James

* Zeger van Hese is our program chair, and Julian Harty and Shmuel Gershon are the other committee people. Here, look.
** Same baby as my baby? Same baby. Our baby. Blimey.

Wednesday, August 15, 2012

Dog Show Quality

tl; dr – I'm ranting about arbitrary judgement

I've been enjoying the occasional horse ballet this Olympics. Back in my dim and distant, I learnt to distinguish some of the more obvious of the tiny communications and balances necessary to keep half a tonne of twitchy muscle in shape. At this remove, and especially when watching experts through my lo-res crystal bucket, I now see very little, but one can only admire the quality of the riding, the riders, and the ridden.

Dressage is exacting, it gives joy, and it's an extraordinary skill gained through dedication and talent. Once upon a time there might have been a working reason to get your horse to trot in goosestep, but in the arena those practicalities are subverted into competition. Criteria are set out, judges judge, competitors are measured and ranked. Everyone's an adult, everyone's there by choice, the horses love it – so what's the harm?

None at all, at a guess.

It's not quite the same in all competitions. For instance, competition requirements which set out strict parameters for dog breeds have led to a variety of unpleasant canine complaints as certain breeds of dog have become caricatures of their ideals, or as unmeasured and unexpected emergent properties have popped up (or out, as the case may be). Preset breed standards make it relatively straightforward to judge objectively – but the social act of judging the quality of a dog against those criteria can drive the breed as a whole, indeed the community of breeders as a whole, into unexpected and unwanted unpleasantness. But hey, if that's your bag, go to it.

My interest here is in software and systems and the people who make them. I'm using the situations above as a leaping-off point for an extended metaphor concerning software development and development processes. Be forewarned: I'm not about to respond to commentary from lovers of dressage or of dogs.

Occasionally, I'll be on site, or in a lecture, or talking with a colleague, or reading a paper, and the words "Dog Show Quality"* will pop into my head.

Dog Show Quality is where "Quality" is its own goal. A measured characteristic, where the measurement is made against an arbitrary set of criteria which don't have any necessary relevance outside the narrow limits the criteria themselves describe. You can measure it objectively, and you can use it as a goal for incremental change – but that goal may worsen your situation over time.

Something that is judged to have good Dog Show Quality is not necessarily rubbish. Indeed, it may in itself be best of breed. It's the measurement that is rubbish, and in being rubbish, leads to rubbish. It's not a bad thing to say "Every line of this software has a test", but "Every line of all our software must have a test" is no guarantee of goodness, and defining "Quality" as "Every line has a test" is just a small and noxious step away.

I tend to think Dog Show Quality most often when confronted with ossified software development process. Where signoff matters more than assessment, where knowing where the documents are is more important than what's in them, where spelling is more important than sense. When I talk to the Divisional Champion of Quality Assurance, and they care far more about adherence to process than whether their rotten product creaks, crashes, destroys data and chews the faces off their users' children, I'll think Dog Show Quality.

Dog Show Quality is mostly pointless and sometimes harmful. Making working systems is an exacting and skilled pleasure, but it's also done for money and directly affects real people. If, in the pursuit of quality, your judgement relies on measurements and you continue to measure the relatively unimportant, you're indulging yourself. Get over it, or get a green blazer with brass buttons and hand out rosettes. Your choice.

* There's nothing original under the sun, and this phrase has popped into my head unbidden for years. You may have read or written something similar. If you think I've simply written a cover version of your work without attribution, then I'm really sorry. I've searched, and I've not found your stuff. Please let me know, and if I've stood upon your shoulders to write this, I'll make a clear acknowledgement and link to your ideas.

Monday, August 13, 2012

Session timer

tl;dr - here's a thing to help you explore within the limits you set yourself

@eviltester* just asked me where I keep my session timer.

I don't appear to keep it anywhere.

Sorry about that. Fixed now.

If you want to use it**, try this page: http://www.workroom-productions.com/2timer.html

I use it*** to keep me on the rails. Here's how:

If I've decided that an exploration is worth 60 minutes of my life, I load it up, change the number in the circle to 60, and hit start. If the phone goes or someone arrives, I hit pause. I want to see how much of my gamble budget I've used, because that knowledge changes the game. Proportion matters more than minutes. Various numbers show up for reasons useful to me – elapsed time, excess time. There are two (pretty much identical) timers, because I run distractions as mini sessions.

I don't want to change the timer while it's going, and I don't trust my hands to avoid the error, so I've made it so one can't (easily) change the timer while it's going. It doesn't go bing because 1) I don't care that much if I go a bit long and 2) I have a timer, so I already know. If you change the system clock, it won't complain, but it won't be much use either, and midnight's not well handled****.

If you work in a similar way to me, it may be handy. If not, it won't.

The timer is ten years old, and can be a bit of a pain with CPU as it's written in AS1 and published for Flash 5. The world and his dog uses iOS now, or HTML5/JS, or if stuck in flashland (as I am) AS3 and FP10+. Use as you wish. Feedback always gratefully received.

* also eviltester.com and elsewhere if you dig a little
** a more-useful-for-work version than the one I deploy in my workshops
*** if I use it. I don't use sessions for testing so much. I do use timed sessions for exploring documentation, investigating ideas, writing, prototyping, rehearsing and sketching. And (non-IT use) cooking, of course. For that, I use a kitchen timer.
**** Hint: it may be time to go to bed.

Monday, August 06, 2012

Contextual spelling

tl;dr simplicity can hide depth

Some people expect that, as they discover more, they get closer to finished. But exploration can instead take you further from where you started.

Got a Mac*? Try this...

I open a new TextEdit document. I type "contact sant". On my machine, the word "contact" is highlighted as a spelling problem. I note that if I change "sant" to "Sant" or "san", "contact" is no longer highlighted. Perhaps one gives more context, one less. If so, I've bracketed a sweet spot, which is good to know. I might come back to this.

Because I'm in Text Edit, I suspect I'm actually using Apple's Cocoa text. The same problem shows up in MacJournal. However, it does not show up in Mail or Evernote. I'll not follow this particular path of enquiry for now. Maybe someone else can inform me.

That's two paths ignored. What I want to do is to dig deeper into spelling. If I flip open the Edit:Spelling:"Spelling and Grammar" panel, the alternatives to "contact" offered are "kontakt" and "kontant". If I change "sant" to "santos", I'm offered** "contacta", "contacte", "contacto", "contacté" and "contactó".

I note that a drop-down box says "Automatic by Language", and to my mostly-monoglot eye, the first suggestions look more Germanic, the second more Latin. Though neither TextEdit nor MacJournal allows me to set the language of a text fragment, it's clear that the suggested spellings are from two different non-English dictionaries***, and that the choice of one excludes others.

Could it be that Cocoa text decides what language a text fragment is in before it goes off to get spelling suggestions? Is it really deciding from four letters?

Over to you.

* I'm still on Snow Leopard, v 10.6.8. You may not be. You'll know better than I.
** actually, nothing changes immediately. I need to close and re-open the "Spelling and Grammar" panel to see the new suggestions. Yeah, I'm not exploring that, either.
*** so that's three in total, smart-alec.

Tuesday, May 29, 2012

Looking for Trouble in Big Data

tl;dr - basing your decisions on aggregates? Tested that?

I was at another "Data Science" meetup yesterday – this one with a focus on the music business*. I was feeling overwhelmingly awkward, and once I'd managed to chip a front tooth (on a chip, of all things), I kept my head down. There was some interesting stuff, and some marketing guff. I made notes, and bits of it stuck.

Before I go further, here's information that'll make you smile if you're interested in the effect of piracy: The music industry can predict 90 days of legitimate sales from BitTorrent traffic in a given artist/work (Greg Mead, MusicMedia). And something to, perhaps, make you frown: Playing obscure, "long-tail" music on the radio really doesn't affect the degree to which most of us listen, mostly, to the most-played 1000 artists (Mark Levy, last.fm). I may be wrong on these, so verify your facts before relying on my half-baked bloggery.

One of my sets of notes was on pathologies, some specific to "big data" (ie volume, changes in data format, data dirtiness, geographical and temporal reach), and I thought readers of this blog might find the list interesting. I guess you'll recognise some from preparing large-scale UAT data. Others you might recognise from the particular needs of aggregation. Clearly, I've not sanitised** my notes so they're all formed as pathologies; some are warnings, other solutions, but I think you'll be able to judge the underlying pathologies for yourself.

Aggregating timestamped data from global sources: how will you deal with different timestamps? To convert everything to universal time, you'll need to capture location – and even then does it make sense to compare trends following an event that happens at 11am London time with figures from LA at 3am, or Tokyo at 7pm?
What if a data source changes information retrospectively (ie deleted tweets)?
Trying to fit the real world into hierarchies where an entity has a single parent may be a lost cause. Think of tracks re-released in box sets, of the line-up of the SugaBabes (or, come to that, the Supremes), of a singer who performs at a festival on his own, then with his band, then as a surprise guest drummer. Specific one-to-many and many-to-one relationships may not serve your demographic.
Duplication; an artist, single, album, album track and best-of might all have the same name. Living in a Box, anyone?***
De-duplication; The London Bulgarian Choir and London Bulgarian Choir are the same motley crew, though clearly the name data isn't precisely the same. It's not just the obscurely-choral edge cases - happens to Smashing Pumpkins too. And Sean Puff-Diddlydee.
Data from the past may be utterly different from that captured today.
Neat, dense, obscure, meaningful codes are unique and apparently useful. But they're unreadable by the humans who enter them, and therefore subject to a whole bunch of pathologies; finger trouble, un-spotted mistakes, misuse and misunderstanding. We were told an apocryphal tale of a country who used the same 10 codes, over and over again, for all their releases.
Mapping location to IP address is not consistent if one is consuming data spanning years.
If you're going to visualise, you're going to graph. If you're going to graph, you're going to count, so you'd better make sure you're counting countable things. Turning real interactions into countable stuff generally involves some machine learning or emergent-behaviour tuning, so you're into a whole new barrel of pathologies there.
Visualisation is no fun without the facility to compare, or (less so) to give context.
Playing with data is bad. One needs a purpose.
Not playing with data is bad. One does not know what one may find.
Batch, hourly summaries are too slow – if you don't design your architecture for real-time, you may be left out in the cold.
Thinking that individual items of data have value, when the real information is in the relationship between data.
There's no single registry of meta-data, so there are clearly meta-duplication troubles.
Don't assume that the same comment on facebook / MySpace**** / Last.fm etc are equivalent. Different channels have different semantics.

* Emphasis ultimately more on business than music.
** monotonised
*** No? Just me? I can still hum it, but I've never owned a copy nor knowingly chosen it to listen to.
**** It's still there - and actually saw an upturn in fortunes last year. As far as anyone can tell. Some bits of the industry seem happy to be 18 months behind in their analysis.