Tuesday, May 29, 2012

Looking for Trouble in Big Data

tl;dr - basing your decisions on aggregates? Tested that?

I was at another "Data Science" meetup yesterday – this one with a focus on the music business*. I was feeling overwhelmingly awkward, and once I'd managed to chip a front tooth (on a chip, of all things), I kept my head down. There was some interesting stuff, and some marketing guff. I made notes, and bits of it stuck.

Before I go further, here's information that'll make you smile if you're interested in the effect of piracy: The music industry can predict 90 days of legitimate sales from BitTorrent traffic in a given artist/work (Greg Mead, MusicMedia). And something to, perhaps, make you frown: Playing obscure, "long-tail" music on the radio really doesn't affect the degree to which most of us listen, mostly, to the most-played 1000 artists (Mark Levy, last.fm). I may be wrong on these, so verify your facts before relying on my half-baked bloggery.

One of my sets of notes was on pathologies, some specific to "big data" (ie volume, changes in data format, data dirtiness, geographical and temporal reach), and I thought readers of this blog might find the list interesting. I guess you'll recognise some from preparing large-scale UAT data. Others you might recognise from the particular needs of aggregation. Clearly, I've not sanitised** my notes so they're all formed as pathologies; some are warnings, other solutions, but I think you'll be able to judge the underlying pathologies for yourself.

  • Aggregating timestamped data from global sources: how will you deal with different timestamps? To convert everything to universal time, you'll need to capture location – and even then does it make sense to compare trends following an event that happens at 11am London time with figures from LA at 3am, or Tokyo at 7pm?
  • What if a data source changes information retrospectively (ie deleted tweets)?
  • Trying to fit the real world into hierarchies where an entity has a single parent may be a lost cause. Think of tracks re-released in box sets, of the line-up of the SugaBabes (or, come to that, the Supremes), of a singer who performs at a festival on his own, then with his band, then as a surprise guest drummer. Specific one-to-many and many-to-one relationships may not serve your demographic.
  • Duplication; an artist, single, album, album track and best-of might all have the same name. Living in a Box, anyone?***
  • De-duplication; The London Bulgarian Choir and London Bulgarian Choir are the same motley crew, though clearly the name data isn't precisely the same. It's not just the obscurely-choral edge cases - happens to Smashing Pumpkins too. And Sean Puff-Diddlydee.
  • Data from the past may be utterly different from that captured today.
  • Neat, dense, obscure, meaningful codes are unique and apparently useful. But they're unreadable by the humans who enter them, and therefore subject to a whole bunch of pathologies; finger trouble, un-spotted mistakes, misuse and misunderstanding. We were told an apocryphal tale of a country who used the same 10 codes, over and over again, for all their releases.
  • Mapping location to IP address is not consistent if one is consuming data spanning years.
  • If you're going to visualise, you're going to graph. If you're going to graph, you're going to count, so you'd better make sure you're counting countable things. Turning real interactions into countable stuff generally involves some machine learning or emergent-behaviour tuning, so you're into a whole new barrel of pathologies there.
  • Visualisation is no fun without the facility to compare, or (less so) to give context.
  • Playing with data is bad. One needs a purpose.
  • Not playing with data is bad. One does not know what one may find.
  • Batch, hourly summaries are too slow – if you don't design your architecture for real-time, you may be left out in the cold.
  • Thinking that individual items of data have value, when the real information is in the relationship between data.
  • There's no single registry of meta-data, so there are clearly meta-duplication troubles.
  • Don't assume that the same comment on facebook / MySpace**** / Last.fm etc are equivalent. Different channels have different semantics.
* Emphasis ultimately more on business than music.
** monotonised
*** No? Just me? I can still hum it, but I've never owned a copy nor knowingly chosen it to listen to.
**** It's still there - and actually saw an upturn in fortunes last year. As far as anyone can tell. Some bits of the industry seem happy to be 18 months behind in their analysis.

Thursday, May 17, 2012

LEWT 11 is on Saturday (that's 19 May 2012)

LEWT is the London Exploratory Workshop on Testing. For those of you who don't know, it's a peer conference. Details are at at http://www.workroom-productions.com/LEWT.html

LEWT11 is on Saturday, and we have two spaces available if you'd like to take part. As ever, you'll need to prepare a very short talk / topic for discussion. If you have contacts or colleagues who would be interested, they should contact me swiftly.

We also have two further spaces that we keep open for people with less than two years experience in testing. Neither are taken. If you know someone who would find it useful to take part, they should also get in touch, but there's no requirement to bring a talk.

[Aside: remarkably, the last LEWT was in March, and I ran a similar thing at Let's Test. Peer workshops à go-go...]

Sunday, May 13, 2012

Apple's Keynote has synchronisation problems - here's a short diagnosis

EuroSTAR virtual is coming up, and I'm not keen on presenting live without an actual audience. Brrr.

So I decided to pre-record my talk. Turns out the audience want slides, not a talk, and I assumed that I would put the slides together in Keynote, and click along to the pre-recorded sound of my own voice. However, the client wants PowerPoint, which I don't have. I could play it live on the day, but I was doubtful whether gotomeeting would take audio from a prerecorded source along with live video, and I certainly didn't want to sort all this out on the day and to deadline. I decided to export the audio and slides to a quicktime movie. Keynote says it can, so I assumed it could. And we all know where that leads.

If you've got a pre-recorded backing track, Keynote gives you a one-shot go at recording all the side transitions as it plays. The talk is around 30 minutes long, so I needed to put the rest of my world on hold and tap along to the sound of my own voice for half an hour. If I made a mistake at the end, I'd need to do it again from the start. This is clearly not ideal*, and I wondered how well the supporting technology might work

For the sake of this posting, let's ignore the ever-yawning pit of despair as I listened to my stumbling and ill-chosen words over and over again, but I challenge you to listen to a talk of your own and not cringe. Let's also ignore Keynote's habit of telling you the recording is 'out of sync' whenever you change anything, at all, and not withdrawing its warning when you undo the changes. Let's just say that, some time later, I had a recording of clicks to go with the recording of my voice, and that Keynote wasn't being neurotic about the sync.

I exported to a Quicktime movie. It took, for a 30-minute movie, about an hour. Bear in mind this is mostly still, white slide-deck - and even my grimy old kit converts HDTV to 720p H.264 Quicktime at double speed. On playback, the slides ended well before the words, but I didn't watch the thing through to see where it had gone out - it was correct at the start, way out at the end, and it seemed to get worse as the deck went on. Putting that one down to experience, I decided that a plausible cause was that I was asking my kit to do too much, or that a visually-heavyweight regular request (I prefer to dissolve from one slide to another) caused it to go out of sync by a little, often. I took out most of the dissolve transitions, turned off everything else the machine was doing, re-recorded all my clicks, tried exporting again.

And it was a bit better, but not right.

Now I could have gone on incrementally changing, but I'm a tester, and I don't believe my own hypotheses. I exported again, immediately. Another hour. I went for a bike ride, came back and cooked. The second was differently out, and it was worse. I watched, as far as I could bear it, flipping back and forth using a binary search approach to isolate momments where things went way out. In the first, a particular and simple transition did not complete. It ended late – subsequent slides tried to catch up, but went far too early. In the second, a short video that should play as soon as a slide was shown started after a long pause. I made a few changes to those bits, ignored the now meaningless** sync warning, and exported again, just to see. The third export was rubbish, as you'd expect, but differently rubbish; the problem transitions worked this time, but subsequent slide transitions were more banjaxed than before. Oddly, the sync wasn't as far out at the end.

Noticing that you can get Keynote to simply play back automatically without export, I re-recorded (a familiar process by now) and played back. Remember, I was using no export – just automatic transitions as if a ghostly machine hand was pressing buttons, not me. The presentation was borked in the first three minutes; one slide hung on far too long. I didn't stick around to see what else was going to go wrong.

I felt I should have tried this earlier. Clearly, no matter what was recorded, the playback itself was a problem – though not one related to CPU usage or contemporaneous export. And not one related to the rate of slides; Keynote happily works for me, but not for pre-scripted changes. The problem didn't consistently make the slides faster than the audio, or vice-versa.

I flipped open the source*** – the recorded timestamps were fine. Indeed, I've used them as a script for whichever poor AV person at the client gets the job of clicking through my deck. This information confirmed my hypothesis.

My diagnostic process, such as it was, was hampered by how long it took to set up an experiment and get the results (around two hours for each). I didn't bother try find a minimum viable set as I was trying to find a workaround, not trying to work out what was going wrong. I did try to eliminate some plausible problems, and I got good information while trying to get beyond the emergent behaviour, and into the internals of what was going on.

If I wanted to carry on, I could create a selection of Keynote files to see whether playback is a problem in all circumstances, or just some. I could consider a disconfirming hypothesis (maybe it's not playback, maybe it's my machine and playback, or maybe there's another set of corrupted timestamps that I've not looked for). I could waste more days of my life trying to get to the bottom if it. Apple, if you're out there, buzz me and pay me and I might just take it on.

But for now, I'm done.

I've taken the timings as a script. I've exported the slides to an obscure Quicktime format that contains unstoppable audio and all the transitions. It plays automatically, and happily moves on at a human-driven keypress. It's the right quality, the right audio, the right collection of transitions in the right order. I hope it suits the client – but I'll wait to see.

We already know that PowerPoint is no fun to work in. However, if you need to work on a self-playing prerecorded presentation, I suggest that Keynote is as much use as a banana lightsaber.

For an inspiring alternative, watch the video for Bob Dylan's Subterranean Homesick Blues. True fact, trivia fiends: It was recorded in London behind the Savoy Hotel. La di bloody da.

The EuroSTAR virtual conference is on Wednesday. Be there, or be somewhere else. Sod that - be there, and be somewhere else. I shall be.

* I've checked the manual, and there is a way of pausing, but no straightforward way of undoing a moving a mistaken slide change.
** If it's wrong when it doesn't warn you, why pay attention when it does.
*** deep breath: I'm not using packages (it's a silently-changed option in Keynote 09), so the saved file is obscured. For Keynote, thankfully, there's a way out: change the file extension to .zip, and unzip it to a folder. Inside the folder, I looked into the ugliest large-ish file (index.apxl, here), recognised it as not-really-XML (the file chewed up textmate and spat it out), found timestamps tagged with narration-event that seem to correspond to my expectations, clipped that bit out and fiddled with it until the events, slides, builds and timestamps were unfogged and human-readable. I may be a Mac user, but that doesn't mean I'm so much of a numpty I can't get down to business.