tl;dr - basing your decisions on aggregates? Tested that?
I was at another "Data Science" meetup yesterday – this one with a focus on the music business*. I was feeling overwhelmingly awkward, and once I'd managed to chip a front tooth (on a chip, of all things), I kept my head down. There was some interesting stuff, and some marketing guff. I made notes, and bits of it stuck.
Before I go further, here's information that'll make you smile if you're interested in the effect of piracy: The music industry can predict 90 days of legitimate sales from BitTorrent traffic in a given artist/work (Greg Mead, MusicMedia). And something to, perhaps, make you frown: Playing obscure, "long-tail" music on the radio really doesn't affect the degree to which most of us listen, mostly, to the most-played 1000 artists (Mark Levy, last.fm). I may be wrong on these, so verify your facts before relying on my half-baked bloggery.
One of my sets of notes was on pathologies, some specific to "big data" (ie volume, changes in data format, data dirtiness, geographical and temporal reach), and I thought readers of this blog might find the list interesting. I guess you'll recognise some from preparing large-scale UAT data. Others you might recognise from the particular needs of aggregation. Clearly, I've not sanitised** my notes so they're all formed as pathologies; some are warnings, other solutions, but I think you'll be able to judge the underlying pathologies for yourself.
- Aggregating timestamped data from global sources: how will you deal with different timestamps? To convert everything to universal time, you'll need to capture location – and even then does it make sense to compare trends following an event that happens at 11am London time with figures from LA at 3am, or Tokyo at 7pm?
- What if a data source changes information retrospectively (ie deleted tweets)?
- Trying to fit the real world into hierarchies where an entity has a single parent may be a lost cause. Think of tracks re-released in box sets, of the line-up of the SugaBabes (or, come to that, the Supremes), of a singer who performs at a festival on his own, then with his band, then as a surprise guest drummer. Specific one-to-many and many-to-one relationships may not serve your demographic.
- Duplication; an artist, single, album, album track and best-of might all have the same name. Living in a Box, anyone?***
- De-duplication; The London Bulgarian Choir and London Bulgarian Choir are the same motley crew, though clearly the name data isn't precisely the same. It's not just the obscurely-choral edge cases - happens to Smashing Pumpkins too. And Sean Puff-Diddlydee.
- Data from the past may be utterly different from that captured today.
- Neat, dense, obscure, meaningful codes are unique and apparently useful. But they're unreadable by the humans who enter them, and therefore subject to a whole bunch of pathologies; finger trouble, un-spotted mistakes, misuse and misunderstanding. We were told an apocryphal tale of a country who used the same 10 codes, over and over again, for all their releases.
- Mapping location to IP address is not consistent if one is consuming data spanning years.
- If you're going to visualise, you're going to graph. If you're going to graph, you're going to count, so you'd better make sure you're counting countable things. Turning real interactions into countable stuff generally involves some machine learning or emergent-behaviour tuning, so you're into a whole new barrel of pathologies there.
- Visualisation is no fun without the facility to compare, or (less so) to give context.
- Playing with data is bad. One needs a purpose.
- Not playing with data is bad. One does not know what one may find.
- Batch, hourly summaries are too slow – if you don't design your architecture for real-time, you may be left out in the cold.
- Thinking that individual items of data have value, when the real information is in the relationship between data.
- There's no single registry of meta-data, so there are clearly meta-duplication troubles.
- Don't assume that the same comment on facebook / MySpace**** / Last.fm etc are equivalent. Different channels have different semantics.
** monotonised
*** No? Just me? I can still hum it, but I've never owned a copy nor knowingly chosen it to listen to.
**** It's still there - and actually saw an upturn in fortunes last year. As far as anyone can tell. Some bits of the industry seem happy to be 18 months behind in their analysis.