For a lot of us, when we think data science we think pocket protectors and lab coats. But you think data science is a creative endeavour. How so?
Generally there is some problem to solve and there’s a ton of creativity in how you construct your analysis to address that problem.
A lot of what a data scientist does is communication. We take some fairly complicated answers and simplify them, and we tell someone who is not involved in the analysis what the conclusions are and how to understand them. That’s where the storytelling comes in.
Do you think there’s an opportunity for scientists to partner with journalists or marketers to tell great stories?
Some of the best collaborations I’ve seen have been a pair of people, one of whom comes from a mathematics background and one of whom comes from a creative background.
The team that wrote the popular OK Cupid data blog, for example, was a comedian and a data scientist.
One project I’m in love with right now is a project done around tracking cicadas. They built a series of hardware sensors, they made an open source kit so anyone can build their own sensors, and they’re collecting the largely community-generated sensor data from around New York City.
They have this cool interactive map where you can see where the projections of the cicadas are coming from based on the temperature and soil.
I love this project because it blends so many different things – it’s great data journalism, it’s a community data gathering and it’s open source hardware.
It’s almost like Revenge of the Nerds, where data collection is all of a sudden a sexy profession. At the same time, big data has become a buzzword. Is this a good thing?
There is certainly a lot of hype around the notion of big data. Generally my attitude is, you should ignore the hype and find the real value. There is plenty of real value to demonstrate but it’s not as easy as a lot of the hype has led us to believe.
Whenever we talk about big data, inevitably it becomes a dystopian conversation about privacy. You have said privacy is the wrong word. What is the right word?
I don’t think we have the right word yet for the conversation, but it’s not what we would traditionally call privacy. It’s really about the friction in how far my data can spread and how to control it.
When we talk about privacy we use words like violation. That’s not the case here. We are voluntarily sharing data, we just don’t necessarily see the consequences of that. I think we need to come up with a different label for this particular discussion.
You work as a data scientist in the private sector. Do you see more interesting research coming out of branded environments rather than universities?
I see interesting research coming from both independently and in collaborations between the two. On the academic side of research it’s actually kind of unfortunate because it’s very hard for people in academic positions to get access to the data that they have the knowledge and skills to do good work around.
If you’re a startup you have data but you’re usually so busy you don’t really have the luxury of doing proper research.
bitly tries to give data to as many academic projects as possible. There are public data sets for people and if someone comes to bitly with a specific project, bitly will put together a data set and ask the academic institution to sign an agreement that lets them publish whatever they like off the data as long as they can’t profit from it.
Some really great research has come out of it. One example is a report called “Blogs and Bullets” from a group at George Washington University and the U.S. Institute of Peace that studied the effect of social media use during the Arab Spring. This was a topic bitly was fascinated by but did not have the domain knowledge to address.
Where does bitly get its data and how has bitly used that data internally?
bitly gets its data from all the bitly links shared and all the clicks on those links. The people at bitly look at what the click distributions look like geographically, by social network and by device. Then the content at the other end of those links is analyzed.
bitly is always finding unexpected things. The brand can see a really big distinction between what people will share publicly versus what they’re actually reading.
What people share publicly is a highly curated subset of the content they consume. It’s all stuff designed to make their identity look great, to make their lives look good and to make them look intellectual, whereas they will read celebrity gossip and sports scores. Seeing that so clearly in the data was actually a big surprise to me.
It shows that you have to choose what side to optimize for – consumption or sharing.