Knowledge Bridge

Global Intelligence for the Digital Transition

//Jeremy Wagstaff /October 15 / 2018

As the Big Data Era Arrives, It Pays To Remember What Data Journalism Is

Data and journalism are natural bedfellows: without information we’d be lost. But has this creation of a sub-discipline that calls itself ‘data journalism’ helped or hindered the profession’s embrace of the digital era?

In researching data journalism in the era of big data, I have found myself trying to define what “data” means in this context. Of course when we refer to data we usually envisage significant amounts of numerical information, but as data sets get unimaginably large this itself becomes problematic. So I decided to take a step back, and look at a successful story that is data driven closer to home.

I chose,a cluster of stories,  from Manila, by the Reuters reporting team of Clare Baldwin, Andrew Marshall and Manny Mogato, who covered Philippine president Rodrigo Duterte’s war on drug suspects. (Transparency alert: I was, until earlier this year, an employee of Thomson Reuters.) The three of them won a Pulitzer  in the International Reporting category for this coverage. I wanted to just focus on a couple of stories because I think it helps define what data journalism is. This is how their story of June 29 2017 described how Philippine police were using hospitals to hide drug war killings:

  • An analysis of crime data from two of Metro Manila’s five police districts and interviews with doctors, law enforcement officials and victims’ families point to one answer: Police were sending corpses to hospitals to destroy evidence at crime scenes and hide the fact that they were executing drug suspects.
  • Thousands of people have been killed since President Rodrigo Duterte took office on June 30 last year and declared war on what he called “the drug menace.” Among them were the seven victims from Old Balara who were declared dead on arrival at hospital.
  • A Reuters analysis of police reports covering the first eight months of the drug war reveals hundreds of cases like those in Old Balara. In Quezon City Police District and neighboring Manila Police District, 301 victims were taken to hospital after police drug operations. Only two survived. The rest were dead on arrival.
  • The data also shows a sharp increase in the number of drug suspects declared dead on arrival in these two districts each month. There were 10 cases at the start of the drug war in July 2016, representing 13 percent of police drug shooting deaths. By January 2017, the tally had risen to 51 cases or 85 percent. The totals grew along with international and domestic condemnation of Duterte’s campaign.

This is data journalism at its best. At its most raw. The simple process of finding a story, finding the data that supports and illustrates it and then writing that story and using the findings to illuminate and prove it. Of course, the data set we’re talking about here is smaller than other data-driven stories but it’s still the point of the story, the difference between a story and much lesser one.

But how did they come by that data? Andrew tells me it was done the old fashioned way: first, they got the tip, the anecdotal evidence that police were covering up deaths by using ambulances to carry away the dead. Then they went looking for proof — for police reports, which are public information in the Philippines and so can, in theory, be obtained legally. These they found, because they looked early and persistently. Then it was a question of assembling these, and cleaning them up. In some cases, it meant taking photos of barely-legible police blotters at a station entrance.

All their stories, Andrew told me, were driven by the reporters already having a sense of what the story was, and then looking for proof. That means already knowing enough about the topic to have formed an opinion about what to be looking for: about what may have happened, about what angle you’re hoping to be able to prove, about what new fresh evidence you believe the data will unearth for you. A data-driven story doesn’t always mean wandering around the data without a clear idea of what you’re looking for. In fact, it’s better to already know. “The key thing,” he told me, “is that this all grew out of street reporting. We wouldn’t have thought to look for it if we hadn’t heard.

That’s the first lesson from their experience. Data is something that is there that helps you prove — or refute — something you have already established to be likely from sources elsewhere. 

This is where I think sometimes data journalism can come adrift. By focusing too much on the “data” part of it, we lose sense of the “journalism” part of it. “It’s the blend of street reporting and data analysis that paid the great dividend,” Andrew said.

A  definition of data journalism should probably start somewhere there; but it tends not to. Instead we tend to get: data journalism as a “set of tools’, or “locating the outliers and identifying trends that are not just statistically significant but relevant”, or “a way to tell richer stories” (various recent definitions.)  These are all  good, but I’m not sure they’re enough to help us define how to best use data for journalism.

By emphasizing data over journalism we risk removing and rarifying the craft, creating a barrier where it doesn’t need to exist. As in the previous examples in the Philippines, data is not always something that sits in databases, servers, libraries or archives. Nor is it something that you have to ask for. It’s something you use to gather information to help better storytell and to reinforce the facts in your coverage.  A study by Google last year trumpeted that more than half of newsrooms surveyed had a dedicated data journalist.

Aren’t we all, or shouldn’t we all consider ourselves, data journalists? Shouldn’t we all be looking for data to enrich — if not prove the thesis that underlies — our stories? 

Back to Andrew’s example. For the team it was something of a no-brainer to work on attaining this data. The story would have been unthinkable without it. This might not be part of every journalist’s instinct, but it’s telling in this example, that it became central to their story and took weeks, months to assemble.

The place to start from was with the local police and hospitals to get this data.  To do so was legal. But it wasn’t easy, and became increasingly less so as the work developed.. Clare Baldwin was greeted at one station by homicide detectives who shouted and lifted their shirts to display their guns. Later, Andrew told me, it became much more difficult to have access to this information as the Duterte government realized what it was being used for.

The lesson from this is that data is not necessarily something that is easy to get, or easily given up. Or that arrives in pristine form. It requires some major work in verifying, identifying and compiling. is more akin to the example of Bellingcat, the crowdsourcingCrowdsourcingTaking a task that would conventionally be performed by a contractor or…//read more  website created by journalist Eliot Higgins, which conducts what it calls open-source investigations of data sources, ranging from social media photographs to online databases.

Of course, not all stories are going to be like this, and not all data is going to be like this. But all journalism, data or otherwise, requires thinking that starts from a similar place: a strong knowledge of what the story might be, and where to find it; whether there might be data that might help, to know where to find it, to not be daunted in obtaining it, or by the condition it is in, and to understand the context of the data that you have, and to know what to do with it. And finally, in Andrew’s words, “to use that data quickly, not just to sit on it“.

The aforementioned team’s stories stand on their own.  As another example, the Reuters’ graphics team, led by Simon Scarr, also did some extraordinary visualizations which helped readers understand stories better, and provided additional impact. Visualization and data journalism are obvious bedfellows.

This isn’t to say sometimes the idea for a story doesn’t lie in the data itself. Data journalism can mean taking data as inspiration to explore and write a story — rather than beginning the process by talking to sources.. At its most basic this could be a simple story about a company’s results, or a country’s quarterly trade figures — data-driven stories where the journalist reports the new numbers, compares them with the earlier numbers, and then adds some comment.

But when there is overemphasis on data journalism as a separate part of the news process it can pose problems. There’s been quite a lot written about a backlash against ‘nerd journalists’ and an exodus of those computer-literate staff in newsrooms who are sick of the skepticism and relatively low salaries. I’ve not witnessed this firsthand, but I have seen how little interest there is in learning more about the ‘techie’ side of journalism that might help reporters wrestle with data beyond their familiar charts and tables. Editors are partly to blame: stories that involve dirty or larger data-sets do take longer and so are often unwelcome, unless they fall into a special category. So reporters quickly figure out they’re better off not being overly ambitious when it comes to collecting data.

Data journalism tends to be limited to a handful of really strong players. In my neck of the woods in South and Southeast Asia there’s an impressive array of indigenous (i.e. not one of the big multinational) outfits: Malaysiakini are almost old hands at this process now.  Their sub editor,  Aun Qi Koh told me that as it gets easier in terms of knowing which tools to use and how to use them, so it gets harder because “we want to push ourselves to do more, and to better…and of course, there’s the challenge of trying to streamline a process that often involves a team of journalists, graphic designers, programmers, social media marketers and translators.” she tells me.

This is impressive, and is demonstrating what is possible. News organizations are making the most of governments’ gradual commitment to opening up their data, and to leveraging issues that the public care about. In the Philippines Rappler has been making waves, and won an award for its #SaferRoadsPH campaign, which compiled and visualized statistics on road crash incidents and has led to local police drawing pedestrian lanes outside schools.

These kinds of initiatives are tailor-made for visual data journalism. Not least because journalists don’t have to rely on government data that might be either absent, incomplete or wrong. Or, in some cases, just unreliable. Malaysiakini’s Aun Qi Koh said that the data in a government portalPortalA Web site that often serves as a starting point for a Web user’s session. It…//read more  set up in 2014 was neither “organized properly nor updated regularly.” That seems to be par for the course. And while staff everywhere need better training, those that do have the necessary training tend to get snapped up by–and attracted to–private sector companies rather than relatively low paying journalist positions, according to Andrew Thornley, a development consultant in Jakarta.

I’m impressed by all these projects, especially those doing great journalism on a shoestring. But I hope it doesn’t sound churlish if I say I still think this is scratching the surface of what is possible, and that we may not be best preparing ourselves as well as we could for the era of big data.

Take this story as an example: Isao Matsunami of Tokyo Shimbun is quoted in the Data Journalism Handbook as talking about his experience after the earthquake Fukushima nuclear disaster in 2011: “We were at a loss when the government and experts had no credible data about the damage,” he wrote.  “When officials hid SPEEDI data (predicted diffusion of radioactive materials) from the public, we were not prepared to decode it even if it were leaked. Volunteers began to collect radioactive data by using their own devices but we were not armed with the knowledge of statistics, interpolation, visualization and so on. Journalists need to have access to raw data, and to learn not to rely on official interpretations of it.”

The data he’s talking to was created by Safecast, an NGO based in Japan which started building its own devices and deploying its own volunteers when it realised that there was no reliable and granular government data on the radiation around Fukushima. Now it produces its own open source hardware and has one of the largest such data-sets in the world, covering air quality as well, covering sizeable chunks of the world.

The future of data journalism lies, I believe, in exactly this: building early, strong relationships with outside groups — perhaps even funding them. More routinely, journalists should find their own sources of raw data where it’s relevant and practical, and fold the mindset, tools and approach of data journalism into their daily workflows the rest of the time. You can already see evidence of the latter on sites like Medium and Bloomberg Gadfly, where journalists are encouraged to incorporate data and charts into their stories and to build an argument. Much of this is already happening: Google’s survey last year found that 42% of reporters use data to tell stories twice or more per week.

But the kind of data being used may be open to question. Data is no more a journalist’s friend than any source — it has an agenda, it’s fallible, and it can often be misquoted or quoted out of context. As journalists we tend to trust statistics, and interpretation of those statistics, a little too readily.

For the sake of balance, here’s a Reuters story from 2014, still online, that quotes an academic study (“Anti-gay communities linked to shorter lives”) despite the fact that in February this year a considerable correction was posted to the original study. (“Once the error was corrected, there was no longer a significant association between structural stigma and mortality risk among the sample of 914 sexual minorities.”) We are not, as journalists, usually given to expressing skepticism about data provided by academics and similar but maybe we should. (And I suppose we should be better at policing our stories, even if the correction is required years after the story first appeared.)

Tony Nash, founder of one of the biggest single troves of economic and trade data online at, believes journalists tend to let their guard down when it comes to data: “The biggest problem with data journalism is that data is taken at face value. No credible journalist would just print a press release but they’ll republish data without serious probing and validation. Statistics agencies, information services firms, polling firms, etc. all laugh at this.”

Day to day journalism, then, could benefit from being both more skeptical and more ambitious about the data it uses. Tony says he’s tried in vain to interest journalists in using his service to mash stories together, so instead writes his own newsletter, often ‘breaking’ stories long before the media: “In July 2017 I showed that Mexico and China are trade competitors but journos always believe China has an upper hand in trade. For all of 2017, Mexico exported more TVs to the US than China. For the first time. It was not a surprise to us. Most journos still have not woken up to that,” he told me recently.

Coupled with tools that make it easier to combine visuals into their stories — Datawrapper, a chart making tool, for example, has launched an extension called River which makes it easier for journalists to identify stories or add data to breaking stories.

But this is just the start. We are in the era of big data and we are only at the beginning of that era. The Internet of Things (IoT) is a fancy term to cover the trend of devices being connected to the internet (rather than people through their devices, as it were.) There will be sensors on everything, but there will also be light switches, washing machines, pacemakers, weather-vanes, even cartons of milk, telling us whether they’re on or off, full or empty, fresh or sour. All will give off data. Right now only about 10% of that data is being captured. But that will change. According to IDC, a technology consultancy, more than 90 percent of this IoT data will be available on the cloud, meaning that it will be analyzed, by governments, by companies, and possibly by journalists. The market for all this big data, according to IDC, will grow from $130.1 billion in 2018 to over $203 billion in 2020. This market will primarily be one about decision making: a cultural shift, to “data-driven decision making”.

You can see some of this in familiar patterns already: Most of it is being used to better understand users — think Amazon and Netflix getting a better handle on what you want to buy or watch next. But that’s pretty easy. How about harder stuff, such as taking huge disparate data sets — the entire firehose of Twitter, say, along with Google searches, Facebook usage (all anonymized of course) — to be able to slice target audiences very thinly. One Singapore-based company I spoke to has been able to build a very granular — as they call it — picture of who they want to target, down to the particular tower block, their food preference (pizza), music (goth) and entertainment (English premier league). Not only does this make advertisers happy they’re going after the right people, it makes it much cheaper.

But this is just the beginning of big data. Everything will be spitting out data — sensors in cars, satellites, people, buildings; everything we do, say, write etc. Knowing what data there is will be key: Another Reuters graphics story — which won the award for Data visualization of the year at the Data Journalism Awards 2018 involved realising the value of a data-set of GPS coordinates of infrastructure gathered by aid agencies working on the ground at a Rohingya refugee camp in Cox’s Bazaar to analyze the health risks of locating water pumps too close to make-shift toilets. And then there’s knowing whether there might be other data hiding within the data: Buzzfeed’s application of machine learning to Flightradar aircraft data to single out the clues that revealed hidden surveillance flights, which also won a Data Journalism award.

These are small glimpses of the future of the kinds of data journalism we might see.

In the future it will be second nature to journalists to not only know what kind of data is being collected and to turn it to their own uses, but to try to pre-emptively collect it. This will require lateral thinking. Journalists have been using satellite imagery for several years as part of their investigations but this is likely to become even easier, cheaper, and more varied. One entrepreneur I spoke to recently is launching dozens of micro-satellites to monitor methane emissions — data of interest to oil and gas companies worried about gas leaks, governments enforcing greenhouse gas regulations, as well as hedge funds looking for exclusive economic indicators. Imagine if a journalist is able to peruse that data and uncover military activity from heat emissions even before governments know about it.

This is just the tip of the iceberg, and while journalists may not be at the front of the queue for this kind of data, it’s going to be important to know what kind of data is out there. Already the notion of what a “leading indicator” is has begun to change — an investor in China is much more likely to be trawling through data from Baidu than government statistics to get a sense of what is going on, and smart journalists already know that.

The future of data journalism, if it is successful, will still be journalism. And data will still be data. But as the world of data gets bigger, it pays to remember that the relationship between ‘data’ and journalism is still about thinking and acting creatively and quickly to uncover stories others may not want us to tell. 

Article by Jeremy Wagstaff

Leave your comment