Does This Exposed Chinese Database Pose a Security Threat?Unless There's More to It, Shenzhen Zhenhua Data Appears to Be Scraped Public Data
A leaked database compiled by a Chinese company has suddenly become the focus of multiple media reports warning that it could be used as an espionage instrument by Beijing. But on closer examination, the data appears to be public information that's been scraped, largely from social media sites and other public sources.
See Also: What is next-generation AML?
On Monday, multiple news outlets - including Australia's ABC and Financial Review - released a coordinated scoop about a leaked database from China that includes massive amounts of data on U.S. military members, as well as prominent members of Australian society, including many politicians.
Zhenhua Data feels like a company that has done what countless other Western companies have done in the age in which data is the new oil: Collect it and sell it. The company wasn't trying to hide. Neither was it very good at securing its own data.
The database contains details on at least 2.4 million people, including 35,000 Australians and Prime Minister Scott Morrison, as well as many business people.
Breathless reporting about the database has stoked fears of how Beijing may be collecting data on politicians, military officers and other prominent people to potentially target them via future intelligence operations. But while it's easy to spin up a furor over anything involving China and cybersecurity, this data exposure deserves a more precise examination.
The database comes from a company called Zhenhua Data. According to Christopher Balding, an American academic in Vietnam, a source in China passed him the data, putting the source "at risk" from the Chinese Communist Party.
"The individual who provided the Shenzhen Zhenhua database by putting themselves at risk to get this data out has done an enormous service and is proof that many inside China are concerned about CCP authoritarianism and surveillance," Balding writes in a Monday blog post.
What is the OKIDB?
The database is called the Overseas Key Information Database, or OKIDB. As I read the reports about it on Monday, I thought it sounded familiar - and that I might have seen it before. In fact, I had, but the version I'd received included a misspelling, making it the "Oversea Key Information Database."
By virtue of being on the cybersecurity beat, I often receive - and welcome - tips about leaks, and I've amassed files filled with random leaked data, adding up to a bucket of leaks, many of which remain unconfirmed. For me, the OKIDB information had remained in that bucket.
Based on the reports I'd seen, on Monday morning, I started posting screenshots from the OKIDB - the version that I'd received - to Twitter, and flagging the posts to Balding and Robert Potter, the co-founder of a Canberra-based company called Internet 2.0. Balding shared the database with Potter to put it into a more digestible format because the version he'd received was corrupted.
I called Potter on Monday morning, and it became clear that the OKIDB that I have is the same database Balding and Potter possess. In response, some people have rightly asked me why I didn't write about this sooner. Here's the skinny.
(1/6) The China database is causing a fair amount of stir in Australia, but before we get too spun up about China-spying-targeting-etc., there are a few important points to keep in mind.— Jeremy Kirk (@Jeremy_Kirk) September 14, 2020
The database was brought to my attention in late December 2019 or early January by a computer security researcher. The database had been left on the internet, open for anyone to access, presumably by mistake. In more precise computer security parlance, it was an unsecured Elasticsearch cluster.
Elasticsearch is an open-source platform for storing and querying data. By default, Elasticsearch clusters are not publicly accessible. But the clusters can be rolled out in a misconfigured manner, leaving the stored data internet-accessible to others. Often, it's possible to hunt out misconfigured Elasticsearch instances using device-focused search engines such as Shodan.io.
When I reviewed the data stored in the OKIDB, it appeared impressive mostly for its size - hundreds and hundreds of gigabytes - but otherwise the data didn't appear to be sensitive. All of it seemed to be public. For example, there were bits from U.S. Navy press releases, announcing deployments of ships, some of which had been translated into Mandarin.
One of the indices contained a list of U.S. Air Force personnel. It included names and addresses but no birth dates. Those listings contained a couple of interesting fields, such as "airmenID" and "medicalExpirationDate." Entries for U.S. Navy officers often included links to public biographies that have been posted on Navy websites.
Other indices contained what appeared to be research papers from think tanks. Copious amounts of data had been copied from data sources including Crunchbase and EveryPolitician. Largely, however, I didn't see anything that might have raised alarms.
Social Media Scraping
The data collection has been tied to a domain, aggso[dot]com, which appeared to be a commercial company that specialized in aggregating public data. The front page of its now shuttered website - okidb.aggso[dot]com - mentioned numerous data sources, including LinkedIn, Facebook, Instagram, YouTube, Twitter and Medium. It appeared quite similar to U.S. companies such as Spokeo or Pipl, which mine a variety of data sources and link them together.
Early views of aggso[dot]com on the Wayback Machine - from around 2012 - show that it started as a social media management system called the Weiju Social Media Management System.
Over time, the company changed how it marketed itself. The Australian Financial Review reports that Zhenhua Data was marketing the data it holds as the "Internet Big Data Military Intelligence System." While the company's website is now offline, it had listed such customers as the People's Liberation Army and Communist Party, the Australian Financial Review reports.
After reviewing the data set, and not knowing that the company that had amassed it was called Zhenhua Data, I didn't see much to merit a story.
To be sure, it contained a huge amount of data, some of which had obvious ties to China, but nothing appeared to be overtly nefarious. I also tried to contact the registrant for aggso[dot]com but received no reply. Hence OKIDB joined the long list of other data exposures that I have learned about but not seen fit to report on.
Risky Data Collection?
I asked Potter this key question: What kind of non-public data is in the database? Because if there is any, it might give more weight to suggestions that the collected data might pose a risk.
Potter responded that "it depends on how you define open source," and that "there seemed to be a fair amount in there that had been pinched from other platforms, which in and of itself wasn't open source as a method it was ingested in."
Asking Potter to define exactly what that meant, he told me that there seemed to be data that was "not classified but they're not public sources." He mentioned data from Factiva, the news-monitoring and research tool from Dow Jones. That's not sensitive, but rather subscriber-only content.
To be sure, there are reasons to be worried about China's cyber activity. U.S. prosecutors have pinned on China some of the largest and most worrisome hacks in memory, including the U.S. Office of Personnel Management, Equifax and health insurance giant Anthem. The data from those hacks has never publicly surfaced. If Zhenhua's repository had that kind of data, this would be a much more significant finding.
Caution: I have seen only a very small slice of the data. There could very well be material in there that is highly sensitive. But if that is true, then I call on anyone who's making this out to be a significant national security concern to describe that highly sensitive data more fully.
Because when I see, for example, the Australian Financial Review's headline that this is a "social media warfare database," I cringe. Anyone who posts material to social media sites or the internet in general should expect to see that data get scraped by marketing agencies and others. By this point in the internet's history, everyone should have gotten fair warning that this is the current state of affairs. Be careful what you expose.
Zhenhua Data looks like a company that has done what countless other Western companies have done in the age in which data is the new oil: Collect it and sell it. The company wasn't trying to hide. Neither was it very good at securing its own data.