Social Data, a company that sells data on social media influencers to marketers, has exposed a database of nearly 235 million social media profiles on the web without a password or any other authentication required to access it, according to a new report from Comparitech researchers. The data included a wealth of information including names, contact info, personal info, images, and statistics about followers.
The profiles were taken from publicly viewable social media pages on Youtube, TikTok, and Instagram. Security researcher Bob Diachenko, who leads Comparitech’s cybersecurity research team, uncovered three identical copies of the exposed data on August 1.
Evidence suggests that much of the data originally came from a now-defunct company: Deep Social. The names of the Instagram datasets (accounts-deepsocial-90 and accounts-deepsocial-91) hint at the data’s origin. Based on this, Diachenko first contacted Deep Social using the email address listed on its website to disclose the exposure. The administrators of Deep Social forwarded the disclosure to Social Data. The CTO of Social Data acknowledged the exposure, and the servers hosting the data were taken down about three hours later.
Facebook and Instagram banned Deep Social from their marketing APIs in 2018 and threatened legal action against it if it continued to scrape data from their users’ profiles. Deep Social then announced it would wind down operations and has since shut down its original service.
Social Data denies any connection between itself and Deep Social.
A spokesperson from Social Data told Diachenko in an email, “Please, note that the negative connotation that the data has been hacked implies that the information was obtained surreptitiously. This is simply not true, all of the data is available freely to ANYONE with Internet access. I would appreciate it if you could ensure that this is made clear. Anyone could phish or contact any person that indicates telephone and email on his social network profile description in the same way even without the existence of the database. […] Social networks themselves expose the data to outsiders – that is their business – open public networks and profiles. Those users who do not wish to provide information, make their accounts private. [sic]”
Facebook company spokesperson Stephanie Otway told Comparitech in an email, “Scraping people’s information from Instagram is a clear violation of our policies. We revoked Deep Social’s access to our platform in June 2018 and sent a legal notice prohibiting any further data collection.”
Timeline of the exposure
We do not know how long the data was exposed for prior to our discovery of it on August 1. We also do not know whether any unauthorized parties accessed it during the exposure. Our honeypot experiments show that hackers can find and attack unsecured databases within hours of being exposed.
The database was shut down about three hours after sending our initial disclosure.
What data was exposed?
Three identical copies of the data were hosted at three separate IPv6 addresses. In total, each one stored data on about 235 million social media profiles. Here is a breakdown of the largest datasets:
- 96,714,241 records scraped from Instagram
- 95,678,713 records scraped from Instagram
- 42,129,799 records scraped from TikTok
- 3,955,892 records scraped from Youtube
Each record contains some or all of the following info:
- Profile name
- Full real name
- Profile photo
- Account description
- Whether the profile belongs to a business or has advertisements
- Statistics about follower engagement, including:
- Number of followers
- Engagement rate
- Follower growth rate
- Audience gender
- Audience age
- Audience location
- Last post timestamp
Based on samples we collected, about one in five records contained either a phone number or email address.
Dangers of exposed data
The information stored in this database is vulnerable to spam marketing and phishing campaigns. Users of Instagram and TikTok should be on the lookout for scams and phishing messages either sent directly or posted in comments. Even though the information is publicly available, the size and scope of an aggregated database makes it more vulnerable to mass attack than it would be in isolation.
The images and other profile data could be used by scammers to create fake imitation accounts. These accounts lure in followers, and then promote scams or misinformation.
The images could also be used without the owners’ permission for face recognition purposes.
Facebook and other social networks have employed both legal and technological solutions to stem web scraping of their users’ profiles, but the practice hasn’t ceased. Scrapers are difficult for automated systems to distinguish from normal website users. The most prominent example is Clearview.ai, which scraped profiles for images to be used in mass-marketed face recognition technology.
About Social Data and Deep Social
Deep Social described itself as “a freemium influencer ranking, discovery and AI-driven analytics platform […] providing its 44,817 customers with in-depth insights into demographic and psychographic data of influencers and their audience.”
According to its website, Deep Social was used by a range of big-name brands including Samsung, Heineken, L’Oreal, Unilever, Walmart, Amazon, Disney, and Booking.com. It claimed to be “GDPR compliant”.
Deep Social shut down in 2018 after Facebook reportedly banned it from its marketing API and threatened legal action.
Social Data launched in August 2019, according to Hong Kong business directories. Its website says it “helps your business to find Influencers and get in-depth insights into demographic and psychographic data of influencers and their audience throughout different types of social media on the web.”
Social Data is incorporated in Hong Kong, according to its terms of service (PDF) and its .hk top-level domain.
Why we reported this data incident
Comparitech researchers regularly scan the web for unprotected servers containing personal data. Upon discovering an unsecured database, we promptly begin an investigation to determine who is responsible for it, who is impacted, and what the potential ramifications could be if a malicious party obtains the data.
As soon as we determine who the owner is, we send a disclosure so it can be secured. We then publish an article like this one to raise awareness and curb potential harm to end users.