FactChecking Claims By Cybernews The 16 Billion Record Data Breach That Wasnt
pDisclaimer This post is quite long as we tried to include most of the information we gathered that we havent seen discussed online You are free to use any of the information here as long as you credit this post and authorppOn June 18th Cybernews published an article titled The 16billionrecord data breach that no ones ever heard of ppThe initial post didnt provide much information on the servers they were using for the article there were no IP addresses or indice names just counts What the article did provide however was enough information to know that this wasnt even a data breach It was a collection of leaks and breaches that had nothing in common other than that they were added into a spreadsheet by the same researcher ppNo wonder no one had ever heard of the 16billionrecord data breach It never existedppI knew I would be able to get information about at least some of the data because I monitor ElasticSearch as well but data in these servers gets wiped frequently by wiperware and since not every dataset is static and records are being imported at times searching just with counts isnt exactly foolproofppWhen Cybernews updated the post to show the indice names making it easier to identify and match servers I messaged my friend Scary who also researches exposed data and together on a call we spent several hours digging through our logs and managed to match multiple entries on our logs with data mentioned in the articleppSome of our findings are described in the remainder of this articleppSome of the names published by Cybernews may not have been the actual indices but we managed to match and look at samples for multiple serversppWe only looked for obvious matches between our logs and Cybernews datasets and we likely missed servers but the matches we did find are sufficient to disprove or at least expect retractions or corrections of Cybernews claimsppAny dates and timelines of exposure are based solely on what we verified ourselves with our logs its likely some of them were exposed for longer as we dont monitor ElasticSearch closelyppAs noted above the issues with the article start with the title itself For every iteration of the title as Cybernews kept updating it the following was always mentioned recordbreaking data breachppThats their first false claim Its false because there was no recordbreaking 16 billion data breachppThrowing 30 different and unrelated datasets or dumps into one article doesnt make the report about one huge data breach It makes it about 30 datasets that may or may not have some common sources or featuresppMoreover the current title of the article reads 16 billion passwords exposed in recordbreaking data breach what does it mean for you ppWhat we think it means is that they didnt bother to check the accuracy of their analyses and reporting This leads us into the second false claimppBob Diachenko the researcher who claims everything in the Cybernews article went through him has stated on LinkedIn that the article was not about numbers but the scale Cybernews must have missed his note about that because the current Cybernews article mentions 16 billion a total of 18 times over their post as if it is about the numbersppAnd 16 billion what In its first version the Cybernews article talked about 16 billion records all with credentials When they were challenged about that statement they later revised it to passwords instead That too was blatantly incorrectppInspection of samples of the available datasets in our logs reveals that despite claiming that all records had credentials and then revising that to claim all records had passwords multiple leaks did not contain login credentials or passwords at allppTwo entries reported by Cybernews were NPDData1 and NPDData2 They contained a total of 743619650 records from the National Public Data breach that has been dumped and leaked multiple times after NPD refused to pay the extortion demands by the hackers in 2024ppThe actual name of the indice for NPDData1 was npd and the count matches exactly what they mention NPDData2 might be related to other indice exposed on the same server called npdrecords but the records do not match the number that we flagged This data was wiped and reimported at some pointppNeither the original NPD leak nor the leak mentioned in the article as NPDData1 NPDData2 include any login credentialsppOne of the 30 datasets in the article socialprofiles shows 75582294 records exposed this data was flagged as exposed for 43 days and contained social media data scrapes ppAn indice on our logs named socialprofilesv2 matches the records from the article and a sample reveals this too was a dataset that had no login credentials or passwords It contained job information names and social media profiles mostly LinkedInppTwo entries under peoplestablev3 contained a total of 3873145486 records combined ppWe werent able to identify any data on our logs matching those records or with a mention of v3 but we identified a server in late 2024 with the indice name peoplestable It contained billions of records and was eventually wiped out by wiperwareppThe Data exposed in late 2024 shows that peoplestable contained data from multiple Russian websites with multiple dumps with fields such as full name dob phone number email SNILS and taxpayer number We did not see any login credentials or passwords but we did not examine all of the dumps in the collectionppBut even if we exclude the peoplestable datasets it is clear that Cybernews claim that ALL records contained login credentials or passwords was patently inaccurate ppBut we are not done pointing out their false claimsppCybernews claimed the data their researchers found is not simply old data but instead new and previously unseen leaks Whats especially concerning is the structure and recency of these datasets these arent just old breaches being recycled This is fresh weaponizable intelligence at scale researchers saidppAs we noted under False Claim 2 some of the datasets were not new at all As but one example the NPD data has been leaked multiple times on different forums Their claim of fresh and new is also refuted by the fact that some of the servers that did contain infostealer records also contained clear links to txtbase leaks from Telegram In other words much of the data in their article they claimed was fresh and weaponizable was already itself a repackaging of logs extracted from Telegram channels ppCompilations and txtbase leaks from Telegram arent exactly new or even interesting datasets but dont take my word for it check what the researcher responsible for the article said on LinkedIn in a reply to a question about whats so special about the scale of thisppDiachenko states Btw txtbases on Telegram are mostly junked versions of x100 times processedtradedsold logsppSo why were they used in this article and the article somehow does not mention this at all instead when it mentions Telegram its to claim that cybercriminals are actively shifting off itppThe increased number of exposed infostealer datasets in the form of centralized traditional databases like the ones found be the Cybernews research team may be a sign that cybercriminals are actively shifting from previously popular alternatives such as Telegram groups which were previously the goto place for obtaining data collected by infostealer malware Nazarovas saidppThe above quote is another example of how little effort was put into researching any of this Lets take a look at some of the indices with infostealer logsppThis indice name is the default name set by an opensource tool every time this shows in our logs we chuckle a bit ppThe tool does not set a password to protect the ElasticSearch cluster by default A Google search of the indice name provides the public repo linked to the toolppOh no they shifted off Telegram just to go back to Telegram ppThe data it inserts into ElasticSearch itself and ends up leaked also proves it comes from txtbase leaks with a field showing the Telegram channel it came from you wouldnt even need to know how to use Google search to validate thisppThe article lists two entries with the names breachfiles and stealerlogs totaling over 12 billion recordsppBoth servers also showed direct connections to txtbases and repackaged datappThe largest one probably linked to Portuguesespeaking users contained more than 35 billionppIve mentioned this server being wiped in a post I did about infostealer logs back in February Back then I checked some of the data exposed on this server so I was surprised to read the quote above instead of being factual as to what was exposedppThe reason they used Portuguesespeaking users is likely because one of the indices with over 21 billion records was named tuga a slang term that is used to refer to people from Portugal not Portuguese speakers If they looked at the field country that is present in the data they could have seen that the entries were showing br for BrazilppBut even that might of been inaccurate back then I saw emails for in id and other countries in both indices not a link to Portuguesespeaking users which is the official language of over 10 countriesppCybernews also seems to contradict its own reporting of the datasets being fresh by adding a quote to the article saying that recycled old leaks were present ppResearchers say most of the leaked data comes from a mix of infostealer malware credential stuffing sets and recycled old leaks This also leads us to the next false claimppBesides claiming the quote above one of the Key takeaways in the article is The data most likely comes from various infostealersppThis is later reinforced in the article with things such as Theres another interesting aspect to this topic It is a fact that all information comes from infostealers an incredibly prevalent threatppCybernews is entitled to its own opinions They are not entitled to their own facts Their screenshot with indice names disproves their socalled fact with indices like NPDData peoplestable and socialprofilesppAs Cybernews claimed in its first reportppThe only silver lining here is that all of the datasets were exposed only briefly long enough for researchers to uncover them but not long enough to find who was controlling vast amounts of data Most of the datasets were temporarily accessible through unsecured Elasticsearch or object storage instances ppAlthough some datasets cited by Cybernews were only briefly exposed from the time they were first detected mostly because they got wiped soon after being exposed according to our logs other datasets were exposed for months ppThe longest exposure we verified was over 5 months that exposure was for the same server used by a coworker of Diachenko to write an article about 184 million infostealer logs in late MayppClaiming all of the datasets were exposed only briefly is seriously misleading at bestppWe found other concerns some related to the original article and others related to different articlesppThe article does not mention any attempts to close any of the servers they found Did they even report each of these exposed datasets to the hosts For how many of the datasets did they make any effort to get the data secured or removedppBecause the server running the indice stealerlogs was exposed at the time their article came out when I searched for the counts initially listed I got a match for this server and noticed the log entry was from earlier that day I sent an email to the abuse contacts for the IP address not long after the article was published and the access was blocked within a few hourspplogins was also exposed but had been wiped and left with only a ransom note the day before the article was published Eventually around June 21st the owner noticed the data was gone and shut it downppGiven the interest in their article numerous people wanted to know where to find the leaks and how to get them If Cybernews reported on data that was discoverable and unsecured then have they put people at risk of fraud or other harmsppPublishing about things they dont bother closing is nothing new for Cybernews either Both DataBreachesnet and I have sent them comments in the past about their failure to report whether they had even tried to get stillunsecured data locked downppIn an article about a leak involving 11m files exposed by beWanted that was found by the Cybernews team they mention The team believes that a data leak involving over a million files with each one likely representing a single person represents a critical security incident for beWantedppI had the file listing for this server which was exposed at least since July 2023 according to my logs so I ran a simple command to list the files per directoryppA simple directory listing and count and I could see that there were over 560000 files in cvfiles and over 500000 in profileimagesppWhat kind of research was done here that the team cant identify half the files were profile pictures and not likely one file per personppCybernews added an FAQ to their article in one of their updates addressing some of the concerns others raised with their initial reporting The FAQ also addresses some of the issues raised in this post but simply adding an FAQ to the article without correcting any wrong or misleading information within it has little to no meaningppDiachenko is not sure why the article made by the other cofounder of his company about the 184m leak that according to him was only exposed for a couple of weeks gained so much attentionppIm not either but Ill leave you with some food for thought and tell you to take a look at the order of how servers were added to the sheet by Diachenko and my dates and then the dates around the entry related to the 184m leak15 eventually closed in late MayppMaybe more attention should be given to that article now that both Cybernews and my post have been publishedppOne last reminder to everyone its not about the numbers but the scale Now say 16 billion and recordbreaking one more time with meppNo postsppReady for morep