While the large language models (LLMs) that power ChatGPT and Google Bard were trained on data from the open web, DarkBERT was trained exclusively on data from the dark web. Yes, you read that correctly, this new AI model was trained using data from hackers, cybercriminals and other scammers.
A team of South Korean researchers have released a paper (PDF) detailing how they made DarkBERT using data from the Tor network, which is often used to access the dark web. By crawling through the dark web and then filtering the raw data, they were able to create a dark web database that they used to train DarkBERT.
Surprisingly, DarkBERT has already managed to outperform other large language models despite being trained on data from a very unlikely place.
Giving an old AI architecture new life
Although DarkBERT is a new AI model, it’s actually based on the RoBERTa architecture, which is an AI approach developed back in 2019 by researchers at Facebook according to our sister site Tom’s Hardware.
In a research paper detailing the inner workings of RoBERTa, Meta AI explains that it is a “robustly optimized method for pretraining natural language processing (NLP) systems” that improves upon BERT (Bidirectional Encoder Representations from Transformers), which was released by Google back in 2018. As the search giant made BERT open source, Facebook’s researchers were able to improve its performance in a replication study.
Thanks to Facebook’s optimized method, it released RoBERTa which was able to produce state-of-the-art results on the General Language Understanding Evaluation (GLUE) NLP benchmark.
Now though, the South Korean researchers behind DarkBERT have shown that RoBERTa is able to do even more as it was undertrained when it was initially released. By feeding RoBERTa data from the dark web over the course of almost 16 days across two data sets (one raw and the other preprocessed) the researchers were able to create DarkBERT.
Fortunately, the researchers don’t have any plans to release DarkBERT to the public. However, they are accepting requests for academic purposes according to Dexerto. Still, DarkBERT will likely provide law enforcement and researchers with a much better understanding of the dark web as a whole.
How to stay safe when using AI chatbots
Just like with any other software or online service, you need to be careful when using AI chatbots as you could get a malware infection from fake ChatGPT apps or even expose sensitive data like employees at Samsung recently did.
This is why you want to ensure you’re actually going to the correct website when using these popular AI chatbots. If you’re looking for a ChatGPT, Bing Chat or Google Bard app, you won’t find one yet as OpenAI, Microsoft and Google have yet to release official apps for their AI chatbots.
Likewise, you don’t want to click on any links in suspicious emails claiming to take you to an AI chatbot or that help you get access right away. Scammers are well aware of the current AI chatbot craze and are taking advantage of it in their attacks right now. At the same time, ads about AI chatbots are also to be avoided as cybercriminals often abuse Google Ads and other ad services to take unsuspecting users to phishing sites.
For extra protection when experimenting with AI chatbots, you should be using the best antivirus software with your PC, the best Mac antivirus software with your Mac and one of the best Android antivirus apps on your smartphone. This way, if a link to an AI chatbot does lead to malware, your antivirus will catch it first before your devices can get infected.
DarkBERT could represent the future of AI models that are trained in one specific area to make them much more specialized. Given its popularity so far, we wouldn’t be surprised if we see similar AI models developed in this way going forward.