AI | S2W

image alt attribute

World's First Dark Web Specialized AI Language Model, DarkBERT

image alt attribute

DRAINCLoG: Detecting Rogue Accounts with Illegally-obtained NFTs using Classifiers Learned on Graphs (NDSS 2024, accepted)

image alt attribute

DarkBERT: A Language Model for the Dark Side of the Internet (ACL 2023)

Shedding New Light on the Language of the Dark Web (NAACL 2022)

Read Paper

What is DarkBERT?

What is DarkBERT?
How to Utilize DarkBERT
Use Cases

DarkBERT

DarkBERT is the world's first Dark Web-specialized AI language model. A language model is an AI model that understands human language and has extensive pre-trained knowledge, making it highly capable of solving various natural language processing tasks. Among them, DarkBERT particularly excels in processing and analyzing unstructured data present on the Dark Web. Unlike other similar encoder language models that struggle with the diverse vocabulary and structural diversity found on the Dark Web, DarkBERT has been trained specifically to understand the illicit content present on the Dark Web. Furthermore, DarkBERT enhances its training by fine-tuning RoBERTa models through Masked Language Modeling (MLM) on text collected from the Dark Web.

The collection of the corpus is a fundamental challenge in training DarkBERT. S2W is renowned for its ability to collect and analyze data from the Dark Web, including its doppelgangers. It has accumulated a substantial Dark Web text corpus suitable for training by removing duplicates and low-density pages, resulting in a massive corpus of 5.83GB even after refinement.

DarkBERT utilized existing large-scale language models and further underwent post-training by incorporating specific domain data. It excels in handling unstructured data typically found on the anonymous web, where extraction can be challenging, and it is adept at inferring context. Additionally, DarkBERT can be employed for detecting/classifying various criminal activities occurring on the anonymous web and extracting crucial threat information.

Development Process of DarkBERT

Read Paper

What is NLP?

How to Utilize DarkBERT?

Dark Web Page Classification

The Dark Web contains numerous pages filled with various types of cybercrime-related content. Automatically classifying pages based on their content within this massive amount of unstructured data is crucial for effective Dark Web intelligence. DarkBERT excels in tasks such as classifying web page content related to topics like pornography, hacking, violence, and more. Detailed information about the page classification system can be found in the paper "Shedding New Light on the Language of the Dark Web" by S2W.
Ransomware Leakage Site Detection

Ransomware attackers often operate "leakage sites" to publish confidential data from victim companies without negotiation. Swiftly detecting such websites is essential for gathering intelligence on high-risk ransomware groups. DarkBERT demonstrates excellent performance in the automatic detection of leakage sites.
Detection of Key Threads

Dark Web forums serve as platforms for sharing information and conducting transactions related to various illegal activities. Since forums allow numerous users to freely create posts, monitoring specific topics can be challenging. Filtering posts to identify key threads (such as sharing confidential information or malicious hacking tools) is essential for effective monitoring. DarkBERT excels in the automatic detection of key forum threads.
Inference of Threat Keywords

Even everyday words can have entirely different meanings on the Dark Web. DarkBERT is trained to understand the jargon and language used by cybercriminals, allowing us to comprehend word usage in context.

What is Data Intelligence?

DarkBERT Use Cases

Use Case 1

Use Case 2

Use Case 3

Customer-Specific Fine-Tuning and Classification

DarkBert can be customized and tuned to meet the specific needs of users. It has the capability to process a vast array of both internal and external unstructured data, enabling it to effectively filter and refine only the desired information from extensive datasets according to user preferences.

Customer A (Industry: Construction)

[Pain point]
There is a wealth of diverse language data available on the web that is crucial for corporate decision-making. However, many companies face challenges in directly scraping and analyzing this data due to insufficient internal infrastructure, especially a lack of expertise in processing unstructured language data. Even when a company possesses language processing expertise, handling domain-specific data can be highly challenging, requiring specialized tuning techniques.
(Example: Creating a tuned DarkBERT model for the dark web)

[Challenge]
The need arose to classify specific data or extract insights for decision-making from the vast amount of unstructured language data generated internally within the company. However, this data is highly domain-specific, making it exceptionally difficult to process effectively with general-purpose technologies.

[Result of Adoption]
By utilizing domain-specific language models, users can significantly reduce the time spent on data refinement by automatically pre-selecting meaningful data when attempting to extract insights from large datasets. Furthermore, when extracting specific statistics from the data, using language models that have pre-refined the data can enhance the reliability of the extracted statistics. The classification and refinement of such domain-specific data play a crucial role in enabling companies to make effective decisions based on their data.

Integration with Open LLM

DarkBERT plays a crucial role in the adoption of Large Language Models (LLMs) like OpenAI's ChatGPT within the enterprise. Companies are increasingly seeking to leverage various datasets, both internally and externally, for conversational purposes, where LLMs like ChatGPT generate responses based on this data. To achieve this, "Retrieval-Augmented Generation" (RAG) technology, which focuses on answer generation through search, has gained significant attention. However, a challenge arises due to the sheer volume of data, strong domain-specific characteristics (including domain-specific terminology), and the presence of irrelevant data, leading to reduced search efficiency and accuracy.

DarkBERT, as a "domain-specific encoder model," can address these issues in two key aspects:

(1)Domain-Specific Data Refinement and Classification:
DarkBERT utilizes models tuned to match the characteristics of a company's data. This allows it to automatically classify essential data relevant to decision-making according to the specific data features of the enterprise. Consequently, it enhances search accuracy and improves the quality of LLM responses.

(2)Domain-Specific Embedding:
One critical element of RAG is performing meaning-based searches, which necessitates appropriately embedding documents. General language models often lack an adequate understanding of data with strong domain-specific characteristics, making it challenging to generate embeddings that reflect the correct meaning. Models like DarkBERT, which undergo domain-specific tuning, enable the creation of high-quality embeddings. This, in turn, significantly boosts search accuracy for user queries.

Dark Web Specialized Generative AI

DarkCHAT is a specialized generative AI model installed within a Dark Web monitoring solution called XARVIS. XARVIS required an effective system to refine and present information that users seek within the Dark Web. DarkCHAT enables users to effectively obtain threat intelligence related to their areas of interest. Leveraging the collected data, DarkCHAT derives new intelligence and grants users access to desired data through single commands.

Unlike commercially available language models, which cannot directly access the Dark Web and rely on curated Dark Web news from surface web sources, DarkCHAT stands apart as a real-time generative AI specialized for the Dark Web. It provides vivid, up-to-the-minute Dark Web information based on data collected from the Dark Web.

* Generative Artificial Intelligence is a technology that generates new data based on given data or inputs. It falls under deep learning and is also referred to as generative models. Generative AI can create various types of data, including text, images, audio, video, and more.

What is NLP?

NLP stands for Natural Language Processing. It is a field within Artificial Intelligence (AI) that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a valuable way. This includes the development of algorithms, models, and techniques that enable specific tasks to be performed.

The Importance of NLP

NLP is a critical technology for processing high-quality intelligence more effectively. It plays a significant role in various applications, including search engines, virtual assistants, customer support chatbots, and recommendation systems. The importance of NLP is continuously growing due to the rapid increase in internet text data and the need for automated processing of language-related data.

Information Extraction
Information Extraction refers to the automatic extraction of structured information from unstructured text. It includes identifying key data (named entity recognition), extracting relationships between data (relation extraction), and linking data to knowledge bases (data linking).

Text Classification
Text Classification is the automatic categorization of text into predefined groups or tags. It is used in applications such as sentiment analysis and spam detection.

Document Summarization
Document Summarization involves condensing and summarizing lengthy text documents into concise and coherent summaries. It can be done by selecting key sentences (extractive) or generating new summary content (abstractive).

Language Models
Language Models are statistical models that predict the likelihood of word sequences. They are used in various applications, including text generation, speech recognition, machine translation, and more.