The following is the content of the paper on the AI language model specialized in cybersecurity, "CyBERTuned," presented by S2W at the world's top-tier AI conference, 'NAACL (North American Chapter of the Association for Computational Linguistics)'.
Ignore Me But Don't Replace Me: Utilizing Non-Linguistic Elements for Pretraining on the Cybersecurity Domain (NAACL 2024)
To summarize the paper in one line, it is about utilizing non-linguistic elements for pretraining in the cybersecurity domain.
- 1. Language Model
A language model is a model that enables a computer to understand context by receiving human language (natural language). Among these, we used the BERT-style encoder model. (An encoder model converts language into a vector form that represents its meaning, unlike conversational models like ChatGPT. DarkBERT is also an encoder model.) The technical paper on DarkBERT, S2W's first language model and the world's first dark web language model, can be found at the links below.
> https://s2w.inc/en/resource/detail/278
- 2. Pretraining
Pretraining involves giving the language model text and letting it learn on its own. This is followed by fine-tuning to impart specific abilities. Pretraining is the process of self-learning using only text before fine-tuning.
2-1. MLM
The typical encoder pretraining method uses Masked Language Modeling (MLM). It masks a few words in a given sentence and then predicts those words. By predicting the masked words, it is expected to learn the meanings between words.
For example, in the sentence 'The capital of France is Paris,' random words are replaced with [MASK] to create 'The capital of France is [MASK],' and the model learns that 'Paris' should fill in the [MASK]. To predict the masked word, the model needs to understand the combined meaning of the words and know that the capital of France is Paris. Similarly, the model becomes smarter with repeated self-learning.
- 3. Cybersecurity-Specific Model
While such pretraining methods can effectively teach the model language, the language used in general contexts differs from that used in specialized contexts. To train a model with specialized cybersecurity knowledge, it needs to be trained with cybersecurity data. Creating domain-specific models like this was once popular (e.g., BioBERT, LegalBERT). DarkBERT was also announced at ACL last year through a paper that pretrained on dark web data.
To create a model specialized in the language used in cybersecurity materials, S2W's AI team collected various data as mentioned above.
- 4. Overcoming Limitations of Cybersecurity Language Models
To create a cybersecurity-specific language model, pretraining must be conducted on cybersecurity documents as explained in 2.1. However, there have already been multiple attempts to create cybersecurity-specific models using this method. CyBERTuned overcomes the limitations of existing cybersecurity-specific models.
4.1 Limitations of Existing Pretraining Methods
In cybersecurity documents, non-linguistic elements like URLs or SHA hashes often appear alongside natural language. These were referred to as non-linguistic elements in the paper. Masking non-linguistic elements and restoring the original has no linguistic meaning. In other words, while it makes sense to train the model to predict 'Paris' in 'The capital of France is [MASK],' it doesn't make sense to train it to predict 'd8e93252f41' from 'd8[MASK]93252f41…'.
4.2 Proposing a New Pretraining Method
We proposed several modifications to the existing pretraining methods. The following changes showed the best performance among various settings:
- 1. Adjust masking based on the type of non-linguistic element.
- 2. Besides masking, train the model to predict the type of non-linguistic element.
Using the validated pretraining methods, we trained the language model on all the data. The completed model, CyBERTuned, demonstrated superior performance compared to other cybersecurity-specific models and was presented in Mexico City on June 19, 2024.
A summary of the paper are attached as a PDF file. (See the preview image below)
If you have any questions, please don't hesitate to contact us!