S2W Identifies LLM Tokenization Vulnerability… Accepted at EMNLP 2025
The joint research team from S2W and KAIST has identified a fundamental vulnerability in LLM tokenization, and their paper has been accepted at EMNLP 2025, one of the world’s most prestigious AI conferences. EMNLP stands alongside ACL and NAACL as one of the top three global conferences in NLP. With this achievement, S2W has now been accepted at leading AI conferences for four consecutive years.
The paper reveals that one of the internal processing functions of LLMs, the tokenizer, which segments characters during sentence analysis, can induce hallucinations. It further highlights that tokenizer-induced hallucinations occur more frequently in non-English languages, suggesting that model response quality may decline more significantly for non-English users. The findings also offer meaningful insights for ongoing discussions related to 'Sovereign AI'.
S2W will continue producing pioneering research to build trustworthy AI and contribute to the advancement of the global AI ecosystem.
🔗 Read the full article:
https://bit.ly/3LyALp7✅ Read More:
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers