ailia Tech BLOG

Released ailia Tokenizer 1.3

We have released ailia Tokenizer 1.3, which enables mutual conversion between text and tokens. We have also introduced a new Python API and applied it to ailia MODELS.


Overview

ailia Tokenizer is a library that converts text to tokens and vice versa. When performing natural language processing with AI, it’s necessary to use a tokenizer to convert the text into tokens that the AI can process. Traditionally, this task was handled by Transformers, but since Transformers only offer a Python API, it was challenging to use them from C++, Unity, or Flutter. ailia Tokenizer addresses this issue by providing a tokenizer available across multiple platforms.

New features

Addition of new tokenizers

We have added support for new tokenizers, including GPT2 and LLAMA. GPT2 is utilized in GPT2, MSCLAP, and BLIP2, while LLAMA is used in llava.

Performance Optimization

We have optimized the BPE logic for Whisperand Clip, resulting in a significant gain in processing speed.

Python API support

A Transformers-compatible API has been added, allowing it to be called directly from Python. Since Transformers use TensorFlow and Torch as backends, they present the following challenges:

ailia Tokenizer resolves these issues by providing a stable tokenizer with minimal dependencies.

Usage with ailia MODELS

With the provision of the ailia Tokenizer Python API, all models in ailia MODELS now use ailia Tokenizer instead of Transformers.

GitHub — ailia-ai/ailia-models: The collection of pre-trained, state-of-the-art AI models for ailia…The collection of pre-trained, state-of-the-art AI models for ailia SDK — ailia-ai/ailia-modelsgithub.com

Out of the 336 models in ailia MODELS at the time of writing, the following 39 models utilize ailia Tokenizer.

audio_processing/clap  
audio_processing/distil-whisper  
audio_processing/msclap  
audio_processing/kotoba-whisper  
diffusion/latent-diffusion-txt2img  
diffusion/stable-diffusion-txt2img  
diffusion/control_net  
diffusion/riffusion  
diffusion/marigold  
image_captioning/blip2  
image_classification/japanese-stable-clip-vit-l-16  
image_classification/japanese-clip  
large_language_model/llava  
natural_language_processing/bert  
natural_language_processing/bert_insert_punctuation  
natural_language_processing/bert_maskedlm  
natural_language_processing/bert_ner  
natural_language_processing/bert_sentiment_analysis  
natural_language_processing/bert_tweet_sentiment  
natural_language_processing/bertjsc  
natural_language_processing/cross_encoder_mmarco  
natural_language_processing/fugumt-en-ja  
natural_language_processing/fugumt-ja-en  
natural_language_processing/multilingual-e5  
natural_language_processing/sentence_transformers_japanese  
natural_language_processing/t5_base_japanese_title_generation  
natural_language_processing/bert_sum_ext  
natural_language_processing/bert_zero_shot_classification  
natural_language_processing/t5_base_japanese_summarization  
natural_language_processing/t5_whisper_medical  
natural_language_processing/gpt2  
natural_language_processing/rinna  
natural_language_processing/bert_question_answering  
natural_language_processing/glucose  
natural_language_processing/bert_maskedlm_proofreeding  
natural_language_processing/soundchoice-g2p  
network_intrucation_detection/bert-network-packet-flow-header-payload  
network_intrucation_detection/falcon-adapter-network-packet  
object_detection/glip

You can still use Transformers as before by using the options below.

--disable_ailia_tokenizer

More info on ailia Tokenizer

For more information on ailia Tokenizer, please refer to the article below.

ailia Tokenizer : NLP Tokenizer for Unity and C++Introducing ailia Tokenizer, a tokenizer for NLP that can be used from Unity or C++, without the need for an Python…medium.com


ailia Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ailia Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.