FuguMT : Machine Learning Model for English to Japanese Translation

This is an introduction to「FuguMT」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.

Overview

FuguMT is a language model for Japanese translation based on MarianMT, a framework developed by Microsoft for machine translation. FuguMT is able to translate English text to Japanese, under a CC-BY-SA-4.0 license.

FuguMT

staka/fugumt-en-ja · Hugging FaceThis is a translation model using Marian-NMT. For more details, please see my repository. source…huggingface.co

GitHub — s-taka/fugumtgithub.com

Dataset

The FuguMT training data can be found in the blog below (Japanese only). The dataset contains about 6.6 million bilingual pairs (Japanese: 690MB English: 610MB, about 100 million words), and the training was conducted for about 30 hours using Marian-NMT + SentencePiece on AWS p3.2xlarge.

ぷるーふおぶこんせぷと英文を日本語訳するニューラル機械翻訳モデルをCC BY-SA 4.0で公開した。以前の記事で紹介した手法を用い昨年11月に構築したモデルである性能はそこそこ（後述）。構築手法は本格的（Marian-NMT[1]を用いた…staka.jp

The BLEU (BiLingual Evaluation Understudy) score is 31.65, higher than GPT3.5’s 27.04 and GPT4’s 29.66.

ぷるーふおぶこんせぷとGPT-4の翻訳性能を外務省WEBサイトのテキスト（日本語/英語）を用いて定量的[1]に測ってみた。…staka.jp

Architecture

FuguMT is a transformer-based Sequence2Sequence model. The output can be obtained one token at a time by iterating through the decoder.

Decoder inputs are input_ids, attention_mask, decoder_input_ids, and past_key_values[25]. input_ids are the input token sequence, attention_mask is a vector of 1s, decoder_input_ids are the token IDs from the previous iteration (pad=32000 initially), and past_key_values are internal states of size (beam_size, 8, 0, 64), with the ‘0’ part increasing with each inference.

The decoder outputs logits and past_key_values[25], where logits are 32001-dimensional, containing the probability of each token. Text is determined via beam search based on logits. In the Python version, the default beam_size is 12.

The tokenizer used is MarianTokenizer, which is a SentencePiece model, employing English source and Japanese target models for input and output, respectively.

Usage (Python)

To use FuguMT from ailia SDK in Python, use the following command

$ python3 fugumt-en-ja.py --input "This is a cat."

The output translation woule be:

translation_text: これは猫です。

ailia-models/natural_language_processing/fugumt-en-ja at master · ailia-ai/ailia-modelsText (English) to translate This is a cat. Translated (Japanese) text translation_text: これは猫です。 This model requires…github.com

Usage (C++)

A sample of using FuguMT in C++ with ailia Tokenizer is also available below.

Below is the build process and a running sample.

cd fugumt  
export AILIA_LIBRARY_PATH=../ailia/library  
export AILIA_TOKENIZER_PATH=../ailia_tokenizer/library  
cmake .  
make  
./fugumt.sh

env_id : 0 type : 0 name : CPU  
env_id : 1 type : 1 name : CPU-AppleAccelerate  
env_id : 2 type : 2 name : MPSDNN-Apple M1 Max (Warning : FP16 backend is not worked this model)  
you can select environment using -e option  
selected env name : CPU-AppleAccelerate  
Input : This is a cat.  
Input Tokens :  
183 30 15 11126 4 0   
Output : これは猫です  
Output Tokens :  
517 6044 68 0   
Program finished successfully.

ailia-models-cpp/fugumt at master · ailia-ai/ailia-models-cppC++ version of ailia models repository. Contribute to ailia-ai/ailia-models-cpp development by creating an account on…github.com

ailia Inc. has developed ailia SDK, which enables cross-platform, GPU-based rapid inference.

ailia Inc. provides a wide range of services from consulting and model creation, to the development of AI-based applications and SDKs. Feel free to contact us for any inquiry.