DeepSpeech2 : 音声認識を行う機械学習モデル

ailia SDKで使用できる機械学習モデルである「DeepSpeech2」のご紹介です。エッジ向け推論フレームワークであるailia SDKとailia MODELSに公開されている機械学習モデルを使用することで、簡単にAIの機能をアプリケーションに実装することができます。

DeepSpeech2の概要

DeepSpeech2は2015年12月に提案されたEnd to Endの音声認識モデルです。音声を入力として英語のテキストを出力することができます。

SeanNaren/deepspeech.pytorchImplementation of DeepSpeech2 for PyTorch. The repo supports training/testing and inference using the DeepSpeech2…github.com

Deep Speech 2: End-to-End Speech Recognition in English and MandarinWe show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese…arxiv.org

DeepSpeech2のアーキテクチャ

DeepSpeech2では、入力音声をMelspectrogram変換した後、CNNおよびRNNを適用し、最後にCTCでテキストを出力します。

出典：https://arxiv.org/abs/1512.02595

CTC（Connectionist Temporal Classification）は文字認識や音声認識でよく使われる手法で、LSTMやRNNと組み合わせて使用されます。文字認識や音声では一つの文字の横幅や、一つの音素の時間長さが可変です。そこで、デコーダ側で同じ文字が連続した場合に消し込むことでこの可変性の問題を解決します。

CTC損失関数CTC Loss(損失関数) (Connectionist Temporal…www.thothchildren.com

言語モデルの使用

CTCの出力を言語モデルで補正することで、より自然な文章にすることができます。言語モデルを使用したctcdecodeには下記のライブラリを使用します。

parlance/ctcdecodectcdecode is an implementation of CTC (Connectionist Temporal Classification) beam search decoding for PyTorch. C++…github.com

言語モデル自体は下記からダウンロード可能です。

openslr.orgIdentifier: SLR11 Summary: Language modelling resources, for use with the LibriSpeech ASR corpus Category: Text…www.openslr.org

言語モデルによるCTCデコードでは、ある単語の出現確率を、取りうる全パターンの総和で計算します。この計算を効率的に行うため、動的計画法が用いられています。

DeepSpeech2のデータセット

DeepSpeech2はAN4、Librispeech、TEDLIUMで学習されています。

AN4 : CMUによって1991年に作成された16kHzの小規模なデータセット

CMU Sphinx Group - Audio DatabasesEdit descriptionwww.speech.cs.cmu.edu

Librispeech : audiobookから取得された16kHzの1000時間のスピーチ

openslr.orgIdentifier: SLR12 Summary: Large-scale (1000 hours) corpus of read English speech Category: Speech License: CC BY 4.0…www.openslr.org

TEDLIUM : TEDトークを使用した16kHzの約118時間のスピーチ

tedlium | TensorFlow DatasetsFeaturesDict({ 'gender': ClassLabel(shape=(), dtype=tf.int64, num_classes=3), 'id': tf.string, 'speaker_id': tf.string…www.tensorflow.org

DeepSpeech2の使用方法

ailia SDKでDeepSpeech2を使用するには下記のコマンドを使用します。

python3 deepspeech2.py -i input.wav

言語モデルを使用するには-dオプションを使用します。言語モデルを使用する場合、事前に、ctcdecodeのライブラリのインストールと言語モデルである3-gram.pruned.3e-7.arpaのダウンロードが必要です。

python3 deepspeech2.py -i input.wav -d

実行例です。

ailia-ai/ailia-modelsaudio file（16kHz) LibriSpeech ASR corpus http://www.openslr.org/12 1221-135766-0000.wav texts…github.com

DeepSpeech2の出力の例

下記のスピーチのテスト素材を使用します。

言語モデルあり。

what somebody decides to break it be careful that you keep angular coverage but look for places to save money ninety is taking longer to get things squared away than the banker's expected during the life for once company may win her taxied retirement and count de bust telle but inadequate new self to seeming rags or hurriedly tolson the two naked bone to want o discussion cannons thou when the title of this type of than is in question or to dying or waxing or gassing tete debrett may be personalized known by a clays leather horn lace work on a flat surface and smooth out a simples tinto separate system uses a single self contained in it the old chap an ad still hold a good mechanic is usually a bad but so figures would do her in lady years we make beautiful chares canet chesnel's etcher'

言語モデルなし。

wha i somebody decides to break it he careful that you keep anquhaod coverage but look for places to save monyniete its taking longer to get things squired away than the bankers expected liring the life for once comnpany my win her taxited retireent and comnt debouse ta telple but inadequate new self to seeming rags ore hurridly tos on the two naked bone to want o discussion cannins shou when the title of this type of thol is in questions ors o dying or waxing orgassingtete dibrualight may be persoaaised known by o clays leather horne lace work on a flat surface and smooth out a siples tiing to separate system useas a single sof contained un it the old chup an ad still hold a good mechanic is usually a bad bot fo figures would no her in lady years o make beautiful chaires camnets ches dol houses ed cheter

言語モデルを使用することで、より自然なテキストを生成することができます。

アイリア株式会社はAIを実用化する会社として、クロスプラットフォームでGPUを使用した高速な推論を行うことができるailia SDKを開発しています。アイリア株式会社ではコンサルティングからモデル作成、SDKの提供、AIを利用したアプリ・システム開発、サポートまで、 AIに関するトータルソリューションを提供していますのでお気軽にお問い合わせください。