Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

Toyoshima, Itsuki; 豊島, 依槻; Okada, Yoshifumi; 岡田, 吉史; Ishimaru, Momoko; 石丸, 桃子; Uchiyama, Ryunosuke; 内山, 竜之介; Tada, Mayu; 多田, 真悠

doi:10.3390/s23031743

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

http://hdl.handle.net/10258/0002000052

名前 / ファイル	ライセンス	アクション
sensors-23-01743-v2.pdf (1 MB)

Item type

学術雑誌論文 / Journal Article.(1)

公開日

2023-09-28

タイトル

言語

タイトル

Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

言語

eng

キーワード

言語

主題Scheme

Other

主題

multi-input deep neural network

キーワード

言語

主題Scheme

Other

主題

speech emotion recognition

キーワード

言語

主題Scheme

Other

主題

mel spectrogram

キーワード

言語

主題Scheme

Other

主題

GeMAPS

キーワード

言語

主題Scheme

Other

主題

focal loss function

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

アクセス権

open access

アクセス権URI

http://purl.org/coar/access_right/c_abf2

著者

豊島, 依槻

岡田, 吉史

en	Okada, Yoshifumi
ja	岡田, 吉史

抄録

内容記述タイプ

Abstract

内容記述

The existing research on emotion recognition commonly uses mel spectrogram (MelSpec) and Geneva minimalistic acoustic parameter set (GeMAPS) as acoustic parameters to learn the audio features. MelSpec can represent the time-series variations of each frequency but cannot manage multiple types of audio features. On the other hand, GeMAPS can handle multiple audio features but fails to provide information on their time-series variations. Thus, this study proposes a speech emotion recognition model based on a multi-input deep neural network that simultaneously learns these two audio features. The proposed model comprises three parts, specifically, for learning MelSpec in image format, learning GeMAPS in vector format, and integrating them to predict the emotion. Additionally, a focal loss function is introduced to address the imbalanced data problem among the emotion classes. The results of the recognition experiments demonstrate weighted and unweighted accuracies of 0.6657 and 0.6149, respectively, which are higher than or comparable to those of the existing state-of-the-art methods. Overall, the proposed model significantly improves the recognition accuracy of the emotion “happiness”, which has been difficult to identify in previous studies owing to limited data. Therefore, the proposed model can effectively recognize emotions from speech and can be applied for practical purposes with future development.

言語

書誌情報

en : Sensors

巻 23, 号 3, p. 1743, ページ数 11, 発行日 2023-02-03

出版者

言語

出版者

MDPI

出版者版へのリンク

10.3390/s23031743

https://doi.org/10.3390/s23031743

DOI

Versions

Ver.1

2023-09-28 03:34:50.354323

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Multi-Input Speech Emotion Recognition Model Using Mel Spectrogram and GeMAPS

× 豊島, 依槻

× 岡田, 吉史

× 石丸, 桃子

× 内山, 竜之介

× 多田, 真悠

Versions

Share

Cite as

エクスポート