They often do not naturally incorporate and simulate emotions.
Supervised, Unsupervised, Semi-Supervised, and Representation Transfer Learning.
The paper considered multiple databases including IEEE Xplore, Springer, Elsevier, and Google Scholar.
Due to relatively small emotional speech datasets, unsupervised methods may not learn useful representations and can ignore emotional attributes.
The paper highlights the importance of deep representation learning for SER, popular DL models, and various representation learning techniques used in the literature.
It presents a visual data-guided self-supervised framework for speech representation learning, achieving state-of-the-art results in emotion recognition.
High.
MFCCs are used as the principal set of features for SER and other speech analysis tasks.
It allows a single neural encoder to solve different self-supervised tasks, improving results for speaker, phonemes, and emotional cues identification.
Unsupervised Representation Learning.
It can exploit both labelled and unlabelled data to improve performance.
Preparing labels for auxiliary tasks is expensive and time-consuming.
Background noise and poor recording quality can contaminate speech signals, affecting the performance of emotion recognition algorithms.
Manually handcrafted acoustic features using feature engineering.
They provide a game-theoretical framework useful for data generation and can learn disentangled representations suitable for SER.
Feature engineering involves manual design of features using domain knowledge, while representation learning automatically transforms input data to yield useful representations.
They are often purpose-driven and developed by professional actors.
Investigation of multi-modal representation.
Users may unintentionally leak personal information such as gender, ethnicity, and emotional state.
CNNs, LSTM/GRU RNNs, and CNN-LSTM/GRU-RNNs.
Attacks that exploit vulnerabilities in deep models, misleading SER classifiers with imperceptible perturbations.
The unavailability of labelled data.
They are powerful unsupervised models that encode emotional speech data in sparse and compressed representations.
The paper covers deep representation learning techniques for speech emotion recognition (SER) and compares them to traditional methods and handcrafted features.
The paper discusses various challenges but does not specify them in the provided text.
The LogMel spectrum is a popular feature used to train deep learning networks in the speech domain, designed to index affective physiological changes in voice production.
A shift from hand-engineered acoustic features to deep representation learning.
Training is complex due to the need to disentangle emotional representations from other attributes in high-dimensional input manifolds.
Emotional representations learned by very deep architectures are found to be robust against adversarial attacks.
They lack exploration.
The ability to automatically learn an intermediate representation of the input signal without manual feature engineering.
It provides a vast array of acoustic features that can reliably indicate the emotional state of the speaker.
They can learn a representation from incomplete data.
DRL-based methods need to be explored for emotional representation learning.
Transformers are used to apply self-supervised multi-modal representation, improving emotion recognition performance.
By facilitating exploration while learning through interaction with the environment.
Principal Component Analysis (PCA).
They are good for sequential modeling and can learn temporal structures from speech suitable for emotion classification.
Static representation learning methods.
Limited size labelled emotional data.
DRL combines deep learning and reinforcement learning principles to create systems that learn by interacting with their environment.
Creating accurate synthetic speech or features in different emotions.
Shallow and deep learning algorithms.
Learning both low-level and high-level representations from emotional speech.
Deep learning techniques to learn representations of input data through non-linear transformations.
Training complexity.
Generative models like VAEs and GANs demonstrate superior performance in representation learning compared to classical methods.
Few shot learning can be used to adapt SER systems with a few samples of target language data.
A technique where multiple devices collaboratively learn a shared model without revealing local data.
They are suitable for capturing emotional attributes in a supervised manner.
The successful training of deep models for representation learning by Hinton and Salakhutdinov.
The first comprehensive survey on the topic, highlighting techniques, challenges, and future research areas.
Because they cannot be generalized to real-life natural emotions.
Performance drops significantly if test samples deviate from the training data distribution.
The paper is organized into sections discussing background concepts, deep representation learning for SER, challenges, discussions, and future directions.
GANs, AE-based models, and other discriminative architectures.
Representation learning is less time-consuming, requires minimal human domain knowledge, and does not need extra efforts to design features for new tasks.
They learn representations that often lead to better performance compared to hand-designed representations.
Discrete emotions (e.g., angry, sad) and dimensional emotions (e.g., arousal, valence).
They utilize self-attention mechanisms for learning temporal correlations with less computational complexity.
GANs encounter convergence issues, making it difficult to train effectively on available emotional data.
It contains only a small number of non-linear operations and struggles to model complex, high-dimensional, and noisy real-world data.
By facilitating deep representation learning where hierarchical representations are automatically learned in a data-driven manner.
Most SER corpora are biased and may not represent real-life human emotions, leading to erroneous algorithm behavior.
They can better capture temporal contexts compared to RNNs.
They require significant manual effort, impeding generalisability and slowing innovation.
Supervised, unsupervised, semi-supervised, and transfer learning techniques.