Multimodal emotion recognition using two models:
- HuBERT + Wav2Vec2: Uses raw audio with transformer architecture
- Attention Fusion: Uses mel-spectrograms with CNN architecture
Both models use MobileNetV2 for visual features and cross-modal attention for fusion.
Emotions: neutral, calm, happy, sad, angry, fearful, disgust, surprised