![]() Note that the order of vectors v ^ → f ( f ∈ F ) is not considered in this model. The number of layers and heads used in this work is discussed in Section 5.2. Each of these sub-layers is enclosed by a residual connection accompanied with a layer normalization. We then feed the set V ^ to N identical layers, whereby each of them includes two sub-layers: a multi-head self-attention followed by a feed-forward layer. ![]() The sets of extracted feature vectors and the corresponding dimension-reduced ones are denoted as V and V ^, respectively. In this model, each feature vector v → f extracted from the movie excerpt (where f ∈ F- a set of all feature types mentioned in Section 3) is fed to an eight-neuron fully connected layer so as to obtain a dimension-reduced feature vector v ^ → f. The 128-feature vectors extracted from all audio segments (from each movie excerpt) are element-wise averaged to obtain a 128-dimensional vector of features. This results in a 128-dimensional audio feature vector for each 0.96-s audio segment. The log Mel spectrogram is then passed to the pretrained VGGish model, which includes six convolutional layers followed by two fully connected layers. After that, each spectrogram is mapped to 64 Mel bins to compute a Mel spectrogram before a logarithmic operation is applied to obtain the log Mel spectrogram of size 96 × 64 for each segment. Then, a spectrogram is computed using the short-time Fourier transform for each 0.96-s frame, whereby the window size and the hop size are 25 and 10 ms, respectively. In the preprocessing step, the audio from each movie clip is first split into non-overlapping 0.96-s frames. VGGish model The VGGish neural network with parameters pretrained on the AudioSet dataset for sound classification is used to extract audio features. The Feature AAN also performed better than the baseline models when predicting the valence dimension. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. ![]() The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). To explore these correlations, a neural network architecture-namely AttendAffectNet (AAN)-uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. ![]() Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |