Predicting 360° Video Saliency: A ConvLSTM Encoder-Decoder Network With Spatio-Temporal Consistency

360° videos have been widely used with the development of virtual reality technology and triggered a demand to determine the most visually attractive objects in them, aka 360° video saliency prediction (VSP). While generative models, i.e., variational autoencoders or autoregressive models have prove...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	IEEE journal on emerging and selected topics in circuits and systems Ročník 14; číslo 2; s. 311 - 322
Hlavní autoři:	Wan, Zhaolin, Qin, Han, Xiong, Ruiqin, Li, Zhiyang, Fan, Xiaopeng, Zhao, Debin
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Piscataway IEEE 01.06.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:	360° videos Active appearance model Autoregressive models Circuits and systems Computer architecture Convolutional neural networks Distortion Effectiveness Encoders-Decoders Feature extraction Gaussian priors Modules Salience Saliency prediction Solid modeling spatio-temporal features Spatiotemporal data Video Virtual reality
ISSN:	2156-3357, 2156-3365
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	360° videos have been widely used with the development of virtual reality technology and triggered a demand to determine the most visually attractive objects in them, aka 360° video saliency prediction (VSP). While generative models, i.e., variational autoencoders or autoregressive models have proved their effectiveness in handling spatio-temporal data, utilizing them in 360° VSP is still challenging due to the problem of severe distortion and feature alignment inconsistency. In this study, we propose a novel spatio-temporal consistency generative network for 360° VSP. A dual-stream encoder-decoder architecture is adopted to process the forward and backward frame sequences of 360° videos simultaneously. Moreover, a deep autoregressive module termed as axial-attention based spherical ConvLSTM is designed in the encoder to memorize features with global-range spatial and temporal dependencies. Finally, motivated by the bias phenomenon in human viewing behavior, a temporal-convolutional Gaussian prior module is introduced to further improve the accuracy of the saliency prediction. Extensive experiments are conducted to evaluate our model and the state-of-the-art competitors, demonstrating that our model has achieved the best performance on the databases of PVS-HM and VR-Eyetracking.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2156-3357 2156-3365
DOI:	10.1109/JETCAS.2024.3377096