Abstract Cross-Media Data Representation Discussion

April 4, 2022

Abstract Cross-Media Data Representation Discussion

Order ID:89JHGSJE83839

Style:APA/MLA/Harvard/Chicago

Pages:5-10

Instructions:

Abstract Cross-Media Data Representation Discussion

Multiple kernel visual-auditory representation learning for retrieval

Hong Zhang1,2 & Wenping Zhang1 & Wenhe Liu3 & Xin Xu1 & Hehe Fan4

Received: 4 October 2015 /Revised: 3 December 2015 /Accepted: 21 January 2016 / Published online: 23 February 2016 # Springer Science+Business Media New York 2016

Abstract Cross-media data representation, which focuses on semantics understanding of multi- media data in different modalities, is a rising hot topic in web media data analysis. The most challenging issues for cross-media data representation include: how to find underlying content-level data correlations and how to use such correlations in the representation model.

Most traditional web media data analysis works are based on single modality data sources, such as Flickr images or YouTube videos, leaving cross-media data representation and semantics understanding wide open. In this paper, we propose a multiple kernel visual-auditory representation learning approach, which learns cross-media correlations from visual and auditory feature spaces with multiple kernel strategies. Besides, we give cross-media distance measure for image-audio retrieval in the mutual subspace of co-occurrence. Experiment results on the collected image-audio database are encouraging, and show that the performance of our approach is effective from multiple perspectives.

Keywords Multiplekernellearning.Visual-auditorydatarepresentation. Cross-media retrieval

1 Introduction

Multimedia representation learning has drawn tremendous research attention in the past decades. In areas of Content-based Image Retrieval (CBIR) [9, 19, 31], multimedia data

Multimed Tools Appl (2016) 75:9169–9184 DOI 10.1007/s11042-016-3294-5

Hong Zhang zhanghong_wust@163.com

1 College of Computer Science & Technology, Wuhan University of Science & Technology, Wuhan 430081, China

2 Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan, China

3 The Centre for Quantum Computation & Intelligent Systems, the University of Technology, Sydney (UTS), Sydney, Australia

4 Baidu, Beijing, China

http://crossmark.crossref.org/dialog/?doi=10.1007/s11042-016-3294-5&domain=pdf

clustering and classification [12, 28, 38], face and motion recognition [11, 24], event detection [1–3], etc. Abundant representation learning methods have been proposed to explore a semantic level data representation model which could be used to better understand underlying data correlations. For example, in CBIR research subspace learning is frequently used to bridge the gap between low-level visual features and high-level image semantics so as to build the semantic data representation. However, most of these works have been focused on multimedia data of single modality, such as image or audio, and cross-media data represen- tation learning is mostly ignored [32, 35]. It is interesting and challenging to retrieval multimedia data of different modalities at the same time, especially nowadays different kinds of multimedia data usually coexist in web sources representing similar semantics.

Content feature is the carrier of multimedia semantics. The main challenging problem for cross-media data representation lies in two aspects: how to find underlying feature correlations among multimedia data of different modalities, and how to use such correlations in cross- media data representation. Considering these two issues, some researchers have proposed representation learning models under certain cross-media data environments. For example, Yi Yang et al. [33] proposed a multi-feature fusion algorithm based on Hierarchical Regression to learn general multimodal semantics, and it was verified with the multimodal document database, which contained text, image and audio. Paper [36] proposed a cross-media repre- sentation learning framework, which explored inherent feature correlations and discovered external useful knowledge based on nonlinear low-level feature analysis. Paper [40] learned the uniform cross-media correlation graph, in which different kinds of multimedia objects are represented exactly in the same way.

Most of these works explored underlying cross-media correlation and built multimodal data representation with the help of prior knowledge, such as page links, user comments and tagging. However, underlying cross-media correlation among heterogeneous low-level content features is mostly ignored or underestimated. Experimental evidence has shown that different kinds of multimedia data carry their contribution to high- level semantics so that the presence of one modality has usually a Bcomplementary effect^ with the other [33]. Our previous work on cross-media data analysis also showed that such complementary information can be explored and utilized to improve multimedia semantics understanding results [37, 39].

However, it is difficult to learn effective cross-media content representation because multimedia data of different modalities originally reside in heterogeneous low-level feature spaces. Although image and audio data may represent similar semantics, such as an image of bird and an audio clip of bird singing, it is challenging to find a unified representation for both bird images and audio clips. In this paper, we propose a Multiple Kernel Visual-Auditory Representation Learning (MKVARL) method for retrieval.

Our framework is formulated based on two typical modalities, i.e., image and audio. In preprocessing, considering audio is a kind of time series data while image data is static, we use fuzzy clustering method proposed in our previous work to get audio indexes so that all audio data is represented in the same dimension. Then, inspired by the recent multiple kernel learning algorithm in visual search [27], we propose multiple kernel visual-auditory learning.

Specifically, we first map low-level image feature matrix and audio feature matrix into high-dimensional kernel spaces with multiple kernel functions in order to better explore underlying cross-media correlations; secondly, we calculate visual-auditory canonical correlations between a pair of kernel spaces, and maximize such correlation when we map kernel spaces into the low-dimensional Isomorphic Visual-Auditory Sub- space (IVA-Subspace). With multiple kernel learning, cross-media data correlations are

9170 Multimed Tools Appl (2016) 75:9169–9184

analyzed from different aspects, and more useful information could be explored in the high-dimensional kernel spaces instead of original visual feature space and auditory feature space. Furthermore, we discuss how to apply our MKVARL method into cross- media retrieval between image and audio. Experiments and comparisons verify the validity, superiority and applicability of our approach from different aspects.

The rest of this paper is organized as follows. Section 2 discusses related works from two aspects. Section 3 presents multiple kernel visual-auditory representation learning based on image and audio samples, and describes how to enable flexible cross-media retrieval between image and audio datasets. Section 4 presents the experimental results and comparisons. We give concluding remarks in section 5.

RUBRIC

Excellent Quality

95-100%

Introduction

45-41 points

The background and significance of the problem and a clear statement of the research purpose is provided. The search history is mentioned.

Literature Support

91-84 points

The background and significance of the problem and a clear statement of the research purpose is provided. The search history is mentioned.

Methodology

58-53 points

Content is well-organized with headings for each slide and bulleted lists to group related material as needed. Use of font, color, graphics, effects, etc. to enhance readability and presentation content is excellent. Length requirements of 10 slides/pages or less is met.

Average Score

50-85%

40-38 points

More depth/detail for the background and significance is needed, or the research detail is not clear. No search history information is provided.

83-76 points

Review of relevant theoretical literature is evident, but there is little integration of studies into concepts related to problem. Review is partially focused and organized. Supporting and opposing research are included. Summary of information presented is included. Conclusion may not contain a biblical integration.

52-49 points

Content is somewhat organized, but no structure is apparent. The use of font, color, graphics, effects, etc. is occasionally detracting to the presentation content. Length requirements may not be met.

Poor Quality

0-45%

37-1 points

The background and/or significance are missing. No search history information is provided.

75-1 points

Review of relevant theoretical literature is evident, but there is no integration of studies into concepts related to problem. Review is partially focused and organized. Supporting and opposing research are not included in the summary of information presented. Conclusion does not contain a biblical integration.

48-1 points

There is no clear or logical organizational structure. No logical sequence is apparent. The use of font, color, graphics, effects etc. is often detracting to the presentation content. Length requirements may not be met