Mapping New Media Objects into the IVA-Subspace

April 4, 2022

Mapping New Media Objects into the IVA-Subspace

Order ID:89JHGSJE83839

Style:APA/MLA/Harvard/Chicago

Pages:5-10

Instructions:

Mapping New Media Objects into the IVA-Subspace

clips from each category as training data, and the rest are used as new media objects to test the performance of mapping new media objects into the IVA-Subspace.

The extracted visual features include Color Histogram (in HSV space), Edge Histogram, Texture feature based on Gray-level co-occurrence matrix, Speeded Up Robust Features (SURF) and GIST. Auditory features are made up of Centroid, Rolloff, Spectral Flux, and Root Mean Square.

We concatenate different visual features into high-dimensional vectors as input. Since audio is a kind of time series data, the dimensionalities of auditory feature vectors are inconsistent. We employ Fuzzy Clustering on auditory features in preprocessing to get isomorphic audio feature indexes [39]. As described in section 3, we use two kinds of kernels for visual-auditory correlation analysis. Specifically, we use the following radial basis function in (12), the polynomial kernel function in (13) and the sigmoid function in (14).

k x; yð Þ ¼ exp − x−yk k 2

γσ2

! ð12Þ

k x; yð Þ ¼ γ x; yh i þ cð Þn ð13Þ

k x; yð Þ ¼ tanh γ x; yh i þ cð Þ ð14Þ

where we choose empirical optimal values of γ=2, σ= 2.4 in (12), γ= 1, c= 1, n=4.2 in (13) and γ=0.6, c= 1.9 in (14), and we choose empirical optimal values of combination weights η= (0.35,0.2, 0.45) in (11).

4.2 Performance comparison results

To evaluate the efficacy of the proposed algorithm, we compare the image-audio retrieval performance of the proposed MKVARL approach with PCA [25], CCA [17] and KCCA [14] methods. When users submit an image query example which is in the training set, relevant audio clips are retrieved and returned, and vice versa. In our experiments, if a returned result and the query example are in the same semantic category, it is regarded as a correct result. And the precision performance is defined as the percentage of correctly retrieved samples in the top-k-returned results.

Figure 2 shows the Mean Average Precision (MAP) of different algorithms and Fig. 3 shows the comparison results of recall ratio. In Figs. 2 and 3, the MAP and the recall values are the average results of 10 times queries in each semantic category, including 5 times of querying image with audio examples and 5 times of querying audio with image examples. And the query examples are randomly selected. From Figs. 1 and 2 we can see that the performances of CCA, KCCA and MKVARL methods are much better than the performance of the PCA.

Meanwhile the KCCA outperforms CCA, while our proposed MKVARL algorithm gains the best performance. Above results are obtained probably because that: (1) the computing process of the projection vectors of CCA,KCCA and MKVARL is based on potential relevance between image features and audio features, it can better reflect the high-level semantics; (2) the use of kernel function in KCCA makes it more appropriate for nonlinear correlation; (3) Different kernels correspond to different notions of similarity between two data samples. In particular, in a high dimensional feature space, it is not optimal to choose one kernel for all the datasets. A single type of kernel function may fail to exploit the potential of all correlations, meanwhile multiple types kernel functions could better explore the potential of all correlations,

Multimed Tools Appl (2016) 75:9169–9184 9177

which validates the importance of the proposed method. Our approach generally returns more relevant results and it verifies the effectiveness of the proposed method.

Figure 4 is a specific example of image-audio retrieval. The query example is a 5-s audio clip in the violin category. We compute the similarity score between the query audio and the images in database, and return the top-15 relevant images. The numbers below the returned images are the correlation values between the images and the audio query example. It can be seen from Fig. 4 that among the top 15 returned results there are 12 violin images.

4.3 Performance evaluation of new media objects

To test image-audio retrieval performance when query examples are out of training set, we first use the method in our previous work to estimate its coordinates in the IVA-Subspace [39], and

Fig. 3 Recall performance comparison results of image-audio retrieval

Fig. 2 MAP performance comparison results of image-audio retrieval

9178 Multimed Tools Appl (2016) 75:9169–9184

then cosine distance metric to compute the cross-media correlation scores. Figures 5 and 6 are the experiment results with new query examples, including querying image by new audio and querying audio by new image. From Figs. 5 and 6 we can have the similar observation that: the overall retrieval performance with new multimedia data is good. When querying image by an

Fig. 5 Querying image by new audio

Fig. 4 An example of image-audio retrieval

Multimed Tools Appl (2016) 75:9169–9184 9179

example of new audio, there are 8.58 correct results in top 20 returns on average. The performance of querying audio by new image is similar to that of querying image by new audio.

5 Conclusions

Different from most existing multimedia representation learning methods, this paper proposes multiple kernel visual-auditory representation learning framework, which learns general rep- resentation model from visual and auditory feature space by explicitly learning statistical cross- media correlations from high-dimensional kernel spaces. Besides, we design distance metric learning strategy in the mutual subspace.

The performance of our approach is tested with cross- media retrieval between image and audio data. Experiments and comparisons verify the validity, superiority and applicability of our approach from different aspects. The main limitation is that the size of image-audio database is comparatively small (lots of web image galleries are not usable because it is difficult to find suited audios). Future work includes further study on large-scale social media dataset.

Acknowledgments This research is supported by the National Natural Science Foundation of China (No.61003127, No. 61373109, No.61440016) and the China Scholarship Council (201508420248).

References

Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero- shot event detection. International Joint Conference on Artificial Intelligence, IJCAI

Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly- isotonic SVM. International Conference on Machine Learning (ICML)

Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM

Fig. 6 Querying audio by new image

9180 Multimed Tools Appl (2016) 75:9169–9184

Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47–59
Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268 6. Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach

Learn Res 13:519–547 7. Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million

kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining 8. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with

semi-definite programming. J Mach Learn Res 5:27–72 9. Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art

and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19 10. Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection

with tensor representation. ACM International Conference on Multimedia. pp.91–100 11. Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face

contribution for multimedia event detection. ICMR 12. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE

Trans Knowl Data Eng 17(4):491–502 13. Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf

Syst 31(7):659–678 14. Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis.

Pattern Recogn 36:1961–1971 15. Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action

recognition. Multimedia Tools and Applications 74(2):523–542 16. Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn

Res 7:1531–1565 17. Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation.

Image Vis Comput 25:531–543 18. Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis.

Pattern Recogn 27(2):1–8 19. Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image

retrieval. Comp Vis Image Underst 140:9–20 20. Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International

Conference on Multimedia, pp. 107–118 21. Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6) 22. Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings of

International Conference on Machine Learning, pp.1065–1072 23. Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO

algorithm. In: NIPS, pp. 2361–2369 24. Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label

regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550–563 25. Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data

analysis. In: ACM Multimedia Conference, pp. 572–579 26. Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data

analysis. ACM International Conference on Multimedia, pp.572–579 27. Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1) 28. Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE

Trans Image Process 24(10):2984–2995 29. Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose

estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press

Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99–109
Yang Y, Ma Z, Hauptmann AG, Sebe N (2012) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi- supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34(5):723–742
Yang Y, Song J, Huang Z, Ma Z, Sebe N, Hauptmann AG (2013) Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans Multimedia 15(3):572–58

Multimed Tools Appl (2016) 75:9169–9184 9181

Yang Y, Zhuang Y, Wu F, Pan Y (2008) Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. IEEE Transactions on Multimedia 10(3):437–446
Yu Z, Wu F, Yang Y, Tian Q, Luo J, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. SIGIR, 395–404
Zhang H, Liu Y, Ma Z (2013) Fusing inherent and external knowledge with nonlinear learning for cross- media retrieval. Neurocomputing 119:10–16
Zhang H, Wu P, Beck A, Zhang Z, Gao X (2016) Adaptive incremental learning of image semantics with application to social robot. Neurocomputing 173:93–101
Zhang H, Yu J, Wang M, Liu Y (2012) Semi-supervised distance metric learning based on local linear regression for data clustering. Neurocomputing 93:100–105
Zhang H, Yuan J, Gao X, Chen Z (2014) Boosting cross-media retrieval via visual-auditory feature analysis and relevance feedback. ACM International Conference on Multimedia
Zhuang Y, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross- media retrieval. IEEE Transactions on Multimedia 10(2):221–229

Hong Zhang corresponding author, received the BS degree from Wuhan University of Technology, China, in 2001, the MS degree from Wuhan University of Technology, China, in 2004, and PhD degree from Zhejiang University, China, in 2007. She is currently a professor in the college of computer science and technology, Wuhan University of Science and Technology, China. Her research interests include content-based multimedia analysis, machine learning and cross-media retrieval.

Wenping Zhang is currently a master student in the college of computer science and technology, Wuhan University of Science and Technology, China. He received his BS degree from Huazhong Agricultural University Chutian College, China, in 2014. His research interests include machine learning and data mining.

9182 Multimed Tools Appl (2016) 75:9169–9184

Wenhe Liu received the Ms. Degree in Artificial Intelligence from The University of Edinburgh, United Kingdom, 2012. He is now a Ph.D.student with The Centre for Quantum Computation & Intelligent Systems (QCIS), the University of Technology, Sydney (UTS), Sydney, Australia. His research interests include machine learning and its applications to multimedia and computer vision.

Xin Xu received the Ph.D. degree in computer science and engineering from Shanghai Jiao Tong University, China. He is a lecturer in the School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China. His current research interests include computer vision, pattern recognition, and visual surveillance.

Multimed Tools Appl (2016) 75:9169–9184 9183

Hehe Fan received the Master degree of Computer Architecture from Huazhong University of Science and Technology, China, in 2015. His research interests include distributed computing, parallel processing and machine learning. Hehe Fan is currently a Research and Development Engineer in Baidu Inc.

9184 Multimed Tools Appl (2016) 75:9169–9184

Multiple kernel visual-auditory representation learning for retrieval

Abstract

Introduction

Related works

Cross-media retrieval

Multiple kernel distance metric learning

Discussion

Multiple kernel visual-auditory representation learning

Visual-auditory kernel canonical correlation analysis and mapping

Extension to multiple kernel visual-auditory analysis

Experiments

Experimental setup

Performance comparison results

Performance evaluation of new media objects

Conclusions

References

RUBRIC

Excellent Quality

95-100%

Introduction

45-41 points

The background and significance of the problem and a clear statement of the research purpose is provided. The search history is mentioned.

Literature Support

91-84 points

The background and significance of the problem and a clear statement of the research purpose is provided. The search history is mentioned.

Methodology

58-53 points

Content is well-organized with headings for each slide and bulleted lists to group related material as needed. Use of font, color, graphics, effects, etc. to enhance readability and presentation content is excellent. Length requirements of 10 slides/pages or less is met.

Average Score

50-85%

40-38 points

More depth/detail for the background and significance is needed, or the research detail is not clear. No search history information is provided.

83-76 points

Review of relevant theoretical literature is evident, but there is little integration of studies into concepts related to problem. Review is partially focused and organized. Supporting and opposing research are included. Summary of information presented is included. Conclusion may not contain a biblical integration.

52-49 points

Content is somewhat organized, but no structure is apparent. The use of font, color, graphics, effects, etc. is occasionally detracting to the presentation content. Length requirements may not be met.

Poor Quality

0-45%

37-1 points

The background and/or significance are missing. No search history information is provided.

75-1 points

Review of relevant theoretical literature is evident, but there is no integration of studies into concepts related to problem. Review is partially focused and organized. Supporting and opposing research are not included in the summary of information presented. Conclusion does not contain a biblical integration.

48-1 points

There is no clear or logical organizational structure. No logical sequence is apparent. The use of font, color, graphics, effects etc. is often detracting to the presentation content. Length requirements may not be met