Towards an adaptive multimodal information access and retrieval paradigm

Topic Description

Multimodal information access and retrieval (MIAR) over the web or other repositories is becoming a major way of acquiring, aggregating and interacting with information. By “multimodal”, we mean that the information content can be accessed in different modalities (e.g. text, tags, images, videos, sketches, 3D objects), each of which can be represented by one or more computable features. In this proposal, we focus on textual and visual content. Suppose an Architecture student is interested in finding out information about historical buildings in some scenery photos. With these photos as query examples, she can search over the Web to gather relevant information of different modalities, such as articles, photos, design sketches and satellite images. Online information gathering and publishing services, if enhanced by an effective MIAR mechanism, would allow users to intelligently build their own online galleries or magazines from different sources of their choice. In addition, the rapidly growing image/video content analysis business requires extracting, correlating and searching with a number of different textual and visual features. The results can be continuously refined, e.g., by allowing the user to provide relevance feedback. We also need to deal with multimodal information available locally. An example is the medical domain (NHS), where text and various types of 2D and 3D images of patients are currently accessed via conventional mono-modal techniques, which has proved insufficient. AdaptMIAR would enable doctors to utilise these rich data sources more effectively, in a coordinated and iterative way, for better diagnosis and treatment.

Thus there is a growing need of a new MIAR paradigm that is adaptive to modality preferences expressed in the content and context. This would involve modelling of multimodal content, user's evolving multimodal context, and a contextualised, adaptive retrieval function. These aspects interact with each other in complex ways. To date, limited progress has been made to realise the full potential of MIAR and established methods proved inadequate to address the above issues. This PhD project would be timely, tackling a challenge that is recognised but largely unsolved by the industry and the research community. The proposed Adaptive MIAR framework is not about simply pooling together multiple individual retrieval systems, but will involve deep learning for intelligently combining features and adapting ranking functions in the contextual retrieval process, by exploiting the interactions between different modalities. This requires investigating and developing novel approaches across disciplines of information retrieval, computer vision, machine learning and human computer interaction. We will also implement a prototype system, apply and evaluate it in selected real application domains. This project will be in collaboration with The Robert Gordon University, NHS Greater Glasgow and Clyde, and various other industry partners.

Skills Required:

Applicants must have a high quality Honours Degree (preferably First Class) or a Master qualification (preferably with distinction) in a relevant discipline; Knowledge of probability theory and statistics; programming skills; research experience and publications in information retrieval and image processing are desirable.

Background Reading:

Schraefel, M.C., Wilson, M., Russell, D. and Smith, D.A. (2006). mSpace: Improving information access to multimedia domains with multimodal exploratory search. CACM, 47-49.
Rahman, M.M., Bhattacharya, P., and Desai, B.C. (2009). A unified image retrieval framework on local visual and semantic concept-based feature spaces. J. Visual Communication and Image Representation. 20(7), 450–462.
Wang, J., Song, D. and Kaliciak, L. (2010). Tensor Product of Correlated Text and Visual Features: A Quantum Theory Inspired Image Retrieval Framework. QI10, 109-116.
Wang, L., Song, D. and Elyan, E. (2011). Words-of-Interest Selection based on Temporal Motion Coherence for Video Retrieval. SIGIR11, 24-28.
Srivastava, N, and Salakhutdinov, R. (2012). Multimodal Learning with Deep Boltzmann Machines, NIPS12, 2231-2239.
Wang, L., Song, D. and Elyan, E. (2012). Improving Bag-of-visual-Words Model with Spatial-temporal Correlation for Video Retrieval. CIKM12, 1303-1312.
Kaliciak, L., Song, D., Wiratunga, N. and Pan, J. (2013). Combining Visual and Textual Systems within the Context of User Feedback. MMM13, 445-455.
Wang, L., Elyan, E., and Song, D. (2014). Rebuilding Visual Vocabulary via Spatial-Temporal Context Similarity for Video Retrieval. MMM14, 74-85.

Report an error on this page


Dawei Song


Shailey Minocha