Incorporating domain (collection) features to improve information retrieval performance

Topic Description

Information Retrieval (IR) techniques that underpin today's search engines have transformed the ways in which people seek, interpret and work with information, and play a central role in our digital economy. It has been recognised that there are complex interdependencies between performance of an IR technique and the salient properties of the document collection on which it is deployed. Collection properties (e.g., linguistic, statistical and domain coverage features) can vary significantly across domains (e.g., legal vs patent vs biomedical documents). Therefore, scientifically, the findings about a IR technique on one collection cannot be automatically generalised to other collections. Practically, the search solutions should be tailored to the underlying document collection profiles, instead of simply adopting off-the-shelf Web search engine for everything. However, it is currently not well understood what collection properties are and how they are connected to the IR technique performance. ,This project aims to address this fundamental gap by developing, implementing and evaluating a comprehensive document collection profiling infrastructure to characterise document collections in relation to the performance of a number of typical IR techniques, and methods for adapting an IR technique to a specific document collection to achieve optimal performance. The project will be in collaboration with University of Essex.

Skills Required:

Applicants must have a high quality Honours Degree (preferably First Class) or a Master qualification (preferably with distinction) in a relevant discipline; Knowledge of probability theory and statistics; programming skills; research experience and publications in information retrieval and natural language processing are desirable.

Background Reading:

De Roeck, A. N., Sarkar, A., and Garthwaite, P. (2004). Frequent Term Distribution Measures for Dataset Profiling. In The 4th International Conference on Language Resources and Evaluation (LREC2004), 1647–1650.
De Roeck, A., Song, D., Kruschwitz, U., and Azzopardi, L. (2009). Eds. Corpus Profiling for IR and NLP. BCS eWiC, 2009.
Hawking, D. and Robertson, S. (2003). On Collection Size and Retrieval Effectiveness. Information Retrieval, 6(2003), 99–150.
Koolen,M. and Kamps, J. (2010). The Importance of Anchor Text for ad hoc Search Revisited. In SIGIR2010, 122-129.
Sanderson, M. and van Rijsbergen, C.J. (1999). The Impact on Retrieval Effectiveness of Skewed Frequency Distributions. ACM Transactions on Information Systems (TOIS), 17(4), 440-465.
Yan, X., Lau, R.Y.K, Song, D., Li, X. and Ma, J. (2011). Towards a Semantic Granularity Model for Domain Specific Information Retrieval. ACM Transactions on Information Systems (TOIS). 29(3), 15:1-15:46.
Fang, H., Tao, T., Zhai, C. (2011). Diagnostic Evaluation of Information Retrieval Models. ACM Transactions on Information Systems (TOIS). 29(2): 7.

Report an error on this page