G. Ramirez

Structural Features in XML Retrieval

SUMMARY

Retrieval systems help us to find information in digital data collections by retrieving documents that might be relevant to our search query. Unfortunately, it can still be a time consuming task for us to scan through the retrieved documents in search for the precise piece of information we are looking for, especially if the documents are long. In these situations, it would be of great help if retrieval systems would provide access to the relevant parts of documents instead of the complete documents.

This thesis discusses this problem in the domain of XML documents, documents that have been marked up with XML, the Extensible Markup Language. In this domain, the task of providing access to specific parts of documents is known as XML element retrieval. In particular, we investigate if the structural characteristics of XML documents (such as the markup and the metadata) can help retrieval systems to perform a more effective search.

We first propose a retrieval framework where the evidence of four different types of XML element representations can be combined: the element content, the element context, the element metadata, and the document metadata. We then use the proposed framework to investigate the potentials of different structural features for retrieval in two different scenarios: 1) the ad-hoc retrieval of XML elements, where we show that the use of the relationships between XML elements can improve retrieval effectiveness, and 2) relevance feedback, where we show that the knowledge of the structural characteristics of the relevant elements can help to find structurally similar ones and improve retrieval effectiveness. Finally, we also look at the potential of contextual information in this domain. We present an analysis of an interactive user study where we investigate the correlations between different contextual features of the information need and the structural characteristics of the relevant XML elements.

The work presented in this dissertation contributes to the understanding of the use of structural features for XML element retrieval. It identifies and analyzes the potentials of different structural features for retrieval and proposes new ways to exploit them.