[Thesis]. Manchester, UK: The University of Manchester; 2016.
With the growing demand for information in various domains, sharing of information
from heterogeneous data sources is now a necessity. Data integration approaches promise
to combine data from these different sources and present to the user a single, unified
view of these data. However, although these approaches offer high quality services
for the managing and integrating of data, they come with a high cost. This is because
a great amount of manual effort to form relationships across data sources is needed
to set up the data integration system. A newer variant of data integration, known
as dataspaces, aims to spread the large manual effort spent at the start of the data
integration system to the rest of the system's phases. This is achieved by soliciting
from the user their feedback on a chosen artefact of a dataspace, either by explicit
ways or implicitly. This practice is known as pay-as-you-go, where a user continuously
pays to the data integration system, by providing feedback, to gain improvements in
the quality of data integration. This PhD addresses two challenges in data integration
by using pay-as-you-go approaches. The first is to identify instances relevant to
a user's information need, calling for semantic mappings to be closely considered.
Our contribution is a technique that ranks mappings with the help of implicit user
feedback (i.e., terms found in query logs). Our evaluation shows that to produce stable
rankings, our technique does not require large-sized query logs, and that our generated
ranking is able to respond satisfactorily to the amount of terms inclined towards
a particular data source, where we describe it as skew. The second challenge that
we address is the identification of duplicate instances from disparate data sources.
We contribute a strategy that uses explicitly-obtained user feedback to drive an evolutionary
search algorithm to find suitable parameters for an underlying clustering algorithm.
Our experiments show that optimising the algorithm's parameters and introducing attribute
weights produces fitter clusters than clustering alone. However, our strategy to improve
on integration quality can be quite expensive. Therefore, we propose a pruning technique
to select from a dataset any records that are informative. Our experiment shows that
on most of the datasets, our pruner produce comparably fit clusters with more feedback