Thesis---Scalable-Entity-Resolution-in RDF-data--- performed at University of Bonn

Entity Resolution is the task of disambuigating entities in different datasets that point to the same real world object.

The naive approach for this task implies O(N^2) complexity. The existing approaches of entity disambiguation in RDF data try to improve efficiency by using a block building strategy, such that entities with in a block are only compared, thus reducing the complexity by O(m^2 * |B|), where m are the entities in a block and we have |B| blocks. After the block building stage, these blocks are preocessed effecitively using a learning or non-learning based approach. Further, problems exist due to heetrogeneity of data as it arises from a variety of sources with increasing volume. Also, the data is noisy. It has inconsistencies and missing values.

Our thesis work focuses on:-

Improving the effeciency and effectiveness of existing approach by introducing scalability.
Dealing with the heterogeneity of data and missing values.
Learning based approaches require several iterations and effort for labelling sufficient data for training. We want to introduce an approach that can process entities directly in one go.
Remove the block building stage completely.

The current thesis work implements two approaches for performing entity resolution task effectively with efficiency and scalability. We suggest the use of Local Senstivity Hashing(LSH) for detecting similar entities in semantic web data in the following way.

Approach-1:- Utilise all available knowledge. It does not work because the data is heterogeneous, we have missing values and inconsistencies.

Approach-2:- Select only 1 or 2 attributes. Datasets:- DBLP-ACM, DBLP-scholar, Abt-Buy (available in the resource folder). These are benchmarked datasets. We compare the accuracy and valadity of our approach with state of the art results. (https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution)

Approach-3:- LSH subjects in the RDF data -> Compare the predicates of only matching entities by Jaccard similarity threshold for predicates and find the intersecting predicate -> Compare the objects of only intersecting entities in the entity matches found to retrieve the true matches. Datasets:- Dbpedia medium(Infobox 3.0rc and Infobox 3.4), Dbpedia large(Dbpedia 3.0rc and Infobox 3.4) (http://downloads.dbpedia.org/)

The implementation is done on top of SANSA-Stack, using the spark and scala technologies. We use HDFS for large datasets storage.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src/main		src/main
README.md		README.md
Report.pdf		Report.pdf
ScalableEntityResolution_ThesisFinal_Amrit.pdf		ScalableEntityResolution_ThesisFinal_Amrit.pdf
deployment_details.odt		deployment_details.odt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Thesis---Scalable-Entity-Resolution-in RDF-data--- performed at University of Bonn

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Thesis---Scalable-Entity-Resolution-in RDF-data--- performed at University of Bonn

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages