Skip to content

Latest commit

 

History

History
13 lines (9 loc) · 647 Bytes

File metadata and controls

13 lines (9 loc) · 647 Bytes

Data Wrangling

Duplicate Identification in restaurants.tsv data set

About

The Python script analyzes the provided .tsv file and filters duplicates before comparing its results to the gold standard and saving the cleaned data set into a new .tsv file.

Usage

To run the script, the "restaurants.tsv" and "restaurants_DPL.tsv" files must be present in the same directory.

restaurants.tsv: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/mdedup/restaurants.tsv

restaurants_DPL.tsv: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/restaurants_DPL.tsv