Skip to content

herrhamilton/Data_Wrangling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Data Wrangling

Duplicate Identification in restaurants.tsv data set

About

The Python script analyzes the provided .tsv file and filters duplicates before comparing its results to the gold standard and saving the cleaned data set into a new .tsv file.

Usage

To run the script, the "restaurants.tsv" and "restaurants_DPL.tsv" files must be present in the same directory.

restaurants.tsv: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/mdedup/restaurants.tsv

restaurants_DPL.tsv: https://hpi.de/fileadmin/user_upload/fachgebiete/naumann/projekte/repeatability/Restaurants/restaurants_DPL.tsv

About

Final project of a course based on "Data Wrangling with MongoDB"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages