GitHub - LazerLab/Google_Elected_Officials: Replication file for "Searching for Elected Officials: Google’s Prioritization of Political Information"

Introduction

This repository contains replication data and for "Searching for Elected Officials: Google’s Prioritization of Political Information."

Directory Structure

/scripts: for scripts to load the data and produce final figures metrics
/data: for data

Data

We provide all scripts required to reproduce our analyses. Due to large file sizes of the raw data, we publically provide cleaned and aggregated data (produced in Steps 1-4 described in the pipeline below) necessary to reproduce analyses, figures and tables in Step 5. The specific data we provide are:

qry_info_house.csv Search query level dataset with information about all members of the 116th House of Representatives including party, districts they represent, relevance score, and whether they are included in the data.
qry_domain_rank_joined.csv Number of results at each domain-query-rank level. All domain level classifications (six main types, politician-controllled, partisan audience scores, local or national news, unreliable or reliable news are joined here).
qry_domain_rank_si.csv Number of results at each domain-query-rank level using an alternative search result parsing method explored in the supplementary information of the paper.
rbo_result_r.parquet All sampled location pairs with calculated rank-biased overlap, Jaccard Index, number of URLs in each search result, and set differences. The columns with the original lists of URLs are removed to reduce file sizes.
gtrends_final_20200109_20201231.csv Google Trends data (relative search volume) of all members included in our analysis.
qry_day_url.parquet Number of results at the query-day-URL level.

Data Processing Pipeline

1. Code relevance score and create final query-level dataset

Get all unique URLs (results) and some metadata for each search query. These query-URl pairs are sampled for handcoding.
File: get_unique_results.py
Create the final search query level data set. Add a column with relevance score, and a column (house_analysis_include) that dummy codes whether the member is included as a search query. Members representing the 5 territories (American Samoa, Guam, Puerto Rico, Northern Mariana Islands, and the Virgin Islands), who have a relevence score less than 3, or were no longer in office by 9/1/2020 are excluded.
File: create_final_qry_info.r

2. Parsing and Summarising Search Results

Remove NA URLs and excluded component types.
File: drop_na_add_domain folder containing Snakefile
Group cleaned data from drop_na_add_domain into different levels and save. Makes subsequent analyses more efficient.
File: generate_analysis_summary folder containing Snakefile
Repeat with an alternative parsing method that counts grouped links as one result (explored in SI section B).
File: si_preprocess folder containing sub-level drop_na_add_domain_si and generate_analysis_summary_si folders

Note: call snakemake from within the folder to run the Snakefiles

3. Domain-level classification

Categorize domains into six main types, Campaign/Personal, Education, Government, News, Other Third Party, or Social Media by matching to external data or using keyword classification
File: label_domains.r
Pull all unique search query-URL pairs for all social media domains and house.gov for handcoding as politician controlled.
File: prepare_poli_control_code.r
Merge all domain level classifications (six types, politican-controlled, partisan audience score, local or national news, reliable or unreliable news) into one dataset at the search query-domain-rank level. Create coverate rates table (SI Table S3).
File: post_process_domains.r

4. Analysis for variation by search result location

Sample location pairs
File: get_rbo_sim_sample.r
Pull URLs from data generated in Step 2 and compute rank-biased overlap for each pair.
File: get_rbo_sim.py
Calculate jaccard index and additional metadata for each pair.
File: postprocess_rbo_sim.py

5. Figures and Tables

Figure 2 and SI Table S2 (Concentration, Compostion, and Ranking)
File: plots_fig2.r
Figure 3 (Distribution of search results by politician control, local or national news, or unreliable or reliable news)
File: plots_fig3.r
Figure 4 (Partisan differences in search results)
File: plots_fig4.r
Figure 5 and SI Figure S6 (Variation by locatioin)
File: plots_rbo.r
SI Figure S1 and calculations for SI Section B (Comparison to Alternative Methods for Parsing Search Results)
File: plots_si_cmpt_rank.r
SI Figures S2 to S5, SI Tables S4 to S6 (Variation over time and by search query)
File: plots_si.r

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Directory Structure

Data

Data Processing Pipeline

1. Code relevance score and create final query-level dataset

2. Parsing and Summarising Search Results

3. Domain-level classification

4. Analysis for variation by search result location

5. Figures and Tables

About

Uh oh!

Releases

Packages

Languages

LazerLab/Google_Elected_Officials

Folders and files

Latest commit

History

Repository files navigation

Introduction

Directory Structure

Data

Data Processing Pipeline

1. Code relevance score and create final query-level dataset

2. Parsing and Summarising Search Results

3. Domain-level classification

4. Analysis for variation by search result location

5. Figures and Tables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages