This repository contains replication data and for "Searching for Elected Officials: Google’s Prioritization of Political Information."
- /scripts: for scripts to load the data and produce final figures metrics
- /data: for data
We provide all scripts required to reproduce our analyses. Due to large file sizes of the raw data, we publically provide cleaned and aggregated data (produced in Steps 1-4 described in the pipeline below) necessary to reproduce analyses, figures and tables in Step 5. The specific data we provide are:
qry_info_house.csvSearch query level dataset with information about all members of the 116th House of Representatives including party, districts they represent, relevance score, and whether they are included in the data.qry_domain_rank_joined.csvNumber of results at each domain-query-rank level. All domain level classifications (six main types, politician-controllled, partisan audience scores, local or national news, unreliable or reliable news are joined here).qry_domain_rank_si.csvNumber of results at each domain-query-rank level using an alternative search result parsing method explored in the supplementary information of the paper.rbo_result_r.parquetAll sampled location pairs with calculated rank-biased overlap, Jaccard Index, number of URLs in each search result, and set differences. The columns with the original lists of URLs are removed to reduce file sizes.gtrends_final_20200109_20201231.csvGoogle Trends data (relative search volume) of all members included in our analysis.qry_day_url.parquetNumber of results at the query-day-URL level.
-
Get all unique URLs (results) and some metadata for each search query. These query-URl pairs are sampled for handcoding.
File:get_unique_results.py -
Create the final search query level data set. Add a column with relevance score, and a column (
house_analysis_include) that dummy codes whether the member is included as a search query. Members representing the 5 territories (American Samoa, Guam, Puerto Rico, Northern Mariana Islands, and the Virgin Islands), who have a relevence score less than 3, or were no longer in office by 9/1/2020 are excluded.
File:create_final_qry_info.r
-
Remove NA URLs and excluded component types.
File:drop_na_add_domainfolder containing Snakefile -
Group cleaned data from
drop_na_add_domaininto different levels and save. Makes subsequent analyses more efficient.
File:generate_analysis_summaryfolder containing Snakefile -
Repeat with an alternative parsing method that counts grouped links as one result (explored in SI section B).
File:si_preprocessfolder containing sub-leveldrop_na_add_domain_siandgenerate_analysis_summary_sifolders
Note: call snakemake from within the folder to run the Snakefiles
-
Categorize domains into six main types, Campaign/Personal, Education, Government, News, Other Third Party, or Social Media by matching to external data or using keyword classification
File:label_domains.r -
Pull all unique search query-URL pairs for all social media domains and house.gov for handcoding as politician controlled.
File:prepare_poli_control_code.r -
Merge all domain level classifications (six types, politican-controlled, partisan audience score, local or national news, reliable or unreliable news) into one dataset at the search query-domain-rank level. Create coverate rates table (SI Table S3).
File:post_process_domains.r
-
Sample location pairs
File:get_rbo_sim_sample.r -
Pull URLs from data generated in Step 2 and compute rank-biased overlap for each pair.
File:get_rbo_sim.py -
Calculate jaccard index and additional metadata for each pair.
File:postprocess_rbo_sim.py
-
Figure 2 and SI Table S2 (Concentration, Compostion, and Ranking)
File:plots_fig2.r -
Figure 3 (Distribution of search results by politician control, local or national news, or unreliable or reliable news)
File:plots_fig3.r -
Figure 4 (Partisan differences in search results)
File:plots_fig4.r -
Figure 5 and SI Figure S6 (Variation by locatioin)
File:plots_rbo.r -
SI Figure S1 and calculations for SI Section B (Comparison to Alternative Methods for Parsing Search Results)
File:plots_si_cmpt_rank.r -
SI Figures S2 to S5, SI Tables S4 to S6 (Variation over time and by search query)
File:plots_si.r