This repository contains data files and scripts to reproduce the analyses and results presented in our paper.
All relevant scripts listed below can be found within the protein/scripts folder.
-
collect_coding_seqs2/run_ccs2_function.sh- blast wrapper to extract orthologous protein-coding sequences from genomes. If you want to do the blast search using our scripts, you will need to download genomes listed in tableprotein/data/TRNP1_source_genomes.csv -
processingL_final.R- process our own primate sequence assemblies from targeted re-sequencing -
collect_coding_seqs2/Ferret_transcr_assembly_steps.shandcollect_coding_seqs2/Ferret_cutContig_makeFa.R- process the re-sequenced ferret sequence -
collect_coding_seqs.R- gather the orthologous TRNP1 protein-coding sequences from all included sources. Intersect with the available trait data. Save sequences and traits for the downstream analyses
Multiple Alignments with PRANK (v150803)
align_with_prank.sh - protein-coding sequence alignment
PAML (v4.8)
First, run PAML site models as described in the README from folder PAML.
select_sign_sites_PAML_M8.R - pull out the identified sites under positive selection
COEVOL (v1.4)
-
run_coevol.sh- wrapper to run Coevol;finish_coevol.sh- wrapper to stop Coevol and generate summaries -
summarize_coevol_output_TRNP1.R- access the estimated omega, correlations and posterior probabilities
Scripts for evolutionary analysis of control proteins can be found in a separate folder other_protein_alignments where there is a separate README on this part.
proliferation_analysis.R - gather proliferation assay data, estimate proliferation rates using logistic regression, infer association with brain size and GI using PGLS
All relevant scripts listed below can be found within the regulation/scripts folder.
-
MPRA/MPRA_sequences.R- identify and collect orthologous TRNP1 CRE sequences across mammals from our sequenced data as well as published genomes -
MPRA/MPRA_oligolib_construction.R- MPRA design - using a sliding window, construct enhancer tiles based on the orthologous CRE sequences from the previous script to test within the MPRA assay -
MPRA/preprocessingMPRA count pre-processing. Extract reporter gene expression counts for each included enhancer tile. This folder contains a README with further details -
MPRA/collect_MPRA_fastas.R- separate and save the relevant sequences from each of the 7 CRE regions, align using MAFFT (v7.407) -
MPRA/MPRA_analysis.R- filter and summarize CRE activities. Plug into PGLS and compare to brain mass and gyrification -
MPRA/combine_dnds_intron.R- combine TRNP1 protein evolution rates inferred using Coevol with the intron activity across catharrines within the same model
-
TFs/download_motifs_JASPAR2020.R- download PWMs and motif clustering from JASPAR 2020, transform PWMs for Cluster-Buster -
TFs/MPRNAseq_NPC.yaml- zUMIs (v2.5.4) yaml file for mapping RNA-seq reads from NPCs. Input raw data for this processing can be accessed under E-MTAB-9951 -
TFs/TF_expression_analysis.R- find the expressed transcription factors in our NPCs (from bulk RNA-seq data). Run Cluster-Buster (Jun 13 2019) on the intron sequences including only the PWMs of the expressed TFs to identify overrepresented motifs -
TFs/PGLS_motifs.R- investigate binding score assocation with intron CRE activity and GI among the 22 most abundant motifs on the intron sequence using PGLS
Tree construction: regulation/scripts/MPRA/tree_construction.R
Throughout the workflow, we are using job scheduling system slurm (v0.4.3).
Primer sequences for the resequencing of putative Trnp1 cis-regulatory elements as well as for the MPRA can be found in oligo_sequences/. For more information on the different tables please have a look at the README oligo_sequences/README