Project has following folder tree:
├── coding_task_description.txt ├── README.md ├── project_files │ └── find_fastq.py │ ├── frequent_sequences.py │ ├── gtf_annotate_tsv.py │ ├── annotated_coordinates.tsv (created after gtf_annotate_tsv is run or testfile is run) │ ├── testfile.py │ └── sample_files | └── optional_gtf | └── gene_annotations.gtf │ ├── optional_tsv | └── coordinates_to_annotate.tsv │ ├── fasta | └── sample1.fasta | └── sample2.fasta │ └── fastq | └── otherreads | └── (Arbitrarily nested reads (Sample_R3.fastq, Sample_R4.fastq, Sample_R5.fastq, Sample_R6.fastq) | └── read1 | └── Sample_R1.fastq | └── read2 | └── Sample_R2.fastq
There are 5 main files/directories: find_fastq.py : Problem 1 frequent_sequences.py : Problem 2 gtf_annotate_tsv.py : Optional Problem test_script.py : Used to test all files at once sample_files : contains FASTQ, FASTA, GTF, and TSV files for testing purposes
Description: Please run the Python scripts from the directory in which they're stored (project_files) as the files paths are automatically created using the current working directory plus an extension. Each Python file has tests if run as the main file, or all of the Python files can be tested at once by running testfile.py.
Find_fastq.py finds all the FASTQ files in a directory, recursively searches subdirectories, and returns the FASTQ file name and percent of nucleotides longer than 30 for the file. Frequent_sequences.py finds the ten most frequent DNA sequences in a FASTA file and returns the DNA sequence along with the frequency of each. Gtf_annotate_tsv.py uses an annotated GTF file to help annotate a TSV file, which contains the chromosome and the location, with the gene added to a new column and a new TSV file is returned. The sample_files contain all the files tested.