diff --git a/project1.ipynb b/project1.ipynb index 8b9ec909..d8637f69 100644 --- a/project1.ipynb +++ b/project1.ipynb @@ -2,6 +2,7 @@ "cells": [ { "cell_type": "markdown", + "id": "208473e7", "metadata": {}, "source": [ "# Project 1: Assembling Genes" @@ -9,6 +10,7 @@ }, { "cell_type": "markdown", + "id": "e849dc7d", "metadata": {}, "source": [ "
\n", @@ -29,15 +31,16 @@ }, { "cell_type": "markdown", + "id": "ece892a4", "metadata": {}, "source": [ "**Team submitting this assignment:** \n", "
\n", " list each member of your team here, including both your name and UVA computing id\n", "\n", - "Team Members (Names): \n", + "Team Members (Names): Meesha Vullikanti, Emily Kao \n", "\n", - "Team Member UVA Computing IDs:\n", + "Team Member UVA Computing IDs: rv6cun, eck3pxj\n", "\n", "
\n", "\n", @@ -52,6 +55,7 @@ }, { "cell_type": "markdown", + "id": "3c621f1d", "metadata": {}, "source": [ "In this project, we will explore genome assembly—the process of determining the order of nucleotides in DNA from fragmented reads. As you might have studied in the reading assignments, genome assembly can get quite complicated, as problems such as full sequence coverage, finding a good length for reads (the $k$ in $k$-mer), and sequencing errors present challenges for sequencing analysis and accuracy. You can assume perfect coverage for all parts of the assignment and no read errors for the first two questions.\n", @@ -62,6 +66,7 @@ }, { "cell_type": "markdown", + "id": "89e80f32", "metadata": {}, "source": [ "## Install basic required packages." @@ -69,6 +74,7 @@ }, { "cell_type": "markdown", + "id": "5bf1b0de", "metadata": {}, "source": [ "- Install basic required packages, should be run only once. You may need to restart the kernel after this stage.\n", @@ -81,6 +87,7 @@ { "cell_type": "code", "execution_count": null, + "id": "a4c221cd", "metadata": {}, "outputs": [], "source": [ @@ -89,7 +96,8 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, + "id": "4de4b69f", "metadata": {}, "outputs": [], "source": [ @@ -99,6 +107,7 @@ }, { "cell_type": "markdown", + "id": "e81f8e3f", "metadata": {}, "source": [ "## Genome Assembly\n", @@ -110,7 +119,8 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 2, + "id": "eb730eb7", "metadata": {}, "outputs": [], "source": [ @@ -123,6 +133,7 @@ }, { "cell_type": "markdown", + "id": "0578bc2b", "metadata": {}, "source": [ "#### Question 1.1.1 GC-content\n", @@ -138,19 +149,42 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "execution_count": 38, + "id": "0f09bbe7", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GC percent: 0.47952218430034127\n", + "temperature: 84.34709897610922\n" + ] + } + ], "source": [ "# Read sequence reads (error-free) from file\n", "sequence_reads, qualities = utils.read_fastq('TeleTubby.fastq')\n", "\n", "# Calculate %GC content\n", - "# Print out temperature in Celsius" + "gc = 0\n", + "total = len(sequence_reads[0]) * len(sequence_reads)\n", + "for x in range(len(sequence_reads)):\n", + " for y in sequence_reads[x]:\n", + " if(y == 'G'):\n", + " gc = cg + 1\n", + " if(y == 'C'):\n", + " gc = cg+1\n", + "gc_percent = cg/total\n", + "print(\"GC percent: \" , gc_percent)\n", + "# Print out temperature in Celsius\n", + "temp = 64.9 + (0.41*gc_percent*100) - (500/total)\n", + "print(\"temperature: \", temp)" ] }, { "cell_type": "markdown", + "id": "7266da16", "metadata": {}, "source": [ "#### Question 1.1.2 Interpreting quality scores" @@ -158,6 +192,7 @@ }, { "cell_type": "markdown", + "id": "91c69f2f", "metadata": {}, "source": [ "Phred33 quality scores are represented as the character with an ASCII code equal to its value + 33 (to make them easy to print alongside genome sequences). List the top 5 most frequent scores in ASCII symbol as well as their Phredd33 scores in TeleTubby.fastq. You can refer to the [official Illumina website](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) to reference the scoring encoding.\n", @@ -168,6 +203,7 @@ { "cell_type": "code", "execution_count": null, + "id": "5a1b6607", "metadata": {}, "outputs": [], "source": [ @@ -176,6 +212,7 @@ }, { "cell_type": "markdown", + "id": "f4a7bb06", "metadata": {}, "source": [ "#### Question 1.1.3 Frequency analysis\n", @@ -187,15 +224,35 @@ }, { "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Find and print out the three most repeated k-mers and their frequencies" + "execution_count": 48, + "id": "fe2b67e2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Most frequent k-mers and their frequencies: \n", + "[('GCTATCGC', 3), ('CGCTATCG', 2), ('TATCGCAA', 2)]\n" + ] + } + ], + "source": [ + "# https://www.geeksforgeeks.org/python-find-most-frequent-element-in-a-list/\n", + "# Find and print out the three most repeated k-mers and their frequencies\n", + "from collections import Counter\n", + " \n", + "def most_frequent(List):\n", + " occurence_count = Counter(List)\n", + " return occurence_count.most_common(3)\n", + " \n", + "print(\"Most frequent k-mers and their frequencies: \")\n", + "print(most_frequent(sequence_reads))\n" ] }, { "cell_type": "markdown", + "id": "7b35151a", "metadata": {}, "source": [ "### Question 1.2. Greedy approach" @@ -203,6 +260,7 @@ }, { "cell_type": "markdown", + "id": "d08e0e97", "metadata": {}, "source": [ "One of the approaches to assemble the genome from the given reads is a greedy algorithm. Have a look at the greedy algorithm described on [Wikipedia](https://en.wikipedia.org/wiki/Sequence_assembly#Greedy_algorithm) and answer the following." @@ -210,6 +268,7 @@ }, { "cell_type": "markdown", + "id": "07b85ecb", "metadata": {}, "source": [ "#### Question 1.2.1 What would the runtime be of this algorithm, given $n$ $k$-mer reads?" @@ -217,13 +276,15 @@ }, { "cell_type": "markdown", + "id": "fbaf002e", "metadata": {}, "source": [ - "Answer:" + "Answer: " ] }, { "cell_type": "markdown", + "id": "df0f8437", "metadata": {}, "source": [ "#### Question 1.2.2 Would this algorithm always yield a unique solution?" @@ -231,6 +292,7 @@ }, { "cell_type": "markdown", + "id": "c527c303", "metadata": {}, "source": [ "Answer:" @@ -238,6 +300,7 @@ }, { "cell_type": "markdown", + "id": "e0c5f6de", "metadata": {}, "source": [ "#### Question 1.2.3 Would this algorithm always yield the right solution?" @@ -245,13 +308,15 @@ }, { "cell_type": "markdown", + "id": "3defee58", "metadata": {}, "source": [ - "Answer:" + "Answer: It may not yield the optimal solution. " ] }, { "cell_type": "markdown", + "id": "c81a6be4", "metadata": {}, "source": [ "### Question 1.3 Graph-based approaches" @@ -259,6 +324,7 @@ }, { "cell_type": "markdown", + "id": "f5865fe7", "metadata": {}, "source": [ "Graphs for genome assembly can be constructed in two ways:\n", @@ -276,6 +342,7 @@ { "cell_type": "code", "execution_count": null, + "id": "2d014ba8", "metadata": {}, "outputs": [], "source": [ @@ -293,6 +360,7 @@ { "cell_type": "code", "execution_count": null, + "id": "a84824f5", "metadata": {}, "outputs": [], "source": [ @@ -302,6 +370,7 @@ { "cell_type": "code", "execution_count": null, + "id": "ed167a9e", "metadata": {}, "outputs": [], "source": [ @@ -311,6 +380,7 @@ { "cell_type": "code", "execution_count": null, + "id": "6c3a38a5", "metadata": {}, "outputs": [], "source": [ @@ -321,6 +391,7 @@ { "cell_type": "code", "execution_count": null, + "id": "3016638e", "metadata": {}, "outputs": [], "source": [ @@ -331,6 +402,7 @@ { "cell_type": "code", "execution_count": null, + "id": "41777325", "metadata": {}, "outputs": [], "source": [ @@ -345,6 +417,7 @@ { "cell_type": "code", "execution_count": null, + "id": "009925b5", "metadata": {}, "outputs": [], "source": [ @@ -355,6 +428,7 @@ }, { "cell_type": "markdown", + "id": "dc953fe2", "metadata": {}, "source": [ "## Question 2 - Sequencing SARS-CoV-2 virus" @@ -362,6 +436,7 @@ }, { "cell_type": "markdown", + "id": "52e3153f", "metadata": {}, "source": [ "Let's move on from TeleTubbies to real-world organisms. Let's start small- with a variant of the SARS-CoV-2 virus. You're given reads from actual genome sequencing runs in the SARS-CoV2.fastq file provided.\n", @@ -372,6 +447,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4b6beb7f", "metadata": {}, "outputs": [], "source": [ @@ -382,6 +458,7 @@ { "cell_type": "code", "execution_count": null, + "id": "f6229ace", "metadata": {}, "outputs": [], "source": [ @@ -392,6 +469,7 @@ { "cell_type": "code", "execution_count": null, + "id": "6a668724", "metadata": {}, "outputs": [], "source": [ @@ -402,6 +480,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4ab0e369", "metadata": {}, "outputs": [], "source": [ @@ -414,6 +493,7 @@ }, { "cell_type": "markdown", + "id": "23b41b79", "metadata": {}, "source": [ "# Question 3- Error-Aware Assembly (Extra Credit)" @@ -421,6 +501,7 @@ }, { "cell_type": "markdown", + "id": "a6ae36a0", "metadata": {}, "source": [ "In the parts above, we assumed error-free reads while assembling $k$-mers. As much as we'd like that, actual reads can (and do) have errors, captured by their Phred scores. For this question, you're given raw, actual reads from sequencing runs (download reads here: https://sra-pub-sars-cov2.s3.amazonaws.com/sra-src/SRR11528307/ABS2-LN-R1_cleaned_paired.fastq.gz). Given these reads and their Phred33 scores, can you assemble the genome?\n", @@ -433,6 +514,7 @@ { "cell_type": "code", "execution_count": null, + "id": "a810efcb", "metadata": {}, "outputs": [], "source": [] @@ -440,6 +522,7 @@ { "cell_type": "code", "execution_count": null, + "id": "4a761bc6", "metadata": {}, "outputs": [], "source": [] @@ -447,7 +530,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -461,7 +544,12 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.1" + "version": "3.9.7" + }, + "vscode": { + "interpreter": { + "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49" + } } }, "nbformat": 4,