From f96edb700b2581dd468ac3a4555c970f17854974 Mon Sep 17 00:00:00 2001
From: Emily Kao <54684530+eckao@users.noreply.github.com>
Date: Tue, 30 Aug 2022 12:03:27 -0400
Subject: [PATCH 1/3] Test
---
project1.ipynb | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/project1.ipynb b/project1.ipynb
index 8b9ec909..423d70eb 100644
--- a/project1.ipynb
+++ b/project1.ipynb
@@ -35,9 +35,9 @@
"
\n",
" list each member of your team here, including both your name and UVA computing id\n",
"\n",
- "Team Members (Names): \n",
+ "Team Members (Names): Meesha Vullikanti, Emily Kao \n",
"\n",
- "Team Member UVA Computing IDs:\n",
+ "Team Member UVA Computing IDs: rv6cun, eck3pxj\n",
"\n",
"
\n",
"\n",
@@ -447,7 +447,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3",
+ "display_name": "Python 3.10.0 64-bit",
"language": "python",
"name": "python3"
},
@@ -461,7 +461,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.7.1"
+ "version": "3.10.0"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
+ }
}
},
"nbformat": 4,
From 868eeb9b9512ea854391ac0ac2d9d7db372f023a Mon Sep 17 00:00:00 2001
From: Emily Kao <54684530+eckao@users.noreply.github.com>
Date: Tue, 30 Aug 2022 12:05:50 -0400
Subject: [PATCH 2/3] test 2
---
project1.ipynb | 55 +++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 50 insertions(+), 5 deletions(-)
diff --git a/project1.ipynb b/project1.ipynb
index 423d70eb..9069baf8 100644
--- a/project1.ipynb
+++ b/project1.ipynb
@@ -2,6 +2,7 @@
"cells": [
{
"cell_type": "markdown",
+ "id": "208473e7",
"metadata": {},
"source": [
"# Project 1: Assembling Genes"
@@ -9,6 +10,7 @@
},
{
"cell_type": "markdown",
+ "id": "e849dc7d",
"metadata": {},
"source": [
" \n",
@@ -29,6 +31,7 @@
},
{
"cell_type": "markdown",
+ "id": "ece892a4",
"metadata": {},
"source": [
"**Team submitting this assignment:** \n",
@@ -47,11 +50,13 @@
" \n",
"External Resources Used:\n",
"\n",
- "
"
+ "\n",
+ "testing my ability to remember git :/"
]
},
{
"cell_type": "markdown",
+ "id": "3c621f1d",
"metadata": {},
"source": [
"In this project, we will explore genome assembly—the process of determining the order of nucleotides in DNA from fragmented reads. As you might have studied in the reading assignments, genome assembly can get quite complicated, as problems such as full sequence coverage, finding a good length for reads (the $k$ in $k$-mer), and sequencing errors present challenges for sequencing analysis and accuracy. You can assume perfect coverage for all parts of the assignment and no read errors for the first two questions.\n",
@@ -62,6 +67,7 @@
},
{
"cell_type": "markdown",
+ "id": "89e80f32",
"metadata": {},
"source": [
"## Install basic required packages."
@@ -69,6 +75,7 @@
},
{
"cell_type": "markdown",
+ "id": "5bf1b0de",
"metadata": {},
"source": [
"- Install basic required packages, should be run only once. You may need to restart the kernel after this stage.\n",
@@ -81,6 +88,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "a4c221cd",
"metadata": {},
"outputs": [],
"source": [
@@ -89,7 +97,8 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 1,
+ "id": "4de4b69f",
"metadata": {},
"outputs": [],
"source": [
@@ -99,6 +108,7 @@
},
{
"cell_type": "markdown",
+ "id": "e81f8e3f",
"metadata": {},
"source": [
"## Genome Assembly\n",
@@ -110,7 +120,8 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 2,
+ "id": "eb730eb7",
"metadata": {},
"outputs": [],
"source": [
@@ -123,6 +134,7 @@
},
{
"cell_type": "markdown",
+ "id": "0578bc2b",
"metadata": {},
"source": [
"#### Question 1.1.1 GC-content\n",
@@ -139,6 +151,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "0f09bbe7",
"metadata": {},
"outputs": [],
"source": [
@@ -151,6 +164,7 @@
},
{
"cell_type": "markdown",
+ "id": "7266da16",
"metadata": {},
"source": [
"#### Question 1.1.2 Interpreting quality scores"
@@ -158,6 +172,7 @@
},
{
"cell_type": "markdown",
+ "id": "91c69f2f",
"metadata": {},
"source": [
"Phred33 quality scores are represented as the character with an ASCII code equal to its value + 33 (to make them easy to print alongside genome sequences). List the top 5 most frequent scores in ASCII symbol as well as their Phredd33 scores in TeleTubby.fastq. You can refer to the [official Illumina website](https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm) to reference the scoring encoding.\n",
@@ -168,6 +183,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "5a1b6607",
"metadata": {},
"outputs": [],
"source": [
@@ -176,6 +192,7 @@
},
{
"cell_type": "markdown",
+ "id": "f4a7bb06",
"metadata": {},
"source": [
"#### Question 1.1.3 Frequency analysis\n",
@@ -188,6 +205,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "fe2b67e2",
"metadata": {},
"outputs": [],
"source": [
@@ -196,6 +214,7 @@
},
{
"cell_type": "markdown",
+ "id": "7b35151a",
"metadata": {},
"source": [
"### Question 1.2. Greedy approach"
@@ -203,6 +222,7 @@
},
{
"cell_type": "markdown",
+ "id": "d08e0e97",
"metadata": {},
"source": [
"One of the approaches to assemble the genome from the given reads is a greedy algorithm. Have a look at the greedy algorithm described on [Wikipedia](https://en.wikipedia.org/wiki/Sequence_assembly#Greedy_algorithm) and answer the following."
@@ -210,6 +230,7 @@
},
{
"cell_type": "markdown",
+ "id": "07b85ecb",
"metadata": {},
"source": [
"#### Question 1.2.1 What would the runtime be of this algorithm, given $n$ $k$-mer reads?"
@@ -217,6 +238,7 @@
},
{
"cell_type": "markdown",
+ "id": "fbaf002e",
"metadata": {},
"source": [
"Answer:"
@@ -224,6 +246,7 @@
},
{
"cell_type": "markdown",
+ "id": "df0f8437",
"metadata": {},
"source": [
"#### Question 1.2.2 Would this algorithm always yield a unique solution?"
@@ -231,6 +254,7 @@
},
{
"cell_type": "markdown",
+ "id": "c527c303",
"metadata": {},
"source": [
"Answer:"
@@ -238,6 +262,7 @@
},
{
"cell_type": "markdown",
+ "id": "e0c5f6de",
"metadata": {},
"source": [
"#### Question 1.2.3 Would this algorithm always yield the right solution?"
@@ -245,6 +270,7 @@
},
{
"cell_type": "markdown",
+ "id": "3defee58",
"metadata": {},
"source": [
"Answer:"
@@ -252,6 +278,7 @@
},
{
"cell_type": "markdown",
+ "id": "c81a6be4",
"metadata": {},
"source": [
"### Question 1.3 Graph-based approaches"
@@ -259,6 +286,7 @@
},
{
"cell_type": "markdown",
+ "id": "f5865fe7",
"metadata": {},
"source": [
"Graphs for genome assembly can be constructed in two ways:\n",
@@ -276,6 +304,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "2d014ba8",
"metadata": {},
"outputs": [],
"source": [
@@ -293,6 +322,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "a84824f5",
"metadata": {},
"outputs": [],
"source": [
@@ -302,6 +332,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "ed167a9e",
"metadata": {},
"outputs": [],
"source": [
@@ -311,6 +342,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "6c3a38a5",
"metadata": {},
"outputs": [],
"source": [
@@ -321,6 +353,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "3016638e",
"metadata": {},
"outputs": [],
"source": [
@@ -331,6 +364,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "41777325",
"metadata": {},
"outputs": [],
"source": [
@@ -345,6 +379,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "009925b5",
"metadata": {},
"outputs": [],
"source": [
@@ -355,6 +390,7 @@
},
{
"cell_type": "markdown",
+ "id": "dc953fe2",
"metadata": {},
"source": [
"## Question 2 - Sequencing SARS-CoV-2 virus"
@@ -362,6 +398,7 @@
},
{
"cell_type": "markdown",
+ "id": "52e3153f",
"metadata": {},
"source": [
"Let's move on from TeleTubbies to real-world organisms. Let's start small- with a variant of the SARS-CoV-2 virus. You're given reads from actual genome sequencing runs in the SARS-CoV2.fastq file provided.\n",
@@ -372,6 +409,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "4b6beb7f",
"metadata": {},
"outputs": [],
"source": [
@@ -382,6 +420,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "f6229ace",
"metadata": {},
"outputs": [],
"source": [
@@ -392,6 +431,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "6a668724",
"metadata": {},
"outputs": [],
"source": [
@@ -402,6 +442,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "4ab0e369",
"metadata": {},
"outputs": [],
"source": [
@@ -414,6 +455,7 @@
},
{
"cell_type": "markdown",
+ "id": "23b41b79",
"metadata": {},
"source": [
"# Question 3- Error-Aware Assembly (Extra Credit)"
@@ -421,6 +463,7 @@
},
{
"cell_type": "markdown",
+ "id": "a6ae36a0",
"metadata": {},
"source": [
"In the parts above, we assumed error-free reads while assembling $k$-mers. As much as we'd like that, actual reads can (and do) have errors, captured by their Phred scores. For this question, you're given raw, actual reads from sequencing runs (download reads here: https://sra-pub-sars-cov2.s3.amazonaws.com/sra-src/SRR11528307/ABS2-LN-R1_cleaned_paired.fastq.gz). Given these reads and their Phred33 scores, can you assemble the genome?\n",
@@ -433,6 +476,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "a810efcb",
"metadata": {},
"outputs": [],
"source": []
@@ -440,6 +484,7 @@
{
"cell_type": "code",
"execution_count": null,
+ "id": "4a761bc6",
"metadata": {},
"outputs": [],
"source": []
@@ -447,7 +492,7 @@
],
"metadata": {
"kernelspec": {
- "display_name": "Python 3.10.0 64-bit",
+ "display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
@@ -461,7 +506,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
- "version": "3.10.0"
+ "version": "3.9.7"
},
"vscode": {
"interpreter": {
From cb270203c3fc115fc1d0c5f690444fdb391f8ae8 Mon Sep 17 00:00:00 2001
From: Emily Kao <54684530+eckao@users.noreply.github.com>
Date: Thu, 1 Sep 2022 21:51:45 -0400
Subject: [PATCH 3/3] Questions 1.1.1 & 1.1.3
---
project1.ipynb | 60 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 49 insertions(+), 11 deletions(-)
diff --git a/project1.ipynb b/project1.ipynb
index 9069baf8..d8637f69 100644
--- a/project1.ipynb
+++ b/project1.ipynb
@@ -50,8 +50,7 @@
" \n",
"External Resources Used:\n",
"\n",
- "\n",
- "testing my ability to remember git :/"
+ ""
]
},
{
@@ -150,16 +149,37 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 38,
"id": "0f09bbe7",
"metadata": {},
- "outputs": [],
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "GC percent: 0.47952218430034127\n",
+ "temperature: 84.34709897610922\n"
+ ]
+ }
+ ],
"source": [
"# Read sequence reads (error-free) from file\n",
"sequence_reads, qualities = utils.read_fastq('TeleTubby.fastq')\n",
"\n",
"# Calculate %GC content\n",
- "# Print out temperature in Celsius"
+ "gc = 0\n",
+ "total = len(sequence_reads[0]) * len(sequence_reads)\n",
+ "for x in range(len(sequence_reads)):\n",
+ " for y in sequence_reads[x]:\n",
+ " if(y == 'G'):\n",
+ " gc = cg + 1\n",
+ " if(y == 'C'):\n",
+ " gc = cg+1\n",
+ "gc_percent = cg/total\n",
+ "print(\"GC percent: \" , gc_percent)\n",
+ "# Print out temperature in Celsius\n",
+ "temp = 64.9 + (0.41*gc_percent*100) - (500/total)\n",
+ "print(\"temperature: \", temp)"
]
},
{
@@ -204,12 +224,30 @@
},
{
"cell_type": "code",
- "execution_count": null,
+ "execution_count": 48,
"id": "fe2b67e2",
"metadata": {},
- "outputs": [],
- "source": [
- "# Find and print out the three most repeated k-mers and their frequencies"
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Most frequent k-mers and their frequencies: \n",
+ "[('GCTATCGC', 3), ('CGCTATCG', 2), ('TATCGCAA', 2)]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# https://www.geeksforgeeks.org/python-find-most-frequent-element-in-a-list/\n",
+ "# Find and print out the three most repeated k-mers and their frequencies\n",
+ "from collections import Counter\n",
+ " \n",
+ "def most_frequent(List):\n",
+ " occurence_count = Counter(List)\n",
+ " return occurence_count.most_common(3)\n",
+ " \n",
+ "print(\"Most frequent k-mers and their frequencies: \")\n",
+ "print(most_frequent(sequence_reads))\n"
]
},
{
@@ -241,7 +279,7 @@
"id": "fbaf002e",
"metadata": {},
"source": [
- "Answer:"
+ "Answer: "
]
},
{
@@ -273,7 +311,7 @@
"id": "3defee58",
"metadata": {},
"source": [
- "Answer:"
+ "Answer: It may not yield the optimal solution. "
]
},
{