DNA Sequence Database using Python Assignment Solution

June 21, 2024

Dr. Nicholas

🇯🇵 Japan

Python

Dr. Nicholas Scott, a distinguished Computer Science expert, holds a PhD from Tokyo University, Japan. With an impressive 8 years of experience, he has completed over 600 Python assignments, showcasing his unparalleled expertise and dedication to the field.

Hire me to do Your Python Assignment

Python

Key Topics

DNA Fragments Database

Submit Your Python Assignment

Get a FREE Quote

Tip of the day

Always normalize your database to at least the third normal form (3NF) to eliminate redundancy and improve efficiency. Use indexing wisely to speed up queries, and always test SQL commands on sample data before executing them on the main database to avoid errors and data loss.

News

In 2025, the programming landscape has been enriched with advanced libraries like TensorFlow 3.0 and PyTorch 2.0, offering enhanced performance and user-friendly interfaces for machine learning applications.

DNA Fragments Database

You are maintaining a database for a study into drug resistance in hospitals. During the study, you will receive bacterial DNA collected from patients. Each bacterial DNA sample contains a drug-resistant gene sequence at the start.

The researchers will need to query your database for various different drug-resistant gene sequences over the course of the study (details below), but you will also need to be able to update the contents of the database with new bacterial DNA samples.

To get help with python assignment problem, you will need to create a class SequenceDatabase, to represent the; database. This class will need to have two methods, addSequence(s) and query(q). As usual, you are welcome to create other functions/methods.

Note: DNA is typically represented using A, T, C, and G. For the purpose of easy coding, we will use A, B, C, and D (since they have adjacent ASCII values).

;Input

The input to addSequence is a single non-empty string of uppercase letters in uppercase [A-D]. The input to query is a single (possibly empty) string of uppercase letters in uppercase [A-D].

Output

addSequence(s) should not return anything. It should appropriately store s into the database represented by the instance of SequenceDatabase. We define the frequency of a particular string to be the number of times it has been added to the database in this way. query(q) should return a string with the following properties:

It must have q as a prefix
It should have a higher frequency in the database than any other string with q as a prefix
If two or more strings with prefix q are tied for most frequent, return the lexicographically least of them

If no such string exists, the query should return None.

Example:

Note that the name "db" in the above example is arbitrary (i.e. the instance of SequenceDatabase

could be called anything)

Open reading frames

In Molecular Genetics, there is a notion of an Open Reading Frame (ORF). An ORF is a portion of DNA that is used as the blueprint for a protein. All ORFs start with a particular sequence and end with a particular sequence.

In this task, we wish to find all sections of a genome which start with a given sequence of characters and end with a (possibly) different given sequence of characters.

To solve this problem, you will need to create a class OrfFinder. This class will need a method, find(start, end). Also note that the constructor for this class takes a string genome as a parameter, unlike the class in Problem 1 (shown in the example below).

Input

genome is a single non-empty string consisting only of uppercase [A-D]. genome is passed as an argument to the init method of OrfFinder (i.e. it gets used when creating an instance of the class).

start and end are each a single non-empty string consisting of only uppercase [A-D].

Output

find returns a list of strings. This list contains all the substrings of the genome which have started as a prefix and ended as a suffix. There is no particular requirement for the order of these strings. start and end must not overlap in the substring (see the last two cases of the example below).

Example

Solution

#node structure for the trie_tree class node: def __init__(self, ch): self.ch = ch self.end = False self.count = 0 self.children = [] #the tree object class trie_tree: def __init__(self): self.root = node("") #insterts a node into the trie after the revelvent node if only the root node is in the tree the node will be insterted after that def ins(self, word): current_node = self.root for ch in word: inChild = False for n in current_node.children: if n.ch == ch: current_node = n inChild = True break if not inChild: newNode = node(ch) current_node.children.append(newNode) current_node = newNode current_node.end = True current_node.count += 1 # recursive function for first call node should be root #searches the depth of the tree for a given prefixs, as the tree structure will be quite linear due to the size of the alaphbet used #we need to check when an end node is reached if it has any further children def depth_first_search(self, node, prefix): if node.end: if len(node.children) > 0: for n in node.children: self.depth_first_search(n, prefix+node.ch) self.output.append([(prefix+node.ch), node.count, node]) else: for n in node.children: self.depth_first_search(n, prefix+node.ch) #finds the prefix using dfs, starts by checking if the prefix is in the tree structure and if it is not it will return [], if it is #in the tree structure it will start the dfs and eventually return an array containing the sequence with the prefix and the count #of the sequence in the following form [sequence, count] def find_prefix(self, prefix): self.output = [] current_node = self.root for ch in prefix: charFound = False for n in current_node.children: if n.ch == ch: charFound = True current_node = n break if not charFound: return [] self.depth_first_search(current_node,prefix[:-1]) trueOutput = [] for n in self.output: trueOutput.append([n[0], n[1]]) return trueOutput class SequenceDatabase: def __init__(self): self.data = trie_tree() # adds the sequence s to the tree that stores data def addSequence(self, s): if s == "": return else: self.data.ins(s) #compares two strings based on there lexical values and returns the least of them def find_lexLeast(self, a,b): alphabet = ["A","B","C","D"] a_count = 0 b_count = 0 for n in a: for m in range(0,len(alphabet)): if n == alphabet[m]: a_count += (m+1) for n in b: for m in range(0,len(alphabet)): if n == alphabet[m]: b_count += (m+1) if a <= b: return a else: return b #searchs throught the data tree for sequences with the prefix specifed by q, it will return the sequence with the highest frequency #if the highest frequency is tied then the lexical least sequence with the tied highest frequency is returned # the query function has a complexity of o(len(s)) unless the database contains no values in which case in runs in o(1). #this function usually has a complexity of o(len(s)) as it only contains one for loop which iterataion range if from 0 to len(s) def query(self, q): vals = self.data.find_prefix(q) if len(vals) == 0: return else: current_highest_freqence = 0 bestInd = 0 for n in range(0, len(vals)): if vals[n][1] > current_highest_freqence: current_highest_freqence = vals[n][1] bestInd = n elif vals[n][1] == current_highest_freqence: lexComp = self.find_lexLeast(vals[n][0], vals[bestInd][0]) if lexComp == vals[n][0]: current_highest_freqence = vals[n][1] bestInd = n return vals[bestInd][0] class OrfFinder: #initiated by adding each substring of the genome to a tree for later used def __init__(self, genome): self.genome = genome self.tree = trie_tree() for n in range(0, len(self.genome)): for m in range(n+1, len(self.genome)+1): self.tree.ins(self.genome[n:m]) #starts by finding substrings of genome with the relivent prefix, then adds the reversed version of these values to a second tree # it then searches this tree using the reversed version of the end paramter adding the orignal version of that value to a results # array #this function has a complexity of o(len(start)+len(end)+U) as it contains two for loops that run sequentially for tree structures #generated for both the start and end repectivly, the addition of U comes from the loops within these loops which have to account for #the frequency of substrings within the sequence def find(self, start, end): prefixed = self.tree.find_prefix(start) print(prefixed) secondTree = trie_tree() result = [] for n in prefixed: for p in range(0, n[1]): secondTree.ins(n[0][::-1]) suffixed = secondTree.find_prefix(end[::-1]) print(suffixed) for m in suffixed: if not (len(m[0]) < len(start) + len(end)): for l in range(0,m[1]): result.append(m[0][::-1]) return result genome1 = OrfFinder("AAABBBCCC") print(genome1.find("AAA","BB"))

Similar Samples

Explore our extensive collection of programming assignment samples at ProgrammingHomeworkHelp.com. These samples exemplify our proficiency in diverse programming languages and topics, showcasing clear and effective solutions. Whether it's Java, Python, or database management, our examples demonstrate our commitment to delivering top-notch academic support. Discover how our samples can guide and inspire you in mastering programming concepts and achieving academic success.

See All Samples

Prime Number Check, Sum of Even Numbers, Guessing Game, and Dice Simulation in Python

Python

Word Count

4091 Words

Writer Name:Walter Parkes

Total Orders:2387

Satisfaction rate:

Python Assignment Sample: Analyzing Stock Market Data with Pandas

Python

Word Count

2184 Words