LSI-M-2: Bioinformatics 🧬 Algorithms and Data Structures (SoSe 2020)
Kursthemen
-
This course will be held by me, Gökçe
📡 Class room
like inf2
Class Recordings
like inf2
The first recording from 2020-03-26 is not listed on aydos.de/binfroom
Minutes & Code examples from the Class
https://mygit.th-deg.de/lsi/binf-notes
Meeting room
like inf2
Class
Thursdays 14:00 to 15:30
Flipped class like inf2
Textbook
We will use the book Bioinformatics - Algorithms and Data Structures from Phillip Compeau & Pavel Pevzner, the third edition. The first chapters are available at the book website. The author said he will open up all the chapters in the next weeks.
Interactive Online Course
There is an interactive online course which contains all the book contents and interactive coding challenges
Bioinformatics Algorithms: An Active Learning Approach @ Stepik
The first chapter is free:
Bioinformatics I: Finding Hidden Messages in DNA
If you do not want to. I think the online version of the textbook should be sufficient. The textbook seems to contain all the content in the print edition and the code challenges are also available at Rosalind.
Another free interactive course based on the book is available at Coursera. I only briefly audited this course, but it seems to be a subset of the textbook. The advantage is that there are peer-reviewed assignments.
Videos
https://www.bioinformaticsalgorithms.org/lecture-videos
Slides
If you need the slides, please contact me.
Exam
written exam on 30.07.2020, open book & internet & computer
Assignments
like inf2
Critical Awareness
like inf2
Moodleoverflow
like inf2
-
Prep
- Please finish the following feedback form
- Finish the following subchapters on the website or in the free Stepik course:
- 1.1 (A Journey of a Thousand Miles...) to including 1.4 (An Explosion of Hidden Messages)
- take the detours if you can and please try to do the coding challenges (on Rosalind or Stepik) using Python or command line
- 1.1 (A Journey of a Thousand Miles...) to including 1.4 (An Explosion of Hidden Messages)
- We will discuss your questions on Thursday at 2pm in the online meeting room
Agenda
- Testing microphone
- Introduction new students
- course start feedback
- course details
- poll: could you finish the chapters?
- meaning of 5 ECTS
- 1 ECTS ~= 28h
- 5 ECTS ~= 140h
- time for class preparation and assignments = 140h - 24h (classes ~1.5h x 16 weeks) - 20h (exam prep) = ~96h
- prep time per week = ~6h
- poll: which language did you use?
- questions
- It has not escaped our... — meaning
- James Watson & Francis Crick — biologists, postulate about how the DNA replicates
- confirmed by Matthew Melsin & Franklin Stahl 1958 — https://en.wikipedia.org/wiki/Meselson%E2%80%93Stahl_experiment
- centrifuge principle
-
Theodosius Dobzhansky — biology
- hidden messages
- https://en.wikipedia.org/wiki/The_Gold-Bug
- https://en.wikipedia.org/wiki/Caesar_cipher
- Pseudocode
- exercise: character count in a string
- one minute paper
- How useful was the online class today? Please give details.
- How useful was the online class today? Please give details.
-
Anzeigen Feedback einreichen
-
Prep
- until including Chapter 1.7
Assignments to Submit
- character count in a string as pseudo code
  input: string of length n, character c
  output: number of c's in the string
- Implementation of PatternCount() in Python (from Chapter 1.2)
- Please submit them in one archive (e.g., zip)
Agenda
- Feedback
- Real-time tracking of pathogen evolution
- learning goal
- understand what pseudo code is and be able to write a basic problem in pseudo code
- dictionary vs list
- assignment discussion
- code example
- questions
- Pseudocode
- group exercise:
- input: k-mer kmer, DNA-sequence seq
- output: number of the k-mers kmer in the DNA-sequence seq
- group exercise:
- frequentWords algorithm
- discussion in groups
- what if we search for all possible k-mers in one singe iteration?
- discussion in groups
- dict vs list (vs set)
- list: order important
- dict: faster to access the elements
- https://docs.python.org/3/tutorial/datastructures.html
-
Anzeigen Abgabe einreichen
-
Prep
- until including Chapter 1.8
Agenda
- feedback
- Hydra — a small, fresh-water organism
- they do not appear to die of old age, or indeed to age at all.
- discovered by Daniel Martinez
- Gerontology
- Strategies for Engineered Negligible Senescence
- coined by biomedical gerontologist Aubrey de Grey (who studied computer science)
- Alcor Life Extension Foundation
- Strategies for Engineered Negligible Senescence
- Would you like to live forever?
- The Reluctant Immortalist - about what Daniel Martinez nowadays does. He discovered the immortality of Hydra
- Homo Deus — Yuval Harari
- assignment discussion
- questions
- creating a Python library
- Python style guide
- evaluating the runtime
- big-O notation
- examples
- patternCount
- frequentWords
- O(1)
- O( n)
- O(n^2)
- O(2^n)
- What is the probability that two randomly generated strings of length 9 in the DNA alphabet ({A,C,G,T}) are identical?
-
- Pseudocode
- Chapter 1.3 Reverse Complement Problem
- Python codes for the following coding challenges
- Chapter 1.2 frequent_words problem
- Chapter 1.3 Reverse Complement Problem
- Chapter 1.3 Pattern Matching Problem
- Pseudocode
-
Prep
- until including Chapter 2.3
Agenda
- Aging research
- Harvard-Professor David Sinclair: Wie bleibt man fĂĽr immer jung? - SZ Magazin
- sports and fasting good for cells
- he tries to revert aging with medicine
- in the future life sciences will be important, old age is classified as a disease, which should be fight against. This means drugs against aging will be allowed. (MG2A in ICD-11 —International classification of diseases 11th version by WHO which will be official in Jan. 2022)
- Do we a clock gene? (Chapter 2)
- Harvard-Professor David Sinclair: Wie bleibt man fĂĽr immer jung? - SZ Magazin
- feedback
- grading
- final project
- based on e.g., implementing an algorithm described in a paper
- final project
- learning goals
- how to use the big-O notation and what is it good for?
- what are iterative and recursive functions?
- assignment feedback
- Big-O Notation
-
A function T(N) is O(F(N)) if for some constant c and for values of N greater than some value n0: T(N) <= c * F(N)
- how to determine complexities
-
- iterative vs recursive functions
- Hanoi Towers
- neighbors() function
- runtime differences
- one minute paper
-
- consider the following function
- input: array of integers
- output: minimum of these numbers
- input: array of integers
- write the pseudocode for your algorithm
- implement it in Python
- calculate the runtime
- best case
- worst case
- is your function iterative or recursive?
- how would you convert it to an iterative or recursive function?
- consider the following function
- until including Chapter 2.3
-
Prep
- until including Chapter 2.4
Agenda
- feedback
- questions
- Hanoi towers
- can you find a formula for T(n) that does not require recursion?
- brute force algorithms
- entropy
-
Prep
- until including Chapter 2.7
Agenda
- feedback
- grading
- idea: to have multiple grades instead of one single exam
- waiting for feedback from faculty
- where are we?
- (regulatory) Motif finding => searching for genes which are circadian
- but the motifs we find are mutated
- (regulatory) Motif finding => searching for genes which are circadian
- where are you?
- until which chapter did you prepare?
- questions
- slides into 2nd chapter
- What is the expected number of occurrences of a 9-mer in 500 random DNA strings, each of length 1000? Assume that the sequences are formed by selecting each nucleotide (A,C,G,T) with the same probability (0.25). (Chapter 2.2.2)
- ~1.8
- https://brilliant.org/wiki/linearity-of-expectation/
-
Prep
- until including Chapter 2.8
Agenda
- feedback
- grading
- graded assignments + 90 min exam
- not allowed by the exam commission
- in which chapter are you?
- do you use the lecture videos for prep?
- questions
- what is heuristic
- e.g., chess
- e.g., local search heuristic
-
Prep
- until including Chapter 3.7
Agenda
- feedback
- in which chapter are you?
- your questions
- we calculated the probability for
- 10 randomly selected 15-mers from the 10 600-nuc-long dna_strings capture *at least* one impl anted 15-mer
- we found ~.017
- what does it mean for us in case of randomizedMotifSearch()
- do you use ipython?
- reloading a module
- import importlib
- importlib.reload(your_library)
- howto implement random() in Gibbs Sampling using random_integer(int)
- until including Chapter 3.7
-
đź’¤
-
Prep
- until including Chapter 3.7
- in which code challenges did you have problems?
Agenda
- feedback
- discussions of code challenges in chapter 3
- until including Chapter 3.7
-
Prep
- until including Chapter 3.8 (do not forget the coding challenges)
Agenda
- feedback
- in which chapter are you?
- discussion from last session
- does information get lost when we convert an overlap graph to a de Bruijn graph?
- it depends if we are working on the composition (exploded newspaper particles) or the overlap graph constructed directly from the string
- example ATGATGGATG
- the overlap graph constructed from the string does not equal to the overlap graph built from the composition
- de Bruijn graph is the same for both
- but our problem is based on exploded newspaper anyway
- https://aydos.de/jhub
- assignments ba3e, ba3f
- Midterm Eval:
- http://tinyurl.com/yblqfbdv
- also give feedback: do you profit from the teaching style (preparation + discussing only assignments & key points & your questions)
- until including Chapter 3.8 (do not forget the coding challenges)
-
Prep
- Chapter 3.9 (Assembling Genomes from Read-pairs)
Agenda
- feedback (midterm eval)
- questions
- Why is the read length limited?
- chapter 3.9 discussion
-
Prep
- 3.10 Epilogue: Genome Assembly Faces Real Sequencing Data
- pre-class assignment ↴
Agenda
feedback
- The self-preparation of the chapters is not sufficient I think. Maybe we could just sum up what we prepared together at the beginning of the lecture
- Generally, the online course spends much time with developing the ideas behind algorithms. It starts with "let's try it..." and then finds out certain weaknesses of the tried approach to afterwards improve it iteratively. I think we should spend more time on explaining the final algorithm than on developing approaches that we will find out they're not good enough anyway
- Also, with regard to preparation for the exam, it is hard to judge which of the contents is relevant and which is not so relevant.
pre-class assignment discussion
- questions
-
Prep
This is a Jupyter Notebook file. Either:
use aydos.de/jhub
- login with your university credentials
- go to
Assignments
tab fetch
motifs assignment- suffer or have fun with it & complete what you can
- you can run individual cells using
CTRL + Enter
- the tests for coding assignments are included (assertions)
- the answers for questions are not tested, we will discuss them in class
- you can turn your solutions in via
submit
button, then I can browse your answers during class
download it and run it on your own Jupyter Notebook server and turn your solutions in via https://aydos.de/jhub
-
semester overview & summary
- what is computational thinking
- how do biologists search, how do computer scientists search?
- how do we search for patterns?
- how to write your approach as pseudocode?
- which algorithms are better than the others?
- how do we calculate the runtime complexity of algorithms?
- what is a DnaA box?
- what is the motivation behind skew diagrams?
- how to write an algorithm iteratively and recursively?
- what are motifs?
- how do we search for motifs?
- what is a profile matrix?
- what is the difference between count- and entropy-based score?
- what is a median string?
- what are the four algorithms we used for motif search? What are the differences?
- what is the analogy between an exploded newspaper and genome assembly?
- how do we use graphs to assemble a genome?
- what are the differences between overlap and de Bruijn graphs?
- what is an Eulerian graph?
- what is en Eulerian cycle?
- what is the advantage of read-pairs?
- what are the challenges of real sequencing data? How can we overcome them?