Applied Machine Learning

WHERE
Arnimalle 7, Room SR E.31/A7

WHEN

Thursday 12-14

PREREQUISITES

Previous participation in the 'Statistics' Master course or 'Applied Machine Learning' master course. Basic knowledge of machine learning techniques will be helpful, as the seminar is meant to extend the basic concepts learned so far in the Master courses.

Notes: The number of participants will be limited to 12. In case the demand exceeds the seminar capacity, students from the Master's programme in Bioinformatics (FU) will have priority.

DESCRIPTION AND GOALS
The term machine learning refers to the development and evaluation of algorithms for pattern recognition, classification and prediction based on models derived from observable data. One of the main challenges in computational biology is to make use of the growing amount of available biological data, in order to select and extract useful knowledge.
In this seminar we will review the main machine learning methods used the Bioinformatics field and we will look at various prediction problems in several biological domains, ranging from gene recognition to cancer classification and regulatory features prediction (e.g. functional SNPs, enhancer, promoters, Transcription Factor Binding Sistes). The selected papers cover both supervised and unsupervised classical approaches, issues with feature selection and model evaluation, as well as some applications of more advanced methods, such as semi-supervised learning, deep learning and active learning to computational biology problems.

GUIDELINES
In the following some additional guidelines for the seminar are given. Helpful material on how to prepare a good scientific presentation can be found here.

Completing the seminar

The language of the seminar is English and to pass the seminar you need to do the following:
- attend 80% of the classes
- give an oral presentation about a paper of your choice selected from the proposed list below. Sudents can propose a topic or paper which is not in the list, but they should discuss their choice with me first, in order to ansure the relevance and scientific value of the selected paper. More details about the presentation format and duration can be found here.
- Each students is expected to participate actively in the discussion following the presentation by asking at least two questions regarding the presented topic and reviewing the other students' work (e.g. feedback on the talk and quality of the presentation)

SCHEDULE
link to the Doodle

INTRODUCTION

slides

TOPICS WITH ASSOCIATED ARTICLES
The follwoing survey article will be useful to everybody as a starting point, independently from the chosen topic, and I strongly reccomend you to read it before starting working on your own presentation:
"Machine learning in Bioinformatics", Brief. Bioinfo 7:86-112

Review of the main machine learning concepts and its applications to Bioinformatics

1) Assessing the accuracy of prediction algorithms for classification: an overview
2) An introduction to ROC analysis
3) A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection

Feature Selection Problem

1) A review of feature selection techniques in bioinformatics
2) Novel unsupervised feature filtering of biological data

Unsupervised learning and its applications to Bioinformatics:

Cluster analysis of gene expression data: A Survey

Biclustering algorithms for Biological data analysis: A Survey

Random forests and its applications to Bioinformatics:

1) Simple decision rules for classifying human cancers from gene expression profiles

2) Prediction of protein - protein interactions using random decision forest framework

3) Detection and interpretation of expression quantitative trait loci (eQTL).

4) RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State

Structured logistic regression

Large scale identification and categorization of protein sequences using structured logistic regression

Classification with Support Vector Machines and different kernels

1) The spectrum kernel: A string Kernel for SVM protein classification
2) Kernel-based machine learning protocol for predicting DNA-binding proteins
3) A boosting approach for motif modeling using ChIP-chip data

Neural networks and its applications to Bioinformatics

1) Gene prediction in metagenomic fragments: A large scale machine learning approach
2) Beyond the ‘best’ match: machine learning annotation of protein sequences by integration of different sources of information

3) Deep learning of the tissue-regulated splicing code

Active learning and its applications to Bioinformatics

1) Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning
2) Active Learning with Support Vector Machine applied to Gene Expression Data for Cancer Classification

Semi-supervised learning and its applications to Bioinformatics

1) Semi-supervised learning improves gene expression-based prediction of cancer recurrence
2) Matching experiments across species using expression values and textual information (Co-training)

Multi-task learning and its applications to Bioinformatics

1) Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection
2) Integrating sequence, expression and interaction data to determine condition-specific miRNA regulation.