# Genome-wide association studies, Autumn 2015

**Teacher:** Matti Pirinen, FIMM (matti.pirinen at helsinki.fi)

**Scope:** 5 cr (exercises and exam give 3 cr, optional project work additional 2 cr)

**Type:** Intermediate / advanced studies in statistics

**Teaching:** Lectures and computer class work

**Topics:**

We are in the middle of a revolution in genomic science. Recent technologies make it possible to read genomes so quickly and cheaply that numerous genomics-related services will be offered to us in future, for example, in health care context for personalized risk prediction and choice of therapies / medication. This requires quantitative data analysis where statistics has a lead role. This course is about statistics used in modern genome analyses. In particular, we will consider genome-wide association studies (GWAS) that have been a discovery engine for the field for the last 10 years.

- Heritability of complex diseases and traits
- Data: Genotyping technologies - quality control
- Statistical concepts: Significance and power, probability of association
- Linear regression, logistic regression, covariates, PCA of genetic structure, meta-analysis
- Linear mixed models, missing heritability, contributions from common and rare variants
- Mendelian randomization
- Linkage disequilibrium, fine-mapping, statistical imputation

**Prerequisites: **Studies in probability, statistical inference, linear models, and basic data-analysis with R-program. The course is expected to be useful (also) for people who have been/will be doing GWAS in practice but do not necessarily have strong background in statistics.

## Teaching schedule

II period, 14-17 in C128, Tuesdays 3.11, 10.11, 17.11, 24.11.

## Course material

Slides: 1 (3.11), 2 (10.11), 3 (17.11), 4 (24.11)

Handouts: Significance thresholds (3.11), PCA (10.11), Height GWAS press release (17.11), LD (24.11)

Practicals: 1 (3.11), 2 (10.11), 3 (17.11), 4 (24.11)

Articles:

- Pearson & Manolio. (2008). How to interpret a GWAS.
- Slatkin (2008) Linkage disequilibrium - understanding the evolutionary past and mapping the medical future.
- Sham & Purcell. (2014) Statistical power and significance testing in large-scale genetic studies.
- Price et al. (2010). New approaches to population stratification in genome-wide association studies.
- Gibson (2012). Rare and common variants: Twenty arguments.
- Zuk et al. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability.
- Lawlor et al. (2008) Mendelian randomization: Using genes as instruments for making causal inferences in epidemiology.

## Assignments

Return the assignments of each week as a single PDF file that contains the R-scripts, the output of the scripts and the Figures. For example, you can use MS Word and save as / export as PDF. The assignments are returned through Moodle. ( If you do not have a UH student account for Moodle system, you can email your answers as a PDF to matti.pirinen'at'helsinki.fi )

An example assignment ( assignment0.txt ) and an example answer ( answer0.pdf ).

- 3.11. assignment1.txt ;
- 10.11. assignment2.txt ;
- 17.11. assignment3.txt ;
- 24.11. assignment4.txt ;

## Home exam

The answers must be returned by 14.15 (o'clock) on Tue 8.12. Return your answers as a single PDF through Moodle. ( If you do not have a UH student account for Moodle system, you can email your answers as a PDF to matti.pirinen'at'helsinki.fi .) Sufficient material for complete exam answers can be found above from 'Course material' and 'Assignments' sections.

How to answer the exam questions ?

- Guiding principle: The idea of the exam is to verify that you yourself have the knowledge of GWAS as taught in this course. Therefore you should understand and have processed everything that is in included in your answers.
- Formulate the answers in your own words rather than copy-pasting from somewhere.
- You can use short definition-like pieces of text from the source materials on this webpage or from elsewhere you have discovered yourself.
- If you copy anything else than definition-like text you should mark clearly where the material is taken from. The only reason that you would ever want to use such longer citations would be to give examples of the topic from the literature. In general, there is no need for long citations in this exam.
- You are free to make figures that mimic figures you have seen during the course or elsewhere. If it is an almost direct copy of the original add a note "Adapted from --reference--" to the figure legend.
- Read each question carefully and answer to the question being asked. It does not help to include irrelevant pieces of information in the answers, no matter how great answers they were to some other question.
- IMPORTANT: Every student does the home exam alone: do not share your answers with others, do not include any material that you haven't processed yourself.

## Passing the course

The course is passed when a student has at least half of the exercise points and at least half of the exam points. The course will be graded from 1 to 5.

For students completing also the project work (see below), there will still be a single grade from the course. An excellent project work can increase the grade determined by the exam and assignments.

## Project work

After a successful completion of the lecture course (home assignments and home exams), students have an option to do a project work of 2 cr. Return your project report as a single PDF through Moodle. ( If you do not have a UH student account for Moodle system, you can email your report as a PDF to matti.pirinen'at'helsinki.fi . ) The deadline is Sun 31.1.2016 (23.59 o'clock).

**Structure of the report:**

- Start with a compact
**Abstract**that tells what is included in the report and why it is important. In practice, Abstract may be the last thing you write/finish for the report, after you know exactly what is included. - Have a short
**Introduction**that puts the topic in its context with respect to GWA studies. With more statistical topics, you may also refer to a more general formulation of the problem in statistics or to some other fields of science that tackle similar problems. - Use your own consideration how to best present the main content of the project from your own angle. If you use R or other software to demonstrate your topic, you should include the codes at the end of the report as an
**Appendix**. Short and compact pieces of code (a few lines in easily readable form) can be included also in the main report. Choice is yours, think what is most clear. - End the report with a compact
**Conclusion**section that describes your own conclusion on the topic. - Add
**References**in the end, for example, by numbering them and referring with ('number') in the text. Or you can refer by ('Surname, year') in the text, in which case use alphabetical order in the reference section. - You may include
**Appendixies**.

A guideline for an amount of work expected to get the credits is to read carefully and with a good understanding at least two scientific publications, and reporting what you have learned in your own words. Note that in some projects you may spend most time on doing simulations and data analysis rather than reading papers and that is completely OK.

Possible topics include the following. Also your own topic is not only possible but also encouraged!

**GWAS essay.**Write a report on what is GWAS by going carefully through one recent large disease GWAS and one quantitative trait GWAS and by reporting how each step of the GWAS process was completed in those two studies. For each GWAS step, include also a general summary of why that step is important in GWAS in addition to explanation about how it was conducted in your example GWASs.

Quantitative traits: Height, BMI, Lipids.

Diseases: Schizophrenia, Coronary artery disease, Multiple sclerosis, Inflammatory bowel disease, Rheumatoid arthritis, Type 2 Diabetes.**GWAS data.**Generate some realistic GWAS data by using public databases of genome variation (HapMap or 1000 genomes) and ready-made software for data generation (e.g. HAPGEN2). Run GWAS using software such as SNPTEST or PLINK and report the results. You can generate population structure effects by simulating data from several populations and generating phenotypes that depend on the population and you can then attempt to correct those effects in the analysis using PCs as covariates (PCA can be done by shellfish or you can use PLINK's 'multidimensional scaling' which is equivalent to PCA). You can also generate several true effects at SNPs in LD with each other that you can separate from each other only by conditonal analysis (command line option in SNPTEST and PLINK). There are two types of file formats you might need: GEN files (HAPGEN2, SNPTEST, shellfish) and PLINK's PED files. You can convert between the two using GTOOL. This project is likely to require UNIX / MAC system to run these software. You can look at the section 2.4. from practicals2.R for an example how to start.**Statistical imputation in GWAS.**Explain how Hidden Markov models are used for genotype imputation and how multivariate normal distribution is used for z-score imputation in GWAS data.

Marchini & Howie (2010) Genotype imputation for genome-wide association studies.

Pasaniuc et al. (2014) Fast and accurate imputation of summary statistics enhances evidence of functional enrichment.**Phenotype prediction.**Explain what are the current methods to predict individual's phenotype from his/her genome data and how well do they work.

Vilhjalmsson et al. (2015) Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores.

Speed & Balding (2014) MultiBLUP: improved SNP-based prediction for complex traits.**Bayes factors in GWAS**. Explain what is the difference between Bayesian approach and the traditional P-value-based association testing. Demonstrate the differences using R and simulated data. See also 'Bayes factor' part of practicals1.R.

Stephens and Balding (2009) Bayesian statistical methods for genetic association studies.

Wakefield (2009) Bayes Factors for Genome-Wide Association Studies: Comparison with P-values**Covariates in logistic regression.**Explain what are surprising effects about covariate adjustment in case-control studies and how should covariates be used in those studies. Demonstrate with R.

Mefford & Witte (2012) The Covariate's Dilemma.

Pirinen et al. (2012) Including known covariates can reduce power to detect genetic effects in case-control studies.

Zaitlen et al. (2012) Informed Conditioning on Clinical Covariates Increases Power in Case-Control Association Studies.**Significance levels in GWAS.**Explain simulation experiments that have been carried out to determine relevant significance threshold in GWAS. Explain also more generally the connection between significance thresholds and statistical power and probability that a statistical association is not a false positive.

Sham & Purcell. (2014) Statistical power and significance testing in large-scale genetic studies.

Pe'er et al. (2008) Estimation of the multiple testing burden for genomewide association studies of nearly all common variants.

Dudbridge & Gusnanto (2008) Estimation of significance thresholds for genomewide association scans.

Hoggart et al. (2008) Genome-Wide Significance for Dense SNP and Resequencing Data.

## Extra material

- Article series on GWAS in Nature Reviews Genetics.
- Anderson et al. (2010). Data quality control in genetic case-control association studies.
- Vukcevic. (2009). Bayesian and Frequentist Methods and Analyses of Genome-Wide Association Studies
- Pirinen et al. (2012) Including known covariates can reduce power to detect genetic effects in case-control studies
- Yang et al. (2010) Common SNPs explain a large proportion of the heritability for human height.
- Lääperi (2015) Msc thesis. Linear mixed models for estimating heritability and testing genetic association in family data
- Lee et al. (2014) Rare-Variant Association Analysis: Study Designs and Statistical Tests.
- Voight et al. (2012) Plasma HDL cholesterol and risk of myocardial infarction: a mendelian randomisation study
- Do et al. (2013) Common variants associated with plasma triglycerides and risk for coronary artery disease
- Video: "Genome-Wide Association Studies - Karen Mohlke (2012)"
- Video: "Understanding and Interpreting GWAS Data" by Paul de Bakker.

## Registration

Did you forget to register? What to do?

## Course feedback

Course feedback can be given at any point during the course. Click here.