Introduction to Open Data Science, spring 2017

Last modified by kvehkala@helsinki_fi on 2024/02/07 06:37

Introduction to Open Data Science, spring 2017


Motivation: Our era of data - larger than ever and complex like chaos - requires several skills from statisticians and other data scientists. We must discover the patterns hidden behind numbers in matrices and arrays. We are not afraid of coding, recoding, programming, or modelling. We want to visualize, analyze, interpret, understand, and communicate. These are the core themes of Open Data Science  (Open Data - Open Science - Data Science). And this course is THE course for learning these skills.

The above text modified from: https://www.crcpress.com/Correspondence-Analysis-in-Practice-Third-Edition/Greenacre/p/book/9781498731775?tab=rev

Teacher

Kimmo Vehkalahti, University Lecturer, Adj.Prof., D.Soc.Sci (Statistics)
Fellow of the Teachers' Academy

Assistant teachers

Emma Kämäräinen, Tuomo Nieminen, Petteri Mäntymaa (students of Statistics/Data Science)

 

Thursday 19 January 2017


THE COURSE HAS STARTED, AND THIS Wiki PAGE WILL NOT BE NEEDED OR UPDATED ANYMORE.

 

See our video "Welcome to the course!"

and check the newest information (published just before the course started) from:

 

https://courses.helsinki.fi/78995/115961424 (these pages are replacing these Wiki pages)

 

You may enroll until 25 January 2017.

If you study at Uni Helsinki, Register for the course in Weboodi.
Otherwise, just enroll to the MOOC platform (see the info from the above link).

 

 

*******************************************************************************************************************************************
*******************************************************************************************************************************************
******************************************************************************************************************************************
*
*******************************************************************************************************************************************
*******************************************************************************************************************************************


General learning objective

After completing this course you will understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research.You will know how to use R, R Studio, R markdown, and GitHub for these tasks and also know how to learn more of these open software tools. You will also know how to apply certain statistical methods of data science, that is, data-driven statistics.

Practical info

    • ===
  • WebOodi: 78995 (5 credits), language: English, campus: City Centre.
  • Period III, starting 19 Jan 2017, ending 2 Mar 2017.
  • Weekly workshop: Thu 8-10, Unioninkatu 35, lecture room
  • We recommend you to bring a laptop computer (Mac/Windows/Linux) to the workshops.
  • Please prepare to work hard several hours each week.

New course for everyone interested in Open Data Science!

  • Basically meant for the doctoral students of the (Computational) Social Sciences and (Digital) Humanities.  
  • Master's students are also welcome, and it will be suitable even for Bachelor's studies (at least in Statistics). 
  • We learn to use open software tools of Data Science and to analyze openly available data sets.
  • R, R studio, R markdown and GitHub will be learnt and used throughout the course.

Contents

The course consists of 7 chapters, one for each week of the teaching period.

Chapter 1 introduces the tools (DataCamp, R, RStudio, GitHub) and the weekly working methods (reports, peer reviews, strict deadlines) of the course.

Chapters 2-6 introduce various topics to be worked with the tools and methods of Chapter 1 using different data sets.

Chapter 7 introduces a special assignment for wrapping up the course after the previous, weekly working phase of six weeks.

1 Tools and methods for open and reproducible research

  • R
  • RStudio
  • Rmarkdown
  • GitHub

2 Regression and model validation

  • Simple regression

  • Multiple regression

  • Regression diagnostics

3 Logistic regression

  • Logistic regression 
  • Cross validation: Training set and test set

4 Clustering and classification

  • K-means clustering (KMC)

  • Discriminant analysis (DA)

Dimensionality reduction techniques

  • Principal component analysis (PCA)

  • Correspondence analysis (CA, MCA)

6 Multivariate statistical modelling

  • Confirmatory factor analysis (CFA)

  • Structural equation models (SEM)

7 Final assignment

  • Doctoral/Master's level: using a new data set (perhaps your own)
  • Bachelor level: using a data set that has been used earlier on the course

Register for the course

 

Statistics – it’s not what you think it is.

Rlogo.png