Introduction to Data Science (for physicists)

Wydział Fizyki, Astronomii i Informatyki Stosowanej,

Uniwersytet Jagielloński  w Krakowie


Rok akademicki 2019/2020



Konsultacje: wtorek, godz. 13:00 - 14:00; pokój 2-D-11.

Egzam (theory part, written):  31.01.2020, 14.00-16.00, A-1-13
List of topics: link


Lectures & Assignments:

-----------------------------------------------------


Date
Lecture slides
Python scripts with assignments
Datasets
Tutorials

   8.10.2019
Introduction
Data exploration

assignement-0-python
assignement-0-numpy
assignement-0-numpy-matplotlib
assignement-0-pandas

assignment-1

kc_house_data.csv.zip
info on kc_house

HowToStart

DS-cheatsheet_numpy.pdf
DS_cheatsheet_matplotlib.pdf
DS_cheatsheet_jupyter_notebook.pdf
DS_cheatsheet_pandas.pdf
15.10.2019 Regression I assignment-2


S.Rashki
python-machine-learning-book


rashka_ch04.ipynb
rashka_ch10.ipynb
22.10.2019 Regression II assignment-3


numpy tutorial
matrix algebra tutorial
29.10.2019
Classification I
assignment-4

amazon_baby.csv.zip
scikit-learn: LogisticRegression
  5.11.2019 Classification II




 


 12.11.2019 Retrieval&clustering I
assignment-5
or
PCA_Clustering
slides
people_wiki.cvs.zip

mnist_train.csv.gz
clusters.gif.gz
pca.gif.gz



http://cs229.stanford.edu/notes/cs229-notes10.pdf

19.11.2019
Retrieval&clustering II


Switching to physics

Switching to physics
Switching to physics Switching to physics Switching to physics
Material for reading: => G. Cowan, "Statistical Data Analysis".
=> F. James, "Statistical Methods in    Experimental Physics".
=> J. Narsky, F. Porter, "Statistical Analysis Techniques in Particle Physics".
=> K. Cranmer, "Practical statistics for LHC"



Advanced projects: (Root based, physics)
BDTs and TMVA
by I. Coadou,
IN2P3 School on Statistics, 2018
slides
Apply.C  Train.C
dataSchachbrett.root
follow exercises there

root-numpy,
examples: write.py, read.py
HowToActivate

Interval Estimation and Hypotheses Testing
by T. Dorigo, IN2P3 School on Statistics, 2018
slides
follow exercises there



Unfolding
S. Schmitt, D. Britzger, DESY Scool 2014
root based: RooUnfold
slides

python based: pynfold
follow exercises there



Higgs signal at LHC
by I. van Vulpen, Terascale Statistics School, DESY 2018
slides,
exercises doc
DesyCode2018.tgz
follow exercises there



Roofit and Roostats
by V. Verkerke, Terascale Statistics School, DESY 201
slides, part I
exercises, part I
macros, part I
slides, part II
exercises, part II
macros, partII
follow exercises there


Assignments for your choice, complete 3 :
(Python based)
Principal Component Analysis
slides
PCA_Clustering.ipynb


Monte Carlo methods
by M. Chrzaszcz, ETH Zurich
MC.ipynb



Bayesian inference
Bayesian_Inference.ipynb



Unfolding
Unfolding.ipynb



Non-parametric inference
Non_parametric.ipynb



Gaussian processes
Gaussian_processes.ipynb
monthly_in_situ_co2_mlo.csv


26.11.2019 Statistics and Data Analysis-part 1
exercises-part_1


  3.12.2019
Statistics and Data Analysis-part 2
Statistics and Data Analysis-part 3
exercises-part_2
exercises-part_3


10.12.2019
Statistics and Data Analysis-part 4 exercises-part_4


16.12.2019
 Zajęcia odwołane



  7.01.2020
Modeling, simulation, Monte Carlo methods
MC.ipynb

14.01.2020
 Zajęcia odwołane



21.01.2020
Machine Learning and Multivariate analyses
Statistical methods for LHC


28.01.2020
Unfolding algorithms and RooUnfold




First part of the course (lectures till 19.11) based on the materials from Coursera: 


Dr. Mine Çetinkaya-Rundel    
"Data Analysis and Statistical Inference"
C. Guestrin and E. Fox, "Machine Learning Specialisation"
        Foundation: link
        Regression: link
        Classification: link
        Clustering and Retrieval: link

Related interesting material from Coursera
D. Peng, J. Leek and B. Caffo, " Exploratory Data Analysis"
J. Leskovec, A. Rajaraman and J. Ullman, "Mining Massive Datasets"
B. Caffo, R. D. Peng and J. Leek, "Regression Models"
B. Chopard et al., "Simulation and modeling of natural processes"

Data Science applications in physics:
B. Nachman,

"Advanced Machine Learning for Classification, Regression, and Generation in Jet Physics
"
M. Stoye,

"ML applications in CMS"
ML techniques in HEP,  Workshop, Berkeley Laboratory, 11 - 13 December 2018
https://indico.physics.lbl.gov/indico/event/546/

Collection of datasets

http://mlr.cs.umass.edu/ml/datasets.html
http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
http://snap.stanford.edu/data/amazon/


Useful links:
https://turi.com/download/install-graphlab-create-aws-coursera.html
https://turi.com/download/academic.html
https://github.com/turi-code/SFrame

Clustering
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Boosting
https://turi.com/learn/userguide/supervised-learning/boosted_trees_classifier.html
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
-----------------
[1] https://class.coursera.org/statistics
[2] http://www.openintro.org/stat/textbook.php
[3] https://class.coursera.org/exdata-006
[4] https://class.coursera.org/mmds
[5]
http://www.mmds.org/
[6] http://www.cs.cmu.edu/~awm/tutorials.html

Additional materials
http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
http://www.youtube.com/watch?v=wQhVWUcXM0A

Link to lectures given in 2017
http://th-www.if.uj.edu.pl/~erichter/dydaktyka/Dydaktyka2017/AiSAD-2017/index.html 


Link to lectures given in 2014
http://th-www.if.uj.edu.pl/~erichter/dydaktyka/Dydaktyka2014/AiSAD-2014/index.html 
 

Ostatnia modyfikacja: 8 November  2019

Elzbieta Richter-Was


Wstecz