Introduction to Data Science (for physicists)

Wydział Fizyki, Astronomii i Informatyki Stosowanej,

Uniwersytet Jagielloński  w Krakowie


Rok akademicki 2020/2021



Konsultacje: wtorek, godz. 15:00 - 16:00; pokój G-0-10.


Egzam (theory part, written):  Monday    1.02.2021, 14.00 - 16.00,
                                                   details will be announced later

                                                   Monday, 22.02.2021, 14.00 - 16.00
Topics for assessment: link

                                            

Lectures

---------
Recommended books/articles for reading:

=> G. Cowan, "Statistical Data Analysis"
=> F. James, "Statistical Methods in Experimental Physics"
=> J. Narsky, F. Porter, "Statistical Analysis Techniques in Particle Physics"
=> J. A.Rice, "Mathematical Statistics and Data Analysis"
=> I. Narsky, F. C. Porter, "Statistical Analysis Techniques in Particle Physics"

=> K. Cranmer, "Practical statistics for LHC"



Date
Lecture slides
Additional material

Statistics and Data Analysis
Statistical methods for LHC (advanced)
13.10.2020 Introduction,   StatAnal-lecture-1
Efficiency uncertainties
 20.10.2020
StatAnal-lecture-2,    StatAnal-lecture-3,
27.10.2020
StatAnal-lecture-3,    StatAnal-lecture-4,
EW measurements at LEP2
  3.11.2020
LHCStatAnal-lecture-1, LHCStatAnal-lecture-2 https://arxiv.org/pdf/1007.1727.pdf
10.11.2020
LHCStatAnal-lecture-2 cont.

17.11.2020
LHCStatAnal-lecture-3 HistFactory
Pyhf

 Multivariate Techniques and Machine Learning

24.11.2020 MVandML-lecture-1
https://arxiv.org/pdf/1506.02169.pdf
  1.12.201.12.2020
MVandML-lecture-2 https://arxiv.org/pdf/1812.09722.pdf1812.09722.pdf
  8.12.2020
Unfolding-lecture
https://arxiv.org/pdf/1910.14654.pdf
F. Spano-Proc14-02/P52.pdf

Physics Modeling, Simulation and Monte Carlo Methods

15.12.2020 PhysModelAndMC-lecture
https://www.coursera.org/learn/
modeling-simulation-natural-processes


Regression, Classification, Clustering and Retrieval

20.12.2020 DataScience-lecture-1: Regression

12.01.2021 DataScience-lecture-2: Classification

19.01.2021 DataScience-lecture-3: Clustering

26.01.2021 DataScience-lecture-4: Retrieval



Assignments:

-------------------------

Date
Topic

Root/C++  or use PyRoot
Datasets/Tutorials

Python + Anaconda
Datasets/Tutorials

   13.10.2020
Getting organised with the framework

Data exploration


Introduction-labs

Select few examples from this link: histograms
eg.:
h1draw
fibonacci
filrandom
ratioplot1
     
or from PyRoot examples from this link: PyRoot
HowToStart

assignement-0-python
assignement-0-numpy
assignement-0-numpy-matplotlib
assignement-0-pandas

HowToStart

DS-cheatsheet_numpy.pdf
DS_cheatsheet_matplotlib.pdf
DS_cheatsheet_jupyter_notebook.pdf
DS_cheatsheet_pandas.pdf

numpy tutorial
matrix algebra tutorial

Statistics and Data Analysis




scripts in
K. Cranmer,

Statistics and Data Science

PYHF: python based fitting/limit-setting/interval estimation

20.10.2020 StatAnal_labs-lecture-1.txt
StatAnal_labs-lecture-2.txt






27.10.2020 StatAnal_labs-lecture-3.txt
StatAnal_labs-lecture-4.txt






 


 







3.11-29.11.2020











StatAnal-project:
select one from suggested topics or propose your own.

OR

Solve more theoretical/conceptual  problems: link



1) Modeling tools: RooFit, RooStats and HistFactory
 LHCStatAnal-labs-4-Root
follow exercises there

2) Interval Estimation and Hypotheses Testing
by T. Dorigo, IN2P3 School on Statistics, 2018

slides
follow exercises there


3) Higgs signal at LHC
by I. van Vulpen, Terascale Statistics School, DESY 2018  slides,
exercises doc
DesyCode2018.tgz
follow exercises there

4) Roofit and Roostats
by V. Verkerke, Terascale Statistics School, DESY 2011

slides, part I
exercises, part I
macros, part I
slides, part II
exercises, part II
macros, partII
follow exercises there



1) Bayesian inference
Bayesian_inference.ipynb

2) Non-parametric inference
Non_parametric inference.ipynb

3) Gaussian processes
Gaussian_processes.ipynb
monthly_in_situ_co2_mlo.csv

4) Try out PYHF tool on exercises proposed for Rootfit/Roostats

PYHF: python based fitting/limit-setting/interval estimation




Multivariate techniques and Machine Learning





 24.11.2020




 
   1.12.2020






 



8.12.2020-10.01.2021

    
      

    
        MVandML-project

1) BDTs and TMVA
by I. Coadou,
IN2P3 School on Statistics, 2018
slides
Apply.C  Train.C
dataSchachbrett.root
follow exercises there

2) Unfolding
RooUnfold
slides
follow exercises there

3) Analysis of ATLAS open data with MV or ML methods

4) Analysis of ATLAS data
optimising electron identification


1) Principal Component Analysis
slides
PCA_Clustering.ipynb

2)
Unfolding.ipynb

3) Analysis of ATLAS open data
BDT example for H->4l
infofile.py

4) Unfolding with Gaussian processes
https://github.com/adambozson/gp-unfold


Physics Modeling, Simulation and Monte Carlo Methods





15.12.2020
PhysModel_lab.txt

Monte Carlo methods
by M. Chrzaszcz, ETH Zurich
follow few exercises there


MC.ipynb


Regression, Classification, Clustering and Retrieval





20.12.2020-26.01.2021 Select one topic, you may start from the  scripts linked,  and build into a bigger project.




Assignment_1_without_code.ipynb
Assignment_2_without_code.ipynb
kc_house_data.csv.zip, info on kc_house

Assignment_4_without_code.ipynb
amazon_baby.csv.zip

Assignment_5_without_code.ipynb
people_wiki.cvs.zip
Other datasets
lending-club-data.csv.zip
http://snap.stanford.ed/data/amazon/
http://mlr.cs.umass.edu/ml/datasets.html
https//data.world/









Statistical Analysis in HEP Physics

N. Beger, 
Foundation of Statistics, Lectures at CERN Summer School 2019
link1, link2, link3

Statistics and Data Science

K. Cranmer, Course at NYU Physics,  Fall 2020, link

Machine learning applications in HEP physics:

B. Nachman,

"Advanced Machine Learning for Classification, Regression, and Generation in Jet Physics
"

M. Stoye,

"ML applications in CMS"

ML techniques in HEP,  Workshop, Berkeley Laboratory, 11 - 13 December 2018
https://indico.physics.lbl.gov/indico/event/546/

A. Castaneda,
LHCP conference, Puebla, Mexico, 2019
ML and Big data tools at HEP,

Last part of the course will be based on the materials from Coursera: 


Dr. Mine Çetinkaya-Rundel    
"Data Analysis and Statistical Inference"
C. Guestrin and E. Fox, "Machine Learning Specialisation"
        Foundation: link
        Regression: link
        Classification: link
        Clustering and Retrieval: link

Related interesting material from Coursera
D. Peng, J. Leek and B. Caffo, " Exploratory Data Analysis"
J. Leskovec, A. Rajaraman and J. Ullman, "Mining Massive Datasets"
B. Caffo, R. D. Peng and J. Leek, "Regression Models"
B. Chopard et al., "Simulation and modeling of natural processes"

Collection of datasets

http://mlr.cs.umass.edu/ml/datasets.html
http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
http://snap.stanford.edu/data/amazon/


Useful links:
https://turi.com/download/install-graphlab-create-aws-coursera.html
https://turi.com/download/academic.html
https://github.com/turi-code/SFrame

Clustering
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Boosting
https://turi.com/learn/userguide/supervised-learning/boosted_trees_classifier.html
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
-----------------
[1] https://class.coursera.org/statistics
[2] http://www.openintro.org/stat/textbook.php
[3] https://class.coursera.org/exdata-006
[4] https://class.coursera.org/mmds
[5]
http://www.mmds.org/
[6] http://www.cs.cmu.edu/~awm/tutorials.html

Additional materials
http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
http://www.youtube.com/watch?v=wQhVWUcXM0A


Ostatnia modyfikacja: 22 October  2020

Elzbieta Richter-Was


Wstecz