Advanced Methods for Data Analysis

Wydział Fizyki, Astronomii i Informatyki Stosowanej,

Uniwersytet Jagielloński  w Krakowie


Rok akademicki 2021/2022



Konsultacje: wtorek, godz. 15:00 - 16:00; pokój G-0-10.


Egzamin (pisemny) sesja zimowa:
                1.02.2022, godz.   9.00-11.00, on-line MSTeams
               
2.02.2022, godz. 13.30-16.00, on-line MSTeams
   

Lista zagadnien


Lectures

---------
Recommended books/articles for reading:

=> G. Cowan, "Statistical Data Analysis"
=> F. James, "Statistical Methods in Experimental Physics"
=> J. Narsky, F. Porter, "Statistical Analysis Techniques in Particle Physics"
=> J. A.Rice, "Mathematical Statistics and Data Analysis"
=> I. Narsky, F. C. Porter, "Statistical Analysis Techniques in Particle Physics"

=> K. Cranmer, "Practical statistics for LHC"



Date
Lecture slides
Additional material

Statistics and Data Analysis
Statistical methods for LHC (advanced)
5.10.2021 Introduction,   StatAnal-lecture-1
Efficiency uncertainties
 12.10.2021
StatAnal-lecture-2,    StatAnal-lecture-3,
19.10.2021
StatAnal-lecture-3,    StatAnal-lecture-4,
EW measurements at LEP2
 26.10.2021
LHCStatAnal-lecture-1, LHCStatAnal-lecture-2 https://arxiv.org/pdf/1007.1727.pdf
2.11.2021
 Godziny dziekańskie

9.11.2021
LHCStatAnal-lecture-2 cont.,
LHCStatAnal-lecture-3
https://arxiv.org/pdf/1609.04150.pdf
https://arxiv.org/pdf/1807.05996.pdf
https://arxiv.org/pdf/2101.06944.pdf
HistFactory
Pyhf
https://arxiv.org/pdf/2109.04981.pdf

Physics Modeling, Simulation and Monte Carlo Methods
16.11.2021 PhysModelAndMC-lecture
                                           on-line with  MSTeams
https://www.coursera.org/learn/
modeling-simulation-natural-processes

 Multivariate Techniques and Machine Learning

23.11.2021
Unfolding-lecture                on-line with
MSTeams
https://arxiv.org/pdf/1910.14654.pdf
F. Spano-Proc14-02/P52.pdf
https://arxiv.org/pdf/1611.01927.pdf

30.11.2021 MVandML-lecture-1             on-line with MSTeams https://arxiv.org/pdf/1506.02169.pdf
P.Bhat, Multivariate_Analysis_Methods_in_Particle_Physics
 7.12.20210
MVandML-lecture-1a           on-line with MSTeams
812.09722.pdf
14.12.2021 MVandML-lecture-1b, MVandML-lecture-1c
https://iopscience.iop.org/article/
10.1088/1748-0221/11/01/P01019/pdf
21.12.2021
MVandML-lecture-2            on-line, zarzadzenie JM Rektora
https://arxiv.org/pdf/1812.09722.pdf
ATL-PHYS-PUB-2019-033.pdf
ATL-PHYS-PUB-2020-018.pdf

Regression, Classification, Clustering and Retrieval

  4.01.2022 DataScience-lecture-1: Regression on-line, zarzadzenie JM Rektora

11.01.2022 DataScience-lecture-2: Classification

18.01.2022 DataScience-lecture-3: Clustering

25.01.2022 DataScience-lecture-4: Retrieval     on-line, zarzadzenie p. Dziekan

     

Assignments:

-------------------------

Date
Topic

Root/C++  or use PyRoot
Datasets/Tutorials

Python + Anaconda
Datasets/Tutorials

   5.10.2021
Getting organised with the framework

Data exploration


Introduction-labs

Select few examples from this link: histograms
eg.:
h1draw
fibonacci
filrandom
ratioplot1
     
or from PyRoot examples from this link: PyRoot
HowToStart

assignment-0-python
assignment-0-numpy
assignment-0-numpy-matplotlib
assignment-0-pandas

HowToStart

DS-cheatsheet_numpy.pdf
DS_cheatsheet_matplotlib.pdf
DS_cheatsheet_jupyter_notebook.pdf
DS_cheatsheet_pandas.pdf

numpy tutorial
matrix algebra tutorial

Statistics and Data Analysis






12.10.2021 StatAnal_labs-lecture-1.txt
StatAnal_labs-lecture-2.txt




scripts in
K. Cranmer,

Statistics and Data Science

PYHF: python based fitting/limit-setting/interval estimation

19.10.2021 StatAnal_labs-lecture-3.txt
StatAnal_labs-lecture-4.txt






 


 







26.10-30.11.2021











StatAnal-project:
select one from suggested topics or propose your own.

OR

Solve more theoretical/conceptual  problems: link

OR

Prepare ~20min presentation explaining application of the stat method to physics measurement, based on the published article in HEP journal.



1) Modeling tools: RooFit, RooStats and HistFactory
 LHCStatAnal-labs-4-Root
follow exercises there

2) Interval Estimation and Hypotheses Testing
by T. Dorigo, IN2P3 School on Statistics, 2018

slides
follow exercises there


3) Higgs signal at LHC
by I. van Vulpen, Terascale Statistics School, DESY 2018  slides,
exercises doc
DesyCode2018.tgz
follow exercises there

4) Roofit and Roostats
by V. Verkerke, Terascale Statistics School, DESY 2011

slides, part I
exercises, part I
macros, part I
slides, part II
exercises, part II
macros, partII
follow exercises there



1) Bayesian inference
Bayesian_inference.ipynb

2) Non-parametric inference
Non_parametric inference.ipynb

3) Gaussian processes
Gaussian_processes.ipynb
monthly_in_situ_co2_mlo.csv

4) Try out PYHF tool on exercises proposed for Rootfit/Roostats

PYHF: python based fitting/limit-setting/interval estimation




Multivariate techniques and Machine Learning





 






   30.11.2021
 - 11.01.2021

    
    
      
         
 MVandML-project:
select one from suggested topics or propose your own

OR

 Prepare ~20min presentation explaining  application of the ML method to physics measurement, based on the published article in HEP journal.

1) BDTs and TMVA
by I. Coadou,
IN2P3 School on Statistics, 2018
slides
Apply.C  Train.C
dataSchachbrett.root
follow exercises there

2) Unfolding
RooUnfold
slides
follow exercises there

3) Analysis of ATLAS open data with MV or ML methods

4) Analysis of ATLAS data
optimising electron identification


1) Principal Component Analysis
slides
PCA_Clustering.ipynb

2)
Unfolding.ipynb

3) Analysis of ATLAS open data
BDT example for H->4l
infofile.py

4) Unfolding with Gaussian processes
https://github.com/adambozson/gp-unfold


Regression, Classification, Clustering and Retrieval





 11.01.2022
-30.01.2022
Select one topic, you may start from the  scripts linked,  and build into a bigger project.
 

                                 OR


choose  second

StatAnal-project
or MVandML-project




Assignment_1_without_code.ipynb
Assignment_2_without_code.ipynb
kc_house_data.csv.zip, info on kc_house

Assignment_4_without_code.ipynb
amazon_baby.csv.zip

Assignment_5_without_code.ipynb
people_wiki.cvs.zip
Other datasets
lending-club-data.csv.zip
http://snap.stanford.ed/data/amazon/
http://mlr.cs.umass.edu/ml/datasets.html
https//data.world/









Statistical Analysis in HEP Physics

N. Beger, 
Foundation of Statistics, Lectures at CERN Summer School 2019
link1, link2, link3

Statistics and Data Science

K. Cranmer, Course at NYU Physics,  Fall 2020, link

Machine learning applications in HEP physics:

B. Nachman,

"Advanced Machine Learning for Classification, Regression, and Generation in Jet Physics
"

M. Stoye,

"ML applications in CMS"

ML techniques in HEP,  Workshop, Berkeley Laboratory, 11 - 13 December 2018
https://indico.physics.lbl.gov/indico/event/546/

A. Castaneda,
LHCP conference, Puebla, Mexico, 2019
ML and Big data tools at HEP,

Last part of the course will be based on the materials from Coursera: 


Dr. Mine Çetinkaya-Rundel    
"Data Analysis and Statistical Inference"
C. Guestrin and E. Fox, "Machine Learning Specialisation"
        Foundation: link
        Regression: link
        Classification: link
        Clustering and Retrieval: link

Related interesting material from Coursera
D. Peng, J. Leek and B. Caffo, " Exploratory Data Analysis"
J. Leskovec, A. Rajaraman and J. Ullman, "Mining Massive Datasets"
B. Caffo, R. D. Peng and J. Leek, "Regression Models"
B. Chopard et al., "Simulation and modeling of natural processes"

Collection of datasets

http://mlr.cs.umass.edu/ml/datasets.html
http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
http://snap.stanford.edu/data/amazon/


Useful links:
https://turi.com/download/install-graphlab-create-aws-coursera.html
https://turi.com/download/academic.html
https://github.com/turi-code/SFrame

Clustering
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Boosting
https://turi.com/learn/userguide/supervised-learning/boosted_trees_classifier.html
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
-----------------
[1] https://class.coursera.org/statistics
[2] http://www.openintro.org/stat/textbook.php
[3] https://class.coursera.org/exdata-006
[4] https://class.coursera.org/mmds
[5]
http://www.mmds.org/
[6] http://www.cs.cmu.edu/~awm/tutorials.html

Additional materials
http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
http://www.youtube.com/watch?v=wQhVWUcXM0A


Ostatnia modyfikacja: 1 October  2021

Elzbieta Richter-Was


Wstecz