Advanced Methods for Data Analysis

Wydział Fizyki, Astronomii i Informatyki Stosowanej,

Uniwersytet Jagielloński  w Krakowie


Rok akademicki 2023/2024



Konsultacje: czwartek, godz. 10:00 - 11:00; pokój G-0-10.


Exams:

     31.01.2024,  godz. 12.00-13.30, sala A-1-08
 
     23.02.2024,  godz. 12.00-13.30, sala A-2-07

Topics for assesment.



Lectures
---------
Recommended books/articles for reading:

=> G. Cowan, "Statistical Data Analysis"
      recorded lectures at CERN Summer School 2023: link
=> F. James, "Statistical Methods in Experimental Physics"
=> J. Narsky, F. Porter, "Statistical Analysis Techniques in Particle Physics"
=> J. A.Rice, "Mathematical Statistics and Data Analysis"
=> I. Narsky, F. C. Porter, "Statistical Analysis Techniques in Particle Physics"

=> K. Cranmer, "Practical statistics for LHC"



Date
Lecture slides
Additional material

Statistics and Data Analysis
Statistical methods for LHC (advanced)
5.10.2023 Introduction,   StatAnal-lecture-1
Efficiency uncertainties
 12.10.2023
StatAnal-lecture-2,    StatAnal-lecture-3,
19.10.2023 Cancelled

26.10.2023
StatAnal-lecture-3,   StatAnal-lecture-4, EW measurements at LEP2
 9.11.2023
StatAnal-lecture-4 cont, LHCStatAnal-lecture-1 https://arxiv.org/pdf/1007.1727.pdf
16.11.2023
LHCStatAnal-lecture-2,  LHCStatAnal-lecture-3 https://arxiv.org/pdf/1609.04150.pdf
https://arxiv.org/pdf/1807.05996.pdf
https://arxiv.org/pdf/2101.06944.pdf
HistFactory
Pyhf
https://arxiv.org/pdf/2109.04981.pdf

 Multivariate Techniques and Machine Learning

23.11.2023
Unfolding-lecture            

https://arxiv.org/pdf/1910.14654.pdf
F. Spano-Proc14-02/P52.pdf
https://arxiv.org/pdf/1611.01927.pdf

30.11.2023 MVandML-lecture-1          
https://arxiv.org/pdf/1506.02169.pdf
P.Bhat, Multivariate_Analysis_Methods_in_Particle_Physics
Understanding Deep Learning
https://arxiv.org/pdf/1806.11484.pdf
7.12.2023 MVandML-lecture-1a   MVandML-lecture-1b   MVandML-lecture-1c
 
MVandML-lecture-2 
https://iopscience.iop.org/article/
10.1088/1748-0221/11/01/P01019/pdf
https://arxiv.org/pdf/1812.09722.pdf
ATL-PHYS-PUB-2019-033.pdf
ATL-PHYS-PUB-2020-018.pdf
GNN in ATLAS flavour tagging
14.12.2023
(canceled)


Physics Modeling, Simulation and Monte Carlo Methods
21.12.2023 PhysModelAndMC-lecture
(on-line)
https://www.coursera.org/learn/
modeling-simulation-natural-processes

Regression, Classification, Clustering and Retrieval

 11.01.2024 DataScience-lecture-1: Regression
https://hastie.su.domains/ElemStatLearn/
https://hastie.su.domains/ISLP/ISLP_website.pdf
https://probml.github.io/pml-book/
https://www.deeplearningbook.org/
18.01.2024 DataScience-lecture-2: Classification

25.01.2024 DataScience-lecture-3: Clustering
DataScience-lecture-4: Retrieval

     

Assignments:

-------------------------

Date
Topic

Root/C++  or use PyRoot
Datasets/Tutorials

Python + Anaconda
Datasets/Tutorials

   5.10.2023
Getting organised with the framework

Data exploration


Introduction-labs

Select few examples from this link: histograms
eg.:
h1draw
fibonacci
filrandom
ratioplot1
     
or from PyRoot examples from this link: PyRoot
HowToStart

assignment-0-python
assignment-0-numpy
assignment-0-numpy-matplotlib
assignment-0-pandas

HowToStart

DS-cheatsheet_numpy.pdf
DS_cheatsheet_matplotlib.pdf
DS_cheatsheet_jupyter_notebook.pdf
DS_cheatsheet_pandas.pdf

numpy tutorial
matrix algebra tutorial

Statistics and Data Analysis






12.10.2023 StatAnal_labs-lecture-1.txt
StatAnal_labs-lecture-2.txt




scripts in
K. Cranmer,

Statistics and Data Science

PYHF: python based fitting/limit-setting/interval estimation

26.10.2023 StatAnal_labs-lecture-3.txt
StatAnal_labs-lecture-4.txt






 


 







9.11-30.11.2023











StatAnal-project:
select one from suggested topics or propose your own.

OR

Prepare ~20min presentation explaining application of the stat method to physics measurement, based on the published article in HEP journal.



1) Modeling tools: RooFit, RooStats and HistFactory
 LHCStatAnal-labs-4-Root
follow exercises there

2) Interval Estimation and Hypotheses Testing
by T. Dorigo, IN2P3 School on Statistics, 2018

slides
follow exercises there


3) Higgs signal at LHC
by I. van Vulpen, Terascale Statistics School, DESY 2018  slides,
exercises doc
DesyCode2018.tgz
follow exercises there

4) Roofit and Roostats
by V. Verkerke, Terascale Statistics School, DESY 2011

slides, part I
exercises, part I
macros, part I
slides, part II
exercises, part II
macros, partII
follow exercises there



1) Bayesian inference
Bayesian_inference.ipynb

2) Non-parametric inference
Non_parametric inference.ipynb

3) Gaussian processes
Gaussian_processes.ipynb
monthly_in_situ_co2_mlo.csv

4) Try out PYHF tool on exercises proposed for Rootfit/Roostats

PYHF: python based fitting/limit-setting/interval estimation




Multivariate techniques and Machine Learning





 






   30.11.2023
 - 11.01.2024

    
    
      
         
 MVandML-project:
select one from suggested topics or propose your own

OR

 Prepare ~20min presentation explaining  application of the ML method to physics measurement, based on the published article in HEP journal.

1) BDTs and TMVA
by I. Coadou,
IN2P3 School on Statistics, 2018
slides
Apply.C  Train.C
dataSchachbrett.root
follow exercises there

2) Unfolding
RooUnfold
slides
follow exercises there

3) Analysis of ATLAS open data with MV or ML methods

4) Analysis of ATLAS data
optimising electron identification


1) Principal Component Analysis
slides
PCA_Clustering.ipynb

2)
Unfolding.ipynb

3) Analysis of ATLAS open data
BDT example for H->4l
infofile.py

4) Unfolding with Gaussian processes
https://github.com/adambozson/gp-unfold


Regression, Classification, Clustering and Retrieval





 11.01.2024
-30.01.2024
Select one topic, you may start from the  scripts linked,  and build into a bigger project.
 

                                 OR


choose  second

StatAnal-project
or MVandML-project




Assignment_1_without_code.ipynb
Assignment_2_without_code.ipynb
kc_house_data.csv.zip, info on kc_house

Assignment_4_without_code.ipynb
amazon_baby.csv.zip

Assignment_5_without_code.ipynb
people_wiki.cvs.zip
Other datasets
lending-club-data.csv.zip
http://snap.stanford.ed/data/amazon/
http://mlr.cs.umass.edu/ml/datasets.html
https//data.world/











Aachen Online Statistics School 2023
https://indico.desy.de/event/37562/timetable/

Statistical Analysis in HEP Physics

N. Beger, 
Foundation of Statistics, Lectures at CERN Summer School 2019
link1, link2, link3

Statistics and Data Science

K. Cranmer, Course at NYU Physics,  Fall 2020, link

Machine learning applications in HEP physics:

B. Nachman,

"Advanced Machine Learning for Classification, Regression, and Generation in Jet Physics
"

M. Stoye,

"ML applications in CMS"

ML techniques in HEP,  Workshop, Berkeley Laboratory, 11 - 13 December 2018
https://indico.physics.lbl.gov/indico/event/546/

A. Castaneda,
LHCP conference, Puebla, Mexico, 2019
ML and Big data tools at HEP,

Last part of the course will be based on the materials from Coursera: 


Dr. Mine Çetinkaya-Rundel    
"Data Analysis and Statistical Inference"
C. Guestrin and E. Fox, "Machine Learning Specialisation"
        Foundation: link
        Regression: link
        Classification: link
        Clustering and Retrieval: link

Related interesting material from Coursera
D. Peng, J. Leek and B. Caffo, " Exploratory Data Analysis"
J. Leskovec, A. Rajaraman and J. Ullman, "Mining Massive Datasets"
B. Caffo, R. D. Peng and J. Leek, "Regression Models"
B. Chopard et al., "Simulation and modeling of natural processes"

Collection of datasets

http://mlr.cs.umass.edu/ml/datasets.html
http://faculty.marshall.usc.edu/gareth-james/ISL/data.html
http://snap.stanford.edu/data/amazon/


Useful links:
https://turi.com/download/install-graphlab-create-aws-coursera.html
https://turi.com/download/academic.html
https://github.com/turi-code/SFrame

Clustering
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Boosting
https://turi.com/learn/userguide/supervised-learning/boosted_trees_classifier.html
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
-----------------
[1] https://class.coursera.org/statistics
[2] http://www.openintro.org/stat/textbook.php
[3] https://class.coursera.org/exdata-006
[4] https://class.coursera.org/mmds
[5]
http://www.mmds.org/
[6] http://www.cs.cmu.edu/~awm/tutorials.html

Additional materials
http://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
http://www.youtube.com/watch?v=wQhVWUcXM0A


Ostatnia modyfikacja: 12 November  2023

Elzbieta Richter-Was


Wstecz