Trip Dates: 15 – 30 Nov 2015
The purpose of our visit to Brazil was to provide a modern course on Statistics and Data Analysis in the context of high-energy physics to students and faculty of Rio de Janeiro State University (UERJ), Rio Federal State University (UFRJ) and State University of the Amazonas (UEA).
Organization: Yara Coutinho, UFRJ faculty member and member of ATLAS collaboration, and Alberto Santoro, head of UERJ CMS group, organized the course, taught by experts from CERN: Sergei Gleyzer and Lorenzo Moneta.
The main goal of the course was to provide a solid theoretical foundation of modern statistics together with practical tools useful in conducting modern data analysis. In particular, the course concentrated on fundamentals of statistical data analysis and machine learning, a science of algorithms that improve their performance with experience.
Recently machine-learning has seen an explosion of applications in many fields, in particular, in high-energy physics. Like other technical disciplines, high-energy physics has fostered the development and breakthroughs in this area by developing novel algorithms, customized software and solutions for challenges that are unique to the field of particle physics.
Using machine-learning techniques particle physicists have been able to increase the sensitivity of analyses looking for new physics, to help discover new particles and improve the performance of the expensive particle physics detectors. This course focused on the fundamentals of this important research area and aimed to enhance participants’ skills in applying machine-learning techniques to everyday research.
Participants: the course was held on UERJ campus and was attended by 30 students, post-docs, researchers and professors from UERJ, UFRJ, and researchers from UEA, who connected by video from Manaus, Brazil. 1/3 of the researchers who participated in the course were professors, while the other 2/3 were students and staff.
Preparation: a client-server web application (JUPYTER) was setup in UERJ and used in the hands-on sessions for sharing live code and equations. It will be available for further educational uses in UERJ.
Course Structure: the two-week visit was organized in two distinct parts. During the first week, a series of lectures on statistical issues relevant to analyzing Large Hadron Collider data with ROOT was presented, covering probability theory, Bayesian and Frequentist statistics, parameter estimation, fitting, confidence interval and discovery significance. The following lectures were presented:
- Introduction to ROOT. Basic functionality of ROOT, simple examples with ROOT prompt, plotting functions and data points.
- Introduction to C++. A quick review of the basic C++ features needed to work with ROOT: classes, objects and pointers.
- Working with histograms. Main functionality of ROOT histogram classes to analyze and visualize data
- ROOT I/O and Trees. Binary format storage using the serialization capabilities of the ROOT framework. Using the TTree class, to store and analyze data.
- Introduction to Statistics. Probability, Bayesian and Frequentist statistics, parameter estimation and hypothesis tests.
- Fitting in ROOT and Introduction to RooFit. Fitting regression analysis in ROOT, parameter estimation with uncertainties, modeling expected distribution of events in a physical analysis with a dedicated tool RooFit
- Advanced statistical analysis with ROOT. Estimating confidence intervals and discovery significance with RooStats.
The second part of the course focused on machine-learning theory and fundamentals covering such topics as supervised and unsupervised learning, methods and algorithms, ensemble learning, boosting, feature selection, classification, regression, and more advanced topics, such as deep learning.
Examples from published analyses and concrete machine-learning applications in high-energy physics were described in detail. Students were encouraged to first try classic machine-learning examples based on the classic C4.5 machine-learning package, and then progress to more advanced examples and tutorials using ROOT-integrated Toolkit for Multivariate Analyses (TMVA).
The following lectures on machine-learning topics were presented:
- Introduction to Machine Learning. Fundamentals of Machine-Learning theory, high-energy physics examples and applications.
- Machine Learning 2: Classification theory, feature selection, linear and quadratic discriminants, decision trees, decisions rules, pruning, overtraining
- Machine Learning 3: Shallow neural networks, ensemble methods, boosting, bagging, cross-validation, support vector machines
- Machine Learning 4: Random Forests, function estimation, regression, multi-objective regression, feature engineering, Bayesian machine learning, genetic algorithms
- Machine Learning 5: Deep Learning
How to get involved in machine-learning activities at CERN via Inter-experimental LHC Machine Learning Working Group (IML) was also discussed.
Each lecture was complemented with series of hands-on exercises, presented in the form of ROOT macros and ROOT notebooks. This was the first time that this new technology was used with ROOT tutorials. Most of the students used laptops to run ROOT interactively on a server, while several students used their smartphones.
Course Materials: All the materials for the course including lectures, exercises and tutorials are available on the course event and twiki pages: https://indico.cern.ch/event/402660/
Evaluation: overall, the course was found to be very productive, useful and very well attended. The course setup allowed various active-learning techniques to be applied to a high degree of success. Many questions were asked during the course that triggered active and interesting discussions, resulting in excellent student engagement and participation. In particular, participants were excited about learning machine learning. The feedback from students after using ROOT with JUPYTER server technology was very positive.
Impact and follow-up collaboration: by presenting this course in an external institute, we increased the visibility of work performed at CERN and expanded the user base of ROOT and related Machine-Learning tools (TMVA). The course also highlighted the use of modern machine-learning techniques and gave us an opportunity to expose the latest machine-learning knowledge to new users.
Further collaboration and involvement of Brazilian scientists who participated in the course, in CERN machine-learning activities is planned. In particular, the following ideas and projects have been discussed, with some already in progress:
1) Connecting matrix element analysis with deep learning:
S. Gleyzer, A. Sznajder (UERJ), S. Fonseca (UERJ) and students from UERJ
2) Evaluation of deep learning software in TMVA on Higgs Data:
S. Gleyzer, L. Moneta, O. Zapata Mesa, A. Sznajder(UERJ), P. Rebello Telles (CBPF)
3) Tuning of Monte Carlo generators with machine learning techniques:
P. Rebello Telles (CBPF), S. Gleyzer
4) Organization of a special session dedicated to machine-learning during the 2017 CMS week in Rio de Janeiro: A. Santoro (UERJ), S. Gleyzer
 Jupyter notebook is a web-based interactive computing platform that combines code, equations, text and visualizations. A ROOT notebook is a ROOT-integrated Jupyter notebook.