Gsoc

Gsoc

CERN-SFT GSoC 2017 ideas page

CERN-SFT has successfully participated in the Google Summer of Code program since 2011, and we are taking part again in 2017! This year, we are applying under the umbrella organization of the HEP Software Foundation (HSF). While in this page we will include the project ideas for CERN-SFT, you can find information about other HSF-related projects at the new HSF GSoC website.

At CERN-SFT, we intend to offer three options. Please pick one of the project ideas grouped into categories, take a look at the section dedicated to 'Blue sky' projects, or propose your own great idea for this summer. We are looking forward to hearing about your proposals. A list of our mentors (and their areas of expertise) can be found here.

We encourage students who plan to apply to contact us  about their interests and explain their project ideas as early as possible. Our experience from past GSoCs was that initial student applications either need to be reworked in close collaboration with a future mentor, or at least can benefit from feedback and discussion. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, LinkedIn profile, Github account, etc.).

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC Web page before April 3, 2017.

Project ideas

ROOT Projects

Big Data Tools for Physics Analysis

Spark is an open-source software framework for large-scale big data processing on clusters. While it has become mainstream in industry, its adoption in the field of physics is still in its infancy. This project intends to explore the use of Spark for physics analysis at CERN, and in particular its interplay with two technologies: (i) ROOT, a software toolkit widely used for high-energy physics analysis, and (ii) the Jupyter notebooks, a well-known interface for interactive analysis.

The main development of this project will focus on making it easier to manage Spark computations from a Jupyter notebook. A plugin will be developed so that notebook users can monitor the status of a Spark job submitted from a notebook cell, and even cancel it if necessary. The main use case of the plugin will be a parallel physics analysis with ROOT and Spark, with a possible second use case in distributed machine learning. The plugin can then be integrated into the SWAN notebook pilot service at CERN.

Task ideas:  

  • Creation of a testbed for submission of Spark jobs from a Jupyter notebook
  • Design and implementation of a plugin to monitor Spark jobs from a notebook
    • Display information such as progress bars, task statistics and errors
    • Allow to cancel ongoing Spark jobs
  • Use cases: apply the plugin to a couple of distributed Spark applications
    • ROOT physics analysis
    • Machine learning
  • Tests on CERN IT infrastructure (Spark clusters)
  • Integration of the plugin into the SWAN notebook service at CERN

Expected results: working implementation of the notebook plugin to manage ROOT-Spark jobs

Requirements: Spark, Python, JavaScript, Jupyter notebooks

Mentors: Enric Tejedor (etejedor@cern.ch), Danilo Piparo (danilo.piparo@cern.ch), Prasanth Kothuri (prasanth.kothuri@cern.ch), Kacper Surdy (kacper.surdy@cern.ch)

Links:
http://root.cern
http://spark.apache.org
http://jupyter.org

Improvements in vectorization and parallelization of ROOT Math libraries

HEP software applications require a large amount of computing resources, and their computing performance is an important issue, in particular to satisfy their ever-increasing requirements. Since 2005, we no longer benefit from the automatic gains due to the increase in processor clock frequency. The growth in the number of transistors on a chip now translates into an increase in the number of cores rather than an improvement in the performance of each core. To tackle these challenges, the ROOT project has been undertaking a re-engineering to adapt its Math libraries to run in multiple concurrent threads and make an efficient use of the vector units (SIMD).

The chosen candidate will continue the re-engineering made on vectorization of the Mathematical function interfaces and the fitting functions plus parallelization of the last one.

Task ideas:

  • Completion of parallelization and vectorization of all the fitting methods available in ROOT. 
  • Adapt gradient function interfaces for thread-based parallelization and vectorization.
  • Vectorization of TFormula and “predefined ROOT functions”.
  • Vectorization of most used mathematical and statistical functions in ROOT::Math and TMath.

Expected results: For each one of these areas the student will be expected to provide tests, reliable benchmarks and speed-up results. At the end of GSoC, ROOT should be capable of fitting in parallel and making use of vectorization to evaluate both user-specified and predefined formulas.

Requirements: Strong knowledge of C++11; being able to produce clean, reliable code; No need for background in math, although basic understanding of equations is expected. Basic notions of vectorization are a plus.

Mentors: Xavier Valls, Lorenzo Moneta

Links: 
ROOT
VecCore
Vc

Enhance C-Reduce to work with ROOT

C-Reduce is a tool which aims to reduce bug reports. It transforms user's source files to make them as minimal as possible. Minimalistic bug reproducers are easy to debug and convert into regression tests. C-Reduce is fairly mature and well adapted to minimize crashes in compilers. The project will be mainly focused towards making C-Reduce easier to use with ROOT and it's interactive C++ interpreter cling.

Task ideas and expected results:  

  • Extend C-Reduce to be able to reduce easily ROOT bug reports.
  • Optionally extend C-Reduce to minimize ROOT's data files.
  • Implement tests for all the realized functionality.
  • Prepare a final poster of the work and be ready to present it

Requirements: Intermediate level of C++, some experience with Clang

Mentors: Vassil Vassilev

Message Parsing Interface for ROOT (ROOTMpi)

Project Description: By standardizing the way different machines communicate during a running process, we can analyze bigger chunks of data in less time. ROOT MPI allows to communicate ROOT's native objects on top of the C/C++ raw data types. ROOT's serialization methods and optimal design of the new C++ standard permits the user to focus on the algorithm instead of the low-level syntax.

Task ideas:  

  • Extend existing communication schemas
  • Write support for MPI files (may consider some design or idea to integrate it to TFile)
  • Write a memory window, shared memory and RMA support
  • Checkpoint support at least in OpenMpi and Mpich
  • Improve the error handler system and messages in the output.
  • User tools:
    • Write a profiling tool
    • Write a benchmarking module
    • Integrate valgrind command to ROOTMpi command for debug
Expected results:  
  • Working implementation with tests and documentation
  • A machine learning example integrated with TMVA that uses ROOTMpi
  • Performance comparison with a basic example between ROOTMpi and Proof

Requirements: advanced skills in C/C++, experience in parallel programming with MPI

Mentors: Omar Zapata, Lorenzo Moneta

Source codeROOT Mpi
 

Integration of Apache Parquet reading capabilities in ROOT data analyses

ROOT is a C++ software toolkit widely used for high-energy physics (HEP) analyses. It specializes in reading and writing large amounts of data efficiently, using a custom storage format known as ROOT files. TDataFrame is a new tool that allows ROOT users to quickly and efficiently setup data analyses through a high-level declarative syntax (i.e. functional chains).
The aim of this project is to seamlessly interface TDataFrame with common storage formats other than ROOT files, such as Apache Parquet. Users of TDataFrame will be able to transparently read data stored in these formats with minimal or no change required to the analysis code.

Task ideas and expected results:  

  • design of new internal facilities for ROOT to read from third-party file formats such as Apache Parquet
  • design of a user-transparent interface to access these facilities through TDataFrame
  • implementation, integration and testing of this interface in ROOT's TDataFrame, to make this feature readily available to users

Requirements: strong C++ and C++11 skills, familiarity with the Apache Parquet file formats. Familiarity with the ROOT software library is highly desirable.

Mentors: Enrico Guiraud​, Danilo Piparo

Links:

 

Machine Learning Projects

Machine Learning Project 1. Convolutional Deep Neural Networks on GPUs for Particle Physics Applications

Project Description: Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework, used in many particle physics data analysis and applications. Last year we have expanded TMVA's capabilities to include feed-forward deep learning library (DNN) that supports interactive training on GPUs. This summer we would like to expand the toolkit with optimized convolutional deep neural networks (CNN). CNNs have very promising applications in particle physics such as particle and event classification, imaging calorimetry and particle tracking, allowing physicists to use new techniques to identify particles and search for new physics.

Task ideas and expected results:

  • Production-ready convolutional deep learning library
  • Design of first-stage filters
  • Support for GPUs for training
    • integration with existing low-level interface designed for the DNN library

Requirements: strong C++ skills, solid knowledge of deep learning, understanding of convolutional networks, familiarity with GPU interfaces a plus

Mentors: Sergei Gleyzer​ and Lorenzo Moneta

Web page:  TMVA

Links:
http://root.cern
http://root.cern/tmva


Source codeDNN

Machine Learning Project 2. Multi-target Regression for Particle Physics

Project Description: Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework, used in many particle physics data analysis and applications. We have recently expanded the regression (continuous function estimation) capabilities of the toolkit to include multiple loss-functions. One important open area of machine learning research is multi-target (or multi-objective) regression. In this project, the goal is to learn to estimate multiple continuous target functions at once. This summer we would like to expand the toolkit capabilities in multi-target/multi-objective function estimation. This has promising applications in particle physics research, such as particle transformations, fast detector simulation and many more, enabling particle physicists to make faster discoveries.

Task ideas and expected results:

  • Extension of existing single-target estimation/regression capabilities to multiple targets
  • Primarily targeting Boosted Decision Trees and Deep Neural Networks

Requirements: strong C++ skills, understanding of machine learning and in particular, its application to function estimation/regression

Mentors: Sergei Gleyzer​, Lorenzo Moneta, Omar Zapata

Web page:  TMVA

Links:
http://root.cern
http://root.cern/tmva


Source codeTMVA

Additional Machine Learning Project Ideas and Areas of Interest

Project Description: Toolkit for Multivariate Analysis (TMVA) is a multi-purpose machine learning toolkit integrated into the ROOT scientific software framework, used in many particle physics data analysis and applications. The following are also areas of interest with impactful applications to particle physics

Task ideas and expected results:

  • Unsupervised learning: deep auto-encoders, restricted boltzmann machines (RBMs)
  • Deep learning: recurrent neural networks (RNNs), LTSM, complex-valued neural networks

Requirements: strong C++ skills, good understanding of machine learning algorithms

Mentors: Sergei Gleyzer​, Lorenzo Moneta​, Omar Zapata​, Stefan Wunsch​, Enrico Guiraud

Web page:  TMVA

Links:
http://root.cern
http://root.cern/tmva


Source codeTMVA

Sixtrack Numerical Accelerator Simulation​ Projects

SixTrack is a software for simulating and analysing the trajectory of high energy particles in accelerators. It has been used in the design and optimization of the LHC and is now being used to design the High-Luminosity LHC (HL-LHC) upgrade that will be installed in the next decade. Sixtrack has been adapted to take advantage of large scale volunteer computing resources provided by the LHC@Home project. It has been engineered to give the exact same results after millions of operations on several, very different computer platforms. The source code is written in Fortran, and is pre-processed by two programs that assemble the code blocks and provide automatic differentiation of the equation of motions. The code relies on the crlibm (link is external) library, careful arrangement of parenthesis, dedicated input/output and selected compilation flags for the most common compilers to provide identical results on different platforms and operating systems. An option enables the use of the Boinc (link is external) library for volunteer computing. A running environment SixDesk is used to generate input files, split simulations for LHC@Home (link sends e-mail) or CERN cluster and collect the results for the user. SixTrack is licensed under LGPLv2.1.

A strong background in computer science and programming languages as well the interest to understand computational physics methods implemented in the code are sought. The unique challenge will be offered to work with a high-performance production code that is used for the highest energy accelerator in the world - and thus the code's reliability and backward compatibility cannot be compromised. There will be the opportunity to learn about methods used in simulating the motion of particles in accelerators.

Optimize and Integrate Standalone Tracking Library (SixTrackLib)

Description: Optimize data structure and source code of a standalone tracking library in C in order to take advantage of automatic vectorization offered by GCC and CLang for CPU and explicit vectorization from OpenCL and CUDA for GPU. The code must then be linked with the legacy SixTrack code and replace the inner loop.

Expected results: Demonstrate that the C code can run existing simulation at the same or faster speed on CPU and can saturate the computing capacity of GPU, while keeping bit-level identical results.

Mentors: Riccardo De Maria, Giovanni Iadarola

Requirements: Experience with Fortran, C, OpenCL, CUDA, vectorization techniques.

Web Page: cern.ch/sixtrack

Source Code: github.com/SixTrack/SixTrack

New physics models

Description: Implement, test and put in production a new solvers for generic RF-mulitpole,  combined function magnets and radiation effects.

Expected results: The user can build accelerator simulations with more accurate models for low energy machines and/or machines with radiation effects.

Mentors: Riccardo De Maria, Kyrre Sjobak

Requirements: Fortran, calculus, accelerator physics.

Web Page: cern.ch/sixtrack

Source Code: github.com/SixTrack/SixTrack

Code Benchmark and Consolidation

Description: Benchmark simulation size against result throughput on different architectures and identify I/O, memory and CPU bottleneck. Evaluate the impact of merging branches on these performance figures.

Expected results: An analysis of result throughput depending on the architecture, identification of bottlenecks in the source code, assessment on impact of branches.

Mentors: Riccardo De Maria, Kyrre Sjobak

Requirements: Fortran, calculus, accelerator physics.

Web Page: cern.ch/sixtrack

Source Code: github.com/SixTrack/SixTrack

Continuous integration and test coverage with CDash and coverity

Description: SixTrack building system has been recently updated using CMake and integrate in CDash. This enables us to implement best practices in testing by improving the test coverage, test speed and nightly test cycles.

Expected results: Approaching 100% code coverage with test and at the same time reducing testing running time. Implement automatic triggers for each commit that will launch the test suite and CDash update.

Mentors: Riccardo De Maria, Kyrre Sjobak

Requirements: Fortran, calculus, accelerator physics.

Web Page: cern.ch/sixtrack

Source Code: github.com/SixTrack/SixTrack

Detector simulation projects

Optimisation of GeantV HPC workload balancing by use of unsupervised machine learning

Description:

GeantV is a project aiming to deploy massively parallel detector simulation workloads from workstations to distributed computing resources, including HPC clusters. The GeantV concurrency model[1] exploits optimally a single node using threads and features such as vectorization and machine topology discovery. We are developing a model to adapt GeantV to make efficient use also of HPC resources, minimizing the processing tails due to the inhomogeneity of distributed environments. The goal of the summer project is to test and optimize the HPC deployment model, using the idea of optimized job scheduling  based on evolutionary tuning of GeantV job parameters and spectral clustering[3] of HPC resources in optimal subregions based on a graph representation of them.

 

Task ideas:

  • Implement multi-tier master-worker communication layer using MPI or alternative communication protocol (ZeroMQ or others).
  • Design and implement multi-layer graph workflow[2] for HPC layer, integrate spectral clustering [3] in job processing schema.
  • Deploy and test the code on a cluster, adding evolutionary tuning module and deploy spectral clustering [4].

Expected deliverables: Deployment of GeantV in a cluster, demonstrating improvement in workload balancing compared to naive workload deployment.

Optimization based on ML techniques demonstrating improvements with respect to the initial heuristic-based implementation.

 

Desired programming experience: C/C++

 

Desired knowledge: HPC, linear algebra, machine learning algorithms


Link to relevant websites: geant.web.cern.ch

Mentors: Andrei Gheata, Oksana Shadura

References:

  1. Apostolakis, J and Bandieramonte, M and Bitzes, G and Brun, R and Canal, P and Carminati, F and Licht, JC De Fine and Duhem, L and Elvira, VD and Gheata, A and others, Adaptive track scheduling to optimize concurrency and vectorization in GeantV, Journal of Physics: Conference Series, 608, 1, 012003, 2015, IOP Publishing

  2. Mikko Kivelä, Alex Arenas, Marc Barthelemy, James P. Gleeson, Yamir Moreno, Mason A. Porter; Multilayer networks. J Complex Netw 2014; 2 (3): 203-271. doi: 10.1093/comnet/cnu016

  3. Andrew Y. Ng, Michael I. Jordan, and Yair Weiss. 2001. On spectral clustering: analysis and an algorithm. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS'01), T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.). MIT Press, Cambridge, MA, USA, 849-856.

  4. Xiaowen Dong and Pascal Frossard and Pierre Vandergheynst and Nikolai Nefedov, Clustering with Multi-Layer Graphs: A Spectral Perspective, CoRR, abs/1106.2233, 2011, http://arxiv.org/abs/1106.2233

New error control methods for integration of trajectories

Geant4 and GeantV use Runge-Kutta methods to integrate the motion of charged particles in a non-uniform electromagnetic field.  Methods must provide good integration accuracy for the integration and to cost a minimum of computation time.  Integration is used to identify the intersection point between the curved track and the volume boundaries.  Due to the large number of steps and the cost of the evaluations of the field, the integration and intersection are a performance critical part of detector simulation. Recent work has introduced new RK methods which reduce the number of field evaluations required, and has the potential to decreased the computation time.

Task ideas:

  • Introduce new methods for the control of integration error of existing integration Runge Kutta methods
  • Create heuristics for choosing an appropriate method (helix, RK of particular order, Burlich/Stoer), using knowledge of the accuracy requirements and the variation of the derivative in the proposed step.

Expected results: Working implementation of a) improved alternative error control methods for RK integration, and/or b) integration methods which combine different methods, e.,g. RK methods of different order for improved performance.

Requirements: This project requires prior exposure to Numerical Analysis and familiarity with either C++, C or Java programming.  Exposure to either numerical methods for solving Ordinary Differential equations (ODEs) or tools for analysing data such as ROOT or R will be valuable. Both programming skill and knowledge of numerical methods for ODEs will be improved by undertaking this project.

MentorJohn Apostolakis​
Web site: http://cern.ch/geant4
Source code
https://github.com/jonapost/field_propagation

Using Pseudo-random numbers repeateably in a fine-grain multithreaded simulation

Description:

Particle transport Monte Carlo simulations are a key tool for High Energy Physics experiments, including the LHC experiments at CERN. All Monte Carlo (MC) simulations depend vitaly on Pseudo-Random Number Generators (PRNGs) used to sample many distributions. Each LHC experiments generates 5-10 Billion sampled physics events a year using around 10^18 sampled values from PRNGs. PRNGs take 1-5% of CPU time. PRNGs used must possess very large periods, fast execution, the ability to create a large number of streams, and very good correlation properties between streams and sampled values. It must be possible to reproduced a simulated event any time, for reexamination or debugging.

Task ideas and expected results:  

  •   Create a prototype that changes the use of random number generation in a Geant simulation so that the RNG state is associated with each track:
    • When a secondary is created, attach to it either a seed or a new state for PRNG, for use in that track.
    • Participating in extending the interface different PRNGs to provide different ways to choose the state of PRNG for a secondary
    • Create implementations including MCRGs, counter-based PRNGs and novel PRNGs with guarantees of stream independence.

Requirements: programming in C/C++ or Java at least for class projects is required. Use of Linux/Unix, cmake and/or a coding IDE (Eclipse, Xcode) will be assets.
Knowledge gained: use of Pseudo-Random Number Generators and state of the art of PRNGs.

Mentors: John Apostolakis​, Sandro Wenzel

Web page:  Geant4

Source code"https://github.com/Geant4/geant4/

References:
 

Pseudo-Random Number Generator Projects

Extend the TestU01 suite for PRNGs

Description
The TestU01 suite is an established suite of empirical tests for pseudorandom number generators. It has been used since 200x as a litmus test for potential PRNG algorithms and implementations.  Given the power of current computers, the length of the tests only probes number sequences which can be generated in tens of minutes on the fastest machines. The idea of this project is to select a larger set of configurations, improve the implementations and create a new 'standard' for tests of PRNGs.

Potential project topics 
1. Creating a 64-bit version of the software, so that more than 32 bits of the output values are tested.
2. Use faster algorithms ot improve the implementation of certain tests.
3. Design and implement specific tests for parallel RNGs.
4. Construct a parallel version of TestU01, that can run on parallel  processors. 

References: 
[1] P. L'Ecuyer and R. Simard. TestU01: A C library for empirical  testing of random number 
generators. ACM Transactions on Mathematical Software, 33(4):Article 22,  August 2007. 
MentorsJohn Apostolakis​, Pierre L'Ecuyer

Web page: http://simul.iro.umontreal.ca/testu01/tu01.html
Source code:  http://simul.iro.umontreal.ca/testu01/tu01.html

 

Other Projects


Extend clad - The Automatic Differentiation

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences).  CLAD  is based on Clang which will provide the necessary facilities for code transformation. The AD library is able to differentiate non trivial functions, to find a partial derivative for trivial cases and has good unit test coverage. There was a proof-of-concept implementation for computation offload using OpenCL.

Task ideas and expected results:  

  • The student should teach AD how to generate OpenCL/CUDA code automatically for a given derivative.
  • The implementation should be very well tested and documented. Prepare a final poster of the work and be ready to present it.

Requirements: Advanced C++, Clang abstract syntax tree (AST), CUDA/OpenCL basic math

Mentors: Vassil Vassilev
 

'Blue sky' ideas

 

Mentors

Here is the list of our mentors and their areas of expertise:

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

Tags

CERN SFT Gsoc 2016 ideas page

We participated in the Google Summer of Code with success each year since 2011, so we are applying again in 2016!
We intend to offer three options, as soon as the project ideas will be ready: pick one of the Project ideas which are grouped into categories, have a look to the section dedicated to 'Blue sky' or propose your own great idea for this summer: we are looking forward to hearing about your perspectives. A list of our mentors (and their areas of expertise) can be found here.

We encourage students who plan to apply to contact us  about their interest and explain their project ideas as early as possible. Our experience from our previous GSoC participation was that frequently an initial student application either needs to be reworked in close collaboration with a future mentor, or at least can benefit from feedback and discussion. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC Web page before the 25th of March.

Project ideas

Geant Vector Prototype and Geant4 Simulation Toolkit

Every LHC experiments is a large scale user of simulation, involving the elementary particles created by the LHC beam collisions and their path through its detectors. Almost half the CPU resources of each experiment (around 100,000 cores) are used constantly to produce simulated events. This is now done using the Geant4 toolkit, the current production detector simulation toolkit for High Energy Physics (HEP) experiments.  R&D is being undertaken in the GeantV project into a new generation of detector simulation, seeking to use existing 'large core' hardware more efficiently, in order to meet the rapidly rising experiment simulation needs, and to be well adapted to other existing and planned architectures, including CPUs with larger vector registers and accelerators ( such as the Intel Xeon Phi & GPUs). The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation on today’s hardware. This effort is focused on identifying ways to reorganise the work so that more data (e.g. multiple particles or rays) is processed by each function. By reusing the code on 'nearby' data we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware.

Geant4 and GeantV are open source, developed by physicists and engineers from laboratories and universities around the world.  Developments are ongoing to improve its computing and physics precision, its scope of application, and to better utilize current and emerging computer architectures. The simulation programs of the LHC experiments are in constant large scale use, and the total number of simulated events produced is becoming a growing limitation in the analysis potential for some interesting types of new physics.

As a result the goal of the project is to explore different ways to reduce the execution time on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’). The code required in Geant4 to model diverse types of particles and interactions, and to model the complex
geometries of detectors spans hundreds of classes and tens to hundreds of thousands of lines of code. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation on today’s hardware. GeantV addresses these challenges by transporting together bunches of particles, in order to profit from the structure and parallelism in modern hardware, exploiting cacahce, vector instructions and multiple cores using multi-threading (MT).

Implementation of task-based transport for GeantV

Description: The current parallelism model of GeantV is data-oriented: a static set of threads handles its work, fetching one "basket" of tracks and transporting each track for a step. Threads are fed with work from a common input concurrent queue. After an injection of a set of initial tracks, further work is generated by the transport threads, which create new baskets at the end of each step. Each step can also generate a trace (or detector 'hit'), which is proto-data for output. Other work will also go on, summarising the detector hits into detector summaries ('digits'), and performing its output.  We are seeking to adapt the steering of this work to a task-based approach, preferably using Thread Building Blocks (TBB), to profit the flexibility of this approach, and from the facilities provided by its powerful library.  A first TBB implementation exists based on an initial version of the GeantV scheduler, which serve as a starting point and potential inspiration.

Task ideas and expected results:  

  • Evaluate the existing TBB-based implementation by identifying and understanding its current bottlenecks, and comparing it to the 'static' thread approach. Expected outcome is a refined task-based version, with improved performance from addressing the most important bottlenecks.
  • Evaluate different approaches for structuring the tasks. Innovative ideas will be welcome. Expected outcome is a design and partial implementation of a second task-based implementation targeting improved performance.

Requirements: Strong C++ skills, experience with parallel programming required. Experience in using TBB or other task-based threading libraries will be considered a strong advantage. Knowledge in the field of physics is a plus, but not a requirement.

Mentor: Andrei Gheata

Web page:     http://geant.web.cern.ch

Source code:  https://gitlab.cern.ch/GeantV/geant/tree/master

New methods for integrating trajectory in field

Geant4 and GeantV use Runge-Kutta methods  to integrate the motion of charged particles in a non-uniform electromagnetic field.  Methods must provide good integration accuracy for the integration and to cost a minimum of computation time.  Integration is used to identify the intersection point between the curved track and the volume boundaries.  Due to the large number of steps and the cost of the evaluations of the field, the integration and intersection are a performance critical part of detector simulation. Recent work has introduced new RK methods which reduce the number of field evaluations required, and has the potential to decreased the computation time.

Task ideas:

  • Introduce multi-step integration methods and compare their performance with the available RK methods, or
  • Introduce heuristics for choosing an appropriate RK method, using knowledge of the accuracy requirements and the length of the current step, and the characteristics of the magnetic field.

Expected results: Working implementation of a) one or more multi-step integration methods inside Geant4 and/or Geant4 module for tracking in field, or b) integration methods which combine RK methods of different order for improved performance.

Requirements: This project requires prior exposure to Numerical Analysis and familiarity with either C++, C or Java programming.  Exposure to either numerical methods for solving Ordinary Differential equations (ODEs) or tools for analysing data such as ROOT or R will be valuable. Both programming skill and knowledge of numerical methods for ODEs will be improved by undertaking this project.

Mentor: John Apostolakis​
Web site: http://cern.ch/geant4
Source code
https://github.com/jonapost/field_propagation


 

ROOT 

 

The ROOT system (root.cern.ch) provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histogramming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis. Thanks to the built-in C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled. ROOT's new C++11 standard-compliant interpreter is Cling, an interpreter built on top of Clang (www.clang.llvm.org (link is external)) and LLVM (www.llvm.org (link is external)) compiler infrastructure. Cling is being developed at CERN as a standalone project. It is being integrated into the ROOT data analysis (root.cern.ch) framework, giving ROOT access to an C++11 standards compliant interpreter. ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

Enhance C-Reduce to work with ROOT

 

Description:

C-Reduce (https://github.com/csmith-project/creduce), is a tool which aims to reduce bug reports. It transforms user's source files to make them as minimal as possible. Minimalistic bug reproducers are easy to debug and convert into regression tests. C-Reduce is fairly mature and well adapted to minimize crashes in compilers. The project will be mainly focused towards making C-Reduce easier to use with ROOT and it's interactive C++ interpreter cling.

 

Expected results: Extend C-Reduce to be able to reduce easily ROOT bug reports. Optionally extend C-Reduce to minimize ROOT's data files. Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.

Required knowledge: Intermediate level of C++, some experience with Clang

Mentor:Vassil Vassilev

Extend clad - The Automatic Differentiation

Description: In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). Clad (https://github.com/vgvassilev/clad) is based on Clang which will provides the necessary facilities for code transformation. The AD library is able to differentiate non trivial functions, to find a partial derivative for trivial cases and has good unit test coverage. There was a proof-of-concept implementation for computation offload using OpenCL.

Expected results:The student should teach AD how to generate OpenCL/CUDA code automatically for a given derivative. The implementation should be very well tested and documented. Prepare a final poster of the work and be ready to present it.

Required knowledge: Advanced C++, Clang abstract syntax tree (AST), CUDA/OpenCL basic math.

Mentor:Vassil Vassilev

Interactive features for the JSROOT web geometry viewer

Description: The ROOT project is developing a JavaScript library reading and rendering ROOT objects (1D, 2D, 3D histograms, 1D & 2D graphs, multi-graphs, 3D geometries) in any modern web browser. One part of this task is to render 3D objects (like detector geometries) using the THREE.js library. This is advancing rapidly and most of the shapes have been implemented. Nevertheless, many interactive features are missing, such as navigation/selection of volumes (picking) and introducing/moving cutting plane. This has to be implemented. And the current code can handle and display several hundred of volumes, but typical detector geometries can be composed of several millions of volumes. The code should be able to deal with such big geometry (memory and performance wise). See the JSROOT page and actual status here.

Expected results:

  • Provide more interactive features to web geometry viewer
  • Performance optimization for large geometries

 

Required knowledge:Good knowledge of JavaScript. Experience with 3D computer graphics. Knowledge of C++ would be an asset

Mentor:Bertrand Bellenot and Sergei Linev

 

TMVA Project in Machine Learning

Description: Toolkit for Multivariate Analyses (TMVA) is a machine-learning framework integrated into the ROOT software framework, containing ML packages for classification and regression frequently used by high-energy physicists in searches for new particles, used for example in the discovery of the Higgs Boson. Recently TMVA has been undergoing a significant makeover both in performance, features and functionality.

There are a number of possible areas of contribution, for example:

Task ideas

  • Improvement of memory management and data-handling for parallel running

  • GPU support for intensive deep learning training applications

  • Interfaces to other machine-learning tools

  • Support for multi-objective regression

  • Support for feature engineering


Expected Results: working implementation of these features in TMVA leading to improved software performance

Requirements: Strong C++ background is desired, strong machine learning knowledge is a plus. 

GitHub repositoryTMVA

Mentors: Sergei Gleyzer​, Lorenzo Moneta
 

Integrating Machine Learning in Jupyter Notebooks


Description: Improving user experience with ROOTbooks and TMVA. A ROOTbook is a ROOT-integrated Jupyter notebook. ROOT is a software framework for data processing, analysis, storage and visualization. Toolkit for Multivariate Analysis (TMVA) is a ROOT-integrated package of Machine Learning tools. Jupyter notebook is a web-based interactive computing platform that combines code, equations, text and visualizations.

Task ideas:

  • Integrate a list of features, currently available in the TMVA Graphical User Interface, into the ROOTbook environment. This includes Receiver Operating Characteristic (ROC) curves, feature correlations, overtraining checks and classifier visualizations.
  • Extend the ROOT-Python binding (or PyROOT) for the use of TMVA in ROOTbooks. This includes simplifying parameter specification for booking and training classifiers, improving output readability and code clarity.
  • Implement interactive training mode in the ROOTbook environment.
  • Interactive feature engineering with JavaScript visualization.
  • Interactive deep learning optimization.


Expected results: working implementation of the TMVA-ROOTbooks integration layer.

Requirements: Python, C++, Javascript, machine learning, familiarity with notebook technology is a plus

Links: TMVA, ROOTbooks

Mentors: Sergei Gleyzer​ and  Enric Tejedor

 

Reflection-based Python-C++ language bindings: cppyy

cppyy is a fully automated, run-time, language bridge between C++ and Python. It forms the underpinnings for PyROOT, the Python bindings to ROOT, the main persistency and analysis framework in High Energy Physics (HEP), is used to drive the frameworks of several HEP experiments, and is the environment of choice for analysis for many HEP physicists. cppyy is the only Python-C++ bindings technology that can handle the scale, complexity, and heterogeneity of HEP codes. There are two implementations, one for CPython, and one for PyPy.

Source codes, documentation, and downloads: https://root.cern.ch/  and  https://pypy.org/

Both the CPython and PyPy implementations support the CINT and Reflex reflection systems, the CPython version also supports Cling, which is based on Clang/LLVM. The goal is to move both to Cling on a code base that is as much shared as possible.

 

Integrate the Cling backend into PyPy/cppyy

Description: Cling, being based on Clang/LLVM can parse the latest C++ standard (C++11/C++14). A Cling backend exists for CPython/cppyy, but not yet for PyPy/cppyy. A common backend could serve both projects, and would reduce the cost of new features, making them much quicker available.

Expected results: Implement a Cling backend on libCling directly, using the CPython implementation as a starting point, for use by both CPython and PyPy. Package this backend for distribution. Design and implement a method for distribution of Clang modules with the standard Python distribution tools.

Requirements: Working knowledge of C++, good knowledge of Python

Mentor: Wim Lavrijsen

Sixtrack numerical accelerator simulation​

SixTrack is a software for simulating and analysing the trajectory of high energy particles in accelerators. It has been used in the design and optimization of the LHC and is now being used to design the High-Luminosity LHC (HL-LHC) upgrade that will be installed in the next decade. Sixtrack has been adapted to take advantage of large scale volunteer computing resources provided by the LHC@Home project. It has been engineered to give the exact same results after millions of operations on several, very different computer platforms. The source code is written in Fortran, and is pre-processed by two programs that assemble the code blocks and provide automatic differentiation of the equation of motions. The code relies on the crlibm (link is external) library, careful arrangement of parenthesis, dedicated input/output and selected compilation flags for the most common compilers to provide identical results on different platforms and operating systems. An option enables the use of the Boinc (link is external) library for volunteer computing. A running environment SixDesk is used to generate input files, split simulations for LHC@Home (link sends e-mail) or CERN cluster and collect the results for the user. SixTrack is licensed under LGPLv2.1.

A strong background in computer science and programming languages as well the interest to understand computational physics methods implemented in the code are sought. The unique challenge will be offered to work with a high-performance production code that is used for the highest energy accelerator in the world - and thus the code's reliability and backward compatibility cannot be compromised. There will be the opportunity to learn about methods used in simulating the motion of particles in accelerators.

 

Optimize and Integrate Standalone Tracking Library (SixTrackLib)

Description: Benchmark a standalone tracking library in C targeting both CPU and GPU and integrate in SixTrack. The inner loop uses a simple data structure based on contiguous arrays that can be generated by SixTrack or external programs and can be hosted in the CPU or GPU main memory. In case of GPU, the ideal number of particle per core (even one such that coordinates do not leave internal registers) and kernel size should be evaluated for speed.

Expected results: Running code which rely only on the newly rewritten library to perform tracking simulations and test suite that proves that old and new implementation produce identical results.

Mentors: Ricardo De Maria (link sends e-mail)

Requirements: Experience with Fortran, C, OpenCL, calculus and a background in physics. 

New physics models

Description: Implement, test and put in production a new solvers for exact bending dipoles, combined function magnets and radiation effects.

Expected results: The user can build accelerator simulations with more accurate models for low energy machines and/or machines with radiation effects.

Mentors: Ricardo De Maria (link sends e-mail)

Requirements: Fortran, calculus, accelerator physics.

BLonD

The CERN Beam Longitudinal Dynamics code, BLonD, is used to simulate the dynamics of particle bunches in synchrotrons. It contains a vast range of different physics features to model multiple-harmonic RF systems, feedback loops, and collective effects, and has been applied for many studies and several machines in- and outside of CERN. Whether it is to understand previously unexplained observations, or whether it is to predict and optimize parameters for future, simulations often require multi-bunch modelling with millions of particles and calculations of collective effects (in frequency or time domain), sometimes over millions of iterations, which can make the simulations computationally very expensive.

BLonD Code Optimisation

Description: The code was originally written in python. In order to significantly reduce the runtime, it will be translated to C++ and algorithms are going to optimized by a BLonD developer during the coming year. This will not only require a complete rewriting, but also a major restructuring of the code, where creativity, initiative, and latest technologies will be needed. As a Gsoc student, you could contribute to explore different parallelization techniques on CPUs and GPUs for different parts of the code, as well as different data structures and overall software architecture options to increase computational efficiency.

Expected results: Determine the best architecture and parallelization option(s) for the BLonD code.

Requirements: Strong skills in C++ and parallelization techniques. Some experience with python would be of advantage. Furthermore, a minimal physics background that allows for understanding the underlying equations.

Mentors: Helga Timko and Danilo Piparo

Website (with links to source code & documentation): http://blond.web.cern.ch/

 

'Blue sky' ideas

Improvement of the VDT Mathematical Library

The VDT (link is external) mathematical library is a collection of optimised, inline and vectorisable mathematical functions. Its adoption allowed to reduce remarkably the runtime of the data processing workflows of CMS experiment at the LHC.

This project aims to further expand the functionality of the VDT (link is external) mathematical library. Two main areas can be explored, namely:

1. Integration with OMP4 and support for simd vectors

The VDT functions can be enhanced in order to support the OpenMP4 programming interface relative to vectorisation. In additon, by templating the VDT (link is external) functions, the explicit vectorisation through the usage of array types such as the gcc and clang built-in types or the Vc array types.

2. Integration of existing limited precision/domain function implementations

Often the usage of a certain mathematical function requires the support of a limited domain or a limited precision. This activity aims to complement the existing VDT (link is external) functions implementations with others characterized by a reduced precision or input range. An appropriate formulation of the interfaces of these functions has to be adopted, for example adopting generic programming principles through the usage of templates.

Mentors: Vincenzo Innocente, Danilo Piparo

Mentors

Here is the list of our mentors and their areas of expertise:

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

Tags

CERN SFT Gsoc 2015 ideas page

We participated in the Google Summer of Code with success each year since 2011, so we are applying again in 2015!
We intend to offer three options - pick one of the Project ideas which are grouped into categories, have a look to the section dedicated to 'Blue sky' or propose your own great idea for this summer: we are looking forward to hearing about your perspectives. A list of our mentors (and their areas of expertise) can be found here.

We encourage students who plan to apply to contact us  about their interest and explain their project ideas as early as possible. Our experience from our previous GSoC participation was that frequently an initial student application either needs to be reworked in close collaboration with a future mentor, or at least can benefit from feedback and discussion. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 27th of March (19:00 UTC).

Project ideas

CernVM File System

The CernVM File System (CernVM-FS) is a read-only file system that is optimized for the distribution of software to world-wide distributed computing infrastructures.  It is used in the context of large-scale scientific computing applications that use tens of thousands of individual computers to process very large data sets in reasonable time.  All these computers need to access specific data processing software.  The access to the software is provided by CernVM-FS, a global and versioning file system that uses HTTP for data transfer.  The file system content is installed on a central web server from where it can be mirrored and cached by other web servers and web proxies.  File system clients download data and meta-data on demand and cache them locally.  Data integrity and authenticity is ensured by the use use cryptographic hashes and digital signatures.  CernVM-FS is used, among others, by the Large Hadron Collider (LHC) experiments for the distribution hundreds of millions files and directories of experiment software onto tens of thousands of worldwide distributed nodes.

Source code: https://github.com/cvmfs/cvmfs
Documentation: http://cernvm.cern.ch/portal/filesystem/techinformation, Doxygen
Downloads: http://cernvm.cern.ch/portal/filesystem/downloads

HTTP/2 Support

Description: CernVM-FS uses HTTP for all data transfers.  This includes clients requesting files from servers as well as server to server copies for replication.  Since the file system’s workload is primarily characterized by a large number of small files, reducing the latency is paramount to performance.  Here, the upcoming HTTP/2 brings some interesting new features, in particular connection multiplexing and fixing head-of-line blocking, which should fix the current problems with HTTP/1.1 pipelining.  This project should add support for HTTP/2 in the CernVM-FS network code and re-arrange the code for data replication to use a multiplexed connection instead of parallel connections.  The HTTP/2 code itself will probably be taken from libcurl.  Optionally, this project can be extended to develop a data prefetcher using multiplexed connections.

Expected results: Extend the CernVM-FS download manager by a vectorized interface, so that a number of download jobs can be sent and received at once.  Develop the necessary code to automatically decide if parallel connections (HTTP/1.1 on the other end) or multiplexed connections (HTTP/2) should be used.  Extend the CernVM-FS replication module to use the vectorized interface of the download manager.  Extend the CernVM-FS file system code by a pre-fetch thread, that is able to prefetch a given list of data chunks in the background.  Extend the file system's meta-data by a table that can store prefetch hints.

Mentor: Jakob Blomer, Gerardo Ganis

Requirements: Good knowledge of C++, good knowledge of HTTP and TCP/IP.  Experience with libcurl would be an advantage.

Key-Value Store Interface for the Client Cache

Description: CernVM-FS stores downloaded files in a local cache.  This is necessary for performance and to keep the load on the web servers and proxies sufficiently low.  At the moment, the local cache needs to reside on a POSIX file system.  In order to perform collaborative caching in a cluster (i.e.: node A can cache a file for node B and vice versa), the dependency of the local cache on a POSIX file system should to be relaxed to a pure PUT/GET interface.  That would allow to store the local cache in a key-value store such as RAMCloud or Riak.  At the conceptual level, this project includes to find a way to prevent many clients in a cluster to download the same file at the same time (i.e.: to find a way to collapse concurrent requests for the same file). The cache eviction requires special treatment, too, in case that files marked for deletion are currently used by a client.  Both problems could be tackled, for instance, by a locking/lease table that is maintained in the key-value store.

Expected results: Based on the current cache manager implementation, separate the cache manager interface (open, close, read, exists, delete) from its implementation.  Plug in the current implementation as a POSIX cache manager.  Develop a cache manager implementation that uses a key-value store as backing storage.  Develop locking primitives in the key-value store that can be used to collaboratively update the cache from multiple nodes.

Mentor: Jakob Blomer, Gerardo Ganis

Requirements: Good knowledge of C++, good knowledge of distributed systems (e.g. CAP theorem, distributed hash tables, failure modes in networks).  Experience with distributed key-value stores would be an advantage.

Geant 4 Simulation Toolkit and Geant Vector Prototype

The Geant4 toolkit simulates the interactions of elementary particles and radiation with matter. It is used to simulate the detectors of the LHC and other High Energy Physics (HEP) experiments. It finds application in other areas, from assessing the effects of radiation on the electronics of satellites to improving the design of medical detectors using domain-specific tools such as the GATE and TOPAS applications (external links). LHC experiments use Geant4 to compare the signatures of rare events from new physics (such as the Higgs boson) to those coming from known interactions. The open source toolkit is developed by the Geant4 collaboration, which brings together 90 physicists and engineers from laboratories and universities around the world (link is external). Developments are ongoing to improve its precision and scope of application, and to better utilise current and emerging computer architectures. The simulation programs of the LHC experiments use the Geant4 simulation toolkit to produce simulated LHC events running on about 100,000 CPU cores continuously. Their statistics remain a limitation in the analysis potential for some interesting types of new physics. As a result the goal of the project is to explore different ways to reduce the execution time on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’). The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation on today’s hardware. This effort is focused on identifying ways to reorganise the work so that more data (e.g. multiple particles or rays) is processed by each function. By reusing the code on 'nearby' data we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware. The key activity in this effort involes the Geant-Vector prototype (GeantV) which aims to evolve the simulation toolkit into a framework that is better adapted to the critical role of memory caches, and makes optimal use of SIMD vector instructions and accelerator devices.

VecGeom

VecGeom is a novel templated C++ library to describe three-dimensional detector geometries and to offer "particle tracing" APIs (much like in ray-tracing or game engines). VecGeom goes beyond previously existing libraries (in Geant4 or ROOT) by offering a strong support for single-instruction multiple data (SIMD) processing of multiple particles and also offers SIMD-enhanced scalar algorithms for appropriate shapes. VecGeom is also built to be used as the geometry tracing library on GPUs and other accelerators. VecGeom is still a very young open source project and there are several project ideas, a GSOC student could contribute to:

Description:

  • "IO:" Study and implement support for native serialization of the VecGeom geometries using ROOT6/Cling object serialization. Implement a module to export and import a geometry description using the GDML format ( a XML format for description of geometries.) Implement an interface to the DD4HEP geometry descriptions.
  • "Testing and performance monitoring infrastructure:" Review and systematically extend the testing infrastructure of the project. Develop comprehensive unit tests. Develop a performance monitoring system that monitors performance evolution and displays results in a web-dashboard or in jenkins.
  • "Generalized GPU solid:" Extend the list of implemented shape primitives by a generalized vectorised/GPU-capable solid which can be used in place of the most popular existing solids (box, tube, conical & spherical shells) especially on the GPU. Inspired by the approach of James Tickner .

 

Requirements: Very good C++ knowledge; For "testing" project: ctest, database, frontend web development experience; For "GPU solid": some experience with the SIMD/SIMT paradigm and vectorized programming.

Mentor: Sandro Wenzel

Additions to the Vc library for vectorization

Vc is a free software library to ease explicit SIMD vectorization of C++ code. It has an intuitive API and provides portability between different compilers and compiler versions as well as portability between different vector instruction sets. This is achieved by offering C++ vector types that hide the platform specific vector assembly instructions. Vc is used as the primary abstraction layer to achieve vectorized code in the GeantV simulation prototype and in many other applications in High-Energy Physics and beyond. Vc is a relatively young project and offers many opportunities for further development in order to keep up with the evolution of computing platforms.

Task ideas:

  • Add support for 8-bit and 64-bit integer SIMD vectors (in particular for AVX2). A thorough investigation of SIMD vectors of char may be interesting for vectorization of string processing.
  • Add support for more SIMD (micro)architectures (AVX2, AVX-512, AltiVec)
  • Support for GPUs. There are several possibilities for supporting the vector types programming model on GPUs. One approach makes Vc codes portable to GPUs by supporting Vc types in CUDA kernels as scalar values. Thus, the data-parallelism is only expressed via the SIMT programming model. The second approach instructs a compiler to generate native GPU code, which executes a GPU thread per entry in the vector type.

Requirements: Working on Vc requires (very) good knowledge of C++ and preferably previous exposure to SIMD assembly or intrinsics programming. In the course of the project your C++ skills will very likely improve. It is possible to get gain insight into the work on SIMD parallelization for the upcoming C++ standards.

Mentor: Matthias Kretz

New methods for integrating trajectory in field

Geant4 and GeantV use Runge-Kutta methods​ to integrate the motion of charged particles in a non-uniform electromagnetic field. Two aspects of this use are important: to obtain good accuracy for the integration and to take a minimum of computation time. Sometimes a step remains in the same volume, and the integration needs only to estimate the endpoint and the momentum.  But when a track crosses a boundary, integration is used also to identify the intersection point between the curved track and the volume boundaries.  Due to the large number of steps and the cost of the evaluations of the field, the integration and intersection are a performance critical part of detector simulation. Adapting RK methods which reduce the number of field evaluations has the potential to measurably decrease the computation time in applications as diverse as the simulation of detectors at the LHC and the development of improved radiation therapy systems.

Task ideas:

  • Introduce RK methods which re-use the last step of one step as the first step of the next step (FSAL)​
  • Introduce RK methods which can provide an evaluation at an intermediate point of an interval, using an interpolation method based on the values calculated by the integration.

Requirements: This project requires prior exposure to Numerical Analysis and some familiarity with C++, C or Java programming.  It would be helpful to have had either exposure to the numerical methods for solving Ordinary Differential equations (ODEs) or else to have used a Computer Algebra program (Maxima, Maple, Mathematica, Sage or similar) which would enable easier prototyping. Both programming skill and knowlege of numerical methods for ODEs will be improved by undertaking this project. 

Mentor: John Apostolakis


ROOT

 

The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histogramming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis. Thanks to the built-in C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled. ROOT's new C++11 standard-compliant interpreter is Cling, an interpreter built on top of Clang (www.clang.llvm.org (link is external)) and LLVM (www.llvm.org (link is external)) compiler infrastructure. Cling is being developed at CERN as a standalone project. It is being integrated into the ROOT data analysis (root.cern.ch) framework, giving ROOT access to an C++11 standards compliant interpreter. ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

Extension and optimisation of the ROOT6 autoloading capabilities

Description
The flexible plug-in architecture of ROOT relies on a powerful mechanism of "autoloading", i.e. the possibility to automatically load shared libraries only when needed, for example upon the usage of a certain class, without the need of linking the libraries. The present infrastructure is rather performant and stable enough to be leveraged by huge software systems like the LHCb, Atlas or CMS software stacks. On the other hand, thanks to the procedure of integration of ROOT6 with the software of the LHC experiments it was possible to conceive possible functionality extensions and performance improvements. Examples are:

  • The extension of the mechanism of autoloading to accommodate the usage of functions and only classes, namespaces, variables and enumerators.
  • The usage of more memory and cache efficient data structures for the storage of the data necessary for autoload (e.g. exploiting tries or implicit sharing)
  • The optimisation of the rootmap files, i.e. the autogenerated catalogues containing the data to feed to the autoloading system
  • The improvement of the ROOT startup time with a clever preprocessing of the information to be passed to the ROOT interpreter. If time allows, the same strategies identified for the autoloading, can be applied to a similar procedure, called autoparsing.

All the new features and changes will be integrated in ROOT by the candidate together with a complete list of tests.
Expected Results

  • Implement the autoloading of shared libraries upon usage of functions
  • Replace the current implementation of the autoload keys registry with one which is faster and with a smaller memory footprint
  • Provide input about the performance of the rootmap format with runtime benchmarks and, if needed, provide a new format
  • A program/script for merging rootmaps avoiding duplication of information and a runtime procedure to format the rootmaps code which minimises the time needed by ROOT to interpret it

Requirements
Strong C++ skills, knowledge in the field of Physics and HEP computation and/or experience with ROOT are certainly a plus.

Mentor: Danilo Piparo

SAS: A tool for static analysis and coding rules checking

Description
Code maintenance is very much facilitated if the coding and style rules are followed. Commercial tools can be used to check the code for rule validations and to present the results in a easy form for developers to act on them. Unfortunately, such commercial solution might be not available for all developers especially in the field of research and adding new rules could become a real problem. SAS (https://github.com/dpiparo/SAS) is a tool which attempts to make coding rules checking and clang based static analysis easy. This tool provides:

  • A simple way to build a Clang plugin providing checks which can be loaded by the scan-build utility or directly by Clang.
  • A tool to check formatting coding conventions taking advantage of clang-format
  • A set of scripts to wrap the compiler calls in order to plug static analysis and coding rules checking in any development workflow/continuous integration system.

The tool is in beta phase and needs consolidation and expansion of functionality. This project aims to provide:

  • The implementation via static analyser checkers of the ROOT coding conventions (https://root.cern.ch/root/nightly/codecheck/rules.html)
  • The consolidation of the existing static analyser checkers ensuring thread safety and potential development of new ones, for example imposing absence of thread unsafe operations in methods which are const qualified.
  • The integration of clang-modernize targetting suggestions about the possible improvements of existing code bases
  • The consolidation of the SAS CMake based build system
  • The development of a ctest based testing system

Requirements
C++, Clang, Cmake/CTest and Python

Mentor: Danilo Piparo

Interface between Paraview and ROOT

Description:​Paraview is a powerful visualisation system, based on VTK, providing a wide range of high level visualisation techniques ROOT doesn’t provide. Paraview is already interfaced to many kind of data formats. The goal of this project would be to make a ROOT plugin to Paraview allowing to convert ROOT trees into Paraview data structures in order to visualise them.

Requirements: Knowledge of C++

Mentor: Olivier Couet, Joachim Pouderoux (Kitware)

Extend ROOT-R Interface

Description: Using the ROOT-R interface many of the statistical packages available in R such as those performing multi-variate analysis can be used in ROOT. The idea of the project is to facilitate the usage of some of the R packages by creating corresponding interface classes in the ROOT system. These classes should implement some of the needed interfaces required to be used by some of the statistical packages of ROOT and they could be loaded at run-time by the ROOT plugin manager system.
Expected Results: Develop interface classes for some of the most used the multi-variate statistical tools in R for classification and regression. These interface classes should be designed to be used by the TMVA package of ROOT.

Requirements: Knowledge of C++, ROOT and R. Knowledge of computational statistics would be an advantage.
Mentors: Lorenzo Moneta, Sergei Gleyzer
 

Prototype of TTreeFormula

Description: TTreeFormula is a class used to parse and interpret the expression strings used to query and make selections on the ROOT TTree class with the TTree::Draw function. A new TFormula class has been integrated in ROOT using the Just In Time compilation provided by Cling to compile the mathematical expression to evaluate the formula. A similar class needs to be developed for the expressions used to analysing the ROOT TTree's.

Expected Results: Develop a new prototype TTreeFormula class to be used as an alternative to the existing one and provide some running use case examples.

Required knowledge:Advanced C++, Basic knowledge of Cling and C++11

Mentors: Lorenzo Moneta, Danilo Piparo

LINQ 4 ROOT and Cling

Description: LINQ originally referred to the SQL-like syntax in C# and VB but it has over time changed its meaning to mean the way you manipulate lists using the higher-order functions provided by System.Linq. Working with lists using higher-order functions have been available for functional and SmallTalk developers since the 70s but has recently been popularized in mainstream languages such as C#, VB, JavaScript, Python, Ruby and so on. There are a few libraries that bring this style into C++11. We'd like to investigate their strengths and weaknesses. We would like to adopt them in our C++11 interpreter and provide a thorough real-world examples of their. A good starting point would be https://github.com/vgvassilev/RxCpp. 

Expected results: Adoption in cling meeting cling's QA requirements. Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.

Required knowledge:Advanced C++, Basic knowledge of Cling and C++11

Mentors: Vasil Vasiliev

Extend clad - The Automatic Differentiation

Description: In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). Clad (https://github.com/vgvassilev/clad) is based on Clang which will provides the necessary facilities for code transformation. The AD library is able to differentiate non trivial functions, to find a partial derivative for trivial cases and has good unit test coverage. There was a proof-of-concept implementation for computation offload using OpenCL.

Expected results:The student should teach AD how to generate OpenCL/CUDA code automatically for a given derivative. The implementation should be very well tested and documented. Prepare a final poster of the work and be ready to present it.

Required knowledge: Advanced C++, Clang abstract syntax tree (AST), CUDA/OpenCL basic math.

Mentor: Vasil Vasiliev

Extension of ROOT I/O customization framework

Description
ROOT includes a extensive, flexible, and performant framework to automatically serialize C++ objects into a platform independent binary format.  One of core strength of this framework is the ability to support evolution in the user's data schema and be able to read files containing older version of this schema.   The frameworks includes several automatic transformation, including from any C++ standard collections to another (for example changing for a vector<UserObject> to a list<UserObject>) and including from any numerical to another.   More complex transformation are also supported via I/O customization rules.   If the core functionality is in placed, some of the intended, but more complex features (see the end of http://indico.cern.ch/event/35523/session/59/material/slides/3?contribId=210) have not yet been addressed.

Expected Results

  • Implement support for I/O rules for nested objects.
  • Implement support for just-in-time compilation of I/O rules.
  • Complete the functionality supported by the I/O customization rules.

Requirements
Strong C++ skills, knowledge in the field of Physics and HEP computation and/or experience with ROOT are certainly a plus.

Mentor: Philippe Canal

Modernization of C++ object streaming framework.

Description
ROOT includes a extensive, flexible, and performant framework to automatically serialize C++ objects into a platform independent binary format.  This framework is undergoing a modernization replacing older construct with newer and more performance techniques made possible by the newest version of the C++ standard and compiler.   This modernization has been partially applied to the routines reading the C++ objects out of the platform independent binary format and still need to be complete.    The same style of modernization needs to be applied to the routines writing the C++ objects into the platform independent binary format.  One the major challenge is to not only transform the code into the new style and infrastructure but also to improve the algorithm to take advantage of the new facilities to significantly improve performance.

Expected Results

  • Complete modernization of the reading routines. 
  • Implementation the modernization of the writing routines.

Requirements
Strong C++ skills, knowledge in the field of Physics and HEP computation and/or experience with ROOT are certainly a plus.

Mentor: Philippe Canal

Implement a tool to 'warn' user of inefficient (for I/O) construct in data model

Description
ROOT includes a extensive, flexible, and performant framework to automatically serialize C++ objects into a platform independent binary format.  One the major strength of this framework is the ability to support almost all C++ constructs.  The downside of this flexibility is that the user has the choice can select for a very wide ranges of constructs and schemas which have very different I/O performance characteristics.   Some of the recommendations for design an I/O efficient data schema are straightforward, for example it is better to use a sorted vector of pair than a map, and some are much more complex, like the effect of deeper class hierarchy or the order of data members.   Creating a tool that can analyze a given data model and give clear recommendation on how to improve its performance would be very beneficial and improve the productivity of data model designers.

Expected Results

  • Implement prototype scanning a user data model and giving simple recommendation.
  • Review and expand the list of recommendation to improve I/O efficiency.

Requirements
Strong C++ skills, knowledge in the field of Physics and HEP computation and/or experience with ROOT are certainly a plus.

Mentor: Philippe Canal


IgProf

IgProf https://igprof.org is a lightweight performance profiling and analysis tool. It can be run in one of three modes: as a performance profiler, as a memory profiler, or in instrumentation mode. When used as a performance profiler it provides statistical sampling based performance profiles of the application. In the memory profiling mode it can be used to obtain information about the total number of dynamic memory allocations, profiles of the live'' memory allocations in the heap at any given time and information about memory leaks. The memory profiling is particularly important for C/C++ programs, where large amounts of dynamic memory allocation can affect performance and where very complex memory footprints need to be understood. In nearly all cases no code changes are needed to obtain profiles. IgProf currently supports Linux and x86 / x86-64 /ARMv7 and ARMv8 CPU architectures. It correctly handles dynamically loaded shared libraries, threaded applications and subprocesses. It can be run entirely in user space without special privileges and produces full navigable call stacks. The profile reports can be visualised in one of several ways. A simple text report can be produced from the raw data saved during the profiling run. Alternatively a web-browsable profile report can be produced which allows easy navigation of the call stack. Both allow one to see profile data ordered by symbol in terms of cumulative'' cost (function plus children) and ``self'' cost (time in the function itself) as well as a full call graph view showing the functions which called, and which were called by, any given function. An important feature of the web-navigable reports is the ability to point via URL at particular places in the call graph. This facilitates collaboration between individuals at different locations. While there are obviously many profilers out there, the goal of IgProf is to provide a reliable profiler for large applications like the one found in HEP, and to tackle the challenges posed by heterogenous computing architectures.

 

Improve Performance Counters support in IgProf

Description: IgProf has initial support for reading performance counters via the PAPI API. This is currently used to read and profile energy consumption related counters. The goal of this project is to extend the current implementation to allow profiling of generic HW counters, providing the user a simple command line based interface to select the different kind of counters.

Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required.

Profiling mixed python / C++ programs

Description: IgProf currently supports profiling native applications, most likely written in C / C++. Profiling scripted applications which invoke native code (e.g. a python framework which invokes C++ plugins) is supported in the form of raw profile of the python interpreter itself, which eventually calls the native code functions. While this kind of profile can already provide insights, it is still desireable to be able to profile and visualize the script stacktraceand the native code one together, in order have a global picture of the connections between the scripted and native code.

Expected results: the first objective is to identify and instrument the parts of the python interpreter which are responsible for allocating, deallocating and execute python stackframes, eventually extending igprof instrumentation capabilities to deal with peculiarities of the python interpreter. The second objective is to collect enough information via the above mentioned instrumentation to be able to show mixed python / C / C++ profiles, first for the performance profiler and subsequently for the memory profiler.

Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Understanding of python interpreter internals a plus.

Mentors: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

Support for CUDA / OpenCL profiling

Description: Extend IgProf to gather information from the profiling APIs of one (or more) of the above mentioned packages. The goal is to be able to measure the performance cost of mixed CPU / GPU applications, taking into account the asynchronous behaviour of those applications, allowing to track the source of GPU workloads.

Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Knowledge of the above mentioned toolkits a plus.

Mentor: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

Enhanced support for POWER, x32 and MacOSX architectures

Description: Depending on the specific knowledge of the candidate, the task is to improve support for profiling Power Architecture (POWER7 and eventually POWER8) applications or, as an alternative, to extend the x86 support to include x32 (the 32bit pointers, 64bit data ABI for Intel compatible processors). An additional task would be to resurrect OSX support (IgProf used to work on PPC based OSX).

Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Knowledge of at least one between ARM and x86 assembly language. Knowledge of MacOSX system programming a plus.

Mentor:Giulio Eulisse, Peter Elmer, Vincenzo Innocente

Sixtrack numerical accelerator simulation​

SixTrack is a software for simulating and analysing the trajectory of high energy particles in accelerators. It has been used in the design and optimization of the LHC and is now being used to design the upgrade that will be installed in the next decade the High-Luminosity LHC (HL-LHC). Sixtrack has been adapted to take advantage of large scale volunteer computing resources provided by the LHC@Home project. It has been engineered to give the exact same results after millions of operations on several, very different computer platforms. The source code is written in Fortran, and is pre-processed by two programs that assemble the code blocks and provide automatic differentiation of the equation of motions. The code relies on the crlibm library, careful arrangement of parenthesis, dedicated input/output and selected compilation flags for the most common compilers to provide identical results on different platforms and operating systems. An option enables the use of the Boinc library for volunteer computing. A running environment SixDesk is used to generate input files, split simulations for LHC@Home or CERN cluster and collect the results for the user. SixTrack is licensed under LGPLv2.1.

A strong background in computer science and programming languages as well the interest to understand computational physics methods implemented in the code are sought. The unique challenge will be offered to work with a high-performance production code that is used for the highest energy accelerator in the world - and thus the code's reliability and backward compatibility cannot be compromised. There will be the opportunity to learn about methods used in simulating the motion of particles in accelerators.

Create a Standalone Tracking Library

Description: Complete, test and deploy a standalone tracking library in C to replace the present inner tracking loop written in fortran with C one that can target both CPU and GPU. The inner loop uses a simple array based contiguous data structure that can be generated by SixTrack or external programs and can be resident in the CPU or GPU main memory. In case of GPU, the ideal number of particle per core (even one such that coordinates do not leave internal registers) should be evaluated for speed.

Expected results: Running code which rely only on the newly rewritten library to perform tracking simulations and test suite that proves that old and new implementation produce identical results.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: Experience with Fortran, C, calculus and a background of physics are important.

New physics models

Description: Implement, test and put in production a new solver for exact bending dipoles, combined function magnets, radiation effects, track total time.

Expected results: The user can build accelerator simulations with more accurate models for low energy machines.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: Fortran, calculus, accelerator physics.

 

Methodical Accelerator Design

MAD is a tool with a long standing history used to design, model and optimise particle accelerators beam physics. The development of a completely new version of MAD (GPLv3) started recently with the aim to support a wider range of topologies and physics, and to provide more accurate models with better performance and flexibility. It relies heavily on the recent improvements performed on LuaJIT, an implementation of the Lua programming language embedding an extremely efficient JIT compiler. Still low-level mathematics and physics modules are implemented in C/C++ with high-level wrappers written in Lua and using native FFI of LuaJIT. Supported platforms are Linux, MacOSX and Windows.

Integration of MAD

Description: As the development evolves, it becomes more and more important to integrate all the different parts of the new MAD into a single standalone application with no external dependencies. The work will consist to embedded LuaJIT, MAD scripts (thousands lines of Lua), MAD libraries (thousands lines of C/C++) and third party libraries (C/C++, Fortran) into a single application in a portable manner, still allowing to load external modules at runtime (user scripts, shared libraries). The application must be able to run in interactive and batch mode with multiple internal Lua environments connected either in parallel or in series. A clear policy will have to be enforced for errors handling and memory management through the different layers and technologies of the application.

Expected results: A standalone application without external dependencies embedding all the components and aforementioned features of MAD, and being able to run in interactive and batch mode in a terminal on Linux, MacOSX and Windows.

Requirements: Good knowledge of C, Lua and OS tools. Good experience with portable integration and deployment of software. Knowledge of LuaJIT and FFI would be preferable.

Mentor: Laurent Deniau

Extension of LuaJIT for MAD

Description: Deferred expressions are heavily used in MAD to describe accelerator lattice while postponing business logic decision (e.g. circuits) at a later design or optimisation stage. Lua supports lambda functions as a generalisation of the deferred expressions, but the syntax is too verbose to be useful. The work will consist to extend the parser and possibly the semantic of LuaJIT through patches to support compact syntax of lambda functions and deferred expressions (lambda without formal  parameter), following the approach of GSL Shell. The extension of the semantic will tag lambdas as special LuaJIT functions that have to be evaluated instead of referenced when used in expressions without call semantic (i.e. without parenthesis). This automatic dereferencing will allow to efficiently emulate the deferred expressions mandatory for the new MAD.

Expected results: Set of patches to apply to LuaJIT (C code) after each update. The patches must be easy to maintain separately as LuaJIT evolves. A set of test cases to be run after each update to ensure no regression.

Requirements: Good knowledge of C. Good experience with parsers and interpreters. Knowledge of LuaJIT would be preferable.

Mentor: Laurent Deniau

Reflection-based Python-C++ language bindings: cppyy

cppyy is a fully automated, run-time, language bridge between C++ and Python. It forms the underpinnings for PyROOT, the Python bindings to ROOT, the main persistency and analysis framework in High Energy Physics (HEP), is used to drive the frameworks of several HEP experiments, and is the environment of choice for analysis for many HEP physicists. cppyy is the only Python-C++ bindings technology that can handle the scale, complexity, and heterogeneity of HEP codes. There are two implementations, one for CPython, and one for PyPy.

Source codes, documentation, and downloads: https://root.cern.ch/  and  https://pypy.org/

Both the CPython and PyPy implementations support the CINT and Reflex reflection systems, the CPython version also supports Cling, which is based on Clang/LLVM.
There are two proposed projects that can proceed independently.

 

Pythonization API

Description: Full automation provides language bindings that cover the vast number of use cases, especially since the reflection information can be used to properly treat common patterns such as smart pointers, STL iterators and so on. We call these treatments "pythonizations", as the strictly bound C++ code is turned into bound code that has a Python "feel." However, there are always a few corner cases that can be improved with manual intervention. Currently this is done by helpers or wrapper code on the C++ or Python side, but a well-designed API standardizes these manual tasks, improving the scalability and interoperability.

Expected results: Design and implement a "pythonization" API that covers all the common cases such as different memory ownership models, hand-off of the Global Interpreter Lock (GIL), and user-defined callbacks that handle more common patterns such as has been done for STL. Implement this for both CPython (in C++) as well as for PyPy (in (R)Python). The API should scale up and be able to deal with conflicts.

Extra: some patterns, such as return by reference of builtin values, can not be described in Python code and need C++ wrappers. Extend the wrapper generator in use by Cling to provide such "pythonizations" automatically.

Requirements: Good knowledge of C++, excellent knowledge of Python

Mentor: Wim Lavrijsen

Integrate the Cling backend into PyPy/cppyy

Description: Cling, being based on Clang/LLVM can parse the latest C++ standard (C++11/C++14). A Cling backend exists for CPython/cppyy, but not yet for PyPy/cppyy. A common backend could serve both projects, and would reduce the cost of new features, making them much quicker available.

Expected results: Implement a Cling backend on libCling directly, using the CPython implementation as a starting point, for use by both CPython and PyPy. Package this backend for distribution. Design and implement a method for distribution of Clang modules with the standard Python distribution tools.

Requirements: Working knowledge of C++, good knowledge of Python

Mentor: Wim Lavrijsen

Xrootd​

Access to extremely large data volumes is a key aspect of many "Big Data" problems in the sciences and for particle physics in particular. The international character of such projects also leads to the need to access data located in many geographical locations. The Xrootd project ( http://xrootd.org (link is external)) aims at giving high performance, scalable fault tolerant access to data repositories of many kinds. It is primarily used to provide access to file system storage over the local or wide area network; however, it is also used as a building block for larger storage systems such as CERN EOS ( http://eos.cern.ch (link is external)) or the large-scale global data federations. The Xrootd protocol has many high-performance features (such as asynchronous requests and request pipelining) which are analogous to those added to HTTP/2. The Xrootd client is used throughout High Energy Physics and is well-integrated in the ROOT data analysis package.

Implement multi-sourcing interface with the Xrootd client

Description:The Xrootd client often has several potential data sources available. Each source may be of varying quality, and quality may change over time. To speed up data access - and to decrease the impact of poor source selection - we would like to investigate the implementation of a multi-source reading algorithm (reading from several sources in parallel). This would be implemented as a plug-in to the existing client and expose identical interfaces; a sample algorithm is already available from other sources.

Requirements: Excellent knowledge of C++ programming and understanding of basic network and storage architectures.

Mentors : Matevz Tadel, Brian Bockelman, Andy Hanushevsky

Implement a client-side caching library

Description: Caching data locally on the client side can be very important to achieving performance when network latencies are high. This project involves the implementation of a client side plugin library to temporarily cache data being read to disk.

Requirements: Excellent knowledge of C++ programming and understanding of basic network and storage architectures.

Mentors : Matevz Tadel, Brian Bockelman, Andy Hanushevsky

'Blue sky' ideas

Binary code browser and tester

Description: This is a potential project on tool development for the benefit of "low-level" developers in various other projects. It proposes the development of a GUI based (web) application with which one is able to browse through a binary file ( executable, object file, library file ) and inspect the assembly code on various levels ( functions ). It could also offer an API to perform unit tests on the assembly code level. To a first instance, this project would be a convenience tool offering a good alternative to "objdump etc." and extending on the capabilities of those tools. Use cases are: Quickly seeing if a certain function contains the right vectorized code or getting a list of functions that are called from a certain function. The project can benefit from the portable ParseAPI layer which provides the API to retrieve data from executable files. Initial ideas for the tool include

  • It should be a GUI tool. Possibilities include: plugin for eclipse and/or a web application interface
  • Minimal functionality: List all functions in an executable file/library. Browse through functions. List disassembly of clicked functions. provide histograms on assembly code. Provide a configurable dashboard. Provide configurable styling on assembly code. Being extensible via a plugin-mechanism.
  • Further idea: provide static call graph (like in valgrind); provide high-level interface to unit test on assembly code ...

Requirement:Object oriented analysis; (Java) GUI development ...

Mentor: Sandro Wenzel

Mentors

Here is the list of our mentors and their areas of expertise:

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

Tags

CERN SFT Gsoc 2012 ideas page

After experiencing great Google Summer of Code in 2011 we have decided to apply again. This year we will be widening the scope of our involvement in GSoC by offering our students to work not only on CernVM related projects, but also Geant4 simulation toolkit, and ROOT Object-Oriented data analysis framework. (that is the reason why we have decided to change the organization name from 'CERN Virtual Machine' to 'CERN SFT').

Same like last year, the project ideas are grouped into categories (ideas related to CERN Virtual Machine, CernVM File System-related ideas, CernVM Co-Pilot related ideas, Geant 4 related ideas, and ROOT related ideas). We also have a so-called 'Blue sky' ideas which are rather raw and must be worked on further before they become projects. The list of ideas is by no means complete, so if you think you have have a great idea regarding our projects we are certainly certainly looking forward to hear from you. The list of our mentors (and their areas of expertise) can be found below.

We encourage students who are planning to apply to do that (and contact us) as early as possible because we have learned from our previous GSoC experience that an initial student application often needs to be reworked in close collaboration with a future mentor. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).  

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 6th of April (19:00 UTC). The application template can be found here.

 

Project ideas

 

CERN Virtual Machine related projects

CernVM is a Virtual Software Appliance designed to provide a complete and portable environment for developing and running the data analysis of the CERN Large Hadron Collider (LHC) on any end-user computer (laptop, desktop) as well as on the Grid and on Cloud resources, independently of host operating systems. This "thin" appliance contains only a minimal operating system required to bootstrap and run the experiment software. The experiment software is delivered using the CernVM File System (CernVM-FS) that decouples the operating system from the experiment software life cycle.

 

CernVM Virtual Machine Lifecycle Management

Description: At CernVM we were looking for a complete solution for the maintenance of Virtual Machines (VM). After using several commercial solutions to drive this process, that failed to provide a common and coherent framework that would allow for full control of every step, we decided to combine existing open-source tools into an extensible framework that could serve as end to end solution for our VM lifecycle management.
Having this basic framework in place, this task consists of creating and interconnecting the required agents in order to form a complete solution that implements the usual lifecycle of a VM: Tuning the configuration, building the virtual disk images, testing the final product, publishing the release, instantiating machines on demand, contextualizing them, monitoring their health and sending back feedback (http://cernvm.cern.ch/portal/ibuilder). Part of the task will also be to create the required web-based front-ends that will ease the management of the whole procedure.

Mentors: Predrag Buncic
Requirements: Good knowledge of Perl (preferred) or Python (for the agents) and good knowledge of Javascript and/or Objective-J (for the web front-ends). Experience with virtualization and virtual machines will be an asset.

 

CernVM File System related projects

The CernVM File System is a Fuse file system developed to deliver High Energy Physics (HEP) software stacks onto (virtual) worker nodes. It is used for instance by LHC experiments on their world-wide distributed computing infrastructure. HEP software is quite complex with tens of gigabytes per release and 1-2 releases per week while, at the same time, almost all the individual files are very small resulting in hundreds of thousands of files per release. CernVM-FS uses content addressable storage (like the GIT versioning system). For distribution it uses HTTP. File meta data are organized in trees of SQlite file catalogs. Files and file meta data are downloaded on demand and aggressively cached. The CernVM File System is part of the CernVM appliance but it compiles and runs on physical Linux/OS X boxes as well. It is mostly BSD licensed with small GPL parts. CernVM-FS source code of the current development branch can be downloaded from here, the documentation is available here.

 

Dodgy ad-hoc network services for CernVM-FS

Description: The aim of this project is not to build things that work but to build things that break—in a predictable way.  CernVM-FS runs in a variety of different environments, such as end-user laptops, virtual machines on Amazon EC2 or other clouds, or managed servers in a computing center.  In order to function properly, it needs to connect to network services, for instance a DNS service, a web proxy and a web server.  All of these components can (and will!) be misconfigured, operate in unexpected ways, or simply fail at some point.

The task is to develop an extensible framework that allows for the ad-hoc deployment of an ensemble of distributed stub services.  Stub services for DNS, HTTP, and HTTP proxy have to be implemented using this framework.  These stub services should support reproducible error conditions, such as network glitches, lack of disk space, HTTP errors, data corruptions, unsolicited DNS redirections, etc. The framework should be integrated with the CernVM-FS test suite.

Mentors: Jakob Blomer
Requirements: Good knowledge of Linux environment, experience with web servers and DNS servers, experience with scripting languages (Perl/Python/bash).

 

Repository archival storage

Description: A CernVM-FS repository typically contains the entire software release history of an experiment collaboration.  For large experiments (say the ATLAS experiment at the LHC), such a release history can grow to hundreds of releases with tens of milions of files altogether.  While for the sake of reprodicibily all the releases need to stay accessible, the older releases don't change anymore.  They retire and could be archived in larger container files in order relieve the storage backend from a large fraction of its many small files.

The first step of the task is to come up with a suitable container format.  This might be a well-known format such as zip or a cow2 hard disk image, or alternatively something simple and handcrafted.  The CernVM-FS file catalogs have to be extended in order to store information about the container contents.  Chunks from the container have to be served in a transparent way, which can be done for instance by an Apache module.  In order to archive a live repository and to unarchive a retired repository, conversion tools are required.

Mentors: Jakob Blomer
Requirements: Good knowledge of C++, file systems, HTTP and web servers

 

Continuous cartography of public network services

Description: CernVM-FS clients pop up in many places of the world.  In order to provide clients a low-latency access to files, there are several mirror servers deployed in different geographical regions. Clients can be redirected to a close server using Geo-DNS (i.e. the DNS reply depends on the network block of the source IP address). As long as the database of available mirror servers and their location is managed manually, however, any change is a cumbersome process that requires a remapping of network blocks. Ideally, the mapping of network blocks to mirror servers is computed automatically.

The main task is to implement an algorithm that pre-processes a set of servers and their location as well as a Geo-IP database (e.g. the GeoLite City database).  The pre-processing should result in buckets of network blocks representing networks that are close to a specific server.  Options to do so are, for instance, a Voronoi diagram on the globe or the use of network coordinates.  The maintenance of the system requires web services / pages for registering and unregistering servers as well as a visualization of the current status of the map.

Mentors: Carlos Aguado Sanchez and Jakob Blomer
Requirements: Good knowledge of TCP/IP, building web application, and scripting languages (Perl/Python/bash).  Knowledge of map services (Google Maps, OpenStreetMap) is a plus.   

 

CernVM Co-Pilot related projects

CernVM Co-Pilot is a framework which allows to instantiate a ready-to-use computing infrastructure on top of distributed computing resources. Such resources include enterprise cloud computing infrastructures (e.g. Amazon EC2), scientific computing clouds (e.g. Nimbus), and last but not least resources provided by the users participating in the volunteer computing projects (e.g. powered by the BOINC volunteer computing platform). Co-Pilot has been used to build LHC@home 2.0 volunteer computing cloud, which has about 12000 registered users (and about 15000 registered hosts) from all over the world. CernVM Co-Pilot is distributed as part of CernVM. Co-Pilot source code is available here and the documentation is available here.

 

CernVM Co-Pilot automated deployment and testing.

Description: Create a framework for automated deployment of CernVM Co-Pilot service and agent nodes, develop a set of unit tests for testing the functionality of the Co-Pilot. The framework should be able to automatically deploy all services needed for running Co-Pilot (such as Jabber, Redis, and graphite), install Co-Pilot components, configure them on the fly, run the test cases, and generate reports. The framework will be used to validate new developments and features in an automated way, and make use of currently deployed tools (tapper, graphite) to generate reports.

Mentors: Artem Harutyunyan
Requirements: Good knowledge of Perl (or Python) and Bash.

 

CernVM Co-Pilot Agent on mobile

Description: Port CernVM Co-Pilot Agent to a mobile platform. It can either be used later for building a framework on top (similar to NASA application) to communicate news and interesting information to the users of the App. Or alternatively it can be used as a base platform for deploying and running volunteer thinking applications.

Mentors: Artem Harutyunyan
Requirements: Experience with mobile development (either Android or iOS), knowledge of XMPP desirable.

 

CernVM Co-Pilot log analysis and visualization

Description: CernVM Co-Pilot services generate lots of log data. Most of the data are currently stored in log files. The data related to the overall status of server-side services (queue length, number of executed jobs, etc.) are sent to Co-Pilot Monitoring compoment which stores it using graphite. The monitoring system, however, does not provide information about the Co-Pilot Agents. So there are many cases when we need to go and dig into log files to find things out (e.g. to see when was the last time a particular user successfully executed a job).
The task of this project would be to improve this by extending Co-Pilot Monitoring system. The aim would be to represent the log data related to Co-Pilot Agents in an intuitive, visual, searchable and easily accessible manner.

Mentors: Artem Harutyunyan
Requirements: Good knowledge of scripting languages, experience with log data management (and visualization) will be an asset.

 

Geant 4 Simulation Toolkit related projects

The Geant4 toolkit simulates the interactions of elementary particles and radiation with matter. It is used to simulate the detectors of the LHC and other High Energy Physics (HEP) experiments. It finds application in other areas, from assessing the effects of radiation on the electronics of satellites to improving the design of medical detectors with the custom GATE application. LHC experiments use Geant4 to compare the signatures of rare events from new physics (such as the Higgs boson) to those coming from known interactions. The open source toolkit is developed by the Geant4 collaboration, which brings together 90 physicists and engineers from laboratories and universities around the world. Developments are ongoing to improve its precision and scope of application, and to better utilise current and emerging computer architectures.
The simulation programs of the LHC experiments use the Geant4 simulation toolkit to produce simulated LHC events continuously running with on about 100,000 CPU cores throughout the year. These statistics remain a limitation in the analysis potential for some interesting types of new physics. As a result the goal of the project is to explore different ways to reduce the execution time of Geant4 on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’).
The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation of today’s hardware. This effort is focused on identifying ways to spread the work between threads and/or to reorganise the work. By using less code in each core we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware.

 

Expand Geant4 multi-threading to the particle track-level

Description: Build a prototype of Geant4 which uses a separate thread to simulate the interactions of each particle track. First simulate all interactions from one primary track in one thread, following all its ‘secondary’ particles, the spin-off particles produced by interactions. Build on and extend the existing multi-threaded Geant4 (G4MT) prototype; G4MT now handles in the same thread the set of particle tracks which comprise an collision event. In a second stage, farm out some secondary tracks to other threads, using different criteria. A key goal is to ensure that all results remain independent of the number of threads used, thus the reproducibility of a simulation is made possible.

Mentors: John Apostolakis
Requirements: Good knowledge of C++, Object-Oriented Programming, parallel programming using threads are needed. The ability to understand quickly the key classes of a large, existing toolkit, familiarity with any type of Monte Carlo simulation or basic knowledge of atomic physics will be beneficial.

 

Reorder Execution to Improve Throughput

Description: Identify hot spots of Geant4-MT (and sequential version) where instruction or data fetching or memory allocation are a significant limiting factor. Investigate ways to rearrange the execution in order to speedup the program. Review the management of key objects (tracks, touchable volumes) seeking to reduce cache misses and reduce execution time. Examine locality of objects, effects of allocation strategy for high-use lightweight objects.

Mentors: John Apostolakis, Gabriele Cosmo
Requirements: Good knowledge of C++ and Object-Oriented Programming. Familiarity with profiling tools will be an asset.

 

Evaluate tools and strategies for parallelization, to guide implementation

Description: Use the sequential and multi-thread versions of Geant4 as a benchmark for the tools that assess parallelisation potential of applications. Evaluate tools which allow annotation of an application to estimate the benefits of different approaches, without making code changes. The project will validate the estimates of such tools, comparing with the results of the choices made in the existing multi-threaded prototype Geant4-MT. In addition it will evaluate additional avenues for parallelisation of Geant4, and examine alternative design choices. Identify a promising way, alternative or complementary, to improve throughput and implement it.

Mentors: John Apostolakis, Andrzej Nowak
Requirements: Good knowledge of C++ and Object-Oriented Programming, and the ability to understand quickly the key classes of a large, existing toolkit. Familiarity with any type of Monte Carlo simulation or basic knowledge of atomic physics will be beneficial.

 

ROOT Object-Oriented data analyzis framework

The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histograming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis.
Thanks to the built-in CINT C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled using a C++ compiler via a machine independent transparent compiler interface called ACliC.
ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

 

Scientific Visualization in JavaScript

Description: The ROOT project is developing a JavaScript ROOT I/O library to be able to read ROOT files from any modern web browser without having to install anything on the client or server side. This project is advancing rapidly (see here), and we are now evaluating several open source JavaScript visualization libraries to display the different graphical objects that can be read from those ROOT files (1D, 2D, 3D histograms, 1D & 2D graphs, multi-graphs, see here and here).
As none of the existing JavaScript visualization libraries are offering the required scientific graphing functionality, they should be implemented in one of those libraries (e.g. d3, or highcharts). The developments will be donated back to the original library project.

Mentors: Bertrand Bellenot
Requirements: Good knowledge of JavaScript. Experience with scientific visualization would be an asset.

 

'Blue sky' ideas

 

  • Reliable and scalable storage backend for CernVM-FS. Currently CernVM-FS uses ZFS for the storage backend. The idea is to come up with an architecture for implementing the backend storage system for CernVM-FS. The storage should be reliable, scalable, it should support snapshots + rollbacks, should be optimized for hosting many (~ 10^8) small (~ 10 kB) files. The storage will be used to transform large directory trees into content addressable storage, the task would also involve distribution of the transformation among a couple (or potentially many) worker and storage nodes.

Mentor: Jakob Blomer, Predrag Buncic, Artem Harutyunyan
Requirements: Good knowledge of existing storage technologies (e.g. distributed file systems), experience with parallel/distributed programming.

  • Camlistore storage backend for Co-Pilot. The idea is to replace the current Chirp storage backend with Camlistore.

Mentor: Artem Harutyunyan
Requirements: Experience with Linux, experience with Bash and Perl (or Python), network programming experience, familiarity with Camlistore.

  • LHCb adapter for Co-Pilot. The idea is to implement a Co-Pilot system Job Manager which will make possible to get jobs from the DIRAC system.

Mentor: DIRAC expert + Artem Harutyunyan
Requirements: Experience with Perl or Python, network programming experience, experience with LHCb DIRAC.

  • CMS adapter for Co-Pilot. The idea is to implement a plugin (aka Adapter) for the Co-Pilot system which will make possible to get jobs from the CRAB system

Mentor: CRAB expert + Artem Harutyunyan
Requirements:Experience with Perl or Python, network programming experience, experience with CMS CRAB.

  • Tier 3 Grid site out of the box. Grid middleware and the application software stacks of scientific collaborations are not easy to deploy and maintain. We think that virtualisation technologies can greatly facilitate the job of Grid site administrators. The task would be to implement a workflow manager (e.g. based on Taverna) which can be used to ease the Grid site deployment and maintenace task. This is a very important problem for relatively small universities and institutions which need to fulfill their commitments in providing Grid resources with a limited manpower.

Mentor: Predrag Buncic
Requirements: Experience with Perl (or Python) and Bash, experience with Taverna and/or other workflow management tool will be a plus 

  • Geant 4 on GPUs. Create small prototype of Geant4 on GPUs for one particle type (gammas or electrons) combining physics algorithm with geometry navigation for a small selection of types of volumes. Can start from prototype for geometry of Otto Seiskari ( OpenCL, Cuda ).

Mentor: John Apostolakis
Requirements: Experience with C/C++, OpenCL/Cuda

  • Geant 4 vectorized solid. Create a vectorised/GPU-capable solid that can be used in place of the most popular existing solids (box, tube, conical & spherical shells). Inspired by the approach of James Tickner (see article in Computer Physics Communications, Vol 181, Issue 11, Nov 2010, pages 1821–1832 available at http://dx.doi.org/10.1016/j.cpc.2010.07.001 ).

Mentor: Gabriele Cosmo
Requirements: Experience with C/C++, vectorization

  • New simulation engine Create a prototype simulation engine for one particle type with a limited geometry choice (1-2 types of volume types) using a high-productivity parallel language ( Chapel, X10, TBB ) or parallel extension of C++ ( Cilk, SplitC ) . Benchmark this against existing solutions for the same problem. Document the development effort and the pitfalls of the language, tools and implementation (potential for a scientific report on the coding experience, in addition to the results).

Mentor: John Apostolakis
Requirements: Experience with C/C++, Chapel/X10/TBB, and/or Cilk/SplitC

 

Mentors

Here is the list of our mentors and their areas of experitse:

 

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

  • SFT GSoC mailing list: sft-gsoc-AT-cern-DOT-ch (no subscription needed).
  • SFT GSoC Jabber/XMPP chat room: gsoc-sft@conference.jabber.org . We noticed that Gmail/Gtalk XMPP accounts have problems posting messages to the chat room, so please register yourself an XMPP account on some other server (it takes no time to register an account at http://www.jabber.org).
  • IRC is restricted at CERN so please use Jabber instead.
Tags

CERN SFT Gsoc 2014 ideas page

Following the great experiences in Google Summer of Code in 2011, 2012 and 2013 we are applying again in 2014!
Project ideas are grouped into categories (ideas related to the SixTrack accelerator simulation engine used in the LHC@Home application, to the cling interpreter inside Root, to the performant and versatile IgProf profiler, to the CERN Virtual Machine, and to the Geant 4 simulation tools). We also have a so-called 'Blue sky' ideas which are rather raw and must be worked on further before they become projects. And we are open to other great ideas, and are looking forward to hearing from students with new perspectives on our projects. The list of our mentors (and their areas of expertise) can be found below.

LHC Experiment Software Stack

Our projects are all related to codes which are used for the LHC accelerator and its experiments. They are almost as diverse as software stack of the LHC experiments ranging from adapting the the CernVM file system to the tracking precisesly hundreds of protons over millions of times around a model of the LHC accelerator.  The LHC experiments have software frameworks that make use of common scientific software libraries for high-energy physics (HEP) and many other open source tools.  CernVM and CernVM-FS provide a uniform runtime environment based on Scientific Linux.  The software stack is fully open sourced; many parts of it are used outside the world of particle physics, as in simulating medical physics detectors for medical imaging or estimating the dose deposited in the sensitive electronics of satelites as they fly through the earth's radiation belts.

 

We encourage students who plan to apply to contact us  about their interest and explain their project ideas as early as possible. Our experience from our previous GSoC participation was that frequently an initial student application either needs to be reworked in close collaboration with a future mentor, or at least can benefit from feedback and discussion. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).

 

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 21st of March (19:00 UTC).

 

 

Project ideas

 

ROOT Object-Oriented data analysis framework


The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histograming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis. Thanks to the built-in C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled.

 

ROOT's new C++11 standard-compliant interpreter is Cling, an interpreter built on top of Clang (www.clang.llvm.org) and LLVM (www.llvm.org) compiler infrastructure. Cling is being developed at CERN as a standalone project. It is being integrated into the ROOT data analysis (root.cern.ch) framework, giving ROOT access to an C++11 standards compliant interpreter.

ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

Complete ROOT — R interface

Description: Complete the interface in ROOT to call R function using the R C++ interface  (Rcpp, see http://dirk.eddelbuettel.com/code/rcpp.html). Make available in ROOT many of the statistical packages available in ROOT such as those performing regression and.or multi-variate analysis. Developing this interface opens the possibility in ROOT to use the very large set of mathematical and statistical tools provided by R.

Expected Results: Make some class(es) implementing some of the existing algorithm interface in ROOT, so that the R functionality can be used directly from the ROOT classes or statistical studies (e.g. multivariate analysis) without knowing the low-level details. Another objective will be to package the ROOT-R interface in a library which will be can be optionally distributed with ROOT

MentorLorenzo Moneta

Requirements: Basic knowledge of C++.  Knowledge of ROOT and/or R would be advantages.

 

Coding rules and style checker based on the Clang Static Analyzer

Description: Code maintenance is very much facilitated if the coding and style rules are followed. This is the case for the ROOT project for which a number of rules have been defined since the early days of the project. A commercial tool was used to check the code for rule validations and to present the results in a easy form for developers to act on them. With a commercial solution, adding new rules has become a real problem. The idea is to re-implemement the existing rules with a open source tool, which can be extended and adapted to also fulfill the needs of other software development projects.

Expected Results: Develop a new C++ code checker tool,  possibly based on the Clang Static Analyzer, which is initially inplementing the set of ROOT code rules and is extendible to other set of rules for diffrent proejcts.  The current coding rules that will need to be implemented are listed here  http://root.cern.ch/root/nightly/codecheck/rules.html and results of the analysis should be presented in a easy and accessible way for developers to identify what rules has been violated by the latest commit to the repository. The existing tool (commercial) produces the following table   http://root.cern.ch/root/nightly/codecheck/codecheck.html

Mentor: Pere Mato, Olivier Couet

Requirements: Basic knowledge of C++, basic knowledge of Clang/Clang Static Analyzer

 

Code copy/paste detection

Description:The copy/paste is common programming practice. Most of the programmers start from a code snippet that already exists in the system and modify it to match their needs. Easily some of the code snippets end up being copied dozens of times, which leads to worse maintainability, understandability and logical design. Clang and clang's static analyzer provide all the building blocks to build a generic C/C++ copy/paste detector.
Expected results:Build a standalone tool or clang plugin being able to detect copy/pasted code. Lay the foundations of detection of slightly modified code (semantic analysis required). Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.
Required knowledge: Advanced C++, Basic knowledge of Clang/Clang Static Analyzer.

Mentor: Vasil Vasiliev

 

Cling projects

 

ROOT's new C++11 standard-compliant interpreter is Cling, an interpreter built on top of Clang (www.clang.llvm.org) and LLVM (www.llvm.org) compiler infrastructure. Cling is being developed at CERN as a standalone project. It is being integrated into the ROOT data analysis (root.cern.ch) framework, giving ROOT access to an C++11 standards compliant interpreter.

 

Cling bundle for most popular platforms

Description: Cling standalone is in fairly stable state. We'd like to ship it to the end users as a package through apt-get, yum and so on (Windows installer included).
Expected results:an automatic tool (script) wrapping newest version of cling into a 'installable' package and registering it in the repositories. Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.
Required knowledge:Basic knowledge of package managers, Shell, Bash, Python

Mentor: Vasil Vasiliev

 

Cling name autodetection and library autoloading

Description: We propose to improve Cling's interactive prompt to provide hints where the names are. Eg. [cling] std::vector a; error std::vector is missing. Please type #include Alongside with this we also would like a hint which library needs to be loaded.
Expected results: Be able to detect a missing include and its corresponding library. Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.
Required knowledge:Advanced C++, Intermediary knowledge of compilers/interpreters, Basic knowledge of Clang and Cling.

Mentor: Vasil Vasiliev

 

LINQ 4 ROOT and Cling

Description: LINQ originally referred to the SQL-like syntax in C# and VB but it has over time changed its meaning to mean the way you manipulate lists using the higher-order functions provided by System.Linq. Working with lists using higher-order functions have been available for functional and SmallTalk developers since the 70s but has recently been popularized in mainstream languages such as C#, VB, JavaScript, Python, Ruby and so on. There are a few libraries that bring this style into C++11. We'd like to investigate their strengths and weaknesses. We would like to adopt them in our C++11 interpreter and provide a thorough real-world examples of their.
Expected results:Thorough analysis of the pros and cons of the available C++ LINQ libraries. Adoption in cling meeting cling's QA requirements. Implement tests for all the realized functionality. Prepare a final poster of the work and be ready to present it.
Required knowledge:Advanced C++, Basic knowledge of Cling and C++11

Mentor: Vasil Vasiliev

 

Cling language support

Description:Clang (the underlying compiler library) supports various programming languages besided C and C++. It offers OpenCL, CUDA, ObjectiveC and ObjectiveC++. Cling's design and implementation are not bound to support only C/C++. We propose to improve the inherited language support in the interpreter by supporting as many languages as the underlying compiler library (clang) supports.
Expected results:adjustment of cling's custom AST transform passes, implementing loading of the corresponding language runtime environment and adding corresponding test cases. Prepare a final poster of the work and be ready to present it.
Required knowledge: Intermediate C++, Clang abstract syntax tree (AST), basic ObjectiveC/C++.

Mentor: Vasil Vasiliev

 

Implement Automatic Differentiation library using Cling

Description: In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). The automatic differentiation library is to be based on Cling which will provides the necessary facilities for code transformation.
Expected results: The AD library is able to differentiate non trivial functions, to find a partial derivative for trivial cases and has good unit test coverage.
Required knowledge: Advanced C++, Clang abstract syntax tree (AST), basic math.

MentorVasil Vasiliev

 

 

CernVM & CernVM-FS


CernVM is a Virtual Software Appliance designed to provide a complete and portable environment for developing and running the data analysis of the CERN Large Hadron Collider (LHC) on any end-user computer (laptop, desktop) as well as on the Grid and on Cloud resources, independently of host operating systems. This "thin" appliance contains only a minimal operating system required to bootstrap and run the experiment software. The experiment software is delivered using the CernVM File System (CernVM-FS),  a Fuse file system developed to deliver High Energy Physics (HEP) software stacks onto (virtual) worker nodes.

 

CernVM-FS is used for instance by LHC experiments on their world-wide distributed computing infrastructure. HEP software is quite complex with tens of gigabytes per release and 1-2 releases per week while, at the same time, almost all the individual files are very small resulting in hundreds of thousands of files per release. CernVM-FS uses content addressable storage (like the GIT versioning system). For distribution it uses HTTP. File meta data are organized in trees of SQlite file catalogs. Files and file meta data are downloaded on demand and aggressively cached. The CernVM File System is part of the CernVM appliance but it compiles and runs on physical Linux/OS X boxes as well. It is mostly BSD licensed with small GPL parts. CernVM-FS source code of the current development branch can be downloaded from here, the documentation is available here.

 

Streamline CernVM Contextualization Plug-ins

Description: CernVM is a virtual appliance that can be used by the four LHC experiments in order to run simulation and data processing applications in the Cloud. Unlike standard virtual machine images, CernVM provides a uniform configuration interface across the most relevant cloud infrastructures (Amazon EC2, Google Compute Engine, CERN OpenStack, Nimbus ScienceCloud, ...). This is achieved by so called "contextualization plugins", light-weight agents inside the virtual machine that detect the infrastructure at hand and dynamically adapt the image. With the transition of most cloud infrastructures from an early exploitation phase to a production service, the contextualization plugins also need to be evolved, streamlined, and optimized. For this project, the student develops or evolves a common framework for contextualization plugins. The student is supposed to measures and optimizes the virtual machine boot time delay due to these plugins. The student will also get in touch with the computing groups of LHC experiments to ensure CernVM fits within their distributed computing environment.

Expected Result: .
Mentors: Gerardo Ganis 
Requirements:  Good knowledge of Linux/Unix, experience with scripting languages (Perl/Python/BASH). Experience with virtualization technology is a plus.

 

Geant 4 Simulation Toolkit and Geant Vector Prototype


The Geant4 toolkit simulates the interactions of elementary particles and radiation with matter. It is used to simulate the detectors of the LHC and other High Energy Physics (HEP) experiments. It finds application in other areas, from assessing the effects of radiation on the electronics of satellites to improving the design of medical detectors with the custom GATE application. LHC experiments use Geant4 to compare the signatures of rare events from new physics (such as the Higgs boson) to those coming from known interactions. The open source toolkit is developed by the Geant4 collaboration, which brings together 90 physicists and engineers from laboratories and universities around the world. Developments are ongoing to improve its precision and scope of application, and to better utilise current and emerging computer architectures. The simulation programs of the LHC experiments use the Geant4 simulation toolkit to produce simulated LHC events continuously running with on about 100,000 CPU cores throughout the year. These statistics remain a limitation in the analysis potential for some interesting types of new physics. As a result the goal of the project is to explore different ways to reduce the execution time of Geant4 on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’). The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation of today’s hardware. This effort is focused on identifying ways to spread the work between threads and/or to reorganise the work. By using less code in each core we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware.

 

Reengineer Propagation of Charged Tracks in a Magnetic Field for Vector and GPU

Description: A significant part of the CPU time in simulations of large detectors is taken in integrating the motion of particles in an electromagnetic field, using numerical integration techniques.  Our idea is to redesign the classes used in propagation, to avoid virtual function calls and aid optimization including vectorisation. We target common code which can be used in several modes: in sequential simulation for a single particle, in vector mode for a set of tracks or on a GPU for a single track.

Expected Result: Created new implementations using template techniques and vectorization that can be used across vector and non-vector CPUs and GPUs, and benchmark the speed for a vector of particles propagating in a magnetic field against the existing sequential version.

Mentors: John Apostolakis, Sandro Wenzel 
Requirements: Good knowledge of C++, Object-Oriented Programming, parallel programming using for vectors or on GPUs are essential. Knowledge of solution of ordinar differential equations will be beneficial.

Sixtrack numerical accelerator simulation


SixTrack is a simulation tool for the trajectory of high energy particles in accelerators. It has been used in the design and optimization of the LHC and is now being used to design the upgrade that will be installed in the next decade the High-Luminsity LHC (HL-LHC). Sixtrack has been adapted to take advantage of large scale volunteer computing resources provided by the LHC@Home project. It has been engineered to give the exact same results after millions of operations on several, very different computer platforms. The source code is written in Fortran, and is pre-processed by two programs that assemble the code blocks and provide automatic differentiation of the equation of motions. The code relies on the crlibm library, careful arrangement of parenthesis, dedicated input/output and selected compilation flags for the most common compilers to provide identical results on different platforms and operating systems. An option enables the use of the Boinc library for volunteer computing. A running environment SixDesk is used to generate input files, split simulations for LHC@Home or CERN cluster and collect the results for the user. SixTrack is licensed under LGPLv2.1.

 

A strong background in computer science and programming languages as well the interest to understand computational physics methods implemented in the code are sought. The unique challenge will be offerred to work with a high-performance production code that is used for the highest energy accelerator in the world - and thus the code's reliability and backward compatibility cannot be compromised.  There will be the opportunity to learn about methods used in simulating the motion of particles in accelerators.

 

Simulating time dependent functions

Description: Implement, test and put in production the ability to change the strength, misalignment of any element by applying a function composed by predefined branch like linear, parabolic, sinusoidal, withe noise, colored noise.

Expected results: The user will have the option to define magnet strength as a function of time from the input files.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: Fortran, calculus.

 

New physics models

Description: Implement, test and put in production a new solver for exact bending dipoles, include per-particle mass and charge state, track total time.

Expected results: The user can build accelerator simulations with more accurate models for low energy machines.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: Fortran, calculus, accelerator physics.

 

Database infrastructure for large scale simulation

Description: Develop a database interface to collect the results sent from volunteers or computer clusters and provide them to the users. The code will allow to split a study in smaller unit, submit the study, query the state, re-submit missing or invalid results, collect the data in users' SQLite database.

Expected results: The user will have a complete set of tools to submit, follow and collect results from distributed computer resources.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: SQL, python, C/C++, Linux.

 

Create a Standalone Tracking Library 

Description: : Define an API for particle tracking, implemented the existing model as a standalone module in C and re-factor existing code to use it. The API and the module should open the way to GPU calculations.

Expected results:  Test runs which rely only on the newly rewritten library to perform tracking simulations.

Mentors: Ricardo De Maria, Eric McIntosh

Requirements: Experience with Fortran, C, calculus and a background of physics are important.

 

Performance Monitoring


IgProf (https://igprof.org) is a lightweight performance profiling and analysis tool. It can be run in one of three modes: as a performance profiler, as a memory profiler, or in instrumentation mode. When used as a performance profiler it provides statistical sampling based performance profiles of the application. In the memory profiling mode it can be used to obtain information about the total number of dynamic memory allocations, profiles of the ``live'' memory allocations in the heap at any given time and information about memory leaks. The memory profiling is particularly important for C/C++ programs, where large amounts of dynamic memory allocation can affect performance and where very complex memory footprints need to be understood. In nearly all cases no code changes are needed to obtain profiles. IgProf currently supports Linux and x86/x86-64 architectures, and provides initial support for ARMv7 one. It correctly handles dynamically loaded shared libraries, threaded applications and subprocesses. It can be run entirely in user space without special privileges and produces full navigable call stacks. The profile reports can be visualized in one of several ways. A simple text report can be produced from the raw data saved during the profiling run. Alternatively a web-browsable profile report can be produced which allows easy navigation of the call stack. Both allow one to see profile data ordered by symbol in terms of ``cumulative'' cost (function plus children) and ``self'' cost (time in the function itself) as well as a full call graph view showing the functions which called, and which were called by, any given function. An important feature of the web-navigable reports is the ability to point via URL at particular places in the call graph. This facilitates collaboration between individuals at different locations. While there are obviouly many profilers out there, the goal of IgProf is to provide a reliable profiler for large applications like the one found in HEP, and to tackle the challenges posed by heterogenous compututing architectures.

Profiling mixed python / C++ programs

Description:IgProf currently supports profiling native applications, most likely written in C / C++. Profiling scripted applications which invoke native code (e.g. a python framework which invokes C++ plugins) is supported in the form of raw profile of the python interpreter itself, which eventually calls the native code functions. While this kind of profile can already provide insights, it is still desiderable to be able to profile and visualize the script stacktrace and the native code one together, in order have a global picture of the connections between the scripted and native code.
Expected results: the first objective is to identify and instrument the parts of the python interpreter which are responsible for allocating, deallocating and execute python stackframes, eventually extending igprof instrumentation capabilities to deal with peculiarities of the python interpreter. The second objective is to collect enough information via the above mentioned instrumentation to be able to show mixed python / C / C++ profiles, first for the performance profiler and subsequently for the memory profiler.
Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Understanding of python interpreter internals a plus.

Mentors: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

 

Data driven profiling

Description:Most low overhead profilers, including IgProf, only consider the code being profiled as always behaving in the same way, regardless of the input data it processes. While this is often the case, sometimes it is of interest to be able to associate cost of a given function to its input parameters and / or data sections passed to the code being profiled in order to identify not only hot code sections, but hot data patterns too
Expected results: The first objective is extend igprof so that different profiling context can be defined, with each context associated to some interesting data processing. This could be done initially by code instrumentation (i.e. calling a function which identifies the data being processed) and then in a more automated manner, e.g. by monitoring the arguments of user specified functions. The second objective is to extend the profile visualization to keep the information so collected into account.
Required knowledge:Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required.

Mentor: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

Support for CUDA / OpenCL profiling

Description:Extend IgProf to gather information from the profiling APIs of one (or more) of the above mentioned packages. The goal is to be able to measure the performance cost of mixed CPU / GPU applications, keeping into account the asynchronous behaviour of those applications, allowing to track the source of GPU workloads.
Required knowledge:Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Knowledge of the above mentioned toolkits a plus.

Mentor: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

 

Enhanced support for ARM, x32 and MacOSX architectures.

Description:Depending on the specific knowledge of the candidate, the task is to improve support for profiling ARMv7 applications and introduce initial support for 64bit ARMv8 one or, as an alternative, to extend the x86 support to include x32 (the 32bit pointers, 64bit data ABI for Intel compatible processors). An additional task would be to resurrect OSX support (IgProf used to work on PPC based OSX).
Required knowledge: Excellent knowledge of C/C++ programming and understanding of Linux system programming and software execution in Linux are required. Knowledge of at least one between ARM and x86 assembly language. Knowledge of MacOSX system programming a plus.

Mentor: Giulio Eulisse, Peter Elmer, Vincenzo Innocente

 

 

'Blue sky' ideas


 

 

  • General GPU-vectorized solid: Create a vectorised/GPU-capable solid that can be used in place of the most popular existing solids (box, tube, conical & spherical shells), for use in Geant4-GPU and vector simulation. Inspired by the approach of James Tickner (see article in Computer Physics Communications, Vol 181, Issue 11, Nov 2010, pages 1821–1832 available at http://dx.doi.org/10.1016/j.cpc.2010.07.001 ).

Mentor: John Apostolakis, Gabriele Cosmo
Requirements: Experience with C/C++, vectorization

  • New simulation engine: Create a prototype geometry navigation for one particle type with a limited geometry choice (1-2 types of volume types) using a high-productivity parallel language ( Chapel, X10 ). Benchmark this against existing solutions for the same problem. Document the development effort and the pitfalls of the language, tools and implementation (potential for a scientific report on the coding experience, in addition to the results).

Mentor: John Apostolakis
Requirements: Experience with C/C++ and either Chapel, X10, Cilk+, SplitC or a similar language is required.

 

 

 

Mentors

Here is the list of our mentors and their areas of experitse:

 

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

  • SFT GSoC mailing list: sft-gsoc-AT-cern-DOT-ch (no subscription needed).
  • SFT GSoC Jabber/XMPP chat room: gsoc-sft@conference.jabber.org . We noticed that Gmail/Gtalk XMPP accounts have problems posting messages to the chat room, so please register yourself an XMPP account on some other server (it takes no time to register an account at http://www.jabber.org).
  • IRC is restricted at CERN - please use Jabber instead.
Tags

CERN SFT Gsoc 2013 ideas page

After experiencing great Google Summer of Code in 2011 and 2012 we have decided to apply again!

As last year, the project ideas are grouped into categories (ideas related to CERN Virtual Machine, Geant 4 related ideas, and ROOT related ideas). We also have a so-called 'Blue sky' ideas which are rather raw and must be worked on further before they become projects. The list of ideas is by no means complete, so if you think you have have a great idea regarding our projects we are certainly certainly looking forward to hear from you. The list of our mentors (and their areas of expertise) can be found below.

LHC Experiment Software Stack

The project ideas are (almost) as diverse as the software stack for LHC experiments is, ranging from twisting the Linux file system layer to the tracking of an electron through a silicon detector.  All experiments have software frameworks that make themselves use of common scientific software libraries for high-energy physics (HEP).  CernVM and CernVM-FS provide a uniform runtime environment based on Scientific Linux.  The software stack is fully open sourced and many of its bits and pieces are used outside the world of particle physics.

 

We encourage students who are planning to apply to do that (and contact us) as early as possible because we have learned from our previous GSoC experience that an initial student application often needs to be reworked in close collaboration with a future mentor. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).

 

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 3rd of May (19:00 UTC).

Project ideas

ROOT Object-Oriented data analysis framework

The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histograming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis. Thanks to the built-in C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled.

ROOT's new C++11 standard-compliant interpreter is Cling, an interpreter built on top of Clang (www.clang.llvm.org) and LLVM (www.llvm.org) compiler infrastructure. Cling is being developed at CERN as a standalone project. It is being integrated into the ROOT data analysis (root.cern.ch) framework, giving ROOT access to an C++11 standards compliant interpreter.

ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

 

Extend Cling's Language Support

Description: Clang (the underlying compiler library) supports various programming languages besided C and C++. It offers OpenCL, CUDA, ObjectiveC and ObjectiveC++.  Cling's design and implementation are not bound to support only C/C++.  We propose to improve the inherited language support in the interpreter by supporting as many languages as the underlying compiler library (clang) supports.
Expected results: adjustment of cling's custom AST transform passes, implementing loading of the corresponding language runtime environment and adding corresponding test cases.
Required knowledge: intermediate C++, Clang abstract syntax tree (AST), basic ObjectiveC/C++.

MentorFons Rademakers

 

Implement Pre-run Verification In Cling

Description: The cling interpreter's interactive prompt allows a user to type statements and see the results of their execution. We propose to enhance Cling's execution engine with pre-run verification, including the detection of a dereference of a null pointer.

Expected results: the execution of a NULL pointer dereference does not causes a crash.

Mentor: Vassil Vassilev

Required knowledge: intermediate C++,  knowledge of Clang AST and LLVM IR.

 

ROOT  R interface

Description: Develop an interface in ROOT to call R function using the R C++ interface  (Rcpp, see http://dirk.eddelbuettel.com/code/rcpp.html). As a proof of concept implement the Minimizer interface of ROOT to use a R package for minimization. Developing this interface opens the possibility in ROOT to use the very large set of mathematical and statistical tools provided by R.

Implement C++ class(es) to perform a conversion of ROOT C++ object to R objects, transform the returned R objects into ROOT C++ objects. 

Expected Results: One or more C++ classes which perform the conversion of ROOT C++ object to R objects, able to call R functions and transform the returned R objects in ROOT C++ objects. The class(es) should implement some of the existing algorithm interface in ROOT, so that the R functionality can be used directly for fitting or statistical studies in ROOT.

MentorLorenzo Moneta

Requirements: Basic knowledge of C++.  Knowledge of ROOT and/or R would be advantages.

 

New Formula class

Description: Develop a new parallel version of the TFormula class in ROOT using the capabilities of the new Cling interpreter. Having such a class will allow the user to introduce more complex code, even based on C++ 11, in defining a formula for a function class.

Expected Results: Provide a complete new TFormula class capable of creating a function with and without parameters from C++ code which can be parsed and compiled on the fly using Cling.

MentorLorenzo Moneta

Requirements: Basic knowledge of C++.

 

Implement Automatic Differentiation library using Cling

Description: In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). The automatic differentiation library is to be based on Cling which will provides the necessary facilities for code transformation.
Expected Result:  - The AD library is able to differentiate non trivial functions, to find a partial derivative for trivial cases and has good unit test coverage.
Mentors: Vassil Vassilev, Lorenzo Moneta.
Required knowledge: advanced C++, Clang abstract syntax tree (AST), basic math.
 

CernVM & CernVM-FS

CernVM is a Virtual Software Appliance designed to provide a complete and portable environment for developing and running the data analysis of the CERN Large Hadron Collider (LHC) on any end-user computer (laptop, desktop) as well as on the Grid and on Cloud resources, independently of host operating systems. This "thin" appliance contains only a minimal operating system required to bootstrap and run the experiment software. The experiment software is delivered using the CernVM File System (CernVM-FS),  a Fuse file system developed to deliver High Energy Physics (HEP) software stacks onto (virtual) worker nodes.

CernVM-FS is used for instance by LHC experiments on their world-wide distributed computing infrastructure. HEP software is quite complex with tens of gigabytes per release and 1-2 releases per week while, at the same time, almost all the individual files are very small resulting in hundreds of thousands of files per release. CernVM-FS uses content addressable storage (like the GIT versioning system). For distribution it uses HTTP. File meta data are organized in trees of SQlite file catalogs. Files and file meta data are downloaded on demand and aggressively cached. The CernVM File System is part of the CernVM appliance but it compiles and runs on physical Linux/OS X boxes as well. It is mostly BSD licensed with small GPL parts. CernVM-FS source code of the current development branch can be downloaded from here, the documentation is available here.

 

Marketplace for VM Contextualization Artifacts

Description: The versatility required to run many different tasks using the very same virtual machine hard disk image is provided by "contextualization".  By contextualizing a CernVM, one can for instance specify that a specific virtual machine instance should run simulation jobs for the ATLAS experiment, or it should connect to the analysis task queue of the CMS experiment, or it should act as a storage proxy for other CernVMs.  The contextualization is typically a brief, textual specification of key value pairs (environment variables), kernel settings, services to start and scriptlets to run at boot.
One can easily imagine that quickly one needs to deal with very many contextualization artifacts, in particular given that not only a single CernVM can be contextualized but also ensembles of virtual machines can be defined by contextualization.  Often, the same work has to be done multiple times by multiple users.  It would be better to consolidate the contextualization artifacts on a common marketplace.  Contextualization artifacts could be described, rated, exchanged, and indexed, similar to an app store, to the recipe repository on puppet forge, or to the collection of virtual machines provided by TurnKey.

Expected Result: Extension of the cernvm-online.cern.ch web application by functionality to index, exchange, rate, and find contextualization artifacts.
Mentors: Jakob Blomer, Predrag Buncic

Requirements: Good knowledge of web technologies. Experience with virtualization and virtual machines will be an asset.

 

CernVM Zookeeper

A virtual analysis facility (VAF) is an ensemble of virtual machines running on IaaS resources which, as a whole, offer a data analysis cluster to physicists.  Such a virtual analysis facility comprises several virtual machine types, e.g. a storage proxy, a job server, and worker nodes that do the actual number crunching.  While such an ensemble of virtual machines can be relatively easily bootstrapped using the right contextualization artifacts, it is more difficult to maintain and monitor the cluster, in particular as every physicist might have several VAFs on several independent IaaS clouds running in parallel.  On top of that, not all virtual machines have public IP addresses.
Description: This project should extend CernVM by a XMPP component that, upon boot, connects to an XMPP server and joins a group chat.  This group chat can then be used by a physicists to control and monitor its virtual machines, possibly through a web interface but also through his or her instant messaging client.  (This project can benefit from previous work on the CernVM Co-Pilot system, a job server for CernVM based on XMPP.)

Mentors: Jakob Blomer, Predrag Buncic
Requirements: Good knowledge of XMPP and scripting lanugages (Perl, Python). Experience with virtualization and virtual machines will be an asset.

 

Garbage Collection for the CernVM File System

Context: In the CernVM-FS internal storage, every file is addressed by its cryptographic content hash.  Whenever a file is modified, a new file with a new content is stored internally.  Naturally, this design results in a versioning file system, i.e. previous versions of files remain accessible.  In many cases, this can be considered a desirable feature.
In some cases, however, it would be better to automatically remove previous file system versions. Such a case is the hosting of nightly snapshots of an experiment software development tree on CernVM-FS. The problem of gathering and removing older revisions of the file system is amplified by the fact that CernVM-FS repositories can grow to a large number of files (tens or hundreds of millions) and that any automatic garbage collection need to be performed online.  Possible approaches can be the implementation of reference counting or a mark and sweep algorithm on the CernVM-FS storage.

Expected Result: Extension of the CernVM-FS server toolkit (C++ and bash) that is able to detect garbage generation during regular file system updates and incrementally removes garbage.
Mentors: Jakob Blomer
Requirements: Good knowledge of C++, file systems, and storage technologies.
 

 

The Geant4 toolkit simulates the interactions of elementary particles and radiation with matter. It is used to simulate the detectors of the LHC and other High Energy Physics (HEP) experiments. It finds application in other areas, from assessing the effects of radiation on the electronics of satellites to improving the design of medical detectors with the custom GATE application. LHC experiments use Geant4 to compare the signatures of rare events from new physics (such as the Higgs boson) to those coming from known interactions. The open source toolkit is developed by the Geant4 collaboration, which brings together 90 physicists and engineers from laboratories and universities around the world. Developments are ongoing to improve its precision and scope of application, and to better utilise current and emerging computer architectures. The simulation programs of the LHC experiments use the Geant4 simulation toolkit to produce simulated LHC events continuously running with on about 100,000 CPU cores throughout the year. These statistics remain a limitation in the analysis potential for some interesting types of new physics. As a result the goal of the project is to explore different ways to reduce the execution time of Geant4 on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’). The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation of today’s hardware. This effort is focused on identifying ways to spread the work between threads and/or to reorganise the work. By using less code in each core we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware.


Improve Geometry Navigation in Sequential and GPU versions

Description: Locating the volume which corresponds to a position, and the next boundary along a ray are computationally expensive operations in a complex geometry.  Geant4 uses a voxelisation technique which precompute the volumes in 3-d slices of a setup. The first objective of this project is to investigate how to speedup the sequential Geant4 navigation. The second objective is to refine the algorithms used for navigation in a voxelised geometry in the GPU version of Geant4.

Mentors: John Apostolakis, Gabriele Cosmo 
Requirements: Good knowledge of C++, Object-Oriented Programming, parallel programming using GPUs are needed. Knowledge of ray tracing algorithms will be very beneficial.

 

Evaluate language or annotation options for Simulation kernels

Description: Port key kernels of simulation code to a parallel C++ variant ( Cilk+, SplitC ) and/or OpenAcc/OpenMP and benchmark the effort and computing speedup obtained.
Mentors: John Apostolakis 

 

Requirements: Good knowledge of C++ and exposure to parallel/vector programming are needed. Experience with OpenMP or similar technologies, or a parallel C++ variant will be beneficial. 
 

Vectorised implementations of Solids

Description: create vectorised versions of the implementations of new library of solides (USolids), which is the future library of Solids for Geant4 and Root.

Expected result: Vector versions of UBox and UTubs solid.

Mentors: John Apostolakis, Gabriele Cosmo

Requirements: Knowledge of C++ and vectorisation.

 

Performance Monitoring

Linux-perf is a performance tuning component of the Linux kernel. It is the go-to system for performance tuning activities on Linux, ranging from software events such as OS context switches to the hardware counter level (instructions executed, cache misses, etc). CERN, as numerous other large organizations and companies, makes extensive use of such performance monitoring features in its daily work. Large portions of the software powering the Large Hadron Collider science go through systems such as linux-perf every day.

The main components of a setup with linux-perf are a kernel part and a userspace tool called “perf”, supported by the libpfm library. Work on the project will provide world-class experience in performance tuning, Linux software architecture and computer architecture. The student(s) can also find out how CERN uses its computers to further large-scale scientific goals.

Performance Self Monitoring

Description: While external tools for tuning exist today, some applications need to have performance monitoring facilities built in. That means that the application - by itself – can report on its usage of various hardware features of the platform it runs on. This project will involve a design and implementation of a C/C++ API to linux-perf and to libpfm that will allow applications to monitor their own behavior. Software developments will be preceded by performance studies of the linux-perf system itself, to establish practical bounds of such improvements. All work will involve and impact real-world High Energy Physics software.
MentorVincenzo Innocente, Andrzej Nowak
Requirements: Knowledge of C/C++ programming and understanding of Linux programming and software execution in Linux are required. Experience with perf and Python is beneficial.

Performance tuning: organizing performance data

Description: The goal of this project is to design and develop efficient mechanisms to organize performance data. One problem is that collected performance data can only be referenced by its location in the code, but cannot be described by user-defined names or in any other way. Another issue is that data is collected on the whole program rather than only some pre-selected routines – improvements on this will help focus performance studies. The work will involve developing software solutions to these challenges, in close collaboration with CERN experts. All work will focus on and impact real-world High Energy Physics software.
MentorVincenzo Innocente, Andrzej Nowak

Requirements: Knowledge of C/C++ programming and understanding of Linux programming and software execution in Linux are required. Experience with perf and Python is beneficial.
 

Miscellaneous

CERN app on Android

Description: A prototype iOS app was created by a GSoC student in 2012 (see http://ph-news.web.cern.ch/content/new-ios-application-cern ) which provides a wide range of (real-time) information on CERN, the accelerators, the detectors, the physics, the history, the latest news, etc. This project foresees to create a similar app for Android. One goal is to make it extensible so that additional aspects relevant to CERN and particle physics could be added in future, including displays of LHC events, mini-simulations or science quizzes. There is the potential for tens to hundreds of thousands of downloads as the public follows the restart of the upgraded LHC in 2015.

Mentors: Fons Rademakers, Timur Pocheptsov

Requirements: Experience with Android development.

 

Parallelisation of Core Services for High Energy Physics Data Processing

 The software frameworks for High Energy Physics (HEP) data processing are C++ programs of millions of lines of code which transform the electric signals coming from the particle detectors into the building blocks of exciting discoveries. Frameworks such as Gaudi (cern.ch/proj-gaudi/) provide a variety of services: from connection to databases, the management of the execution of physicists' code to the interaction with accelerators and coprocessors.
With multiple petabytes of experiment data recorded each year, the overall performance of such frameworks is essential. The partially interdependent services mentioned above are rather complex entities, which need considerable time for their initialization, before the actual physics data processing can start.

Description: As the parallelisation of the physics processing improves, the overhead due to sequential startup becomes an important issue.  To address this we propose to parallelise the service initialization in the widely used,  in the GaudiHive (concurrency.web.cern.ch/GaudiHive) multi-threaded extension of the cross-experiment HEP framework Gaudi.
The task should be accomplished using the functionalities offered by the C++ language (C++11 standard) and taking advantage of highly optimised, lock free algorithms and data structures.

Mentors: Danillo Piparo, Benedikt Hegner

Requirements: Good knowledge of C++ and parallel programming. Experience with Python is a plus.

 

Extending Indico

Indico (http://indico-software.org<http://indico-software.org/>) is a web-based event lifecycle management system, room booking application and collaboration hub (video conference, chat, webcast and others). It is used at many High Energy Physics laboratories for long term archival of documents and metadata related to all kinds of events.  CERN installation  (http://indico.cern.ch<http://indico.cern.ch/>), hosts more than 210.000 events, more than 1 million uploaded files and receives around 14.000 visitors per day. It is installed in more than 100 institutes world-wide (http://indico-software.org/wiki/IndicoWorldWide). This generates a need for permanent improvement and development.

The student will gain expertise in web development, acquiring skills in software engineering good practices, and both development of front-end (using modern JavaScript-based web technologies and libraries) and back-end web (Python programming language, Object Oriented Database (ZODB), Flask, WSGI,...).  Indico's GitHub address is https://github.com/indico/indico.

 

Indico: Abstract editor

Description: Indico is mainly used for events by the physics community. Physicists, like some other scientists, often need to write very complex formulas in their paper abstracts, as well adding images and other types of content. Indico currently provides only a simple web form to submit plain-text abstracts for a conference. The project is to enable users to write in different formats (Markdown + MathJax for formulas, LaTeX, etc) that will allow the generation of beautiful conference abstracts, which can be published in a Book of Abstracts.

MentorJose Benito Gonzalez Lopez, Pedro Ferreira

Requirements: Knowledge of Web development and Javascript (jQuery) technologies are required. Knowledge of Python, git, Markdown, MathJax for formulas, and/or LaTeX would be beneficial.

 

Indico: On-line Proceedings

Description: The final step for almost every big conference is the publication of the conference's proceedings. Basically, this means a big book with all the accepted papers of a conference. This final step is the only one that Indico is missing in order to provide the full functionality needed to complete the lifecycle of conference organization. We do not intend to be able to publish a real book but instead to create digital proceedings. The goal would be to be able to retrieve all the papers of a given conference and build a very user-friendly interface with all the publications as well as many different indexes, filters and search capabilities by content.

MentorJose Benito Gonzalez Lopez, Pedro Ferreira

Requirements: Knowledge of Web development and Javascript (jQuery) technologies are required. 

 

'Blue sky' ideas

  • "HEPunion" filesystem implementation as Linux module: Adapt and extend union file system to address the requirements of a HEP experiment's dedicated farm.
    Mentors: Pierre Schweitzer, Loic Brada
    Requirements: Good knowledge of C, Linux and (Linux) Kernel development.

  • Extend Geant 4 on GPUs: Extend the Geant4 GPU prototype's geometry or physics.  Geometry navigation can be extended by devoloping additional types of volumes, starting from the geometry developed by Otto Seiskari (2010) and Dhruva Tirumala (GSoC 2012) in OpenCL/Cuda.  Extending physics can create new processes or integrate models from other authors.

Mentor: John Apostolakis
Requirements: Experience with C/C++, OpenCL/Cuda

  • General GPU-vectorized solid: Create a vectorised/GPU-capable solid that can be used in place of the most popular existing solids (box, tube, conical & spherical shells), for use in Geant4-GPU and vector simulation. Inspired by the approach of James Tickner (see article in Computer Physics Communications, Vol 181, Issue 11, Nov 2010, pages 1821–1832 available at http://dx.doi.org/10.1016/j.cpc.2010.07.001).

Mentor: John Apostolakis, Gabriele Cosmo
Requirements: Experience with C/C++, vectorization

  • New simulation engine: Create a prototype geometry navigation for one particle type with a limited geometry choice (1-2 types of volume types) using a high-productivity parallel language ( Chapel, X10 ). Benchmark this against existing solutions for the same problem. Document the development effort and the pitfalls of the language, tools and implementation (potential for a scientific report on the coding experience, in addition to the results).

Mentor: John Apostolakis
Requirements: Experience with C/C++ and either Chapel, X10, Cilk+, SplitC or a similar language is required.

 

Mentors

Here is the list of our mentors and their areas of experitse:

 

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

  • SFT GSoC mailing list: sft-gsoc-AT-cern-DOT-ch (no subscription needed).
  • SFT GSoC Jabber/XMPP chat room: gsoc-sft@conference.jabber.org . We noticed that Gmail/Gtalk XMPP accounts have problems posting messages to the chat room, so please register yourself an XMPP account on some other server (it takes no time to register an account at http://www.jabber.org).
  • IRC is restricted at CERN - please use Jabber instead.
Tags

CERN Summer Thrills Gsoc 2012

For the physics software development group at CERN, our second year of Google Summer of Code couldn’t have come at a better time. Motivated by CernVM's awesome experience in 2011, our colleagues from the Geant4 and ROOT software projects joined us as mentors this summer. And while physicists around the world snatched the first evidence of a long-sought Higgs boson from the Large Hadron Collider (LHC), our seven Google Summer of Code students worked on core parts of the open source software engine that makes LHC data processing possible.

Two of our students worked with the Geant4 team at CERN. Geant4 is a toolkit for the simulation of the response of a material when high-energetic particles are passing through it. Geant4 can be used to model a gas detector, a gamma-ray telescope, an electronic device next to an accelerator or the inside of a satellite. In order to keep up with the rate of real data coming from the LHC detectors, Geant4 has to be both accurate and fast.
 

  • Stathis Kamperis improved the speed of Geant4 by re-ordering the simulation of particles according to particle type. By simulating, for instance, first all electrons, than all photons, and so on, the number of instruction cache misses decreases. In the course of this work, Stathis also ported Geant4 to Solaris which gives us access to the very powerful DTrace profiling machinery.
  • Dhruva Tirumala Bukkapatnam contemplated Geant4 pointers and data structures. He developed code for a particle navigation algorithm optimized for use on GPU architectures.


Two more students were working together with the ROOT team. The ROOT framework is the main working horse for LHC experiments to store, analyze, and visualize their data.
 

  • Omar Zapata Mesa worked on an MPI interface for ROOT. On a cluster of machines, the MPI interface enables ROOT to toss around its C++ objects from node to node while also integrating with ROOT's C++ interpreter.
  • Eamon Ford worked on the CERN iOS App.The App brings CERN news and information on an iPad or iPhone. In case you can’t sleep at night, you can now peek at the live display of particle collisions from inside the LHC.

For the CernVM base technology, we had three more students working with us this summer. CernVM provides a virtual appliance used to develop and run LHC data processing applications on the distributed and heterogeneous computing infrastructure that is provided by hundreds of physics institutes and research labs around the world.
 

  • Josip Lisec, back for his second Google Summer of Code, worked on the log analysis and visualization of CernVM Co-Pilot, the job distribution system which powers the LHC@home Test4Theory volunteer computing project. Want to see the world map of active volunteers from the 19th of November at 3:07pm?  Check out the Co-Pilot dashboard.
  • Francesco Ruvolo worked on broken network services, such as misconfigured DNS or HTTP servers. Breaking such services in a controlled way comes in handy when simulating the behavior of a CernVM running on a hotel WiFi.
  • Rachee Singh programmed maintenance tools for the content distribution network that is used by the CernVM File System to distribute terabytes of experiment software to all the worker nodes. All the proxy servers of the content distribution network can now be plotted on a map and every CernVM can automatically find a close proxy by means of a Voronoi diagram produced by Rachee's code.


Overall, we were very glad to see so much interest and enthusiasm from the student programmers in LHC software tools. We'd like to congratulate all of our students on their hard work and on successfully completing the program!

By Jacob Blomer, CERN Organization Administrator
(original article on Google Open Source Blog)

Tags

CernVM’s fruitful summer Gsoc 2011

This was the first year CERN participated in Google Summer of Code, and it turned out to be an amazing experience for us! We were given four students to mentor, all of whom proved to be very skilled developers. The students quickly familiarized themselves with our code base and managed to make valuable contributions within the three month time frame of Google Summer of Code. Our students were very open and willing to learn and spent a considerable amount of their time researching tools, libraries, and the latest technological developments. As a result, all four students were able to solve their problems and come up with interesting ideas for future development. The code and the documentation they produced is available here. The specific problems (projects) that we suggested to our students spanned several domains, ranging from consistent replication of terabytes of data across several remote sites to automated testing of virtual machine releases.
Josip Lisec was working on the development of the monitoring system for the CernVM Co-Pilot framework, which is mainly used as a distributed computing platform within the LHC@home 2.0 volunteer computing project. The LHC@home 2.0 project currently has more than 9,000 registered users who contribute their spare CPU cycles for the simulation of the particle collision events in CERN's Large Hadron Collider (LHC). After some research, Josip decided to integrate existing tools with the Co-Pilot as opposed to trying to reinvent the wheel by rewriting everything from scratch. This resulted in a nicely engineered monitoring framework, parts of which were put into production while the Google Summer of Code was still going on (Josip's developments have now been fully integrated after completion of the program). Since this was Josip's first encounter with Perl, he has been seen adding support for 'my' keyword to every other major programming language since the Google Summer of Code concluded.

The goal of Yin Qiu's project was to devise a mechanism for a consistent replication of changes made to the central repository of CernVM File System (CernVM-FS) to a globally distributed set of mirror servers. CernVM-FS is used to host and distribute the application software of CERN LHC experiments to hundreds of Grid sites, as well as the laptops and workstations of users worldwide. As such, it is currently one of the central components of the distributed computing infrastructures on which CERN ATLAS and LHCb experiments rely. Yin's approach was to organize CernVM-FS mirrors into a Paxos-managed replication network and to enforce state machine version transitions on them. Following the suggestion of Jakob, his mentor, Yin implemented a messaging framework which is used to orchestrate the replication process and facilitates the implementation of new features. He also managed to implement a couple of Python plugins which ensure the consistency of data across replicas. The project is currently in the state of a working prototype.

Jesse Williamson took up the challenge of designing a new library for CernVM-FS to consolidate support for various cryptographic hashing algorithms. The first task was to survey the implementation of CernVM-FS and establish a list of requirements. Next, quite a bit of effort was spent on designing the library specifically so that it would be easy to use, comparatively simple to extend, and robust enough to support extensions like a streaming interface and compression. Since CernVM-FS is heavily used in production, it has been very important to make sure that the new developments do not break anything. Jesse has developed a set of unit tests which ensure that all the existing features and properties were maintained.

The design of new C++ libraries was certainly an improvement, but it also became clear late in the cycle that a further abstraction to fully separate digests and hash functions will be necessary to avoid memory fragmentation issues and ensure stronger const-correctness

Jonathan Gillet worked on implementing a solution for automating the testing of CernVM virtual machine images on multiple hypervisors and operating systems. The solution, which is a ready to use testing infrastructure for CernVM, was developed in collaboration with other open source projects such as AMD Tapper (used for the reports and web interface), libvirt (interaction with hypervisors), and Homebrew (OS X support).  The main goals of the project were accomplished with support for all major hypervisors running on Linux and OS X platforms. The framework automates the task of downloading and configuring the CernVM images on the fly, and executing a series of thorough tests which check various features of CernVM images before release. Documentation was also an important goal of the project; in total there are now over two hundred pages of documentation which cover everything from setting up the testing infrastructure and virtual machines to a complete API reference.

We certainly enjoyed Google Summer of Code 2011, and we sincerely congratulate all of our students and mentors for successfully completing the program!

By Artem Harutyunyan, Senior Fellow, CernVM Project (CERN) and Google Summer of Code Mentor
(original article on Google Open Source Blog)

Tags