CERN SFT Gsoc 2012 ideas page

Fri, 09/20/2013 - 00:00

After experiencing great Google Summer of Code in 2011 we have decided to apply again. This year we will be widening the scope of our involvement in GSoC by offering our students to work not only on CernVM related projects, but also Geant4 simulation toolkit, and ROOT Object-Oriented data analysis framework. (that is the reason why we have decided to change the organization name from 'CERN Virtual Machine' to 'CERN SFT').

Same like last year, the project ideas are grouped into categories (ideas related to CERN Virtual Machine, CernVM File System-related ideas, CernVM Co-Pilot related ideas, Geant 4 related ideas, and ROOT related ideas). We also have a so-called 'Blue sky' ideas which are rather raw and must be worked on further before they become projects. The list of ideas is by no means complete, so if you think you have have a great idea regarding our projects we are certainly certainly looking forward to hear from you. The list of our mentors (and their areas of expertise) can be found below.

We encourage students who are planning to apply to do that (and contact us) as early as possible because we have learned from our previous GSoC experience that an initial student application often needs to be reworked in close collaboration with a future mentor. Please do not forget to provide us with some relevant information about yourself (for example CV, past projects, personal page or blog, linkedin profile, github account, etc.).

Before submitting an application please consult the official GSoC FAQ where you can find some good advice on writing a successful application. The application should be submitted through the GSoC webapp before the 6th of April (19:00 UTC). The application template can be found here.

Project ideas

CERN Virtual Machine related projects

CernVM is a Virtual Software Appliance designed to provide a complete and portable environment for developing and running the data analysis of the CERN Large Hadron Collider (LHC) on any end-user computer (laptop, desktop) as well as on the Grid and on Cloud resources, independently of host operating systems. This "thin" appliance contains only a minimal operating system required to bootstrap and run the experiment software. The experiment software is delivered using the CernVM File System (CernVM-FS) that decouples the operating system from the experiment software life cycle.

CernVM Virtual Machine Lifecycle Management

Description: At CernVM we were looking for a complete solution for the maintenance of Virtual Machines (VM). After using several commercial solutions to drive this process, that failed to provide a common and coherent framework that would allow for full control of every step, we decided to combine existing open-source tools into an extensible framework that could serve as end to end solution for our VM lifecycle management.
Having this basic framework in place, this task consists of creating and interconnecting the required agents in order to form a complete solution that implements the usual lifecycle of a VM: Tuning the configuration, building the virtual disk images, testing the final product, publishing the release, instantiating machines on demand, contextualizing them, monitoring their health and sending back feedback (http://cernvm.cern.ch/portal/ibuilder). Part of the task will also be to create the required web-based front-ends that will ease the management of the whole procedure.

Mentors: Predrag Buncic
Requirements: Good knowledge of Perl (preferred) or Python (for the agents) and good knowledge of Javascript and/or Objective-J (for the web front-ends). Experience with virtualization and virtual machines will be an asset.

CernVM File System related projects

The CernVM File System is a Fuse file system developed to deliver High Energy Physics (HEP) software stacks onto (virtual) worker nodes. It is used for instance by LHC experiments on their world-wide distributed computing infrastructure. HEP software is quite complex with tens of gigabytes per release and 1-2 releases per week while, at the same time, almost all the individual files are very small resulting in hundreds of thousands of files per release. CernVM-FS uses content addressable storage (like the GIT versioning system). For distribution it uses HTTP. File meta data are organized in trees of SQlite file catalogs. Files and file meta data are downloaded on demand and aggressively cached. The CernVM File System is part of the CernVM appliance but it compiles and runs on physical Linux/OS X boxes as well. It is mostly BSD licensed with small GPL parts. CernVM-FS source code of the current development branch can be downloaded from here, the documentation is available here.

Dodgy ad-hoc network services for CernVM-FS

Description: The aim of this project is not to build things that work but to build things that break—in a predictable way. CernVM-FS runs in a variety of different environments, such as end-user laptops, virtual machines on Amazon EC2 or other clouds, or managed servers in a computing center. In order to function properly, it needs to connect to network services, for instance a DNS service, a web proxy and a web server. All of these components can (and will!) be misconfigured, operate in unexpected ways, or simply fail at some point.

The task is to develop an extensible framework that allows for the ad-hoc deployment of an ensemble of distributed stub services. Stub services for DNS, HTTP, and HTTP proxy have to be implemented using this framework. These stub services should support reproducible error conditions, such as network glitches, lack of disk space, HTTP errors, data corruptions, unsolicited DNS redirections, etc. The framework should be integrated with the CernVM-FS test suite.

Mentors: Jakob Blomer
Requirements: Good knowledge of Linux environment, experience with web servers and DNS servers, experience with scripting languages (Perl/Python/bash).

Repository archival storage

Description: A CernVM-FS repository typically contains the entire software release history of an experiment collaboration. For large experiments (say the ATLAS experiment at the LHC), such a release history can grow to hundreds of releases with tens of milions of files altogether. While for the sake of reprodicibily all the releases need to stay accessible, the older releases don't change anymore. They retire and could be archived in larger container files in order relieve the storage backend from a large fraction of its many small files.

The first step of the task is to come up with a suitable container format. This might be a well-known format such as zip or a cow2 hard disk image, or alternatively something simple and handcrafted. The CernVM-FS file catalogs have to be extended in order to store information about the container contents. Chunks from the container have to be served in a transparent way, which can be done for instance by an Apache module. In order to archive a live repository and to unarchive a retired repository, conversion tools are required.

Mentors: Jakob Blomer
Requirements: Good knowledge of C++, file systems, HTTP and web servers

Continuous cartography of public network services

Description: CernVM-FS clients pop up in many places of the world. In order to provide clients a low-latency access to files, there are several mirror servers deployed in different geographical regions. Clients can be redirected to a close server using Geo-DNS (i.e. the DNS reply depends on the network block of the source IP address). As long as the database of available mirror servers and their location is managed manually, however, any change is a cumbersome process that requires a remapping of network blocks. Ideally, the mapping of network blocks to mirror servers is computed automatically.

The main task is to implement an algorithm that pre-processes a set of servers and their location as well as a Geo-IP database (e.g. the GeoLite City database). The pre-processing should result in buckets of network blocks representing networks that are close to a specific server. Options to do so are, for instance, a Voronoi diagram on the globe or the use of network coordinates. The maintenance of the system requires web services / pages for registering and unregistering servers as well as a visualization of the current status of the map.

Mentors: Carlos Aguado Sanchez and Jakob Blomer
Requirements: Good knowledge of TCP/IP, building web application, and scripting languages (Perl/Python/bash). Knowledge of map services (Google Maps, OpenStreetMap) is a plus.

CernVM Co-Pilot related projects

CernVM Co-Pilot is a framework which allows to instantiate a ready-to-use computing infrastructure on top of distributed computing resources. Such resources include enterprise cloud computing infrastructures (e.g. Amazon EC2), scientific computing clouds (e.g. Nimbus), and last but not least resources provided by the users participating in the volunteer computing projects (e.g. powered by the BOINC volunteer computing platform). Co-Pilot has been used to build LHC@home 2.0 volunteer computing cloud, which has about 12000 registered users (and about 15000 registered hosts) from all over the world. CernVM Co-Pilot is distributed as part of CernVM. Co-Pilot source code is available here and the documentation is available here.

CernVM Co-Pilot automated deployment and testing.

Description: Create a framework for automated deployment of CernVM Co-Pilot service and agent nodes, develop a set of unit tests for testing the functionality of the Co-Pilot. The framework should be able to automatically deploy all services needed for running Co-Pilot (such as Jabber, Redis, and graphite), install Co-Pilot components, configure them on the fly, run the test cases, and generate reports. The framework will be used to validate new developments and features in an automated way, and make use of currently deployed tools (tapper, graphite) to generate reports.

Mentors: Artem Harutyunyan
Requirements: Good knowledge of Perl (or Python) and Bash.

CernVM Co-Pilot Agent on mobile

Description: Port CernVM Co-Pilot Agent to a mobile platform. It can either be used later for building a framework on top (similar to NASA application) to communicate news and interesting information to the users of the App. Or alternatively it can be used as a base platform for deploying and running volunteer thinking applications.

Mentors: Artem Harutyunyan
Requirements: Experience with mobile development (either Android or iOS), knowledge of XMPP desirable.

CernVM Co-Pilot log analysis and visualization

Description: CernVM Co-Pilot services generate lots of log data. Most of the data are currently stored in log files. The data related to the overall status of server-side services (queue length, number of executed jobs, etc.) are sent to Co-Pilot Monitoring compoment which stores it using graphite. The monitoring system, however, does not provide information about the Co-Pilot Agents. So there are many cases when we need to go and dig into log files to find things out (e.g. to see when was the last time a particular user successfully executed a job).
The task of this project would be to improve this by extending Co-Pilot Monitoring system. The aim would be to represent the log data related to Co-Pilot Agents in an intuitive, visual, searchable and easily accessible manner.

Mentors: Artem Harutyunyan
Requirements: Good knowledge of scripting languages, experience with log data management (and visualization) will be an asset.

Geant 4 Simulation Toolkit related projects

The Geant4 toolkit simulates the interactions of elementary particles and radiation with matter. It is used to simulate the detectors of the LHC and other High Energy Physics (HEP) experiments. It finds application in other areas, from assessing the effects of radiation on the electronics of satellites to improving the design of medical detectors with the custom GATE application. LHC experiments use Geant4 to compare the signatures of rare events from new physics (such as the Higgs boson) to those coming from known interactions. The open source toolkit is developed by the Geant4 collaboration, which brings together 90 physicists and engineers from laboratories and universities around the world. Developments are ongoing to improve its precision and scope of application, and to better utilise current and emerging computer architectures.
The simulation programs of the LHC experiments use the Geant4 simulation toolkit to produce simulated LHC events continuously running with on about 100,000 CPU cores throughout the year. These statistics remain a limitation in the analysis potential for some interesting types of new physics. As a result the goal of the project is to explore different ways to reduce the execution time of Geant4 on today’s complex commodity CPUs, and to prototype how to use it efficiently on the many-core hardware of the future (tens, hundreds of cores, threads or ‘warps’).
The code required to model diverse types of particles and interactions, and to model the complex geometries of detectors is large. Due to this it overwhelms the caches of current CPUs, significantly reducing the efficiency of utilisation of today’s hardware. This effort is focused on identifying ways to spread the work between threads and/or to reorganise the work. By using less code in each core we aim to use the memory architectures of today’s hardware better. At the same time we prepare the way to obtain good performance on tomorrow’s hardware.

Expand Geant4 multi-threading to the particle track-level

Description: Build a prototype of Geant4 which uses a separate thread to simulate the interactions of each particle track. First simulate all interactions from one primary track in one thread, following all its ‘secondary’ particles, the spin-off particles produced by interactions. Build on and extend the existing multi-threaded Geant4 (G4MT) prototype; G4MT now handles in the same thread the set of particle tracks which comprise an collision event. In a second stage, farm out some secondary tracks to other threads, using different criteria. A key goal is to ensure that all results remain independent of the number of threads used, thus the reproducibility of a simulation is made possible.

Mentors: John Apostolakis
Requirements: Good knowledge of C++, Object-Oriented Programming, parallel programming using threads are needed. The ability to understand quickly the key classes of a large, existing toolkit, familiarity with any type of Monte Carlo simulation or basic knowledge of atomic physics will be beneficial.

Reorder Execution to Improve Throughput

Description: Identify hot spots of Geant4-MT (and sequential version) where instruction or data fetching or memory allocation are a significant limiting factor. Investigate ways to rearrange the execution in order to speedup the program. Review the management of key objects (tracks, touchable volumes) seeking to reduce cache misses and reduce execution time. Examine locality of objects, effects of allocation strategy for high-use lightweight objects.

Mentors: John Apostolakis, Gabriele Cosmo
Requirements: Good knowledge of C++ and Object-Oriented Programming. Familiarity with profiling tools will be an asset.

Evaluate tools and strategies for parallelization, to guide implementation

Description: Use the sequential and multi-thread versions of Geant4 as a benchmark for the tools that assess parallelisation potential of applications. Evaluate tools which allow annotation of an application to estimate the benefits of different approaches, without making code changes. The project will validate the estimates of such tools, comparing with the results of the choices made in the existing multi-threaded prototype Geant4-MT. In addition it will evaluate additional avenues for parallelisation of Geant4, and examine alternative design choices. Identify a promising way, alternative or complementary, to improve throughput and implement it.

Mentors: John Apostolakis, Andrzej Nowak
Requirements: Good knowledge of C++ and Object-Oriented Programming, and the ability to understand quickly the key classes of a large, existing toolkit. Familiarity with any type of Monte Carlo simulation or basic knowledge of atomic physics will be beneficial.

ROOT Object-Oriented data analyzis framework

The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficient way. Having the data defined as a set of objects, specialized storage methods are used to get direct access to the separate attributes of the selected objects, without having to touch the bulk of the data. Included are histograming methods in an arbitrary number of dimensions, curve fitting, function evaluation, minimization, graphics and visualization classes to allow the easy setup of an analysis system that can query and process the data interactively or in batch mode, as well as a general parallel processing framework, PROOF, that can considerably speed up an analysis.
Thanks to the built-in CINT C++ interpreter the command language, the scripting, or macro, language and the programming language are all C++. The interpreter allows for fast prototyping of the macros since it removes the, time consuming, compile/link cycle. It also provides a good environment to learn C++. If more performance is needed the interactively developed macros can be compiled using a C++ compiler via a machine independent transparent compiler interface called ACliC.
ROOT is an open system that can be dynamically extended by linking external libraries. This makes ROOT a premier platform on which to build data acquisition, simulation and data analysis systems. ROOT is the de-facto standard data storage and processing system for all High Energy Phyiscs labs and experiments world wide. It is also being used in other fields of science and beyond (e.g. finance, insurance, etc).

Scientific Visualization in JavaScript

Description: The ROOT project is developing a JavaScript ROOT I/O library to be able to read ROOT files from any modern web browser without having to install anything on the client or server side. This project is advancing rapidly (see here), and we are now evaluating several open source JavaScript visualization libraries to display the different graphical objects that can be read from those ROOT files (1D, 2D, 3D histograms, 1D & 2D graphs, multi-graphs, see here and here).
As none of the existing JavaScript visualization libraries are offering the required scientific graphing functionality, they should be implemented in one of those libraries (e.g. d3, or highcharts). The developments will be donated back to the original library project.

Mentors: Bertrand Bellenot
Requirements: Good knowledge of JavaScript. Experience with scientific visualization would be an asset.

'Blue sky' ideas

Reliable and scalable storage backend for CernVM-FS. Currently CernVM-FS uses ZFS for the storage backend. The idea is to come up with an architecture for implementing the backend storage system for CernVM-FS. The storage should be reliable, scalable, it should support snapshots + rollbacks, should be optimized for hosting many (~ 10^8) small (~ 10 kB) files. The storage will be used to transform large directory trees into content addressable storage, the task would also involve distribution of the transformation among a couple (or potentially many) worker and storage nodes.

Mentor: Jakob Blomer, Predrag Buncic, Artem Harutyunyan
Requirements: Good knowledge of existing storage technologies (e.g. distributed file systems), experience with parallel/distributed programming.

Camlistore storage backend for Co-Pilot. The idea is to replace the current Chirp storage backend with Camlistore.

Mentor: Artem Harutyunyan
Requirements: Experience with Linux, experience with Bash and Perl (or Python), network programming experience, familiarity with Camlistore.

LHCb adapter for Co-Pilot. The idea is to implement a Co-Pilot system Job Manager which will make possible to get jobs from the DIRAC system.

Mentor: DIRAC expert + Artem Harutyunyan
Requirements: Experience with Perl or Python, network programming experience, experience with LHCb DIRAC.

CMS adapter for Co-Pilot. The idea is to implement a plugin (aka Adapter) for the Co-Pilot system which will make possible to get jobs from the CRAB system

Mentor: CRAB expert + Artem Harutyunyan
Requirements:Experience with Perl or Python, network programming experience, experience with CMS CRAB.

Tier 3 Grid site out of the box. Grid middleware and the application software stacks of scientific collaborations are not easy to deploy and maintain. We think that virtualisation technologies can greatly facilitate the job of Grid site administrators. The task would be to implement a workflow manager (e.g. based on Taverna) which can be used to ease the Grid site deployment and maintenace task. This is a very important problem for relatively small universities and institutions which need to fulfill their commitments in providing Grid resources with a limited manpower.

Mentor: Predrag Buncic
Requirements: Experience with Perl (or Python) and Bash, experience with Taverna and/or other workflow management tool will be a plus

Geant 4 on GPUs. Create small prototype of Geant4 on GPUs for one particle type (gammas or electrons) combining physics algorithm with geometry navigation for a small selection of types of volumes. Can start from prototype for geometry of Otto Seiskari ( OpenCL, Cuda ).

Mentor: John Apostolakis
Requirements: Experience with C/C++, OpenCL/Cuda

Geant 4 vectorized solid. Create a vectorised/GPU-capable solid that can be used in place of the most popular existing solids (box, tube, conical & spherical shells). Inspired by the approach of James Tickner (see article in Computer Physics Communications, Vol 181, Issue 11, Nov 2010, pages 1821–1832 available at http://dx.doi.org/10.1016/j.cpc.2010.07.001 ).

Mentor: Gabriele Cosmo
Requirements: Experience with C/C++, vectorization

New simulation engine Create a prototype simulation engine for one particle type with a limited geometry choice (1-2 types of volume types) using a high-productivity parallel language ( Chapel, X10, TBB ) or parallel extension of C++ ( Cilk, SplitC ) . Benchmark this against existing solutions for the same problem. Document the development effort and the pitfalls of the language, tools and implementation (potential for a scientific report on the coding experience, in addition to the results).

Mentor: John Apostolakis
Requirements: Experience with C/C++, Chapel/X10/TBB, and/or Cilk/SplitC

Mentors

Here is the list of our mentors and their areas of experitse:

Carlos Aguado Sanchez CernVM/CernVM File System
John Apostolakis Geant 4
Bertrand Bellenot ROOT
Jakob Blomer CernVM File System
Predrag Buncic CERN Virtual Machine
Gabriele Cosmo Geant 4
Artem Harutyunyan CernVM Co-Pilot
Andrzej Nowak Geant 4
Ben Segal CernVM Co-Pilot (BOINC and volunteer computing)

Contact information

Please do not hesitate to contact us if you are planning to apply for any of the above projects:

SFT GSoC mailing list: sft-gsoc-AT-cern-DOT-ch (no subscription needed).
SFT GSoC Jabber/XMPP chat room: gsoc-sft@conference.jabber.org . We noticed that Gmail/Gtalk XMPP accounts have problems posting messages to the chat room, so please register yourself an XMPP account on some other server (it takes no time to register an account at http://www.jabber.org).
IRC is restricted at CERN so please use Jabber instead.