Skip to main content
Przejdź do strony domowej Komisji Europejskiej (odnośnik otworzy się w nowym oknie)
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Exploiting big data and machine learning techniques for LHC experiments

Periodic Reporting for period 1 - LHCBIGDATA (Exploiting big data and machine learning techniques for LHC experiments)

Okres sprawozdawczy: 2018-07-02 do 2020-07-01

The aim of the LHCBIGDATA project is to provide the Large Hadron Collider (LHC) community with the necessary tools to deploy Machine Learning (ML) solutions. The tools under development are experiment-independent to promote the exchange of common solutions among the various LHC communities. The benefits of such an approach are being applied to a real world use case, the optimization of the computing operations for the CMS experiment. To set the scale, a typical LHC experiment manages a computing infrastructure of more than 100000 cores spread over more than one hundred computing centres around the world. The two biggest experiments (ATLAS and CMS) have collected and produced around 1 EB of data since LHC start. Operating such an infrastructure still has a very large human cost (of the order of 50-100 FTEs per year per experiment).
The following objectives have been identified for this project:
• Development of a scalable ML framework and integration of several architectures (CPU, GPUs, FPGAs);
• Promotion of ML techniques within the LHC community and beyond, by creating links with the local scientific community, and within international collaborations in the field of High Energy Physics (HEP);
• Application to the use case of the optimisation of CMS computing operations.
With reference to the above identified objectives, the following work was done:
• Development of a scalable ML framework and integration of several architectures (CPU, GPUs, FPGAs).
The Experimented Researcher (ER), together with the Turin group, has deployed and tested the ML framework, which is available on opportunistic resources of the Turin INFN computing centre. The framework is used by students of the course Big data and Machine Learning, held by the ER for doctoral students, and by researchers of the University of Torino for a variety of applications (analysis of MRI data, fast silicon sensors, log files). The computing applications are run on virtual clusters deployed on top of the physical infrastructure. Task scheduling is managed by an orchestration layer (Kubernetes), leveraging Docker containers to define and isolate the runtime environment. The virtual clusters developed to execute ML workflows are accessed through a web interface based on JupyterHub. When a user authenticates on the Hub, a notebook server is created as a containerized application. A set of libraries and helper functions is provided to execute a parallelized ML task by automatically deploying a Spark driver and several Spark execution nodes as Docker containers. This solution automates the delivery of the software stack required by a typical ML workflow, and enables scalability by allowing the execution of ML tasks, including training, over commodity (i.e. CPUs) or high-performance (i.e. GPUs) resources distributed over different hosts across a network.
• Promotion of ML techniques within the LHC community and beyond, by creating links with the local scientific community, and within international collaborations in the field of High Energy Physics (HEP).
In this context, the ER is coordinating a cross-experiment activity, named Operational Intelligence (OI), aimed at developing and collecting tools for the application of artificial intelligence (AI) to computing operations for large scientific collaborations. The ER started collaborations with several groups interested in ML applications for earct sciences and detector developments at the University of Turin. She holds a course for Ph.d students on Big Data Science and Machine Learning.
• Application to the use case of the optimisation of CMS computing operations.
The ER is coordinating all monitoring activities for the CMS computing system: managing the infrastructure that collects all logging information, providing access and support to the various groups involved. Such infrastructure is vital to provide the various teams all necessary information to steer computing operations. An automatic alert system is in place to notify the relevant teams of subsystem failures. An intelligent system providing combination, silencing, and grouping of alerts coming from different sources is currently under development.

A list of dissemination and communication activities is given below:
• The following conferences were attended by the ER:
◦ CHEP (Computing in High Energy Physics) 2018 in Sofia in July 2018;
◦ CHEP 2019 in Victoria (Australia) in November 2019;
◦ 3rd Rucio Community Workshop in FNAL in March 2020.
• The following publications were prepared:
◦ The CMS monitoring infrastructure and applications, submitted to Computing and Software for Big Science;
◦ Operational Intelligence for Distributed Computing Systems for Exascale Science, accepted for publication on EPJ Web Conf;
◦ Big data solutions for CMS computing monitoring and analytics, accepted for publication on EPJ Web Conf;
◦ Delivering a machine learning course on HPC resources, submitted for publication on EPJ Web Conf.
• Outreach activities:
◦ Speaker at the Premiere of the movie “Almost Nothing”,
◦ Shifter at the INFN stand at the Turin Book Exposition (Salone del Libro),
◦ Partecipation to “Pint of Science” nights in Turin,
Several activities, the Monitoring and Analytics working group in CMS, the Operational Intelligence effort, the ML course and MlaaS cluster at the University of Turin, were started during this project. Deliverables of the above activities consist of documentation, projects, code and data. By fostering the adoption of industry-standard technologies, this project is building bridges among different scientific communities (that have been for decades developing ad-hoc solutions). Clear advantages of this approach are: people from different experiments working together on common projects; increase of young researcher skillsets in a way that is recognisable by both industry and research; compliance with policies from funding agencies. In particular this project contributed to:
◦ the creation of a complete monitoring and analytics framework for the CMS distributed computing community, including tools for visualization, data mining, and alert rules;
◦ the creation of a cross-experiment community sharing experiences, effort, and code to develop intelligent systems to increase the level of automation in distributed computing operations;
◦ the start of several ML related projects within the University of Turin, fostered by the MlaaS cluster and the course for PhD students. Some of the projects include: detector design studies for fast silicon trackers, analysis of log files from computing services, analysis of magnetic resonance data for soil composition.
The following results have been achieved:
◦ built ties with more communities/experiments both internationally (within CMS, with other LHC/HEP experiments) and locally (with the Turin scientific community from both University and Technical University, and with external research foundations such as Links). Such collaborations are evolving in proposals for further projects;
◦ started a new working group, Monitoring and Analytics, in CMS;
◦ started a new cross-experiment project, Operational Intelligence;
◦ started the course “Big Data Science and Machine Learning”, and the supervision of several students in ML related projects.
Logo of the Operational Intelligence effort
Moja broszura 0 0