Italy

Researcher (scientific/technical/engineering)

Date of the expedition

From 15/03/2024 to 15/09/2024

Selected Track

Challenges

Project title

PROJECT 2 (TNA)

Host Organization

University of Colorado, Boulder

Media

Biography

I received my master’s degree in ICT and Internet Engineering from the University of Rome Tor Vergata (Italy), in February 2020. I worked as a researcher in Tor Vergata with a CNIT scholarship from January 2020 until October 2020, when I started my Ph.D. in Electrical Engineering. I will complete my degree at the end of September 2024. My research interests include networked systems, in-band network monitoring and processing, eBPF and Linux networking, machine learning.

Project Summary

Machine learning uses substantial amounts of computing resources in datacenters. The training process can be very costly, because of the hardware needed and the power consumption. When a large model is being trained, the load is shared among multiple servers, that use the network to synchronize. Whenever a network fault occurs, a cost is faced and keeping the down time to a minimum is beneficial. Existing solutions do not consider faults happening at the network edge. By deploying a network solution that is fault tolerant, we can reduce the costs, computation, power and time needed by the machine learning task. The solution does not require additional hardware, being deployable in existing datacenters, and is transparent to the applications, that do not need to be modified. This is also impactful when there is not an unexpected fault, but when a network device needs to be powered down for maintenance.

Key Result

The results obtained until now are promising and show a satisfying tolerance to network faults, such as switch or link failures. The time and computing resources needed are thus lowered. It is not easy to provide precise numbers, as these depend on the specific model that is being trained, a few examples will be provided in the final report. What we have already achieved is a fast network recovery that allows the training process to continue without an interruption. Without fault tolerance systems in place, if a failure occurs, the training must resume from the last “checkpoint” that had been saved, wasting all the resources used since that moment in time.

Impact of the Fellowship

Until this moment, the fellowship has produced some preliminary results. The solution is already being tested in the form of a prototype and a scientific publication will publish all the final results. The current testbed is a reproduction of a small datacenter, with host machines connected via network switches. All the devices are running Linux and all the software is open source based. The source code will be published at the end of the project together with the conference article. The work aims at improving the state of the art. Its outcome can positively impact future research and be adopted in commercial deployments. The research is being carried out with the collaboration of a group of researchers led by the host organization.