28.- Summer School on High – Perfomance Interconnection Networks for HPC and Data Centers Towards the Exascale and Big – Data Era (HiPINEB).
10 Julio - 11 Julio
Francisco J. Alfaro.-University of Castilla-La Mancha, Spain
Jesús Escudero-Sahuquillo.-University of Castilla-La Mancha, Spain
Dr. Isidro Ramos Salavert Full Professor of the Department of Computer Systems and Computation Technical University of Valencia, Spain
The course lessons and talks will be in English.
Graduate, Master, PhD students, professors, engineers, researchers, managers, and other people both from industry and academia are encouraged to participate.
By the year 2023, High-Performance Computing (HPC) Systems are expected to break the performance barrier of the Exaflop (1018 FLOPS) while their power consumption is kept at current levels (or increases marginally), what is known as the Exascale challenge. In addition, more storage capacity and data-access speed is demanded to HPC clusters and datacenters to manage and store huge amounts of data produced by software applications, what is known as the Big-Data challenge. Indeed, both the Exascale and Big-Data challenges are driving the technological revolution of this decade, motivating big research and development efforts from industry and academia. In this context, the interconnection network plays an essential role in the architecture of HPC systems and datacenters, as the number of processing or storage nodes to be interconnected in these systems is very likely to grow significantly to meet the higher computing and storage demands. Besides, the capacity of the network links is expected to grow, as the roadmaps of several interconnect standards forecast. Therefore, the interconnection network should provide a high communication bandwidth and low latency, otherwise the network becoming the bottleneck of the entire system. In that regard, many design aspects are considered when it comes to improving the interconnection network performance, such as topology, routing algorithm, power consumption, reliability and fault tolerance, congestion control, programming models, control software, etc.
jun 30 th
Monday, July 10
Room: Salón de Actos ESII
9:00 – 9:30 – Registration and Welcome
9:30 – 10:00 – Opening
10:00 – 11:30 – Shared Pool of Virtualized Accelerators: A Key Architectural Innovation for Power-Efficient Clusters Jose Duato, Technical University of Valencia, Spain.
Abstract: Power consumption has become a critical issue in HPC clusters and datacenters, and currently is the main obstacle in the Exaflop race. A well-known strategy to increase power efficiency consists of incorporating computing accelerators to the system architecture. Currently, many HPC systems include GPUs, but future systems will likely use a larger variety of accelerators: FPGAs, wavefront arrays of fixed-point units for AI, and ultimately, quantum accelerators. However, the average utilization of GPUs in current HPC systems is quite low, and more specialized accelerators are likely to exhibit even lower utilizations.
In this talk, we present a technique to drastically increase the utilization of computing accelerators. It consists of a combination of virtualizing those accelerators to enable concurrent access from different application threads, and providing support for remote access through the system’s high-speed interconnect. The result is a shared pool of virtualized accelerators that will increase resource utilization, application execution speed, and system throughput while reducing energy consumption. A key component to efficiently implement remote accelerator access is a high-speed interconnection network.
In this talk, we will present performance results for a particular implementation of this architectural innovation, based on enabling access to remote nVIDIA GPUs through an InfiniBand network for unmodified CUDA-accelerated applications. We will also introduce some alternative approaches to implement remote access to accelerators as well as some enhancements for this class of techniques.
12:00 – 13:30 – Congestion Management for Current and Future Interconnection Networks: Challenges and Solutions Pedro Javier Garcia, University of Castilla-La Mancha (UCLM), Spain.
Abstract: Congestion appears in interconnection networks when intense traffic clogs internal paths, thus slowing down traffic and degrading network performance. This keynote offers an overview of current strategies to avoid, reduce or eliminate network congestion and/or its negative effects, analyzing their suitability for future Exascale systems.
16:00 – 18:30 – Practical Lab: Hands-on configuring HPC networks
This session is intended to be a totally practical session where attendants will learn several activities and methods for the configuration of high-performance interconnection networks. Specifically, we will work with the cluster CELLIA from the RAAP research group in the Albacete Research Institute (I3A) at the UCLM.
19:00 – 20:30 – Many-core processors interfacing interconnection networks: lessons learned and possible future directions Holger Fröning, Ruprecht-Karls University of Heidelberg, Germany.
Abstract: The concurrency galore is currently defining computing at all levels, leading to a vast amount of parallelism even for small computing systems. Technology constraints prohibit a reversal of this trend, and the still unsatisfied need for more computing power has led to a pervasive use of accelerators to speed up computations. In spite of this pervasive use, they are typically supervised by general-purpose CPUS, which results in frequent control flow switches and data transfers as CPUs are responsible for communication tasks. This talk will shortly introduce the current state-of-the-art for accelerator-centric communication, and review our observations and insights when experimenting with accelerators sourcing and sinking network traffic. We will discuss current options and limitations, as well as implications on interconnection networks and tool stacks like MPI. In particular, we will learn why specialized processors require specialized communication models. Finally, the talk will offer some opinions on anticipated research problems.
Tuesday, July 11
Room: Salon de Actos ESII
9:00 – 10:30 – High Performance Input/Output in large Fabrics Bernard Metzler, IBM Zurich Research Laboratory, Zurich, Switzerland.
11:00 – 12:30 – New trends in Data Center and HPC networks Eitan Zahavi, Mellanox Technologies, Israel.
Abstract: In this session we will inspect two future trends in the HPC and data center networks. In recent years, bare metal provisioning needs for Machine Learning as well as HPC clouds has grown in demand. This need stems for the requirement for predictable performance and thus isolation some modern applications rely on. Such that cloud network isolation is no longer just VLAN based but should actually provide some real guarantees. With the ever increasing demand for West/East bandwidth DCNs turn to optical technologies to guarantee the exponential bandwidth growth. Such that Optical Data Center Networks (ODCN) are on the rise. We will review their promise and fundamental limitations via the introduction of NEPHELE – a TDMA ODCN.
12:30 – 14:00 – Intel Omni-Path Fabric: Architecture and technology overview Gaspar Mora, Intel Corp., Santa Clara, USA.
Abstract: Wondering about the new interconnect solution with 28 systems on the current TOP500 list, including the fastest 100Gbps cluster? This is an introduction to the recent Intel Omni-Path Architecture (Intel OPA), an end-to-end fabric solution that delivers the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes. Special focus will be on the micro-architecture of the 4.8Tbps switch ASIC, 48-ports, that powers Intel Omni-Path Edge switches that build large multitier fabrics for HPC systems.
16:00 – 17:30 – Exascale fabric administration tools – BXI Software solutions Ben Bratu, Atos/BULL.
Abstract: For HPC systems, and in the context of a leap to exascale, the scalability of the system management solutions will be a mandatory feature. In the same time, the HPC administrators will be asked to install and manage a plethora of systems in terms of sizes and network complexity. The management solutions should be also flexible enough in order to easily configure, install and use on different HPC systems from teraflops and to exaflops systems. With Bull BXI Software, we are providing a series of tools allowing to install, configure and operate the entire system or individual network devices. In the same time, the proposed solutions enable administrators to be aware and to pin point potential causes of service degradation much more rapidly than can today.
18:00 – 19:30 – Panel Session. The Future of the Interconnect Technology in Moderator: Pedro Javier Garcia, University of Castilla-La Mancha (UCLM), Spain.
Applications demands for computing power and data processing are embarrasingly increasing during this decade, so that these requirements obviously have influence in the design of the computing infrastructure. This panel will gather experts both from the academia and the industry to share their opinions in the technology advances for the current and future high-performance interconnection networks.
19:30 – 20:00 – Farewell