Machine Learning Accelerators: From Cloud to Edge by Luca Benini & Francesco Conti
Machine learning (ML), and more specifically deep learning (DL) - training and inference - has rapidly become the key workload for a wide range of computing systems: high-performance supercomputers, cloud data centers, small clusters and servers, embedded computers and even mobile or IoT devices. As a consequence, industry and academia have been working with unprecedented focus to squeeze energy efficiency and performance by tuning and specializing systems, architectures, targeting them to machine learning workloads. This monumental effort has produced, in less than five years, an enormous proliferation of ML (DL) accelerator architectures, with a number of exciting new ideas and esigns. The goal of this lecture is to give a practical knowledge of the main architectural patterns used in the design of ML accelerators and to analyze their hardware and software embodiments. The lecture will also offer a deep dive on ultra-low power accelerators for edge devices.
Cache and memory capacity has a significant impact on performance, energy consumption and cost in today's computers ranging from smartphones, laptops/desktops to server systems in data centers. One promising approach to improve the uilization of a given amount of cache or main memory is to compress the data contained in it. However, to deal with a compressed cache or memory design involves several challenges including how to access compressed data in cache or memory fast by tackling the issues of choosing a compression algorithm and how to locate, compress and recompress data. This course offers an overview of state-of-the-art techniques for cache and memory compression and goes into detail in some of the recent ongoing advances in this area.
For the last 40 years Process Technology and Computer Architecture have been orchestrating the magnificent growth in computing performance; Process Technology was the main locomotive, while Computer Architecture contributed to only about a 1/3 of the performance outcome.It seems that we have reached a major turning point; Moore’s law is reaching its end and Dennard scaling has already ended, while performance requirements continue to soar for many new exciting applications. The combination of “new” killer applications (ML) and the trend towards Heterogeneous computing provide a new thrust in computer architecture. In this session we will present the environment change and development of an analytical model (MultiAmdahl) that provides basis to optimally use limited resources (e.g. memory, area) to achieve the target goal (e.g. maximum performance, minimum energy). We will apply the MultiAmdahl model to a specific Neural Network implementation.
Design concepts and their application: In the next future, the computer architecture panorama is going to exhibit a convergence of targets between HPC and AI embedded processing for IoT. The lecture will analyse how the requirements of computing speed, power efficiency and memory demand translate into processor design in such converging scenario, taking into account the inter-relations between circuit level and micro-architecture level. Two examples of an HPC vector processor and of an AI-oriented IoT embedded processor will be illustrated and compared. The hands-on session will include RTL design exercises showing architecture/circuit parameter exploration.
Prof. Labarta will start his seminar addressing how architectural evolutions and multicores have impacted the way we program our machines and his vision on how we should proceed with the objective of ensuring productivity and performance. In this context, programming models, their runtime implementation and the architectural support play a key role to succeed in our efforts towards exascale. The seminar will then continue presenting task-based programming (with emphasis on OpenMP and its OmpSs forerunner, developed at BSC), the hybridization with the MPI message passing interface (TAMPI, Task-aware MPI) and support for Dynamic Load Balancing through the DLB library. During his lecture, Prof. Labarta will discuss some examples showing how future runtime-aware architectures can provide the required support to the parallel runtime system in order to take the appropriate decisions.
Disaggregated Computing refers to a computer organization model in which resources are not exposed as pre-configured computers, but instead, pools of resources are packaged together (memory-dense nodes, storage-dense nodes, ccelerator-dense nodes), interconnected with very fast network fabrics, and supervised by a global system manager that can dynamically interconnect those resources to assemble general purpose nodes on-demand. In this lecture, we will explore the technologies used to enable resource pooling and disaggregation, the needed changes in scheduling policies and the applications that can benefit from this paradigm.
ACM offers lifelong learning resources including online books from Safari, online courses from Skillsoft, TechTalks on the hottest topics in computing and IT, and more.
Written by leading domain experts for software engineers, ACM Case Studies provide an in-depth look at how software teams overcome specific challenges by implementing new technologies, adopting new practices, or a combination of both. Often through first-hand accounts, these pieces explore what the challenges were, the tools and techniques that were used to combat them, and the solution that was achieved.
ACM's prestigious conferences and journals are seeking top-quality papers in all areas of computing and IT. It is now easier than ever to find the most appropriate venue for your research and publish with ACM.