Keynotes & Courses
Keynotes
Building Data-Intensive Systems that Care
given by Sihem Amer-Yahia
On Optimizing the Optimizer
given by Wolfgang Lehner
The traditional cost-based query optimizer approach enumerates different execution plans for each individual query, assesses each plan with costs, and selects the plan that promises the lowest costs for execution. However, - as we all know - the optimal execution plan is not always selected. To steer the optimizer in the right direction, many database systems provide optimizer hints. These hints can be set for workloads, individual queries or even for query fragments. Within this talk, we first show the potential of optimizer hinting by presenting the results of a comprehensive and in-depth evaluation using three benchmarks and two different versions of the open-source database system PostgreSQL. Subsequently, we highlight that query optimizer hinting is a nontrivial challenge and show two potential solutions: On the one hand, we propose TONIC, a novel cardinality estimation-free extension for generic SPJ query optimizers. TONIC follows a learning-based approach and revises operator decisions for arbitrary join paths based on learned query feedback. To continuously capture and reuse optimal operator selections, we introduce a lightweight yet powerful Query Execution Plan Synopsis (QEP-S). On the other hand, we provide insights into FASTgres, a context-aware classification strategy for a more holistic hint set prediction. Both strategies show in the context of end-to-end evaluations significant reductions of benchmark runtimes.
Mosaics in Big Data
given by Dr. Volker Markl
Data management systems research focuses on improving human and technical efficiency for performing data analysis tasks. In this presentation, I describe selected research contributions to achieve that goal. I first highlight work on using query feedback in order to improve the cardinality model of a relational query optimizer. I then will discuss how the research vision of the Stratosphere research project at TU Berlin lead to the creation of the data stream processing system Apache Flink. As a third contribution, I will discuss how fractal space-filling curves can be used to efficiently process multidimensional range queries. I will conclude by giving an outlook on NebulaStream, a novel data processing system to handle massively distributed data streams on heterogeneous devices.
Courses
Graph Databases: Foundations and Data Science Applications
given by Stefania Dumbrava
Network Data Science
given by Panayiotis Tsaparas
Sustainable Machine Learning
given by Bettina Kemme
Machine learning has become increasingly data and processing hungry. A recent report from the International Energy Agency projects that the electricity demand for data centers specialized in AI will more than quadruple by 2030. As such, it has become a pressing need to include energy awareness and environmental sustainability into the Machine Learning life cycle. In fact, a considerable amount of research efforts have been conducted in the last years in this direction.
The first part of this tutorial will discuss various mechanisms to assess the environmental impact of machine learning, from power and energy consumption to carbon footprint. This will be put in relation to more traditional performance metrics used in the research literature, from the “goodness” of a ML solution, measured by metrics such as accuracy, to systems performance metrics such as runtime, throughput and scalability. From there, the tutorial will present several concrete research efforts for a quantitative analysis of the environmental footprint of various ML tasks.
The second part of the tutorial will outline recent solutions to tackle the huge energy consumption of modern ML. For instance, there have been an increasing number of research efforts to make both the learning and the inference tasks more efficient while providing similar performance in terms of traditional ML performance metrics such as accuracy. A further line of research focuses on adjusting the infrastructure or the execution of ML tasks to be more energy aware, e.g., through scheduling approaches.
From Data Quality to Data Reduction for a Sustainable Future
given by Yannis Velegrakis
Due to the rate in which we are currently collecting and storing data, organisations often run our of space and need to discard part of the data they have in order to accomodate new. The main and most challenging task in order to do so is to decide the value of the different parts of the data so that one may get rid of the one with the lowest value. In this lecture we will discuss the different methods that can be used for that purpose and we will investigate, what to use and when.
Lab coordinators: Yannis Velegrakis, Ramon Rico
NebulaStream – Data Stream Processing for the Edge-Cloud-Continuum
given by Dr. Volker Markl
Modern data-driven applications arising in such domains as smart manufacturing, healthcare, and the Internet of Things, pose new challenges to data processing systems. Traditional stream processing systems, such as Flink, Spark, and Kafka Streams are ill-suited to cope with the massive scale of distribution, the heterogeneous computing landscape, and requirements, such as timely processing and actuation. Classical approaches like managed runtimes, interpretation-based query processing, and the optimization of single queries that neglect interactions, greatly limit throughput, latency, energy-efficiency, and the general usability of these systems for emerging applications involving distributed data processing at scale in a sensor-edge-cloud-environment.
To overcome these limitations, we are researching and building NebulaStream, a novel open-soruce data stream processing system for massively distributed, heterogeneous environments. NebulaStream supports (potentially resource-constrained) heterogeneous devices, a hierarchical topology (with the distribution of computation and data flow in a cloud-edge-continuum), and the sharing of computations and data across multiple concurrent queries. This presentation discusses the design goals and core concepts of NebulaStream and looks back at inspirations drawn from our prior work on Stratosphere and Apache Flink, among others.
Lab: A Hands-On Tutorial on NebulaStream
given by Dr. Volker Markl, Aljoscha Lepping
NebulaStream is a novel, open-source data stream processing system for distributed, heterogeneous data streams in the cloud-edge continuum. It adheres to the design goals of ease-of-use, extensibility, and efficiency to provide a framework for users and developers to implement diverse Internet of Things (IoT) use cases. Equipped with essential built-in functionalities, NebulaStream allows users to customize the system easily while ensuring efficient execution even on low-end devices. In this tutorial, we showcase NebulaStream’s extensibility capabilities with a specific focus on integrating and processing multi-modal data. Visitors of the tutorial will learn how to extend NebulaStream, implementing functions, sources (data ingestion), sinks (data export) and data types operating on multi-modal data. After the tutorial, a visitor should be able to extend NebulaStream on their own, e.g., creating a new function, without the need to modify or even understand the rest of the codebase.
Participate

Meet the Speakers & Lecturers

Venue

Organizers & Committee
