Keynotes & Courses

Keynotes

Building Data-Intensive Systems that Care

Computer science has been about automating everything including data science pipelines. Data science and humans do not optimize for the same “objective function”. Many of us have been claiming that we build human-centric systems. But are we? and if we are, are we doing it properly? This talk will attempt to answer this question at various stages of the data science pipeline, illustrating the essential roles that humans take along the way, as data labelers, as domain experts, and as end-users, and providing recommendations for building data-intensive systems that truly care.

Data Management Strategies for Space-Efficient Decoding and Planning

given by Panagiotis Karras

Several key computer science tasks are traditionally solved via dynamic programming and need to work within the constraints of low-memory devices. This talk presents two solutions that enhance space-efficiency in such tasks. First, we will show how to achieve space-efficient Viterbi decoding, used in speech recognition and probabilistic context-free grammar parsing. Second, we will outline how to make optimal planning decisions space-efficiently in a finite-horizon Markov Decision Process. Thereby, we will showcase how data management expertise can deliver solutions in other domains. Lastly, we will glimpse into alternative time-efficient strategies for those problems.

On Optimizing the Optimizer

given by Wolfgang Lehner

The traditional cost-based query optimizer approach enumerates different execution plans for each individual query, assesses each plan with costs, and selects the plan that promises the lowest costs for execution. However, - as we all know - the optimal execution plan is not always selected. To steer the optimizer in the right direction, many database systems provide optimizer hints. These hints can be set for workloads, individual queries or even for query fragments. Within this talk, we first show the potential of optimizer hinting by presenting the results of a comprehensive and in-depth evaluation using three benchmarks and two different versions of the open-source database system PostgreSQL. Subsequently, we highlight that query optimizer hinting is a nontrivial challenge and show two potential solutions: On the one hand, we propose TONIC, a novel cardinality estimation-free extension for generic SPJ query optimizers. TONIC follows a learning-based approach and revises operator decisions for arbitrary join paths based on learned query feedback. To continuously capture and reuse optimal operator selections, we introduce a lightweight yet powerful Query Execution Plan Synopsis (QEP-S). On the other hand, we provide insights into FASTgres, a context-aware classification strategy for a more holistic hint set prediction. Both strategies show in the context of end-to-end evaluations significant reductions of benchmark runtimes.

Mosaics in Big Data

given by Dr. Volker Markl

Data management systems research focuses on improving human and technical efficiency for performing data analysis tasks. In this presentation, I describe selected research contributions to achieve that goal. I first highlight work on using query feedback in order to improve the cardinality model of a relational query optimizer. I then will discuss how the research vision of the Stratosphere research project at TU Berlin lead to the creation of the data stream processing system Apache Flink. As a third contribution, I will discuss how fractal space-filling curves can be used to efficiently process multidimensional range queries. I will conclude by giving an outlook on NebulaStream, a novel data processing system to handle massively distributed data streams on heterogeneous devices.

Courses

Graph Databases: Foundations and Data Science Applications

given by Stefania Dumbrava

Graphs have emerged as a unifying abstraction for modeling and analyzing interconnected data across a wide range of application domains, including fraud detection, genomics, recommender systems, social network analysis, and enterprise knowledge management.

This lecture introduces the theoretical foundations and practical applications of graph database technologies, with particular emphasis on recent standardization efforts. We will survey the current graph data processing ecosystem and cover key topics such as graph data modeling, constraint specification, query formulation, and algorithmic aspects of query evaluation.

The lecture will also illustrate how graph queries can support data science workflows through a hands-on lab session. Participants will apply the discussed concepts to a real-world urban mobility dataset, by performing tasks such as pattern extraction and graph analytics using open-source tools.

Network Data Science

given by Panayiotis Tsaparas

Networks offer powerful data models for representing complex systems composed of interconnected entities. They are widely used to model diverse domains such as social interactions, biological systems, financial relationships, and user-item interactions. In such systems data is no longer a collection of IID samples, but rather measurements interconnected through pairwise relationships. Understanding the structure and dynamics of the network is essential for analyzing such systems and designing effective algorithms.

Network Data Science is the subfield of Data Science that focuses on studying network structures and leveraging them for analytical and predictive tasks. This course provides a foundational introduction to key concepts and advanced topics in network data science, including network measurements and modeling, community detection, random walks and diffusion processes, and Graph Machine Learning.

Through hands-on lab sessions, students will become familiar with a programming toolbox for network analysis and apply it to real-world network datasets.

Sustainable Machine Learning

given by Bettina Kemme

Machine learning has become increasingly data and processing hungry. A recent report from the International Energy Agency projects that the electricity demand for data centers specialized in AI will more than quadruple by 2030. As such, it has become a pressing need to include energy awareness and environmental sustainability into the Machine Learning life cycle. In fact, a considerable amount of research efforts have been conducted in the last years in this direction.

The first part of this tutorial will discuss various mechanisms to assess the environmental impact of machine learning, from power and energy consumption to carbon footprint. This will be put in relation to more traditional performance metrics used in the research literature, from the “goodness” of a ML solution, measured by metrics such as accuracy, to systems performance metrics such as runtime, throughput and scalability. From there, the tutorial will present several concrete research efforts for a quantitative analysis of the environmental footprint of various ML tasks.

The second part of the tutorial will outline recent solutions to tackle the huge energy consumption of modern ML. For instance, there have been an increasing number of research efforts to make both the learning and the inference tasks more efficient while providing similar performance in terms of traditional ML performance metrics such as accuracy. A further line of research focuses on adjusting the infrastructure or the execution of ML tasks to be more energy aware, e.g., through scheduling approaches.

From Data Quality to Data Reduction for a Sustainable Future

given by Yannis Velegrakis

Due to the rate in which we are currently collecting and storing data, organisations often run our of space and need to discard part of the data they have in order to accomodate new. The main and most challenging task in order to do so is to decide the value of the different parts of the data so that one may get rid of the one with the lowest value. In this lecture we will discuss the different methods that can be used for that purpose and we will investigate, what to use and when.

Lab coordinators: Yannis Velegrakis, Ramon Rico

NebulaStream – Data Stream Processing for the Edge-Cloud-Continuum

given by Dr. Volker Markl

Modern data-driven applications arising in such domains as smart manufacturing, healthcare, and the Internet of Things, pose new challenges to data processing systems. Traditional stream processing systems, such as Flink, Spark, and Kafka Streams are ill-suited to cope with the massive scale of distribution, the heterogeneous computing landscape, and requirements, such as timely processing and actuation. Classical approaches like managed runtimes, interpretation-based query processing, and the optimization of single queries that neglect interactions, greatly limit throughput, latency, energy-efficiency, and the general usability of these systems for emerging applications involving distributed data processing at scale in a sensor-edge-cloud-environment.

To overcome these limitations, we are researching and building NebulaStream, a novel open-soruce data stream processing system for massively distributed, heterogeneous environments. NebulaStream supports (potentially resource-constrained) heterogeneous devices, a hierarchical topology (with the distribution of computation and data flow in a cloud-edge-continuum), and the sharing of computations and data across multiple concurrent queries. This presentation discusses the design goals and core concepts of NebulaStream and looks back at inspirations drawn from our prior work on Stratosphere and Apache Flink, among others.

Lab: A Hands-On Tutorial on NebulaStream
given by Dr. Volker Markl, Aljoscha Lepping

NebulaStream is a novel, open-source data stream processing system for distributed, heterogeneous data streams in the cloud-edge continuum. It adheres to the design goals of ease-of-use, extensibility, and efficiency to provide a framework for users and developers to implement diverse Internet of Things (IoT) use cases. Equipped with essential built-in functionalities, NebulaStream allows users to customize the system easily while ensuring efficient execution even on low-end devices. In this tutorial, we showcase NebulaStream’s extensibility capabilities with a specific focus on integrating and processing multi-modal data. Visitors of the tutorial will learn how to extend NebulaStream, implementing functions, sources (data ingestion), sinks (data export) and data types operating on multi-modal data. After the tutorial, a visitor should be able to extend NebulaStream on their own, e.g., creating a new function, without the need to modify or even understand the rest of the codebase.