Keynotes & Courses
Keynotes
Data Science in the Era of Heterogeneity
given by Gustavo Alonso
Computing platforms are evolving rapidly along many dimensions: processors, specialization, disaggregation, acceleration, smart memory and storage, etc. Many of these developments are being driven by data science but also arise from the need to make cloud computing more efficient. From a practical perspective, the result we see today is a deluge of possible configurations and deployment options, most of them too new to have a precise idea of their performance implications and lacking proper support in the form of tools and platforms that can manage the underlying diversity. The growing heterogeneity is opening up many opportunities but also raising significant challenges. In the talk I will describe the trend towards specialization at all layers of the architecture, the possibilities it opens up, and demonstrate with real examples how to take advantage of heterogeneous computing platforms. I will also discuss opportunities for systems research in the context of data science both on the software as well as on the hardware side.
Re-configuring data practices for Intelligent, Reliable and Responsible decision-making systems
given by Timos Sellis
In this talk we will focus on how data management practices need to be re-configured in order to support Intelligent, Reliable and Responsible decision-making systems. The appetite for effective use of information assets has been steadily rising in both public and private sector organisations. However, whether the information is used for social good or commercial gain, there is a growing recognition of the complex socio-technical challenges associated with balancing the diverse demands of regulatory compliance and data privacy, social expectations and ethical use, business process agility and value creation, and scarcity of data science talent. In this talk, we highlight these interconnected challenges and introduce Information Resilience, as a scaffold within which the competing requirements of responsible and agile approaches to information use can be positioned. The aim is to develop and present a manifesto for Information Resilience that can serve as a reference for future research and development in relevant areas of Responsible Data Management.
The Data Systems Grammar: Self-designing Systems for the era of AI
given by Stratos Idreos
Data systems are everywhere. A data system is a collection of data structures and algorithms working together to achieve complex data processing tasks. For example, with data systems that utilize the correct data structure design for the problem at hand, we can reduce the monthly bill of large-scale data applications on the cloud by hundreds of thousands of dollars. We can accelerate data science tasks by dramatically speeding up the computation of statistics over large amounts of data. We can train drastically more neural networks within a given time budget, improving accuracy. However, knowing the right data system design for any given scenario is a notoriously hard problem; there is a massive space of possible designs, while no single design is perfect across all data, queries, and hardware contexts. In addition, building a new system may take several years for any given (fixed) design. As a result modern data-driven applications incur massive cloud costs and development details.
We will discuss our quest for the first principles of data system design. We will show that it is possible to reason about this massive design space. This allows us to automatically create self-designing data systems that can take drastically different shapes to optimize for the workload, hardware, and available cloud budget using a grammar for data systems. These shapes include data structure, algorithms, and overall system designs which are discovered automatically and do not (always) exist in the literature or industry, yet they can be more than 10x faster. We will show performance examples for up to 1000x faster NoSQL processing and up to 10x faster neural network training.Courses
Querying Graph Databases
given by Angela Bonifati
Graph data modeling and querying arise in many practical application domains such as social, biological and fraud detection networks where the primary focus is on concepts and their relationships and the complex graph patterns involving multiple labels and lightweight recursion. In this lecture, I present a concise unified view on the current challenges which arise over the complete life cycle of formulating and processing queries on graph databases. To that purpose, I present all major concepts relevant to this life cycle, formulated in terms of a common and unifying ground: the property graph data model—the predominant data model adopted by modern graph database systems. I also introduce property graph schemas and graph indexing techniques for label-constrained reachability queries on graph databases. Practical work will follow with a focus on query-driven pangenomic analysis using an open-source graph database.
The power of graph neural networks
given by Floris Geerts
Graph neural networks (GNNs) have become a prominent technique for graph learning tasks such as vertex and graph classification, link prediction and graph regression. It was recently shown that classical GNNs have limited expressive power. This resulted in the proposal of a plenitude of new - more expressive - graph learning architectures. In this course we will present a systematic investigation in the expressive power of these different architectures. We here use techniques from areas such as graph algorithms, logic and query languages. The goal is to introduce various ways of boosting the expressive power of GNNs and to provide techniques to estimate the expressive power of GNNs. The conceptual part of the course is complemented with some practical coding sessions showing how theory and practice compare.
Spatial and multi-dimensional indexing and data analytics
given by Nikos Mamoulis
Smart telecommunication and IoT devices have become a commodity and have brought to availability huge volumes of spatial and multi-dimensional data, rendering their search and analysis affordable by small companies and even for personal use. In this course, we will study the most fundamental spatial and multi-dimensional access methods and the most popular search and analysis tasks that they support.
Outline:
Part 1. Fundamental spatial access methods and search operations for spatial and low-dimensional data
Spatial data types, relationships, and queries. Spatial data analytics. Multi-dimensional access methods for points. Spatial access methods for non-point data. Evaluation of spatial queries and data analytics tasks.
Part 2. Access methods and similarity search for multi-dimensional data and metric spaces
Distance and similarity. The curse of dimensionality. Similarity search in multi-dimensional metric spaces. Multi-dimensional data analytics tasks.
Part 3. Scalable spatial access methods
Scalable in-memory spatial indexing. Parallel and distributed spatial data management. Big spatial data analytics.
Part 4. Learned and adaptive spatial and multi-dimensional indexing
Learned indexes for multi-dimensional data. Adaptive indexes for spatial and multi-dimensional data.
Computational Methods to Counter Online Misinformation
given by Paolo Papotti
Misinformation is an important problem but mitigators are overwhelmed by the amount of false content that is produced online every day. To assist human experts in their efforts, several projects are proposing computational methods that aim at supporting the detection of malicious content online. In the first part of the lecture, we will overview the different approaches, spanning from solutions involving humans and a crowd of users to fully automated approaches. In the second part, we will focus our attention on the data driven verification for computational fact checking. We will review methods that combine solutions from the ML and NLP literature to build data driven verification, such as those that translate text claims into SQL queries on relational databases. We will also cover how the rich semantics in knowledge graphs and pre-trained language models can be used to verify claims and produce explanations, which is a key requirement in this space. Better access to data and new algorithms are pushing computational fact checking forward, with experimental results showing that verification methods enable effective labeling of claims, both in simulations and in real world efforts. However, while fact checkers start to adopt some of the resulting tools, the misinformation fight is far from being won. In the last part of this lecture, we will cover the opportunities and limitations of computational methods and their role in fighting misinformation.
Data Processing on Modern Hardware
given by Jens Teubner
Even today, most database engines essentially build on a system model that reflects the state of hardware several decades ago. Modern systems feature large main memories, high degrees of hardware parallelism, and modern instruction sets; networks and storage media have become fast, and processing units heterogeneous.
I will start with a re-cap of classical implementation techniques. Then we will have a look at different aspects of hardware evolution. And I will discuss techniques to implement data processing techniques in hardware-conscious ways. This includes, for instance, cache-aware data layouts and algorithms; scalable multi-core parallelism; and vectorization. If time permits, I will also give an introduction to GPU- and/or FPGA-accelerated data processing.