135 views
--- robots: noindex, nofollow, noarchive --- # Lehr- und Fachvorträge zum Thema „Big Data Engineering“ Auf einen **20-minütigen lehrbezogenen Vortrag** zum Thema „Scalable Data Science: An Example of Big Data Systems for Machine Learning“ folgt jeweils ein **30-minütiger wissenschaftlicher Vortrag** zu den folgenden Themen: ## Übersicht **Zeiten auf Grund einer fehlerhaften Einladung angepasst.** [toc] ## Provenance for Big Data Engineering Prof. Dr. Melanie Herschel, Universität Stuttgart ### Donnerstag, 25. Februar 2021, 8:00–9:00 Uhr s.t. (per Videokonferenz) With the increasing volume of data, novel systems to process and analyze these data have emerged. These include for instance visual analytics tools such as Tableau or Big Data analytics systems such as Apache Spark or Apache Flink. While such systems unquestionably allow to conduct interesting and meaningful data science, they typically do not keep track of how data were processed to produce results. Such information, commonly referred to as provenance, is however relevant for a large variety of applications, including for example debugging of data processing pipelines, reproducibility of analytical results, or sense-making. This talk will present contributions in modelling, collecting, and leveraging provenance. After a brief introduction to provenance, I will focus on two specific data science settings for which we devised data engineering techniques to leverage provenance. The first setting introduces provenance to visual data exploration. The collected provenance is used to recommend to users interesting data and visualizations thereof to explore. Second, I describe our extension of Apache Spark that collects provenance in a scalable way to then use it to debug analytical data processing pipelines. ## Data Engineering for Evolving NoSQL Datasets Prof. Dr. Meike Klettke, Universität Rostock ### Donnerstag, 25. Februar 2021, 10:15–11:15 Uhr s.t. (per Videokonferenz) Dataset structures are often subject to change over time. If such datasets are to be used for analyses, one of the data engineering tasks is to identify the different schema versions and migrate the datasets into the latest structural version. For NoSQL databases without an explicit schema, methods are shown that extract either a schema overview or different schema versions and the associated evolution operations. In the latter case, we can use these evolution operations for an automatic migration of all datasets into the latest structural version. Several efficient methods to perform a data migration will be presented: an eager strategy that immediately after introducing a new schema version updates the data, a lazy strategy that realized data migration only on demand, and different hybrid data migration strategies. In the talk, we will see that a combination of all these subtasks allows a fully automatic update of legacy NoSQL datasets on demand. ## Building Data Equity Systems Prof. Dr. Julia Stoyanovich, New York University, USA ### Donnerstag, 25. Februar 2021, 15:15–16:15 Uhr s.t. (per Videokonferenz) Equity as a social concept -- treating people differently depending on their endowments and needs to provide equality of outcome rather than equality of treatment -- provides a unifying framework for ongoing work to operationalize ethical considerations across technology, law, and society. In my talk, I will present a vision for Data Equity Systems (DES): data-intensive systems that consider equity, trust, and oversight as essential facets of data-intensive cyberinfrastructure. I will discuss recent technical progress towards building such systems, including methods that support introspection and intervention on equity during system design, and those that enable public disclosure postdeployment. I will conclude by broadening the scope of DES to include educational activities, policy engagement, and outreach. ## All you need is knowledge … and graph data management Prof. Dr. Katja Hose Aalborg University, Denmark ### Freitag, 26. Februar 2021, 9:00–10:00 Uhr s.t. (per Videokonferenz) Graphs are a popular way of representing knowledge in a machine-interpretable format, potentially distributed over the whole Web in the form of Linked Data. They are applied in diverse use cases that span over natural language processing, fact checking, recommender systems to semantic data lakes and semantic data warehousing over statistical knowledge graphs. Whereas the versatility of graph structures represents advantages when creating and publishing data, it naturally results in challenges for scalable graph data management and processing along the value chain including classic ETL aspects as well as provenance, indexing, search, querying, and scalable analytics. In this talk, I will provide an overview of this vital field of research and highlight challenges and solutions at the crossroads of scalable knowledge engineering and graph data management. ## Accelerating the Data Science Process through Accessible Data Preparation Techniques Prof. Dr. Ziawasch Abedjan Leibniz-Universität Hannover ### Freitag, 26. Februar 2021, 11:15–12:15 Uhr s.t. (per Videokonferenz) Data preparation is the most time-consuming and tedious task in data-driven tasks and data science pipelines. Typically, it entails data profiling, extraction, matching, and cleaning. Because of the sheer diversity of datasets and their applications, users end up hard-coding dataset specific preparation operations. This approach is hard-wired and requires tedious re-organizations until a preparation sequence fits the desired data science task. Recent exampledriven paradigms enable systems to adapt to particular datasets and applications with only a handful of data annotations. Furthermore, the large body of existing pipelines motivates reuse and meta-learning to ease the process of data science pipeline creation. In this talk, I present my work on example-driven cleaning as one of the most relevant tasks in preparation and discuss how the proposed techniques can be extended for other steps of the pipeline. Our work on holistic cleaning systems that internally ensemble and configure individual cleaning techniques significantly reduces the amount of required user interaction by leveraging label propagation techniques and metalearning. I will conclude my talk by shedding light on my vision for application-specific pipeline generation. ## Towards Instance-Optimized Data Systems Prof. Dr. Tim Kraska MIT, USA ### Freitag, 26. Februar 2021, 13:45–14:45 Uhr s.t. (per Videokonferenz) Recently, there has been a lot of excitement around ML-enhanced (or learned) algorithm and data structures. For example, there has been work on applying machine learning to improve query optimization, indexing, storage layouts, scheduling, log-structured merge trees, sorting, compression, sketches, among many other things. Arguably, the motivation behind these techniques are similar: machine learning is used to model the data and/or workload in order to derive a more efficient algorithm or data structure. Ultimately, what these techniques will allow us to build are “instance-optimized” systems; systems that self-adjust to a given workload and data distribution to provide unprecedented performance and avoid the need for tuning by an administrator. In this talk, I will provide an overview of the opportunities and limitations of learned index structures, storage layouts, and query optimization techniques we have been developing in my group, and how we are integrating these techniques to build a first instance-optimized database system. ## Towards Democratizing Data Science Prof. Dr. Carsten Binnig TU Darmstadt ### Mittwoch, 03. März 2021, 12:00–13:00 Uhr s.t. (per Videokonferenz) Technology has been the key enabler of the current Big Data movement. Without open-source tools like TensorFlow and Spark, as well as the advent of cheap, abundant computing and storage in the cloud, the trend toward datafication of almost every field in research and industry could never have happened. However, the current Big Data tool set is ill-suited for an efficient knowledge discovery by domain experts with only limited IT skills and thus represents a major bottleneck in our data-driven society. In this talk, I will present an overview of my current research efforts to revisit the Big Data stack from the user interface to the underlying hardware for making Big Data tools more efficient and easier to use for domain experts and thus enable the democratization of data science. ## Compressed Linear Algebra for Large-Scale Machine Learning Prof. Dr. Matthias Boehm TU Graz, Österreich ### Mittwoch, 03. März 2021, 14:15–15:15 Uhr s.t. (per Videokonferenz) Declarative, large-scale machine learning (ML) aims to simplify the development and usage of large-scale ML algorithms. In Apache SystemDS, data scientists specify end-to-end ML pipelines and new ML algorithms in a highlevel language with R-like syntax and the system automatically generates hybrid runtime execution plans composed of local, in-memory operations and distributed operations on Spark. This talk gives a brief overview of SystemDS and then discusses Compressed Linear Algebra (CLA), as a selected runtime technique with high-performance impact. Large-scale ML algorithms are often iterative, using repeated read-only data access and I/O-bound matrix-vector multiplications to converge to an optimal model. For this reason, it is crucial for performance to fit the data into memory and enable fast matrix-vector operations. General-purpose, heavy- and lightweight compression techniques struggle to achieve both good compression ratios and fast decompression speed to enable block-wise uncompressed operations. Therefore, we initiate work---inspired by database compression and sparse matrix formats---on valuebased compressed linear algebra, in which heterogeneous, lightweight database compression techniques are applied to matrices, and then linear algebra operations such as matrix-vector multiplication are executed directly on the compressed representation. Our experiments show that CLA achieves in-memory operations performance close to the uncompressed case and good compression ratios, which enables fitting substantially larger datasets into available memory. Finally, we also outline recent work on workload-aware compression and pushing compression into pre-processing steps such as feature transformations.