Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Only Sessions at Location/Venue 
Session Overview
2A - Data Management
Monday, 05/July/2021:
6:30pm - 8:00pm

Session Chair: Christopher Maronga
Zoom Host: Linda Jazmín Cabrera Orellana
Replacement Zoom Host: s gwynn sturdevant
Virtual location: The Lounge #talk_data_management

Session Topics:
Databases / Data management

Session Sponsor: cynkra
Session Slide

6:30pm - 6:50pm
ID: 235 / ses-02-A: 1
Regular Talk
Topics: Algorithms
Keywords: applications, case studies

Solving Big Data Problems with Apache Arrow

Neal Richardson

Ursa Computing

As distributed computing platforms and data warehouses become more prevalent, R users face new challenges in working with the data they generate and store. From R, we may need to analyze a dataset that is too big to fit into memory, is split across many files, is hosted on cloud storage, or was produced by systems using other languages. The {arrow} package helps to solve these integration problems, allowing R users to employ familiar, idiomatic R code to work with larger-than-memory datasets on their own workstations. This presentation will discuss several of these common challenges people face when working with bigger data, and using case studies from the community, it will demonstrate how Arrow can help solve these problems.

6:50pm - 7:10pm
ID: 292 / ses-02-A: 2
Regular Talk
Topics: Bioinformatics / Biomedical or health informatics
Keywords: simulation, relational data, data analysis, computing, biostatistics, CDISC, clinical trials

respectables and A framework for creating relational data with application

Gabriel Becker1, Adrian Waddell2

1Clindata Insights, United States of America; 2F. Hoffmann-La Roche, Switzerland

Synthetic data provides a privacy-safe mechanism for developing, benchmarking, testing, and showcasing analysis plans and data processing pipelines. Existing tools in R focus primarily on creating or manipulating individual rectangular datasets (dplyr) or combining already existing multiple rectangular datasets (dm). Many crucial types of data, however, involve inter-related rectangular sets of data, with columns in one table acting as keys or lookups within another. An example of this is the CDISC data standard for clinical trial data, which, given an overall set of patients, has some rectangular datasets containing exactly one row per patient and other datasets where some patients have data represented in multiple rows while other patients are entirely absent (e.g., because they did not have any adverse events).

The respectables package defines a recipe-based framework which allows for the specification of data to be synthesized at three distinct levels. First, we provide an intuitive recipe mechanism for specifying the creation - via sampling, synthesis or both - of a rectangular dataset. These recipes support the definition of both conditional and joint behaviors between sets of variables. Second, we define the concept of a scaffolding join recipe which specifies the creation of a rectangular dataset with a particular foreign-key style relationship with another dataset. Finally, we combine these two types of recipes to create recipe books which specify, a priori, the creation and construction of full database-like cohorts of inter-related datasets, either based on existing starting data or from whole cloth.

The package provides respectables recipes for the creation of synthetic clinical trial readout data that adheres to the CDISC standard.

respectables and will be released open source - approval granted prior to abstract submission - and available on github at the time of the presentation.

7:10pm - 7:30pm
ID: 165 / ses-02-A: 3
Regular Talk
Topics: Databases / Data management
Keywords: big data

Going Big and Fast - {kafkaesque} for Kafka Access

Peter Meißner


Kafka is a big data technology that allows for high throughput low latency stream processing, storing and distributing data. Kafka is written in Java and has become an infrastructure industry standard to for real time, near time, microservice and distributed applications.

This talk introduces {kafkaesque} a package which allows R users to integrate their code and models with Kafka e.g., to use it as a distributed message queue or to access and process data fed into Kafka by other systems. Besides presenting core concepts and how to use the package, the talk will also talk about the development process and the pros and cons of using Java code in R packages.

Link to package or code repository.

7:30pm - 7:50pm
ID: 148 / ses-02-A: 4
Regular Talk
Topics: Databases / Data management
Keywords: API

Scaling R for Enterprise Data

Mark Hornick


While R has made significant gains in performance with each new release, sometimes the data itself is the bottleneck - having to move it from one environment to another. With enterprise data stored in Oracle databases, the integration of R with Oracle Database enables using R more efficiently at on larger volume data. Oracle Machine Learning for R (OML4R) – the R interface to in-database machine learning and R script deployment from Oracle – enables you to work with database tables and views using familiar R syntax and functions. For scalable and performant data exploration, data preparation, and machine learning, R users leverage Oracle Database as a high-performance compute engine and build machine learning models using parallelized in-database algorithms using R Formula-based specification. Further, deployment of user-defined R functions from SQL facilitates application and dashboard development, where R engines are dynamically spawned and controlled by Oracle Database. Users can even take advantage of running user-defined R functions in a data-parallel and task-parallel manner.