Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Only Sessions at Location/Venue 
Session Overview
6B - R in Production 1
Thursday, 08/July/2021:
5:30am - 7:00am

Session Chair: Emi Tanaka
Zoom Host: Jyoti Bhogal
Replacement Zoom Host: Nick Spyrison
Virtual location: The Lounge #talk_r_production_1

Session Topics:
R in production

Session Sponsored by: cynkra Session Slides

5:30am - 5:50am
ID: 350 / ses-06-B: 1
Sponsored Talk
Topics: Big / High dimensional data
Keywords: memory, big data

Big Memory for R

Jingchao Sun1, Chris Kang3, Austin Gutierrez2

1MemVerge; 2The Translational Genomics Research Institute (TGen); 3Analytical Biosciences

As we are stepping into the big data era, R and programs written in R are facing various new challenges. First, the data to be processed is growing exponentially and thus results in large memory usage when running R programs. Memory is becoming one of the bottlenecks for large data processing. Second, large data processing dramatically increases the processing time and the risk of program crashes. Scientists or developers might lose hours or days due to a program crash with no chance to save the data. Third, iterative analysis for large data is a pain point due to the long data loading time from disk. Fourth, a large amount of legacy R code does not support multi-threading which was recently introduced to R. This leads to the sequential processing of R code and wastes the CPU's multi-core capability.

To tackle these challenges, MemVerge developed Memory Machine software which supports Intel Optane Persistent Memory. With the help of Memory Machine, R programs can use up to 8 TB of memory on a single server with a 30-50% cost saving. R users can take snapshots of their workload at any time within 1 second to get data persistence without writing data to disk, and restore the workload within 1 second without loading data from disk. Moreover, with the help of instant restore, R users can easily try different parameters multiple times for their workloads. Memory Machine can also restore the workload into different namespaces to enable parallel processing and greatly reduce program execution time.

This talk will provide an overview of Big Memory Computing consisting of Intel Optane Persistent Memory and memory virtualization software working together. R users from Analytical Biosciences and The Translational Genomics Research Institute (TGen) will provide overviews of their implementations of Big Memory.

5:50am - 6:10am
ID: 282 / ses-06-B: 2
Regular Talk
Topics: R in production
Keywords: DevOps, Agile, Production, Docker

Bridging the Unproductive Valley: Building Data Products Strictly Without Magic

Maximilian Held, Najko Jahn

State- and University Library Goettingen, Germany

Between GUI-based reports and scripted data science lies an unproductive valley that combines the worst of both worlds:

poor scaleability *and* high overhead.

To avoid getting stuck there, small and medium-sized teams must 1) build strategic data products (not one-off scripts), 2) adopt software development best practices (not hacks) and 3) concentrate on business value (not infrastructure).

1) Strategic data products focus on the ETL pipelines, common visualisations and other modules that are central to the mission. These unix-style building blocks can then be recombined into various reports.

2) These modules are designed "as-if-for-CRAN" and written as type/length-stable, unit-tested and exported functions.

3) If something is not related to our mission, we rely on industry standards (Docker) and CaaS/DBaaS (Azure, GCP).

{muggle}'s opininated DevOps provides some technical scaffolding to help with this transition.

It standardises the compute environment in development, testing and deployment on a multi-stage `Dockerfile` with `ONBUILD` triggers for lightweight target images and leverages public cloud services (RSPM, GitHub Actions, GitHub Packages).

In contrast to some existing approaches, {muggle} never infers developer intent and has a minimal git footprint.

Success also requires a cultural shift. Development may still be agile, but it must not build prototype code. Fancy plots and reports are good, but reproducibility is more important.

We believe this is a necessary change to ensure value generation, and thereby, to ensure the future of democratic, and open-source data science.

Link to package or code repository.

6:10am - 6:30am
ID: 237 / ses-06-B: 3
Regular Talk
Topics: R in production
Keywords: API

Data science serverless-style with R and OpenFaaS

Peter Solymos

Analythium Solutions Inc.

R is well suited for data science due to its diverse tooling and its ability to leverage and integrate with other languages and solutions. In production, R is often just a piece of a much larger puzzle providing API endpoints via e.g. plumber, RestRServe, or a similar web framework. Managing many API endpoints can lead to problems due to shifting dependency requirements or more recent additions breaking older code. The common solution is to use Docker containers to provide isolation to these components. However, managing containers at scale is not trivial, and managing serverless infrastructure is often outsourced to public cloud providers. Providers differ in their approaches, leading to independent integrations of R and repeated efforts. The OpenFaaS project was born to mitigate these problems and to avoid vendor lock-in. OpenFaaS is an open-source framework to deploy functions and microservices anywhere (local cluster, public cloud, edge devices) and at any scale (including 0), with an emphasis on Kubernetes. It provides auto-scaling, metrics, API gateway, and is language-agnostic. In this talk, I introduce R templates for OpenFaaS. The templates support different Docker R base images (Debian, Ubuntu, Alpine Linux) and 6 different frameworks, including plumber. I explain the development life cycle with OpenFaaS using an example cloud function for time series forecasting on daily updated epidemiological data. I end with a review of production use cases where R can truly shine in the multilingual serverless landscape.

Link to package or code repository.