Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Only Sessions at Location/Venue 
Session Overview
Elevator Pitches 2
Tuesday, 06/July/2021:
1:00pm - 2:30pm

Virtual location: The Lounge #elevator_pitches

ID: 195 / ep-02: 1
Elevator Pitch
Topics: Reproducibility
Keywords: DOI, reproducibility, credibility, Open Science, Zenodo, data science

Make Your Computational Analysis Citable

Batool Almarzouq1,2,3

1University of Liverpool, United Kingdom; 2Open Science Community Saudi Arabia; 3King Abdullah International Medical Research Center (KAIMRC), Saudi Arabia

Although there are overwhelming resources about licencing and citation for R software packages, there's less attention paid to making non-package (data science) code in R citable. Academics and researchers who want to embrace Open Science practices are mostly unaware of how to make their R code citable before publishing in academic journals and what kind of licence they may use to protect the intellectual property of their work. This lightning talk will highlight important aspects to data scientists, which include generating persistent DOIs, metadata, tracking of data re-use, licensing, access control and long term availability. It'll start by introducing the `zen4R` package, which will be used to generate Digital Object Identifier (DOI) for any R code from RStudio. This R package provides an interface to the Zenodo e-infrastructure API, which is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. Then, I'll show how you can add metadata and track your data/code re-use. Also, to protect the project's intellectual property, several types of licenses applicable for non-package (data science) code in R will be described and applied using `usethis` package.

By the end of the talk, academics and researchers, who use R frequently, will be provided with the necessary tools to publish all the research life cycle of their project while protecting the intellectual property of their work. This will increases efficiency and brings benefits to the broader scientific community by increasing reproducibility and credibility.

Link to package or code repository.
This is based on several blog posts I'll be publishing in:

ID: 291 / ep-02: 2
Elevator Pitch
Topics: Ecology
Keywords: structural connectivity, landscape ecology, landscape metrics, principal component analysis, wetland forests

Structural connectivity in the Lower Uruguay River Forest

Adriana Rojas1, Mariel Bazzalo2, Natalia Morandeira1,3


In recent decades, a process of agricultural expansion took place in the wetlands of the lower Uruguay River that leading to the fragmentation of the landscape. We aimed to estimate the structural connectivity of the hydrophilic forest and the open forest, in the basins of the main tributaries of the Lower Uruguay River for the years 1985, 2002 and 2017. Our inputs were landcover classifications previously generated by the authors with Landsat imagery. For each date, the study area (339.000 km2) was subdivided into 49800 cells of 1Km2. The connectivity is estimated by calculating 14 landscape metrics in each of the 49800 cells, for each of the dates. The spatial representation of the connectivity indices was processed using the sf, tidyverse and dplyr packages. Subsequently, we performed a PCA analysis to reduce the dimensionality of the connectivity analysis and propose a simpler connectivity index without redundant variables, the stats, FactoMineR and factoextra packages were used. The variables with the highest score in components 1 and 2 of the PCA (which explain the greater variability) are graphically represented for one of the cells. Our proposed index is based on four landscape metrics: class area,

number of patches, landscape shape,and area-weighted mean patch area. Based on this index, we identified areas with low/high connectivity of the forests, and trends in connectivity changes during the study period.

ID: 222 / ep-02: 3
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: epidemiology

Fitting the beta distribution for the intra-apiary dynamics study of the infestation rate with Varroa Destructor in honey bee colonies

Camila Miotti, Ana Molineri, Adriana Pacini, Emanuel Orellano, Marcelo Signorini, Agostina Giacobino

Instituto de Investigación de la Cadena Láctea (IDICAL-CONICET-INTA),

The aim of this study was to estimate the infestation level with V. destructor mites of honey bee colonies as function of the autumn-winter parasitic dynamics. A total of six apiaries (with five colonies each) distributed within a 30 km radius with minimum distance of 2 km between them were set. All colonies were set with sister queens and the apiaries were balanced according to the adult bee population size. The following experimental conditions were established: a) two apiaries in a circular arrangement with one colony each infested with Varroa mites (donor colony), b) four apiaries in a lineal arrangement, two of them with a donor colony each located at the edge of the line and two apiaries with a donor colony each located in the middle of the line. All colonies were treated during autumn against V. destructor (except for the donor colonies) with amitraz (Amivar 500®) to reduce to 0% infestation level of the receiver colonies (four within each apiary). Samples for diagnose the phoretic Varroa infestation (PV) were taken 45 days after treatment (middle-may) and monthly from June to September. The PV mite infestation (estimated as N° Varroa / N° bees) was evaluated as function of the colonies disposition (circular/ lineal-middle/ lineal-edge) and the initial PV mite level of the donor colony. A generalized linear mixed model with Beta distribution and Logit link was fitted using glmmTMB function (glmmTMB package), including the colony as random effect. After the descriptive analysis, a cubic model was fitted. The colony disposition effect and the initial PV mite level were statistically significant (P=0.0126 and P=0.0314, respectively).This result suggested that the PV temporal dynamics within each colony differs according to initial PV of neighbor colonies and the infestation probability is higher for lineal-middle disposition of the colonies.

ID: 274 / ep-02: 4
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: NEET; C50; Classification trees; Imbalanced data; SDGs.

C50 Classification of young Moroccan men and women not in employment, education or training (NEET).


High Commission for Planning, Morocco

Within the 2030 Agenda for Sustainable Development framework, the proportion of youth not in employment, education or training (NEET) has to be substantially reduced; in this context, in order to draw a clearer picture for targeting-policy designers, the present study aims to investigate the composition of Moroccan young NEET men and women aged from 15 to 29 years old by elaborating two classification trees; for both NEET men and NEET women; using some previously proved to be potential predictors (handicap situation, matrimonial status, age, level of education, economic activity of the head of household). This study provides a comparison of different classification trees by implementing some algorithms in R and Python (R: C50, Python: scikit, Orange3); and presents the optimal trees that splits the data the best. Besides, it is to be noted that youth with NEET status form a population that is characterized by a great gender related heterogeneity with respect to economic activity status: most of NEET women are housewives (76.7%) or unemployed (13%), while NEET men are mostly unemployed with no experience (51.6%), unemployed who have worked previously (25.4%) or economically inactive (23%); consequently, the imbalanced classes problem in the target variable had to be priorly solved applying SMOTE, Over Sampling, SMOTE-ENN methods.

Keywords: NEET; Classification trees; Imbalanced data; R; Python; SMOTE.

Link to package or code repository. , (page 59) , Contributed Paper Session (CPS) - Volume 2 (page 59)

ID: 257 / ep-02: 5
Elevator Pitch
Topics: R in production
Keywords: DataOps, Banks, Regulation, Government, Reproducibility

Decision support using R and DataOps at a European Union bank regulator

Jonas Bergstrom, Nicolas Pochet

Single Resolution Board, Belgium

We describe how the Single Resolution Board (SRB) created an environment for efficient and reproducible decision support using R and DataOps principles.

The SRB is the Resolution Authority for the EU Banking Union. Its mission is to manage failures of large EU banking groups while protecting financial stability and minimizing impact on public finances. The SRB develops quantitative models to simulate interbank contagion and impacts to the financial system. The SRB uses these models as a basis for decisions, both as part of its day-to-day work and in crisis management situations. In a bank crisis, it is important that SRB is able to respond immediately to changing data. Furthermore, decisions taken by the SRB regarding a failing bank can be subject to legal proceedings, and it is crucial that SRB can reproduce and justify its decisions regarding the affected bank(s).

The constraints imposed on SRB make the case for code-driven data analysis using R and DataOps principles to ensure reproducibility, correctness and the ability to quickly deploy new models in production. Working side-by-side, SRB IT operations engineers and Data Scientists have created an R-based infrastructure where models, packages and dashboards are automatically built, tested and deployed in reproducible environments. Finally, models in production deliver automated feedback which is used to improve future models. The end result is improved quality and reduced time to production.

In conclusion, we make the case that using R and DevOps, public authorities can deliver better quality decisions more quickly and with lower cost to taxpayers.

ID: 199 / ep-02: 6
Elevator Pitch
Topics: Bayesian models
Keywords: Model-Based Clustering, Finite Mixture Models, Infinite Mixture Models

fipp : a bridge between domain knowledge and model specification in Dirichlet Process Mixtures and Mixture of Finite Mixtures

Jan Greve, Bettina Grün, Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter

WU Vienna University of Economics and Business, Austria

Bayesian methods have established a firm foothold in unsupervised learning, particularly in the area of clustering. The probabilistic and generative nature of the Bayesian paradigm offers a rich inference in clustering framework that are successfully applied to various areas in science and industry such as natural language processing, computer vision and volatility modeling to name a few. The fipp package is aimed at enhancing the use of the most popular and successful Bayesian methodology in this area: Dirichlet Process Mixtures (DPMs) and its parametric counterpart Mixture of Finite Mixtures (MFMs). A major source of uncertainty when implementing these models in practice is how one can incorporate domain-specific knowledge in prior distribution and hyperparameters. For example, a practitioner may have a rough idea of the number of clusters to be expected or the unevenness in partition structure, which should be translated appropriately to the prior and hyperparameter specification. Bridging this gap between statistical formulation and domain knowledge is what the functionalities implemented in fipp package do. Specifically, it allows users to evaluate the prior distribution of the number of clusters and computation of functionals over the prior partitions in a computationally efficient manner. This enables efficient experimentation using various prior and hyperparameter settings. Suggested use of this package is to combine it with R packages aimed at fitting the aforementioned methodology to real data such as PReMiuM and dirichetprocess.

ID: 266 / ep-02: 7
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data science teaching, learnr, shiny, bookdown, learning management system

An integrated teaching environment for R with {learnitdown}

Philippe Grosjean, Guyliann Engels

Numerical ecology department, Complexys and InforTech Institutes, University of Mons, Belgium

Many R resources exist for teaching R and data science, like {bookdown}, {blogdown} or {distill} for textbook material, {learnr} and {gradethis} for tutorials with interactive exercises, {shiny} applications for interactive demonstrations, R/exams for exams generation and administration... However, as far as we know, there is still no integrated system that manage all these tools and other common ones like Moodle or H5P, in a coherent teaching platform. The {learnitdown} R package ( brings all these tools together into a little LMS (learning management system) dedicated to teaching with R and R Markdown.

Students authentication from Moodle or Wordpress allow to track individual activity in the H5P, {learnr} or {shiny} exercises in a centralized database. A list of exercises is build automatically for each {bookdown} chapter and an auto-generated progress report help students to manage their exercises more easily. Data gathered from these exercises can be pseudonymized and analyzed. The {learnitdown} system is used to teach data science to students in biology at the University of Mons, Belgium, since 2018 with great satisfaction, see (in French) and

ID: 120 / ep-02: 8
Elevator Pitch
Topics: Data visualisation
Keywords: applications, case studies

Visualization of the one-way flexible ANOVA tests using with {doexplot}

Mustafa Cavus

Eskisehir Technical University, Department of Statistics

It is not always easy to interpret the output of statistical tests. This situation can be made easier with visualization methods. The {ggbetweenstats} package provides tools for visualizing and reporting the output of ANOVA tests under normality. However, the violation of assumptions are commonly faced problem in ANOVA. The {doex} package provides several one-way ANOVA tests under heteroscedasticity and non-normally distributed data. In this study, {doexplot} package is implemented to visualize the output of one-way ANOVA tests provided in the {doex} package. In this way, made easier for researchers to interpret and report the results of flexible ANOVA methods in case of violation of the assumptions.

ID: 129 / ep-02: 9
Elevator Pitch
Topics: Ecology
Keywords: biology

TrackJR: a new R-package using Julia language for tracking tiny insects

Gerardo de la vega1,2, Federico Triñanes2, Andres Gonzalez Ritzel2


Here we present the trackJR package, a tool to analyze tiny insect behaviours in bioassays where the main important variable is the position of the insect (for example, an olfactometer bioassay or other orientation experiment). This package allows to work with tiny objects, understood as an individual representing ~1% of the frame, thus it could be used with other species rather than insects. It was written in Julia and R as a common tool for biologists, with user-friendly’ Shiny widget to a broad audience. Therefore, the package allows biologists to use a script wrote in Julia language with a basic R language knowledge. Also, the results can be easily merged with others R object outputs (i.e. data-frame, matrix or lists).

ID: 232 / ep-02: 10
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: javascript

shiny.fluent and shiny.react: Build Beautiful Shiny Apps Using Microsoft's Fluent UI

Marek Rogala


In this talk we will present the functionality and ideas behind a new open source package we have developed called shiny.fluent.

UI plays a huge role in the success of Shiny projects. shiny.fluent enables you to build Shiny apps in a novel way using Microsoft’s Fluent UI as the UI foundation. It gives your app a beautiful, professional look and a rich set of components while retaining the speed of development that Shiny is famous for.

Fluent UI is based on the Javascript library React, so it’s a challenging task to make it work with Shiny. We have put the parts responsible for making this possible in a separate package called shiny.react, which enables you to port other React-based components and UI libraries so that they work in Shiny.

During the talk, we will demonstrate how to use shiny.fluent to build your own Shiny apps, and explain how we solved the main challenges in integrating React and Shiny.

Link to package or code repository.

ID: 231 / ep-02: 11
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: API

Conducting Effective User Tests for Shiny Dashboards

Maria Grycuk


User tests are a crucial part of development, yet we frequently skip over them or conduct them too late in the process. Involving users early on allows us to verify if the tool we want to build will be used by them or will be forgotten in the next few months. Another risk that increases significantly when we don’t show the product to end users before going live is that we will build something unintuitive and difficult to use. When you are working with a product for a few months and you know every button and feature by heart, it is hard to take a step back and think about usability. In this talk, I would like to share a few tips on how to perform an excellent user interview, based on my experience working with Fortune 500 clients on Shiny dashboards. I will show why conducting effective user tests is so critical, and explain how to ask the right questions to gain the most from the interview.

ID: 124 / ep-02: 12
Elevator Pitch
Topics: R in production
Keywords: business, industry

NNcompare: An R package supporting the peer programming process in clinical studies

Mette Bendtsen, Steffen Falgreen Larsen, Frederik Vandvig Heinen, Claus Dethlefsen

Novo Nordisk A/S, Alfred Nobels Vej 27, DK-9220 Aalborg Øst, Denmark

Analysing and reporting data from clinical studies require a high level of quality in the entire process from data collection to final clinical study report (CSR). In Novo Nordisk, part of the quality assurance is ‘peer programming’ of important data derivations, complex combinations, and statistical analyses included in data sets and TFL (tables, figures, and listings) for the CSR. In this context, peer programming involves two persons solving a specific programming task: the programmer and the reviewer. The programmer creates a program that solves the task, and the reviewer creates a ‘peer program’ that reviews/validates the programmer’s work. To avoid being influenced by the programmer’s code, the reviewer should not read it until after preparation of the peer program. NNcompare is an R package that supports this peer programming process in Novo Nordisk. The package builds on the comparedf() function from the ‘arsenal’ package, which essentially provides functionality for comparing two data frames and reporting the results of the comparison. To support the peer programming process in Novo Nordisk the NNcompare package provides additional functionality for exporting comparison reports to various formats using R markdown, and for creating summary reports across multiple peer programs to provide an overview of the status of all peer programs for a given trial. Furthermore, the package includes functionality for comparing png files using pixel-wise comparisons and marking differences in the plot. Future development will include comparisons of other file types and comparisons of multiple data frames with one function call.

ID: 253 / ep-02: 13
Elevator Pitch
Topics: Statistical models

The evolution of the dependencies of CRAN packages

Clement Lee

Lancaster University, United Kingdom

The number of CRAN packages has been growing steadily over the years. In this talk, we examine two aspects of the package dependencies. First, we look at a snapshot of the dependency network, and apply statistical network models to study its properties, including the degree distribution and the different clusters of packages. Second, we study the evolution of the network over the last year and how the number of reverse dependencies of grows for a typical package. This will allow us to examine the extent to which the preferential attachment model (or the-rich-gets-richer effect) is valid.

Link to package or code repository.

ID: 249 / ep-02: 14
Elevator Pitch
Topics: Algorithms
Keywords: Resampling, Linear mixed-effect models, Bootstrap, Nested data

Bootstrapping Multilevel Models in R using lmeresampler

Adam Loy

Carleton College, United States of America

Linear mixed-effects (LME) models are commonly used to analyze clustered data, such as split plot experiments, longitudinal studies, and stratified samples. In R, there are two primary packages to fit LME models: nlme and lme4. In this talk, we present an extension of the nlme and lme4 packages to include methods for bootstrapping model fits. The lmeresampler packages implements several bootstrap methods for LME models with nested dependence structures using a unified framework: the cases bootstrap resamples entire clusters or observations within clusters (or both); the parametric bootstrap simulates data from the model fit; the residual bootstrap resamples both the predicted random effects and the predicted error terms; and the random effect block bootstrap utilizes the marginal residuals to calculate nonparametric predicted random effects as part of the resampling process. We will discuss and demonstrate the implementation of these bootstrap procedures, and outline plans for future development.

lmeresampler is available on CRAN.

ID: 299 / ep-02: 15
Elevator Pitch
Topics: Other
Keywords: IDE

RCode, a new IDE for R

Nicolas Baradel, William Jouot

PGM Solutions, France

RCode is a new and modern IDE for R. It includes usual features like code highlighting, environment for R variables, history of execution, etc. It also provides extra features such as an Excel-like data grid in which data.frame are directly editable. RCode is multiplatform and available in several languages.

ID: 295 / ep-02: 16
Elevator Pitch
Topics: R in production
Keywords: pharma, validation, verification, qualification

R Package Validation and {valtools}

Ellis Hughes

Fred Hutch Cancer Research Center, United States of America

The R Package Validation Framework offers a clear, easy to follow guide to automate the creation of validated R packages for use in regulated industries. By combining many of the package development tools and philosophies already in existence in the R ecosystem, the framework minimizes overhead while improving the quality of both the package and validation.

{valtools} is the implementation of this framework as an R package. Much like {usethis}, {valtools} automates the creation of the validation infrastructure and eventual validation report so users can focus on what matters: writing the R package.

By the end of this talk, listeners will know the basics to implement the R Package Validation Framework using the {valtools} package.

ID: 151 / ep-02: 17
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: algorithms

NetCoupler: Inferring causal pathways between high-dimensional metabolomics data and external factors

Luke W. Johnston1, Clemens Wittenbecher2, Fabian Eichelmann3

1Steno Diabetes Center Aarhus; 2Harvard T.H. Chan School of Public Health; 3Department of Molecular Epidemiology, German Institute of Human Nutrition and German Center for Diabetes Research

High-dimensional metabolomics data are highly intercorrelated, implying that associations with lifestyle and other exposures or with disease outcomes generally propagate across sets of co-varying metabolites. When inferring biological pathways from metabolomics studies, it is often crucial to detect direct exposure-metabolite or metabolite-outcome relationships instead of associations that can be explained by correlations with other metabolites. To tackle this challenge, we have developed the NetCoupler-algorithm R package (found at NetCoupler builds on evidence showing that data-driven networks recover biological dependencies from metabolomics data and that, based on causal inference theory, adjustment for at least one subset of direct neighbors is sufficient to block all confounding influences within a conditional dependency network. NetCoupler estimates a conditional dependency network from metabolomics data and then uses a multi-model approach to adjust for all possible subsets of direct neighbors in the network in order to identify exposure-affected metabolites or metabolites that have direct effects on disease outcomes. We demonstrate using simulated data that NetCoupler correctly identifies direct exposure-metabolite and metabolite-outcome effects and provide an example of its application in a prospective cohort study to integrate the information on food consumption habits, metabolomics profiles, and type 2 diabetes incidence. While NetCoupler was developed from a need to process and analyze the data from metabolomics studies, NetCoupler can also be applied to detect direct links between other external variables and network types.

Link to package or code repository.

ID: 236 / ep-02: 18
Elevator Pitch
Topics: R in production
Keywords: CI/CD

Continuously expanding Techguides: An open source project based on bookdown using CI/CD pipelines from GitHub Actions

Peter Schmid

Mirai Solutions

The Data Scientist work is often about solving unfamiliar problems. Online resources are a bliss in this regard, with the community providing answers to virtually any problem. However, it can be difficult to find the working solution in an ocean of more or less useful suggestions. Therefore, we at Mirai Solutions, have started to gather solutions to some of these issues in an open source project: techguides. This initiative is meant to give back to the community a bit of our know-how. It resulted in a public repository that elegantly puts together several rmarkdown files and renders them as a bookdown website served on GitHub Pages.

In this talk, I would like to show how we are continuously expanding our techguides in a flexible way based on an automated continuous integration and deployment workflow using GitHub Actions. As Github Actions is fairly new and not yet trivial to set up, we hope that our explanations can help and inspire others to consider using CI / CD.

Link to package or code repository.

ID: 261 / ep-02: 19
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: k-mer, prediction, protein, functional analysis

R as a an environment for the functional analysis of proteins

Michał Burdukiewicz

Medical University of Białystok, Poland

The functional analysis of proteins, development of models associating the protein sequence with its function, was always one of the cornerstones of bioinformatics. Like every other application of machine learning, it is prone to issues such as reproducibility or benchmarking. Moreover, as potential users are mostly biologists, these models should be accessible without any coding. Unfortunately, the resources necessary to build and share such models in accordance with CRAN/Bioconductor guidelines and requirements of reproducible science are still scattered.

During my presentation, I sum up my experience of developing several tools for functional analysis of proteins (AmyloGram, SignalHsmm, AmpGram, and CancerGram). I show the advantages of the R ecosystem during the development of the model (tidysq, mlr3) and deployment (R packages, Shiny web servers, and Electron-based standalone apps). As sharing very large (>10 MB) predictive models on CRAN is not intuitive, I show how to do it in a way that satisfies submission requirements.

ID: 139 / ep-02: 20
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: Bioimaging, R workflow, high dimensional data

Statistical Workflows in R for Imaging Mass Spectrometry Data

Hoang Tran, Valeriia Sherina, Fang Xie

GlaxoSmithKline, United States of America

Matrix-assisted laser desorption/ionization (MALDI) imaging mass spectrometry (IMS) is a technique that can reveal powerful insights into the correlation between molecular distributions and histological features. Due to their high-dimensional, hierarchical and spatial nature, MALDI IMS datasets present numerous statistical challenges. In collaboration with the bioimaging team at GlaxoSmithKline (GSK), we have developed special purpose statistical workflows in R that provide end-to-end support for the entire MALDI IMS analysis pipeline, from study design and assay quantification to functional pharmacology. These applications leverage numerous R packages, with a particular focus on the “tidyverse” and “tidymodels” ecosystems due to their modularity and interconnectedness (to protect GSK’s intellectual property, we are currently unable to share our code). Our workflows include robust smoothing and estimation of calibration curves; non-trivial animal and tissue sample size calculations via in silico experiments; and AI/ML implementations for prediction of drug effects from the high-dimensional molecular space. These solutions addressed unique biological and quantitative challenges, and yielded actionable insights for GSK’s bioimaging team.

ID: 258 / ep-02: 21
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: Shiny, NLP, Human-computer interaction, Chatbot, AI&Society

Hi, Let’s Talk About Data Science! - Customize Your Personal Data Science Assistant Bot.

Livia Eichenberger, Oliver Guggenbühl

STATWORX GmbH, Switzerland

In June 2020, OpenAI released their newest NLP model, GPT-3, and thus set a new standard for language understanding and generation. GPT-3 is an autoregressive language model, enabling the generation of human-like text. Sample use cases are chatbots, Q&A systems and text summarization. Due to the complexity of GPT-3, it is difficult for non-technical specialists to experience both the strengths but also the shortcomings of this technology. A fundamental challenge faced today is educating society about the potentials and risks of AI and not leaving anyone behind.

To approach this task, R’s Shiny framework can be leveraged to lower the barrier of entry for interaction with AI models. Specifically, GPT-3 can be instructed to incorporate different types of chatbots by supplying it with a precise description of how it should behave during a conversation. We provide an interface to chat with a Data Science bot, where various parameters of the bot’s behaviour can be selected on the fly. Examples are the preferred language and the user’s knowledge level. A mockup of our interface is attached.

Shiny is the preferred framework for this application because it comes packaged with all the necessary tools for interacting with a customizable chatbot based on GPT-3. With Shiny’s input widgets the user can then manipulate various parameters to influence the pre-defined chatbot’s personality. The chatbot will immediately adjust their behaviour and finetune their personality, allowing the user to experience their input on GPT-3 in real-time. All this will be done in a clearly laid out interface where users need no prior experience with R coding or creating Shiny apps.

We present how we use Shiny to lower the barrier to interact with AI models with little overhead and thus to tackle one of today’s most important problems: AI education of the broader population.

Link to package or code repository.

ID: 133 / ep-02: 22
Elevator Pitch
Topics: R in production
Keywords: business, industry

NNSampleSize: A tool for communicating, determining and documenting sample size in clinical trials

Claus Dethlefsen1, Steffen Falgreen Larsen1, Anders Ellern Bilgrau2, Nynne Holdt-Caspersen1, Maika Lindkvist Jensen1

1Novo Nordisk A/S; 2Seluxit

Determination of sample size in clinical studies is an iterative process involving many stakeholders and leading to many decisions. When data from other studies become available, assumptions may be revised or other scenarios for study design may be considered. Assumptions also feed into decision guiding framework aimed to determine if the sample size is adequate to make a decision for the future development of the product. At Novo Nordisk, we have developed an R shiny that can assist us in this process. In the R shiny application, several sample size scenarios can be carried out for a given study. The application has a documentation module for keeping track on decisions using Rmarkdown as well as facilities for programming and reviewing the final determination of sample size. When finalized, the idea is to download word-files ready for archiving in a documentation system.

ID: 250 / ep-02: 23
Elevator Pitch
Topics: Multivariate analysis
Keywords: True Discovery Proportion, Permutation Test, Multiple Testing, Selective Inference, fMRI Cluster Analysis

pARI package: valid double-dipping via permutation-based All Resolutions Inference

Angela Andreella1, Jelle Goeman2, Livio Finos3, Wouter Weeda4, Jesse Hemerik5

1Department of Statistical Sciences, University of Padova; 2Biomedical Data Sciences, Leiden University Medical Center; 3Department of Developmental Psychology and Socialization, University of Padua; 4Methodology and Statistics Unit, Department of Psychology, Leiden University; 5Biometris, Wageningen University and Research

The functional Magnetic Resonance Imaging cluster extent-based thresholding is popular for finding neural activation associated with some stimulus. However, it suffers from the spatial specificity paradox: we only know that a specific cluster of voxels is significant under the null hypothesis of no activation. We can not find out the number of truly active voxels inside that cluster without falling into the double-dipping problem. For that, Rosenblatt et al. (2018) developed All-Resolution Inference (ARI), which associates the lower bound of the number of truly active voxels for each cluster. However, ARI can lose power if the data are strongly correlated, e.g., fMRI data. So, we re-phrase it using the permutation theory, developing the package pARI. The main function pARIbrain takes as input a list of contrast maps, one for each subject, given by neuroimaging tools. The user can then insert a cluster map, and pARIbrain returns the lower bounds of true discoveries for each cluster from the cluster map inserted. The package was developed for the fMRI scenario; however, we develop also the function pARI. It takes the permutation p-values null distribution and the indexes of the hypothesis of interest as inputs and returns the lower bound for the number of true discoveries inside the set of hypotheses specified. The user can compute the permutation null distribution concerning the two-sample t-tests and one-sample t-tests by the permTest and signTest functions. The set of hypotheses can be specified as often as the user wants, and pARI still controls FWER.

ID: 233 / ep-02: 24
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: biostatistics

Data Access and dynamic Visualization for Clinical Insights (DaVinci)

Matthias Trampisch, Julia Igel, Andre Haugg

Boehringer Ingelheim

This talk introduces the Boehringer Ingelheim initiative on Data Access and dynamic Visualization for Clinical Insights (DaVinci). It is named after Leonardo da Vinci, one of the most diversely talented individuals ever to have lived. The main objective of the DaVinci project is to reflect this diversity creating a modular framework based on the shiny, which enables end-users to have direct access to clinical data via advanced visualization during clinical development.

DaVinci consists of a collection of shiny-based modules to review, aggregate and visualize data to develop and deliver safe and effective treatments for patients. Based on harmonized data concepts (SDTM/ADaM), DaVinci provides and maintains GCP compliant modules for data review and analysis, which can easily be combined and customized into trial-specific dashboards by the end-user.

The talk outlines the developed approach, including the developed modular manager and highly flexible, custom-designed modules which all lead to an individual and customizable app experience. Main advantages of this approach are that the individual modules can be validated separately and used flexible in a joint shiny application, which permits easy validation considering GDPR, GxP and 21 CFR part 11. This approach also supports trial, project or substance specific needs to get the most value out of the data.

Deployment of these apps is done via a CI/CD pipeline using the Atlassian Stack and Jenkins, resulting in dockerized shiny server instances, which can easily scale up to the application needs.

ID: 179 / ep-02: 25
Elevator Pitch
Topics: Environmental sciences
Keywords: Environmental research; Big data; Reproducibility; Data visualisation

Reproducibility and dissemination in the research: a case of study of the bioaerosol dynamics

Jesús Rojo1, Antonio Picornell2, Jeroen Buters3, Jose Oteros4

1Department of Pharmacology, Pharmacognosy and Botany, Complutense University. Madrid (Spain); 2Department of Botany and Plant Physiology. University of Malaga. Malaga (Spain); 3Center of Allergy & Environment (ZAUM), Technische Universität München/Helmholtz Center Munich. Munich (Germany); 4Department of Botany, Ecology and Plant Physiology. University of Cordoba. Cordoba (Spain)

Environmental databases are constantly increasing which require computational tools to be efficiently managed. This experience is an example of the procedure followed to manage the aerobiological databases used in the publication led by Rojo et al. [Environ Res,174:160-169; doi:10.1016/j.envres.2019.04.027] based on the effect of height on pollen exposure. While the analysis of pollen time-series at local scale may provide no clear or even contradictory findings from different study areas, a global study provides robust results avoiding biases or the effect of local factors masking the true patterns in bioaerosol dynamics. We analysed about 2,000,000 daily pollen concentrations from 59 monitoring stations of Europe, North America and Australia, using R Software and 'AeRobiology', a specific package in this field [Rojo et al., Methods Ecol Evol,10:1371-1376; doi:10.1111/2041-210X.13203]. Due to the huge amount of data contributors involved we conducted a first step of exhaustive filtering and quality control of data for making standard and comparable datasets between sites. This quality control required basic rules of removing uncertain or missing data, but also scientific criteria based on the optimisation of parameters like distance or degree of similarity between sites. The pollen rate between paired stations was used to study the effect of height on pollen concentrations which constituted the second step (analysis of data) and the main scientific findings. One of the key benefits of computational tools is the automation of the processes. In this case, the processing and analysis systems made it possible to dynamically incorporate the pollen data from new stations, obtaining an automatic update of the statistical analysis. Finally, since reproducibility and dissemination are both very important principles of the scientific research, we designed a Shiny Application where the users may interpret the results and generate the graphs selecting specific scientific criteria by themselves. Link of the Shiny Application:

ID: 210 / ep-02: 26
Elevator Pitch
Topics: Environmental sciences
Keywords: environmental sciences

R in the aiR!

Adithi R. Upadhya1, Pratyush Agrawal1, Sreekanth Vakacherla2, Meenakshi Kushwaha1

1ILK Labs, Bengaluru, India; 2Center for Study of Science, Technology and Policy, Bengaluru, India

R is a powerful tool in analysing air-quality data. With the ever-increasing global measurements of air pollutants (through stationary, mobile, low-cost, and satellite monitoring), the amount of data being collected is huge and necessitates the use of management platforms. In an effort to address this issue, we developed two Shiny applications to analyse and visualise air-pollution data.

‘mmaqshiny’, now on CRAN, is aimed at handling, calibrating, integrating, and visualising spatially and temporally acquired air-pollution data from mobile monitoring campaigns. Currently, the application caters to data collected using specific instruments. With just the click of a button, even non-programmers can generate summary statistics, time series, and spatial maps. The application is capable of handling high-resolution data from multiple instruments and formats. Moreover, it also allows users to visualize data at near-real time and helps in keeping a tab on data quality and instrument health.

Our second Shiny application (currently in the development phase) is specific to India, and allows users to handle open-source air-quality datasets available from OpenAQ (, CPCB (, and AirNow ( sers can visualize data, perform basic statistical operations, and generate a variety of publication-ready plots. It also provides outlier detection and replacement of fill/negative values. We have also integrated the popular openair package in this application.

ID: 108 / ep-02: 27
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics

segmenter: A Wrapper for JAVA ChromHMM

Mahmoud Ahmed, Deok Ryong Kim

Gyeongsang National University

Chromatin segmentation analysis transforms ChIP-seq data into signals over the genome. The latter represents the observed states in a multivariate Markov model to predict the chromatin's underlying (hidden) states. ChromHMM, written in Java, integrates histone modification datasets to learn the chromatin states de-novo. We developed an R package around this program to leverage the existing R/Bioconductor tools and data structures in the segmentation analysis context. segmentr wraps the Java modules to call ChromHMM and captures the output in an S4 R object. This allows for iterating with different parameters, which are given in R syntax. Capturing the output in R makes it easier to work with the results and to integrate them in downstream analyses. Finally, segmentr provides additional tools to test, select and visualize the models. To sum, we developed an R package to wrap a popular chromatin segmentation tool and capture the output in R for testing and visualization.

Link to package or code repository.

ID: 122 / ep-02: 28
Elevator Pitch
Topics: Efficient programming
Keywords: recursion, list, nested, efficient programming, C

Efficient list recursion in R with rrapply

Joris Chau

Open Analytics

The little used R function rapply() applies a function to all elements of a list recursively and provides control in structuring the result. Although occasionally useful due to its simplicity, the rapply() function is not sufficiently flexible to solve many common list recursion tasks. In such cases, the solution is to write custom list recursion code, which can quickly become hard to follow or reason about, making it a time-consuming and error-prone task to update or modify the code. The rrapply() function in the rrapply-package is an attempt to enhance and extend base rapply() to make it more generally applicable in the context of efficient list recursion in R. For instance: i) rapply() only allows to apply a function f to list elements of certain classes, rrapply() generalizes this concept through a general condition function; ii) rrapply() allows additional flexibility in structuring the result by e.g. pruning or unnesting list elements; iii) with rapply() there is no convenient way to access the name or location of the list element under evaluation, rrapply() allows the use of a number of special arguments to overcome this limitation. The rrapply()-function aims at efficiency by building on rapply() ’s native C implementation and does not require any external R-package dependencies. The rrapply-package is available on CRAN and several vignettes illustrating its use can be found online.

ID: 263 / ep-02: 29
Elevator Pitch
Topics: Mathematical models
Keywords: flow chart, flow diagram, model diagram, ggplot2, visualization

An R package to flexibly generate simulation model flow diagrams

Andreas Handel1, Andrew Tredennick2

1University of Georgia; 2Western EcoSystems Technology, Inc.

We recently developed an R package that allows users to quickly generate ggplot2 based flow diagrams of compartmental simulation models that are commonly used in infectious disease modeling and many other areas of science and engineering. The package allows users to create publication quality diagrams in a user-friendly manner. Full access to the ggplot2 code that generates the diagram means advanced users can further customize the final diagram as needed. In this talk, we will provide a brief overview and introduction to the package.

Link to package or code repository.

ID: 272 / ep-02: 30
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: Markdown, automation, trend epidemiology, daily report, metrics

Using R Markdown to Automate COVID-19 Reporting

Farzad Islam, Michael Elten, Najmus Saqib

Public Health Agency of Canada, Canada

The COVID-19 pandemic has impacted the operational needs at the Public Health Agency of Canada (PHAC), and consequently the day-to-day responsibilities of its employees. The emergency surveillance needs in light of the pandemic require around-the-clock monitoring 7 days a week. To accompany the surveillance, daily reporting needs were developed to keep the Office of the Chief Public Health Officer (OCPHO) informed on nationwide trends that would ultimately help inform public policy decisions and craft communication strategies. Due to the abrupt development of these needs, the solutions initially devised were labour-intensive and inefficient. Epidemiologists were utilizing the same datasets across different teams, writing scripts in various languages and maintaining them in silos.

The Center for Data Management, Innovation and Analytics at PHAC was responsible for taking these functions and improving them so that they a) they became standardized, b) they reduced the need for manual labour, and c) they eliminated the risk of human error. As a result, R was utilized to automate the functions of reporting, moved to the back-end, and the outputs of the scripts were generated in the form of PowerPoint decks. This included the uses of various plots (using ggplot2), charts (flextable, officer), and cross-functionality with Python (reticulate). The data ingestion systems were also improved by utilizing Googlesheets, reading public data directly from websites, and utilizing web-scraping techniques to pull data reported daily.

As a result of these efforts, daily reporting needs which could take hours to accomplish were reduced to the click of a button and five minutes of processing.

ID: 186 / ep-02: 31
Elevator Pitch
Topics: Statistical models
Keywords: statistics, Cumulative Link Mixed-effects Models, Ordinal response variable

Cumulative Link Mixed-effects Models (CLMMs) as a tool to model ordinal response variables and incorporate random effects

Christophe Bousquet

Lyon Neuroscience Research Center, France

Ordinal response variables are frequent in various scientific domains, including ecology, ethology and psychology. However, researchers often analyse these data with methods suitable for non-ordinal response variables. The R package ‘ordinal’ has been developed specifically to model ordinal response variables and also offers the possibility to incorporate random effects. In this elevator pitch, I will present how to approach this kind of analysis, from the integration of random effects to the production of visualisations to communicate the results. The dataset is based on experiments in behavioural biology, specifically on leadership in mallards. The code to access the data and analysis is available on GitHub and may allow other researchers to learn analysis techniques for ordinal data.

ID: 242 / ep-02: 32
Elevator Pitch
Topics: Data visualisation
Keywords: ggplot2

High dimensional data visualization in ggplot2

Zehao Xu, Wayne Oldford

University of Waterloo

Package 'ggmulti' extends the 'ggplot2' package to add high dimensional visualization functionality such as serialaxes coordinates (e.g., parallel, ...) and multivariate scatterplot glyphs (e.g. encoding many variables in a radial axes or star glyph).

Much more general glyphs (e.g., polygons, images) are also now possible as point symbols in a scatterplot and can provide more evocative pictures for each point (e.g. an airplane for flight data or a team’s logo for sports data).

As its name suggests, serial axes coordinates arranges variable axes in series (radially for stars, parallel for parallel coordinates) and can be used as its own plot or as a glyph. These are extended to a continuous curve representation (e.g., Andrews curves) through function transformations (e.g. Fourier series). The parallel coordinates work in the ggplot pipeline allowing histograms, density, etc. to be overlaid on the axes.

In this talk, an overview of ggmulti will be given, largely by example.

Link to package or code repository.

ID: 211 / ep-02: 33
Elevator Pitch
Topics: Data visualisation
Keywords: API

Charting Covid with the DatawRappr-Package

Benedict Witzenberger

Süddeutsche Zeitung / TUM,

Covid-19 swept across the world like a huge, sudden wave. Data journalists all around the globe had a brand new beat to cover from one moment to another. A lot of newsrooms used the available data to start automated and regularly updated visualizations or dashboard. One tool that is often used for creating charts, maps or dashboard-like tables in journalism (and corporate) is Datawrapper.

I created an R-API-package to combine the power of R-code for analysing data and the various options Datawrapper offers in creating interactive and responsive visualizations.

I would like to show some examples and best practices for useful automated visualizations in Datawrapper - created in R.

Link to package or code repository.

ID: 154 / ep-02: 34
Elevator Pitch
Topics: Efficient programming
Keywords: C++, AutoDiff, packages

Bringing AutoDiff to R packages

Michael Komodromos

Imperial College London

We demonstrate the use of a C++ automatic differentiation (AD) library and show how it can be used with R to solve problems in optimization, MCMC and beyond. In particular, we show how gradients produced with AD can be used with R's built in optimization routines. We hope such integrations will enable package developers to produce robust efficient code by overcoming the need to produce functions that compute gradients.

Link to package or code repository.

ID: 218 / ep-02: 35
Elevator Pitch
Topics: Community and Outreach
Keywords: interface, community, education, workflow

Healthier & Happier Hands: Software and Hardware Solutions for More Ergonomic Typing

John Paul Helveston

George Washington University,

Most R users spend multiple hours every day typing on a keyboard, which can lead to serious injuries such as Repetitive Strain Injury (RSI) and Carpal Tunnel Syndrome. This talk discusses a variety of software and hardware tools to improve the ergonomics of typing. I will discuss a wide range of solutions, from implementing software tools for remapping keys to using a split mechanical keyboard for improved hand and arm positioning. Each solution involves a trade-off between the time and effort required to learn and implement it and the benefits in terms of health and typing improvements, like speed and accuracy. I will also showcase some specific applications of how these solutions can improve the experience of working with R. No one solution will work for everyone, but my goal is that by introducing a broad overview of solutions, many will leave inspired to try (and eventually adopt) some and end up with healthier and happier hands.

ID: 144 / ep-02: 36
Elevator Pitch
Topics: Algorithms
Keywords: high-dimensional data

High Dimensional Penalized Generalized Linear Mixed Models: The glmmPen R Package

Hillary M. Heiling1, Naim U. Rashid1,2, Quefeng Li1, Joseph G. Ibrahim1

1University of North Carolina at Chapel Hill; 2UNC Lineberger Comprehensive Cancer Cencer

Generalized linear mixed models (GLMMs) are popular for their flexibility and their ability to estimate population-level effects while accounting for between-unit heterogeneity. While GLMMs are very versatile, the specification of fixed and random effects is a critical part of the modeling process. Historically, variable selection in GLMMs has been restricted to a search over a limited set of candidate models or has required selection criteria that are computationally difficult to compute for GLMMs, limiting variable selection in GLMMs to lower dimensional models. To address this, we developed the R package glmmPen, which simultaneously selects fixed and random effects from high dimensional penalized generalized linear mixed models (pGLMMs). Model parameters are estimated using a Monte Carlo Expectation Conditional Maximization (MCECM) algorithm, which leverages Stan and RcppArmadillo to increase computational efficiency. Our package supports the penalty functions MCP, SCAD, and Lasso, and the distributional families Binomial, Gaussian, and Poisson. Tools available in the package include automated tuning parameter selection and automated initialization of the random effect variance. Optimal tuning parameters are selected using BIC-ICQ or other BIC selection criteria; the marginal log-likelihoods used for the BIC criteria calculation are estimated using a corrected arithmetic mean estimator. The package can also be used to fit traditional generalized linear mixed models without penalization, and provides a user interface that is similar to the popular lme4 R package.

Link to package or code repository.

ID: 140 / ep-02: 37
Elevator Pitch
Topics: Reproducibility
Keywords: R markdown

trackdown: collaborative writing and editing your R Markdown and Sweave documents in Google Drive

Filippo Gambarota1, Claudio Zandonella Callegher1, Janosch Linkersdörfer2, Mathew Ling3, Emily Kothe3

1University of Padova; 2University of California, San Diego; 3Misinformation Lab, Deakin University

"The advantages of using literate programming that combines plain-text and code chunks (e.g., R Markdown and Sweave) are well recognized. This allows creation of rich, high quality, and reproducible documents. However, collaborative writing and editing have always been a bottleneck. Distributed version control systems like git are recommended for collaborative code editing but far from ideal when working with prose. In the latter cases, other software (e.g, Microsoft Word or Google Docs) offer a more fluent experience, tracking document changes in a simple and intuitive way. When you further consider that collaborators often do not have the same level of programming competence, there does not appear to be an optimal collaborative workflow for writing reproducible documents.

trackdown (formerly rmdrive) overcomes this issue by offering a simple solution to collaborative writing and editing of reproducible documents. Using trackdown, the local R Markdown or Sweave document is uploaded as plain-text in Google Drive allowing other colleagues to contribute to the prose using convenient features like tracking changes and comments. After integrating all authors’ contributions, the edited document is downloaded and rendered locally. This smooth workflow allows taking advantage of the easily readable Markdown and LaTeX plain-text combined with the optimal and well-known text editing experience offered by Google Docs.

In this contribution, we will present the package and its main features. trackdown aims to promote good scientific practices that enhance overall work quality and reproducibility allowing collaborators with no or limited R knowledge to contribute to literate programming workflows."

Link to package or code repository.

ID: 289 / ep-02: 38
Elevator Pitch
Topics: Statistical models
Keywords: multivariate funtional data, outliers detection, functional classification, clustering, machine learning

Multivariate functional data analysis

Manuel Oviedo-de la Fuente1, Manuel Febrero-Bande2

1University of Coruña, Spain; 2University of Santiago de Compostela, Spain

This talk proposes new tools to use multivariate functional data (MFD) in R. For this, to handle multivariate functional data the class "mfdata" is proposed, and to handle complex data (scalar, multivariate, directional, images, and functional) the class "ldata". These new classes are useful in problems such as i) visualizing centrality and detecting outliers for MFD, ii) extending supervised classification algorithms in machine learning and iii) also unsupervised algorithms such as hierarchical and k-means procedures.

Link to package or code repository.

ID: 220 / ep-02: 39
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: big data

Multivariate functional principal component analysis on high dimensional gait data

Sajal Kaur Minhas1, Morgan Sangeux3, Julia Polak2, Michelle Carey1

1University College Dublin,; 2School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia; 3Murdoch Childrens Research Institute, Melbourne, Australia

A typical gait analysis requires the analysis of the kinematics of five joints (trunk, pelvis, hip, knee and ankle/foot) in three planes. It requires how much a subject’s gait deviates from an average normal profile as a single number. This can quantify the overall severity of a condition affecting walking, monitor progress or evaluate the outcome of an intervention prescribed to improve the gait pattern. The Gait Deviation Index (GDI) and Gait Profile Score (GPS) are the standard indices for measuring gait abnormality and work well on common gait pathologies such as Cerebral palsy etc. The GDI is easy to interpret and is normally distributed allowing for parametric statistical testing whereas GPS has the ability to decompose scores by individual joints/planes and produce altered indices without the need for a large control database but it is not normally distributed. Neither index accounts for the potential co-variation between the kinematic variables for any individual subject, i.e. the motions of one joint affect the motions of adjacent. Additionally, the intrinsic smoothness of the gait movement in each kinematic variable is not accounted for, i.e. the position of a joint at one time affects the positions at a later instant. The aim of this work is to utilize techniques from multivariate functional principal components analysis in the R package MFPCA to create an index that combines the advantages of the existing GDI and GPS i.e, an index that is easy to interpret, is normally distributed, has the ability to decompose scores by individual joints and planes, and is easily adaptable. While also accounting for the intrinsic smoothness of the gait movement in each kinematic variable and the potential co-variation between the kinematic variables. The functional gait deviation index is implemented in R and provides a computationally efficient and easily administered metric to quantify gait impairment.

Link to package or code repository.

ID: 184 / ep-02: 40
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: teaching, lecture, introduction, programming

Teaching an introductory programming course with R

Reto Stauffer1,2, Joanna Chimiak-Opoka1, Luis M Rodriguez-R1,3, Achim Zeileis2

1Digital Science Center, Universität Innsbruck, Austria; 2Department of Statistics, Universität Innsbruck, Austria; 3Department of Microbiology, Universität Innsbruck, Austria

As part of a large digitalization initiative, Universität Innsbruck established a Digital Science Center that aims to foster both interdisciplinary research and modern education using digital and data-driven methods. Specifically, the center offers a package of elective courses that can be taken by all students and that covers programming, data management, data analysis, and further aspects of digitalization.

The first course within this package is a general introduction to programming for novices, offered in two tracks, using either Python or R. The focus is on teaching data types including object classes, writing and testing functions, control flow, etc. While some basic data management and data analysis is touched upon, these topics are mainly deferred to subsequent courses.

As this design differs from most introductory R materials that emphasize data analysis early on, we developed new course materials centered around an online textbook: Our course follows the flipped classroom design allowing the diverse group of participants to learn at their own pace. In class open questions are resolved before students can work jointly on non-mandatory programming tasks with guidance and feedback from the instructors. The assessment is based on short weekly (randomized) online quizzes generated with the R/exams package ( that are automatically graded, as well as manually graded mid-term and final exams. The concept of the course turned out to work well both in-person and in virtual teaching.

Link to package or code repository.

ID: 255 / ep-02: 41
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: XAI, DALEX, iml, flashlight, shap, Interpretable Artificial Intelligence

Landscape of R packages for eXplainable Artificial Intelligence

Szymon Maksymiuk, Alicja Gosiewska, Przemysław Biecek

Warsaw University of Technology, Poland

The growing availability of data and computing power is fueling the development of predictive models. To ensure the safe and effective functioning of such models, we need methods for exploration, debugging, and validation. New methods and tools for this purpose are being developed within the eXplainable Artificial Intelligence (XAI) subdomain of machine learning. In this lightning talk, we present the design by us taxonomy of methods for a model explanation, show what methods are included in the most popular R XAI packages, and acknowledge trends in recent developments.

Link to package or code repository.
Link to a site presenting the results:
Repo with codes:

ID: 227 / ep-02: 42
Elevator Pitch
Topics: R in production
Keywords: interactive visualization

Reactive PK/PD: An R shiny application simplifying the PK/PD review process

Kristoffer Segerstrøm Mørk, Steffen Falgreen Larsen

Novo Nordisk

In phase 1 of clinical drug development there is a greater interest in the pharmacokinetics (PK) and pharmacodynamics (PD) of a drug. PK describes what the body does to the drug. PD describes what the drug does the body. Due to the limitations and uncertainties related to the procedures used to assess the PK and PD of a drug there is a need to review the PK and PD data on a patient level. Such a review is usually conducted in a smaller group of people from different skill areas.

In this elevator pitch you will be presented to how we at Novo Nordisk have simplified and automated a lot of the tasks related to a PK/PD review using R shiny. We have developed an application that automatically generates the figures that we need in order to conduct a review. The app enables the users to comment on the data through the autogenerated figures and the comments are instantly shared with other users. Once a review has been conducted, minutes can be downloaded in a word format including the added comments.

ID: 121 / ep-02: 43
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data processing

r-cubed: Guiding the overwhelmed scientist from random wrangling to Reproducible Research in R

Hannah Chatwin1, Luke W. Johnston2, Helene Baek Juel3, Bettina Lengger4, Daniel R. Witte2,5, Malene Revsbech Christiansen3, Anders Aasted Isaksen5

1University of Southern Denmark; 2Steno Diabetes Center Aarhus; 3University of Copenhagen; 4Technical University of Denmark; 5Aarhus University

The volume of biological data increases yearly, driven largely by technologies like high-throughput omics, real-time monitoring, and high-resolution imaging, as well as by greater access to routine administrative data and larger study populations. This presents operational challenges and requires considerable knowledge of and skills to manage, process, and analyze this data. Along with the growing open science movement, research is also increasingly expected to be open, transparent, and reproducible. Training in modern computational skills has not yet kept pace, particularly in biomedical research where training often focuses on clinical, experimental, or wet lab competencies. We developed a computational learning module, r-cubed, that introduces and improves skills in R, reproducibility, and open science that was designed with biomedical researchers in mind. The r-cubed learning module is structured as a three-day workshop with five submodules. Over the five submodules, we use a combination of code-alongs, exercises, lectures, and a group project to cover skills in collaboration with Git and GitHub, project management, data wrangling, reproducible document writing, and data visualization. We have specifically designed the module as an open educational resource that instructors can use directly or to modify for their own lessons, and that learners can use independently or as a reference during and after participating in the workshop. All content is available for re-use under CC-BY and MIT Licenses. The course website is found at and the repository with the source material is at

Link to package or code repository.

ID: 128 / ep-02: 44
Elevator Pitch
Topics: Databases / Data management
Keywords: databases

Validate observations stored in a DB

Edwin de Jonge

Statistics Netherlands / CBS

Data cleaning is an important step before analyzing your data.

Often it is wise to check the validity of your observations before running

your statistical methods on the data. Validation checks embody real world

knowledge about your observations, e.g. age cannot be negative or over 150 years


R package `validate` allows for formulating validation checks in R syntax and

run these checks on a `data.frame`.

`validatedb` brings `validate` to the database:

It allows for running the validation

checks on a (potentially very) large database tables, offering the same benefits

as `validate`, namely a clean documented set of validation rules, but checked on

a database. The presentation will go into the details of the implementation,

describe the output of the validation checks, and also discuss an alternative

sparse format for describing errors in your data.

ID: 208 / ep-02: 45
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data science class, flipped classroom, learnr, gradethis

Teaching Biology students to code smoothly with learnR and gradethis

Guyliann Engels, Philippe Grosjean

Numerical Ecology Department, Complexys and InforTec Institutes, University of Mons, Belgium

R is teach in a biology curriculum at the University of Mons, Belgium, in the context of five data science courses spanning from 2nd Bachelor to last Master classes ( Since 2018 the flipped classroom approach is used. Three levels of exercices of increasing difficulties are proposed. First, students read a {bookdown} with integrated interactive exercises written in H5P or {Shiny}. Then, they practice R using {learnR} tutorials. Finally, they apply the new concepts on real datasets in individual or group projets managed with GitHub and GitHub Classroom (

{LearnR} is a useful tool to bridge the gap between theory and practice in R learning. Students can auto-assess their skills and get immediate feedback thanks to {gradethis}. All the exercises generate xAPI events that are recorded in a MongoDB database (more than 300,000 events recorded so far for a total of 182 students over three academic years). These data allow to quantify and visualize the progression (individual progress reports as {Shiny} applications). Thanks to the detailed visualization of their own progression, students are more motivated to complete the exercises. Whether {learnr} is used alone , or in combination with {gradethis} for immediate feedback on the answers, determine student's behavior. They spend more time on each exercise and try harder to find the right answer when {gradethis} is used.

Link to package or code repository.

ID: 138 / ep-02: 46
Elevator Pitch
Topics: Statistical models
Keywords: algorithms

Partial Least Squares Regression for Beta Regression Models

Frederic Bertrand, Myriam Maumy

European University of Technology - Troyes Technology University

Many responses, for instance, experimental results, yields or economic indices, can be naturally expressed as rates or proportions whose values must lie between zero and one or between any two given values.

The Beta regression often allows modelling these data accurately since the shapes of the densities of Beta laws are very versatile.

Yet, as any of the usual regression model, it cannot be applied safely in the case of multicollinearity and not at all when the model matrix is rectangular. These situations are frequently found from chemistry to medicine through economics or marketing.

To circumvent this difficulty, we derived an extension of PLS regression to Beta regression models in PLS regression for beta regression models, Bertrand, F., [...], Maumy-Bertrand, M. (2013). “Régression Bêta PLS” [French]. JSFDS, 154(3):143-159.

The plsRbeta package provides Partial least squares Regression for (weighted) beta regression models and k-fold cross-validation using various criteria. It allows for missing data in the explanatory variables. Bootstrap confidence intervals constructions are also available. Parallel computing (CPU and GPU) support is currently being implemented.

ID: 194 / ep-02: 47
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: interpretability, machine learning, explainability

Simpler is Better: Lifting Interpretability-Performance Trade-off via Automated Feature Engineering

Alicja Gosiewska1, Anna Kozak1, Przemysław Biecek1,2

1Warsaw University of Technology, Poland; 2University of Warsaw, Poland

Machine learning generates useful predictive models that can and should support decision-makers in many areas. The availability of tools for AutoML makes it possible to quickly create an effective but complex predictive model. However, the complexity of such models is often a major obstacle in applications, especially in terms of high-stake decisions. We are experiencing a growing number of examples where the use of black boxes leads to decisions that are harmful, unfair, or simply wrong. Here, we show that very often we can simplify complex models without compromising their performance; however, with the benefit of much-needed transparency.

We propose a framework that uses elastic black boxes as supervisor models to create simpler, less opaque, yet still accurate and interpretable glass box models. The new models were created using newly engineered features extracted with the help of a supervisor model.

We supply the analysis using a large-scale benchmark on several tabular data sets from the OpenML database. There are three main results: 1) we show that extracting information from complex models may improve the performance of simpler models, 2) we question a common myth that complex predictive models outperform simpler predictive models, 3) we present a real-life application of the proposed method.

The proposed method is available as an R package rSAFE,

Link to package or code repository.

ID: 162 / ep-02: 48
Elevator Pitch
Topics: Bayesian models
Keywords: Bayesian analysis

State of the Market - Infinite State Hidden Markov Models

Dean Markwick


The stock market is either in a bull or a bear market at any given time. In a bull market, on average prices increase. In a bear market, prices decrease on average. In this talk I will build a non-parametric Bayesian model that can classify the stock market into these different states.

This model is a practical application of my dirichletprocess R package and will serve as an introduction to both the package and non-parametric Bayesian models. I use free stock data and take you through the full quantitative modelling process. I will show how to: prepare the data, build the model and analyze the model output. This model is able to highlight the dot-com crash of the 2000s, the credit crisis of 2008 and the more recent COVID turmoil in the market. As it is a Bayesian model I am also able to highlight the uncertainty around these market states without having to do any extra work. Overall, this talk will provide a practical example and introduction into how R can be used in quantitative finance.

ID: 164 / ep-02: 49
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: algorithms

networkABC: Network Reverse Engineering with Approximate Bayesian Computation

Myriam Maumy, Frederic Bertrand

European Technology University - Troyes Technology University

We developed an inference tool based on approximate Bayesian computation to decipher network data and assess the strength of the inferred links between network's actors.

It is a new multi-level approximate Bayesian computation (ABC) approach. At the first level, the method captures the global properties of the network, such as scale-freeness and clustering coefficients, whereas the second level is targeted to capture local properties, including the probability of each couple of genes being linked.

Up to now, Approximate Bayesian Computation (ABC) algorithms have been scarcely used in that setting and, due to the computational overhead, their application was limited to a small number of genes. On the contrary, our algorithm was made to cope with that issue and has a low computational cost. It can be used, for instance, for elucidating gene regulatory network, which is an important step towards understanding the normal cell physiology and complex pathological phenotype.

Reverse-engineering consists of using gene expressions over time or over different experimental conditions to discover the structure of the gene network in a targeted cellular process.

ID: 187 / ep-02: 50
Elevator Pitch
Topics: Economics / Finance / Insurance
Keywords: Tidymodels, Tidyverse, actuarial science, actuarial claim cost analysis

Navigating Insurance Claim Data Through Tidymodels Universe

Jun Haur Lok, Tin Seong Kam

Singapore Management University, Singapore

The increasing ability to store and analyze the data due to the advancement in technology has provided actuaries opportunities in optimizing capital held by insurance companies. Often, the ability to optimize the capital would lower the cost of capital for companies. This could translate into an increase in profit from the lower cost incurred or an increase in competitiveness through lowering the premiums companies charge for their insurance plans. In this analysis, tidyverse and tidymodels packages are used to demonstrate how the modern data science R packages could assist the actuaries in predicting the ultimate claim cost once the claims are reported. The conformity with tidy data concepts by these R packages has flattened the learning curve to use different machine learning techniques to complement the conventional actuarial analysis. This has effectively allowed actuaries in building various machine learning models in a more tidy and efficient manner. The packages also enable users to harass on the power of data science to mine the “gold” in unstructured data, such as claim descriptions, item descriptions, and so on. Nevertheless, these would enable the companies to hold less reserve through a more accurate claim estimation while not compromising the solvency of the companies, allowing the capital to be re-deployed for other purposes.

ID: 146 / ep-02: 51
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: regression, mixed-effects model, grouped data, correlated outcomes, transfromation model

tramME: Mixed-Effects Transformation Models Using Template Model Builder

Balint Tamasi, Torsten Hothorn

Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich, Switzerland

Statistical models that allow for departures from strong

distributional assumptions on the outcome and accommodate

correlated data structures are essential in many applied

regression settings. Our technical note presents the R package

tramME that implements the mixed-effects extension of the linear

transformation models. The model is appealing because it directly

parameterizes the (conditional) distribution function and

estimates the necessary transformation of the outcome in a

data-driven way. As a result, transformation models represent a

general and flexible approach to regression modeling of discrete

and continuous outcomes. The package tramME builds on existing

implementations of transformation models (the mlt and tram

packages) as well as the Laplace approximation and automatic

differentiation (using the TMB package) to perform fast and

efficient likelihood-based estimation and inference in

mixed-effects transformation models. The resulting framework can

be readily applied to a wide range of regression problems with

grouped data structures. Two examples are presented, which

demonstrate how the model can be used for modeling correlated

outcomes without strict distributional assumptions: 1) A

mixed-effects continuous outcome logistic regression for

longitudinal data with a bounded response. 2) A flexible

parametric proportional hazards model for time-to-event data from

a multi-center trial.


correlated outcomes, mixed-effects models, R package development,

regression, transformation models

ID: 119 / ep-02: 52
Elevator Pitch
Topics: Statistical models
Keywords: big data

The one-step estimation procedure in R

Alexandre Brouste1, Christophe Dutang2

1Le Mans Université; 2Université Paris-Dauphine

In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial $\sqrt(n)$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed.