Conference Agenda

Elevator Pitches 1
Tuesday, 06/July/2021:
12:45am - 2:15am

Virtual location: The Lounge #elevator_pitches

ID: 264 / ep-01: 1
Elevator Pitch
Topics: Social sciences
Keywords: national identification number, demographic data, generator, privacy, Finland

Hetu-package: Validating and Extracting Information from Finnish National Identification Numbers

Pyry Kantanen1, Måns Magnusson2, Jussi Paananen3, Leo Lahti1

1University of Turku, Finland; 2Uppsala University, Sweden; 3University of Eastern Finland

The need to uniquely identify citizens has been critical for efficient governance in the modern era. Novel techniques, such as iris scans, fingerprints and other biometric information have only recently begun to supplement the tried-and-true method of assigning each individual a unique identifier, a national identification number.

In Nordic countries national identification numbers are not random strings but contain information about the person’s birth date, sex and, in the case of Swedish personal identity numbers, place of birth. In addition, most identification numbers contain control characters that make them robust against input errors, ensuring data integrity. Datasets that lack aforementioned demographic information can be appended with data extracted from national identification numbers and already existing demographic data can be validated by comparing it to extracted data.

The method of validating and extracting information from identification numbers is manually doable and simple in principle but in practice becomes unfeasible with datasets larger than a few dozen observations. Hetu-package provides easy-to-use tools for programmatic handling of Finnish personal identity codes (henkilötunnus) and Business ID codes (y-tunnus). Hetu-package utilizes R’s efficient vectorized operations and is able to generate and validate over 5 million Finnish personal identity codes or Business Identity Codes in under 10 minutes. This covers the practical upper limit set by the current population of Finland (5.5 million people) and also provides adequate headroom for handling large registry datasets.

Privacy concerns can push Finland and other Nordic countries towards redesigning their national identification numbers to omit the embedded personal information sometime in the future, but policy changes will be closely monitored and, if necessary, the package functions will be adjusted accordingly.

ID: 150 / ep-01: 2
Elevator Pitch
Topics: Data visualisation
Keywords: feature selection

Visualising variable importance and variable interaction effects in machine learning models.

Alan Inglis1, Andrew Parnell1, Catherine Hurley2

1Hamilton Institute, Maynooth University; 2Dept. of Mathematics and statistics, Maynooth University

Variable importance, interaction measures and partial dependence plots are important summaries in the interpretation of statistical and machine learning models. In our R package vivid (variable importance and variable interaction displays) we create new visualisation techniques for exploring these model summaries. We construct heatmap and graph-based displays showing variable importance and interaction jointly, which are carefully designed to highlight important aspects of the fit. We also construct a new matrix-type layout showing all single and bivariate partial dependence plots, and an alternative layout based on graph Eulerians focusing on key subsets. Our new visualisations are model-agnostic and are applicable to regression and classification supervised learning settings. They enhance interpretation even in situations where the number of variables is large and the interaction structure complex. In this work we demonstrate our visualisation techniques on a data set and explore and interpret the relationships provided by these important summaries.

Link to package or code repository.

ID: 279 / ep-01: 3
Elevator Pitch
Topics: Ecology
Keywords: landscape ecology, spatial data, remote sensing, wetland ecology, wildfire

Obtaining reproducible reports on satellite hotspot data during a wildfire disaster

Natalia Soledad Morandeira1,2

1University of San Martín, Environmental Research and Engineering Institute, Argentine Republic; 2CONICET (National Scientific and Technical Research Council, Argentina)

Wildfires can be monitored and analyzed using thermal hotspots records derived from satellite data. In 2020, the Paraná River floodplain (Argentina) suffered from a severe drought, and thousands of hotspots —associated with active fires— were reported by the Fire Information for Resource Management System (FIRMS-NASA). FIRMS-NASA products are provided in spatial objects (shapefiles), including recent and archive records from several sensors (VIIRS and MODIS). I aimed to handle these data, analyze the number of hotspots during 2020, and compare the disaster with previous years' situation. Using sf, tidyverse, janitor, stringr, spdplyr, ggplot2 and RMarkDown, I imported and pre-processed the spatial objects, generated plots, and obtained reproducible reports. I used R to handle satellite data, monitor the number of active fires, and detect which wetland areas were being affected: this allowed me to quickly respond to peers and journalists about how the wildfires were evolving.

As a case study, I summarize the 2020 outputs for my study area, the Paraná River Delta (19,300 km2). A total of 39,821 VIIRS thermal hotspots were detected, with August (winter in the Southern Hemisphere) accounting for 39.8% of the whole year’s hotspots. While VIIRS data (resolution: 375 m) is available from 2012, MODIS data is available from 2001. However, MODIS resolution is 1 km, so fewer hotspots are reported and each hotspot corresponds to a greater area. The cumulative MODIS hotspots recorded during 2020 were 8,673, the highest number of hotspots of the last 11 years. However, MODIS hotspots detected in 2020 were 62.9% of those recorded during 2008. All the plots were obtained in English and Spanish versions, showing daily and cumulative hotspots, monthly summaries, and a comparison with hotspots detected in previous years. My workflow can be used to analyze thermal hotspot data in any other interest area.

Link to package or code repository.
Code repository: ;

Three main dissemination articles and interviews (full list of articles in the repository):

In English: ;

In Spanish (1/2): ;

In Spanish (2/2):

ID: 117 / ep-01: 4
Elevator Pitch
Topics: Statistical models
Keywords: clustering

Modeling spatio-temporal point processes with nphawkes package

Peter Boyd, Dr. James Molyneux

Oregon State University

As the literature on Hawkes processes grows, the use of such models continues to expand, encompassing a wide array of applications such as earthquakes, disease spread, social networks, neuron activity, and mass shootings. As new implementations are explored, correctly parameterizing the model is difficult with a dearth of field-specific research on parameter values, thus creating the need for nonparametric models. The model independent stochastic declustering (MISD) algorithm accomplishes this task through a complex, computationally expensive algorithm. In the package nphawkes, I have employed Rcpp functionalities to create a quick and user-friendly approach to MISD. The nphawkes R package allows users to analyze data in time or space-time, with or without a mark covariate, such as the magnitude of an earthquake. We demonstrate the use of such models on an earthquake catalog and highlight some features of the package such as using stationary/nonstationary background rates, model fitting, visualizations, and model diagnostics.

Link to package or code repository.

ID: 204 / ep-01: 5
Elevator Pitch
Topics: Spatial analysis
Keywords: Functional Programming, Spatial Point Pattern Analysis, Parallelization, tidy code

Use Case: Functional Programming and Parallelization in Spatial Point Pattern Analysis

Clara Chua, Tin Seong Kam

Singapore Management University, Singapore

Performing Spatial Point Pattern Analysis (SPPA) can be computationally intensive for larger data sets, or data with non-uniform observation windows. It can take a day or more to run a dataset of 7,000 points. There is also often a need to repeatedly apply the same method to different cuts of data (e.g. running the same tests for different regions, subtypes), or when mapping and visualising the results of the analysis.

Part of my project looks at SPPA of Airbnb listings in Singapore, using an envelope simulation of Ripley’s K-function test from the spatstat package to determine if there is clustering in specific subregions.

In my talk I will briefly explain the K-test and compare the performance of a for-loop function and a functional programming function using purr, as well as the performance of the normal K-test and K-test using the Fast Fourier Transform. I show that using functionals helps to break down the analysis into tidier chunks which result in tidier code, and reproducibility down the road. Computation time may sometimes be quicker with functional programming.

Despite this, there is still a need for parallelization for spatial analysis of larger datasets. There are no in-built parallelization methods in spatstat. Parallelisation is also dependent on OS –functions such as `mclapply` from the base parallel package works for Mac and Linux, but not for Windows. Hence, I will also share my efforts to parallelize the envelope simulations that are OS agnostic.

Link to package or code repository.

ID: 135 / ep-01: 6
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: Clinical trials, Bayesian modeling, Shiny

Predicting the COVID-19 Pandemic Impact on Clinical Trial Recruitment at GSK

Valeriia Sherina, Nicky Best, Graeme Archer, Jack Euesden, Dave Lunn, Inna Perevozskaya, Doug Thompson, Magda Zwierzyna


The COVID-19 pandemic required an unprecedented response from the pharmaceutical industry, both in terms of developing new antiviral medicines and vaccines as rapidly and safely as possible, and in continuing to deliver its existing portfolio of important new medicines. As many countries went into lockdown to slow the spread of the disease, sponsors faced the twin dilemma of replanning study delivery on the fly, while rebalancing their portfolios to meet the emergent medical need. Our multidisciplinary team at GSK tackled the problem of delayed recruitment due to the pandemic. We aggregated external data on the pandemic across 42 countries into epidemiological forecasts, designed a novel Bayesian hierarchical model with 3 levels: site initiation, patient screening, and patient randomization, to link classical recruitment predictions with epidemiological COVID-19 predictions. We obtained COVID-adjusted estimates of time to achieve key recruitment milestones via forward sampling from posterior distributions of the model parameters. The results of this exercise were summarized and deployed in a user-friendly Shiny application to assist study teams with recruitment planning in the face of the pandemic. Here we showcase the results of the effective collaboration of statisticians and data scientists and how it fits into decision-making framework in the clinical operations.

ID: 112 / ep-01: 7
Elevator Pitch
Topics: Databases / Data management
Keywords: agriculture

The Grammar of Experimental Design

Emi Tanaka

Monash University,

The critical role of data collection is well captured in the expression "garbage in, garbage out" -- in other words, if the collected data is rubbish then no analysis, however complex it may be, can make something out of it. The gold standard for data collection is through well-designed experiments. Re-running an experiment is generally expensive, contrary to statistical analysis where re-doing it is generally low-cost; there's a higher stake in getting it wrong for experimental designs. But how do we design experiments in R? In this talk, I present my R-package edibble that implements a novel framework, which I refer to as the "grammar of experimental design", to facilitate the data collection and design of an experiment. The grammar builds the experimental design by describing the fundamental components of the experiment. Because the grammar resembles a natural language, there is greater clarity about the experimental structure, and includes considerations beyond the construction of the experimental layout. I will reconstruct some experimental layout using edibble with comparison to other popular R-packages.

Link to package or code repository.

ID: 243 / ep-01: 8
Elevator Pitch
Topics: Spatial analysis
Keywords: open source

rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R

Varsha Ujjini Vijay Kumar1,2, Dilinie Seimon1,2, Paula Moraga2

1Faculty of Business and Economics, Monash University, Australia; 2Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia

Open and reliable data, analytical tools and collaborative research are crucial for solving global challenges and achieving sustainable development. Spatial and spatio-temporal data are used in a wide range of fields including health, social and environmental disciplines to improve the evaluation and monitoring of goals both within and across countries.

To facilitate the increasing need to easily access reliable spatial and spatio-temporal data using R, many R packages have been recently developed as clients for various spatial databases and repositories. While documentation and many open source repositories on how to use these packages exist, there is an increased need for a one stop repository for this information.

In this talk, we present rspatialdata, a website that provides a collection of data sources and tutorials on downloading and visualising spatial data using R. The website includes a collection of data sources and tutorials on how to download and visualise a wide range of datasets including administrative boundaries of countries, Open Street Map data, population, elevation, temperature, vegetation and malaria data.

The website can be considered a useful resource for individuals working with problems that require spatial data analysis and visualisation, such as estimating air pollution, quantifying disease burdens, predicting species occurrences, and evaluating and monitoring the UN Sustainable Development Goals.

ID: 361 / ep-01: 9
Elevator Pitch
Topics: Web Applications (Shiny/Dash)

Moved to Session 2

Test Test



ID: 157 / ep-01: 10
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: teaching, machine learning, data science, undergraduate

Teaching Advanced Data Science in R: Successes and Failures

Lisa Lendway

Macalester College, United States of America

I am about to embark on teaching a new course I named Advanced Data Science in R. This is a capstone course being taught to undergraduates at Macalester College, a small liberal arts college in the US. The course expands on what students learned in an Intro Data Science course focused on data visualization and wrangling using the tidyverse and a Statistical Machine Learning course focused on machine learning algorithms implemented via the caret package.

Students will review machine learning algorithms while learning the tidymodels packages. They will become familiar with the larger machine learning process by using data from a database within RStudio and implementing the basics of putting a model into production. Other components of the class include practicing reproducible research by using Git and GitHub in RStudio; creating a website using R Markdown, distill, or blogdown; and building shiny apps. In order for the students to become more comfortable learning new R packages on their own, in groups of three, they will teach the class about a package or set of functions of their choosing. They will create materials that students can refer back to and will also write homework problems and solutions. The last few weeks of the course will be dedicated to working on group projects where they will be expected to implement the skills they learned in the course.

Since this course hasn’t started yet, I do not yet know what the successes and failures will be. I will share the course website and will reflect on how it went and the changes I will make for the future.

Link to package or code repository.

ID: 234 / ep-01: 11
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: annotation

rRbiMs, r tools for Reconstructing bin Metabolisms.

Mirna Vázquez Rosas Landa, Valerie de Anda Torres, Sahil Shan, Brett Baker

The University of Texas at Austin

Although microorganisms play a crucial role in the biogeochemical cycles and ecosystems, most of their diversity is still unknown. Understanding the metabolic potential of novel microbial genomes is an essential step to access this diversity. Most annotation pipelines use predefined databases to add the functional annotation. However, they do not allow the user to add metadata information, making it challenging to explore the metabolism based on a specific scientific question. Our motivation to build the package rRbiMs is to create a workflow that helps researchers to explore metabolism data in a reproducible manner. rRbiMs reads different database outputs and includes a new custom-curated database that the user can easily access. Its module-based design allows the user to choose between running the whole pipeline or just part of it. Finally, a key feature is that it facilitates the incorporation of metadata such as taxonomy and sampling data, allowing the user to answer specific questions of biological relevance. rRbiMs is a user-friendly R workflow that allows performing reproducible and accurate microbial metabolism analyses. We are working on the package looking forward to submitting it to R/Bioconductor and make it available for the research community.

Link to package or code repository.

ID: 202 / ep-01: 12
Elevator Pitch
Topics: Statistical models
Keywords: relative weights analysis, Key Drivers Analysis, residualization, nonlinear, main effect, interaction

ResidualRWA: Detecting relevant variable using relative weight analysis with residualization

Maikol Solís, Carlos Pasquier

Universidad de Costa Rica, Escuela de Matemática, Centro de Investigación en Matemática Pura y Aplicada, Costa Rica

In statistical models, determining the most influential variables is a continuous task. A common technique is called relative weights analysis (RWA) (a.k.a. Key Drivers Analysis). The method described in Tonidandel & LeBreton (2015), uses an orthonormal projection of the data to determine which variable has more impact into the model. Packages like “rwa” ( or “flipRegression” ( handle the situation when there are multiple linear primary effects into the model.

To extend those packages, we present the novel package “ResidualRWA” ( This new package implements the relative weight analysis, with the possibility to handle complex models with nonlinear effects (via restricted splines) and nonlinear interactions. The package residualizes appropriately each interaction to report correctly the main and the pure interaction effects.

The interactions are residualized against the primary effects as described in LeBreton et al. (2013). This step is necessary to remove the influence of the main variables into the interactions. This way, we can separate the effect of the main variable of a model and the pure interaction effect due to true the synergy of the variables.

The package “ResidualRWA” handles the fit of the model, the residualization of interaction and the relative weight analysis estimation. Also, it reports through easy-to-read tables and graphics the results to the user.

In this presentation we will show the capabilities of this package and test it with some simulated and real data.

* References *

LeBreton, J. M., Tonidandel, S., & Krasikova, D. V. (2013). Residualized Relative Importance Analysis. Organizational Research Methods, 16(3), 449–473.

Tonidandel, S., & LeBreton, J. M. (2015). RWA Web: A Free, Comprehensive, Web-Based, and User-Friendly Tool for Relative Weight Analyses. Journal of Business and Psychology, 30(2), 207–216.

Link to package or code repository.

ID: 203 / ep-01: 13
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: survey, serology, serological-survey, srvyr, tidyverse

serosurvey R package: Serological Survey Analysis For Prevalence Estimation Under Misclassification

Andree Valle-Campos

Centro Nacional de Epidemiología Prevención Control Enfermedades CDC Perú, Peru

Population-based serological surveys are fundamental to quantify how many people have been infected by a certain pathogen and where we are in the epidemic curve. Various methods exist to estimate prevalence considering misclassifications due to an imperfect diagnostic test with sensitivity and specificity known with certainty. However, during the first months of a novel pathogen outbreak like the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) diagnostic test performance is “unknown” or “uncertain” given limited validation studies available. During the pandemic, the increased demand for prevalence estimates and the absence of standardized procedures to address this issue went along with the usage of different methods and non-reproducible workflows for SARS-CoV-2 serological surveys within proprietary softwares or non-public R code repositories. Given this scenario, we created the serosurvey R package to gather Serological Survey Analysis functions and workflow templates for Prevalence Estimation Under Misclassification. We provide functions to calculate single prevalences using the srvyr package environment and generate tidy outputs for a hierarchical bayesian approach that incorporates the unknown test performance from Larremore et al. [bioRxiv, May 2020]. We applied them in a reproducible workflow with purrr and furrr R packages to efficiently iterate and parallelize this step for multiple prevalences. We test it simulating the usage of an imperfect test to measure an outcome within a multi-stage sampling survey design and incorporate the test uncertainty into the sampling design uncertainty. Therefore, the serosurvey R package could facilitate the generation of prevalence estimates for the current and improve our preparedness for next pandemics. In conclusion, the serosurvey R package reduces the reproducibility gap of serological survey analysis using diagnostic tests with unknown performance within a free software environment.

ID: 191 / ep-01: 14
Elevator Pitch
Topics: Interfaces with other programming languages
Keywords: HTTP, API, reprex

httrex: Help Debugging HTTP API Clients

Greg Daniel Freedman Ellis, United States of America

HTTP APIs are a fantastic way to collaborate across programming language divides. It does not matter what language the server's infrastructure is written in, an R user can write code to bring the data into R. There are many packages on CRAN, and presumably even more internally developed packages in organizations, that are wrappers of web APIs, abtracting over some of the details of APIs to make a friendlier interface for R users. When everything is working as intended, this is the ideal way to work, with users not needing to worry about the inner workings of APIs on a day to day basis. But when there's a bug, it can be hard to figure out the source because of the layers of abstraction and potentially the programming language barrier between the user and API owner. The httrex package provides tools to debug R code based on HTTP APIs and communicate what the code is doing in a language-agnostic way.

The httrex package explores two approaches for this problem. The first is a shiny app that runs alongside your current R session and tracks the code you run and the API calls that the code makes and visualizes each step of that process. The second creates a document inspired by the reprex package that reruns code in a new environment and includes the HTTP API calls interspersed with the R code and output. With either of these two approaches users can better understand the HTTP requests that their code is making and share with API owners in a way that doesn't require them to understand R.

Link to package or code repository.

ID: 213 / ep-01: 15
Elevator Pitch
Topics: Efficient programming
Keywords: CLI (command line interface)

Reading, Combining, and Pre-Filtering Files with R and AWK

David Shilane, Mayur Bansal, Chung Woo Lee

Columbia University,

The authors have been developing a software package named awkreader that provides an R interface to utilize AWK for the purpose of combined and pre-filtered file reading.

The proposed software is designed to solve a number of problems. For one or more files of the same column structure, we will demonstrate a method to read and bind the data. A vector of names can be supplied to limit the data to selected columns. Pre-filtering the data can then be achieved either through a specification of patterns to match or through logical inclusion criteria. The program then performs a translation to build a corresponding coding statement in AWK. Importantly, the logical criteria can be directly specified in syntax that is familiar to R’s users. In addition to reading in the data, the code can also return its translations to AWK. Combined data sets can also include a column that identifies the source file.

The awkreader package creates a variety of novel capabilities. It reduces the computational complexity of filtering and binding data sets. Targeted queries can search a wider range of data than what might otherwise be loaded into R because the filters are applied in the reading process. A single statement can replace the labor and code of reading, binding, and filtering multiple files. Users can benefit from AWK for file reading without having to learn its syntax. Those who are interested in learning AWK will be able to generate working examples that correspond to their more familiar setting of programming in R.

The presentation will showcase the benefits of the awkreader package. We will provide some details on the translation process, particularly for logical subsetting operators such as %in% and %nin%. We will also demonstrate examples of the code in data processing applications.

Link to package or code repository.

ID: 131 / ep-01: 16
Elevator Pitch
Topics: Social sciences
Keywords: data management


Roberto Delgado Castro


FODESAF, administered by DESAF (part of the Ministry of Labor and Social Security), is Costa Rica´s and Latin America´s largest public-social investment fund. It transfers around US$1.000 million per year (2% of local GDP) to a wide variety of social programs natiowide.

Local employers (patrons) and Ministry of Treasury (Government) provides its economic resources due to monthly financial contributions.

Among with monthly financial transfers, a large database with specific information of contributors is attached. Since 1978, FODESAF´s establishment year, DESAF have not had the opportunity to classify and analyze such crucial data.

A data science project was developed in SQL© and RStudio© to implement a Data Pipeline, in which all data was loaded to break down its key elements, in order to improve DESAF authority´s decision-making capabilites.

As inputs, annual databases within 2003 and 2019 (17 years) were loaded into the Pipeline in a separate way (250.000 registers per year, 30.8 million in total). The results were RMarkdown© automatic reports with brand-new dataframes and visualizations for each year, that helped authorities visualize and analyze elements that had not been seen in 43 years of DESAF´s history.

After its implementation, local government has now a unique-recurrent data science tool to improve management of financial contributions to the fight against poverty. As key learnings, data-taming skills were strenghten, project-questions were defined as project´s coding-structure strategy, brand-new Contributors Mass Report was developed, and this project has been used as an economic-recovery follow-up-instrument in post-pandemic era.

Keywords: Data Pipepline, data-taming, visualizations, automatic-reports, poverty.

ID: 114 / ep-01: 17
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: sleep, chronobiology, mctq, data wrangling, reproducibility

mctq: An R Package for the Munich ChronoType Questionnaire

Daniel Vartanian, Ana Amélia Benedito-Silva, Mario Pedrazzoli

School of Arts, Sciences and Humanities (EACH), University of Sao Paulo (USP), Sao Paulo, Brazil

mctq is an R package that provides a complete and consistent toolkit to process the Munich ChronoType Questionnaire (MCTQ), a quantitative and validated method to assess peoples’ sleep behavior presented by Till Roenneberg, Anna Wirz-Justice, and Martha Merrow in 2003. The aim of mctq is to facilitate the work of sleep and chronobiology scientists with MCTQ data while also helping with research reproducibility.

Although it may look like a simple questionnaire, MCTQ requires a lot of date/time manipulation. This poses a challenge for many scientists, being that most people have difficulties with date/time data, especially when dealing with an extensive set of data. The mctq package comes to address this issue. mctq can handle the processing tasks for the three MCTQ versions (standard, micro, and shift) with few dependencies, relying much of its applications on the lubridate and hms packages from tidyverse. We also designed mctq with the user experience in mind, by creating an interface that resembles the way the questionnaire data is shown in MCTQ publications, and by providing extensive and detailed documentation about each computation proposed by the MCTQ authors. The package also includes several utility tools to deal with different time representations (e.g., decimal hours, radians) and time arithmetic issues, along with fictional datasets for testing and learning purposes.

The first stable version of mctq is available for download on GitHub. The package is currently under a software peer-review by the rOpenSci initiative. We plan to submit it to CRAN soon after the review process ends.

ID: 107 / ep-01: 18
Elevator Pitch
Topics: Ecology
Keywords: flow cytometry, cytometric diversity, microbial ecology, {flowDiv} workflow

{flowDiv} workflow: reproducible cytometric diversity estimates

María Victoria Quiroga1, Bruno M. S. Wanderley2, André M. Amado3, Fernando Unrein1

1Instituto Tecnológico de Chascomús (INTECH, UNSAM-CONICET), Argentina; 2Departamento de Oceanografia e Limnologia, Universidade Federal do Rio Grande do Norte, Brazil; 3Departamento de Biologia, Universidade Federal de Juiz de Fora, Brazil

Flow cytometry is widely used in life sciences, as it records thousands of single-cell data related to their morphological or physiological state within minutes. Hence, each sample has a characteristic cytometric pattern that can be studied through diversity indexes: evenness, alpha- and beta-diversity. Applying this approach to microbial ecology research frequently involves intricate handling of data recorded with different instrumental settings or sample dilutions. The {flowDiv} package overcomes this through data normalization and volume correction steps before estimating cytometric diversity. Here, we share a reproducible {flowDiv} workflow, hoping to help researchers to streamline flow cytometry data processing.

ID: 118 / ep-01: 19
Elevator Pitch
Topics: R in production
Keywords: data management

From Sheets to Success: Reliable, Reproducible, yet Flexible Production Pipelines

Katrina Brock


At our growth-stage startup, our data team was challenged to produce accurate and reliable forecasts despite ever-changing product lines. To solve this, we used yaml-based config to specify inputs and models in a standardized way, the validate package to check inputs before running any custom transformations, and the drake package to isolate the business logic for each product line’s forecast. We transformed a process that previously was a patchwork of SQL, R, and hundreds of spreadsheets into a set of robust pipelines that use a standardized set of steps. The reliability frees our team from constant troubleshooting and allows us to focus on building and tuning models to ever-increasing accuracy.

ID: 281 / ep-01: 20
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: survey design, multiwave sampling, Neyman allocation

Efficient multi-wave sampling with the R package optimall

Jasper B. Yang1, Bryan E. Shepherd2, Thomas Lumley3, Pamela A. Shaw1

1Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, U.S.A; 2Department of Biostatistics, Vanderbilt University School of Medicine, Vanderbilt University, Nashville, TN, U.S.A; 3Department of Statistics, University of Auckland, Auckland, New Zealand

When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design process, especially when data are expensive to collect. One common approach in epidemiological studies is a two-phase sampling design, where inexpensive variables collected on all sampling units are used to inform the sampling scheme for collecting the expensive variables on a subsample in the second phase. Recent studies have demonstrated that even more precise estimates can be obtained when the second phase is conducted over a series of adaptive waves. Unlike simpler sampling schemes, executing multi-phase and multi-wave designs requires careful management of many moving parts over repetitive steps, which can be cumbersome and error-prone. We present the R package optimall, which offers a collection of functions that efficiently streamline the design process of survey sampling, ranging from simple to complex. The package’s main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum in order to minimize the variance of the target sample mean using Neyman allocation, and select specific IDs to sample based on a stratified sampling design. As the survey is performed, optimall provides a framework for every aspect of the sampling process, including the data and metadata, to be stored in a single object. Although it is particularly tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey.

Link to package or code repository.

ID: 185 / ep-01: 21
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: explainable machine learning, predictive modeling, interactive visualization, dashboards

Open the Machine Learning Black-Box with modelStudio & Arena

Hubert Baniecki, Piotr Piątyszek

Warsaw University of Technology, Poland

Complex machine learning predictive models, aka black-boxes, demonstrate high efficiency in a rapidly increasing number of applications. Simultaneously, there is a growing awareness among machine learning practitioners that we need more comprehensive tools for model explainability. Responsible machine learning will require continuous model monitoring, validation, and black-box transparency. These challenges can be satisfied with novel frameworks which add automation and interactivity into explainable machine learning pipelines.

In this talk, we present the modelStudio and arenar packages which, at their core, automatically generate interactive and customizable dashboards allowing to "open the black-box". These tools build upon the DALEX package and are model-agnostic - compatible with most of the predictive models and frameworks in R. The effort is put on lowering the entry threshold for crucial parts of nowadays MLOps practice. We showcase how little coding is needed to produce a powerful dashboard consisting of model explanations and data exploration visualizations. The output can be saved and shared with anyone, further promoting reproducibility and explainability in machine learning practice. Finally, we highlight the Arena dashboard's features - it specifically aims to compare various predictive models.

ID: 247 / ep-01: 22
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: classification

Classifying Student’s Learning Pattern using R Sequence Analysis Packages: The Impact of Procrastination on Performance

Teck Kiang Tan

National University of Singapore,

Sequence analysis is an analytical technique to analyze categorical longitudinal data, incorporating classification procedures that categorizing categorical sequences. A framework of sequence analysis will be provided to give an overview of this approach.

This talk shares the findings of a study extracted from the National University of Singapore Learning Management System (LMS) that provides online video learning materials to students. The time spent on the learning and watching the online video of a statistics course that runs for 35 days is extracted from the LMS to form sequences for all the students with each student formed a learning sequence of 35 states. Sequence analysis was carried out to classify student’s time spent learning patterns and the outcomes of sequence analysis are linked to explain student’s performance.

R package TraMineR and a few cluster analysis packages were used for carrying out sequence analysis and classifying sequences. Fifteen sequence distances were computed and their goodness-of-fit indices were determined for the selection of the best distance metric for classifying students. Four sequence complexity measures were examined to quantify whether the students vary their time use learning patterns as a strategy in studying. These four measures are sequence entropy, turbulence, complexity, and precarity index. The usefulness of graphing sequence analysis using state distribution plot, sequence frequency plot, transversal entropy plot, sequence modal state plot, and representative sequence plot will be shared to point out their relevancy in explaining the results of classifying student’s learning sequences. The inferential regression findings that student’s act of learning procrastination, the resulted classification from the sequence analysis, will be shared how to make use of the sequence analysis results to determine the degree of influence of student’s learning procrastination affecting student’s performance.

ID: 200 / ep-01: 23
Elevator Pitch
Topics: Data visualisation
Keywords: outliers, dataviz, ggplot2, data communication, blogdown

Outlier redemption

Violeta Roizman

Self-employed, France

In Data Science, outliers are usually perceived negatively, as a problem to solve prior to the data analysis. Many tutorials, blog posts and papers are written to offer tips about how to get rid of them. There is even the whole area of Robust Statistics focused on techniques that try to ignore outliers. However, I’ve discovered that outliers can be a very exciting part of the data. Outliers are quite often fun facts about the data. And I love fun facts. I love them so much that I collect all these fun-data-facts in a website called outlier redemption, where I only publish short data-driven stories about them. For each story, I publish the data and the R code used to produce the graphics and simulations. In this short talk, I will start by introducing the positive side of outliers. After that, I will present different visual ways in which we can spot and highlight outliers with R. I will finish with some of the fun outlier stories that I encountered while analyzing data.

ID: 115 / ep-01: 24
Elevator Pitch
Topics: Reproducibility

The Canyon of Success: Scaling Best Practices in R with Internal R packages

Malcolm Barrett

Teladoc Health

At useR! 2016, Hadley Wickham said that what he sought to design in the tidyverse was (after Jeff Atwood) a pit of success: it should be easy for users to succeed. They shouldn’t have to trudge uphill to use your tools effectively. They should fall right into the pit of success--no guard rails! In this talk, I’ll discuss how we use internal R packages to scale best practices across our data science and analytics teams at Teladoc Health. Using R packages has allowed new team members to quickly onboard, set up their work environments, and create reproducible work that aligns with our standards. Our sets of packages include opinionated, usethis-style workflows, database connections, reporting, and more. Designing these tools to make it easy to succeed has become a keystone in our design approach, allowing us to scale our practices with less intervention.

ID: 102 / ep-01: 25
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: Cancer, simulation, mathematical modeling

Modeling tumor evolutionary dynamics with SITH

Phillip Nicol1, Amir Asiaee2

1Harvard University, United States of America; 2Ohio State University, United States of America

A tumor of clinically detectable size is the result of a decades-long evolutionary process of cell replication and mutation. Mathematical models of cancer growth are becoming increasingly valuable for situations where high-quality clinical data is unavailable. We designed "SITH," a CRAN package that implements a stochastic spatial model of tumor growth and mutation. The goal of "SITH" is to provide fast simulations with a convenient interface for researchers interested in cancer evolution. The core simulation algorithm is written in C++ and integrated into R using the "Rcpp" framework. 3D interactive visualizations of the simulated tumor (using the "rgl" package) allow users to investigate the spatial location of various subpopulations. "SITH" also provides functions for analyzing the amount of heterogeneity present in the tumor. Finally, "SITH" can create simulated DNA sequencing datasets. This feature may be helpful for researchers interested in understanding how spatial heterogeneity can bias observed data.

Link to package or code repository.

ID: 225 / ep-01: 26
Elevator Pitch
Topics: Reproducibility
Keywords: ecology

A Glimpse into the Reproduciblity of Scientific Papers published in Movement Ecology: How are we doing ?

Jenicca Poongavanan, Rocio Joo Arakawa, Mathieu Basille

University of Florida

Reproducibility is the earmark of science and thus Movement Ecology as well. However, studies in disciplines such as biology and geosciences have shown that published work is rarely reproducible. Ensuring reproducibility is not a mandatory part of the research process and thus there are no clear procedures in place to assess the reproducibility of scientific articles. In this study we put forward a reproducibility workflow scoring sheet based on six criteria that lead to successful reproducible papers. The reproducibility workflow can be used by authors to evaluate the reproducibility of their studies before publication and reviewers to evaluate the reproducibility of scientific papers. To assess the state of reproducibility in Movement Ecology, we attempted to reproduce the results from Movement Ecology papers that use behavioral pattern identification methods. We selected 75 papers published in several journals from 2010-2020. According to our proposed reproducibility workflow, sixteen studies reflected at least some reproducibility (scores ≥ 4). In particular, we were only able to obtain the data for 16 out of 75 papers. Out of these, a minority of papers also provided code with the data (6 out of the 16 studies). Out of the 6 studies that made both data and code available, only four studies reflected a high level of reproducibility (scores ≥ 9) owing it to good code annotation and execution. Based on our findings, we proposed guidelines for authors, journals and academic institutions to enhance the state of reproducibility in Movement Ecology.

ID: 180 / ep-01: 27
Elevator Pitch
Topics: Time series
Keywords: Time series, Forecasting, Business/Industry, Economics, R programming

Forecasting under COVID: when simple works - and when it doesn’t

Maryam Shobeirinejad1, Steph Stammel2

1Transurban, Australia; 2Transurban, Australia

The worldwide impact of COVID-19 represented a global change point across almost all areas of life. As a result, forecasting is presenting new challenges for us to manage. Fit-for-purpose time series forecasts can range from deceptively simple, but highly useful models to eye-watering complexity that require expertise to implement correctly and usefully. In addition, COVID-19 has resulted in many of the exogenous variables used to improve forecast models (such as macroeconomic data) are also compromised by the same global events. This talk discusses how we have dealt with these heightened challenges in corporate forecasting; balancing highly interpretable solutions with few assumptions (for example seasonal decomposition models) with complex modelling explicitly managing a rapidly changing environment (e.g. state space switching models). The balancing of different forecast model features (accuracy, interpretability, required assumptions etc.) with the needs of stakeholders is a subject of considerable empirical review within our team. In a corporate context, this is balanced with tight time frames, a need to rapidly fit, estimate and compare large numbers of models and come to a solution that meets the needs of a complex array of users.

This presentation will discuss how the tidyverts ecosystem (tsibble, FEASTS, etc.) has been used to test and iterate batch sets of modelling to achieve robust solutions to time sensitive business problems. We will discuss some of our findings in the context of a cost-benefit framework as we traded off the key features of a forecast model in search of the ‘fit for purpose’ solution.

ID: 209 / ep-01: 28
Elevator Pitch
Topics: R in the wild (unusual applications of R)
Keywords: GIS

gatpkg: Developing a geographic aggregation tool in R for non-programmers

Abigail Stamm

New York State Department of Health Bureau of Environmental and Occupational Epidemiology,

The Geographic Aggregation Tool (GAT) was developed in R to simplify and standardize the development of small geographic areas that meet the minimum thresholds required to provide stable and meaningful population measures. To improve usability, we converted GAT to an R package and designed it to be accessible to non-programmers with little to no experience in R. To run GAT, users will be required to install the package and run one line of R code. GAT’s user friendly interface offers a series of dialogs for the user to select their options and saves all files without requiring additional code. The package includes documentation on files created and guidance on interpretation of files and how to address aggregation issues. It also includes several examples using embedded shapefiles, a tutorial, and detailed documentation for advanced users interested in modifying or enhancing the tool. This talk will provide an overview of the tool, how it works, and the planning that informed its development, keeping accessibility and reproducibility in mind.

Link to package or code repository.

ID: 278 / ep-01: 29
Elevator Pitch
Topics: R in production
Keywords: HIV/AIDS, package, analysis, automation

tidyndr: An R package for analysis of the Nigeria HIV National Data Repository

Stephen Taiye Balogun, Scholastica Olanrewaju, Oluwaseun Okunuga, Temitope Kolade, Geraldine Chizoba Abone, Fati Murtala-Ibrahim, Helen Omuh

Institute of Human Virology Nigeria, Nigeria

Nigeria, being the fourth-largest HIV epidemic in the world, is central to achieving the UNAIDS target of epidemiologic control of HIV/AIDS by 2030. Data of 1.3 million HIV-positive patients on treatment in the country is stored centrally in the National Data Repository (NDR). Using access levels, this data is accessible to the Government of Nigeria, donor agencies, implementing partners and other stakeholders to track progress and improve HIV programming. To achieve this, the data must be cleaned, processed, summarized, and communicated for easy recognition of progress, and identification of gaps for tailored intervention. The analysis is traditionally conducted in Microsoft Excel using a downloaded file from the NDR. This means that the Excel software must be installed on the user’s computer, and the user must be familiar with the formula for the calculation of the various indicators. It lacks reproducibility and is error-prone, with errors occasionally going unnoticed. Performing the same analysis periodically can also be quite tedious and time-consuming.

The tidyndr package eliminates these bottlenecks by improving the user friendliness and automating routine analysis, saving several man-hours in the process while eliminating individual errors. The functions are grouped into four categories: importing, treatment, supporting, and summary functions. Together, these ensure that patient-level data are consistently imported into R, subset the data based on specific indicators, and provide summary tables both aggregated and disaggregated in line with the national requirements.

The output from this process is very useful to improve program performance and help achieve epidemiologic control of the virus at local, state, and national levels. With continued national efforts to provide patient-level information for HIV prevention and other services, the package can be scaled-up to support the analysis of these data. Finally, it provides a foundation upon which other relevant program applications can be built.

Link to package or code repository.

ID: 259 / ep-01: 30
Elevator Pitch
Topics: Data visualisation
Keywords: color accessibility, data visualization, data organization, cvd accessible colors, microbiome

Do you see what I see? Introducing microshades: An R package for improving color accessibility and organization of complex data

Lisa Karstens, Erin Dahl, Emory Neer

Oregon Health & Science University, United States of America

Approximately 300 million people in the world have Color Vision Deficiency (CVD). When creating figures and graphics with color, it is important to consider that individuals with CVD will interact with this material, and may incorrectly perceive information associated with color. Multiple CVD friendly color palettes are available in R, however they are limited to 8 different colors. When working with complex data, such as microbiome data, this is insufficient. To overcome this limitation, we created the microshades R package, designed to provide custom color shading palettes that improve accessibility and data organization.

The microshades package includes two crafted color palettes, microshades_cvd_palettes and microshades_palettes. Each color palette contains six base colors with five incremental light to dark shades, for a total of 30 available colors per palette type that can be directly applied to any plot. The microshades_cvd_palettes contain colors that are universally CVD friendly. The individual microshades_palettes are CVD friendly, but when used in conjunction with multiple microshades_palettes, are not universally accessible.

The microshades package also contains functions to aid in data visualization such as creating stacked bar plots organized by a data-driven hierarchy. To further assist users with data storytelling, there are functions to sort data both vertically and horizontally based on ranked abundance or user specification. The accessibility and advanced color organization features help data reviewers and consumers notice visual patterns and trends in data easier. Examples of microshades in action are available on our website, for both microbiome and other datasets.

Link to package or code repository.