# Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 28th June 2022, 07:03:57pm UTC

 Filter by Track or Type of Session All submission types / tracks  Elevator Pitch  Panel  Incubator  Tutorial  Keynote Talk  Social Events Filter by Topic All topics: do not filter  Algorithms  Bayesian models  Bioinformatics / Biomedical or health informatics  Biostatistics and Epidemiology  Community and Outreach  Data mining / Machine learning / Deep Learning and AI  Data visualisation  Databases / Data management  Ecology  Economics / Finance / Insurance  Efficient programming  Environmental sciences  Interfaces with other programming languages  Multivariate analysis  Operational research and optimization  Other  R in production  R in the wild (unusual applications of R)  Reproducibility  Social sciences  Spatial analysis  Statistical models  Teaching R/R in Teaching  Time series  Web Applications (Shiny/Dash) Only Sessions at Location/Venue Sessions at any Location/Venue  -----  The Lounge [0]  The Lounge #new_to_UseR! [2]  The Lounge #announcements [7]  The Lounge #key_pebesma [1]  The Lounge #key_murrell [1]  The Lounge #key_seibold [1]  The Lounge #key_ooms [1]  The Lounge #key_kushwaha [1]  The Lounge #key_communication [1]  The Lounge #key_resp_prog [1]  The Lounge #key_metadocencia [1]  The Lounge #elevator_pitches [2]  The Lounge #talk_community_outreach_1 [1]  The Lounge #talk_packages_1 [1]  The Lounge #talk_data_management [1]  The Lounge #talk_shiny [1]  The Lounge #talk_ml_dm [1]  The Lounge #talk_spatial_analysis [1]  The Lounge #talk_packages_2 [1]  The Lounge #talk_trends_markets_models [1]  The Lounge #talk_dataviz_spatial [1]  The Lounge #talk_math_stats [1]  The Lounge #talk_teaching [1]  The Lounge #talk_r_production_1 [2]  The Lounge #talk_dataviz [2]  The Lounge #talk_ecology_environment [2]  The Lounge #talk_stats [2]  The Lounge #talk_teaching_automation_r [1]  The Lounge #talk_r_production_2 [1]  The Lounge #talk_stats_bioinformatics [1]  The Lounge #talk_stats_data_analysis_1 [1]  The Lounge #talk_community_outreach_2 [1]  The Lounge #talk_r_production_3 [1]  The Lounge #talk_stats_data_analysis_2 [1]  The Lounge #incubator_r_community [1]  The Lounge #MiR_minorities_in_R [1]  The Lounge #R-Ladies [1]  The Lounge #incubator_rse [1]  The Lounge #panel_user_developer [1]  The Lounge #lobby [14]  The Lounge #incubator_conference_abstract [1]  The Lounge #incubator_asiar [1]  The Lounge #incubator_mena [1]  YouTube [19]

 Session Overview
 Date: Sunday, 04/July/2021 10:15pm - 11:15pm Welcome and Navigation Guide (PDT)Virtual location: The Lounge #announcementsSession Chair: Rocío JooSession Chair: Heather TurnerZoom Host: Marcela Alfaro CordobaReplacement Zoom Host: Andrea Sánchez-TapiaSession Chair: Yanina Noemi Bellini SaibeneA warm welcome for our first timezone, along with a quick guide on how to navigate the conference 11:15pm - 11:59pm My first UseR! (PDT)Location: The Lounge #new_to_UseR!Session Chair: Batool AlmarzouqZoom Host: Andrea Sánchez-TapiaReplacement Zoom Host: Marcela Alfaro CordobaA session for newcomers to useR!This pre-conference session is aimed at newcomers to useR! as an introduction on how to engage with the wider R community. This session will be interactive with open discussion and a chance for networking. 11:15pm - 12:15amID: 360 Social Events My first UseR! (PDT) Batool Almazrouq
 Date: Tuesday, 06/July/2021 12:00am - 12:30am RechaRge 1Session Chair: Marcela Alfaro CordobaYoga Intro + Breathing exercisesJana will teach us how to breath and will lead a yoga class for all audiences. Come join us! Beginners are welcome. 12:30am - 12:45am BreakVirtual location: The Lounge #lobby 12:45am - 2:15am Elevator Pitches 1Virtual location: The Lounge #elevator_pitches ID: 264 / ep-01: 1 Elevator Pitch Topics: Social sciencesKeywords: national identification number, demographic data, generator, privacy, Finland Hetu-package: Validating and Extracting Information from Finnish National Identification Numbers Pyry Kantanen1, Måns Magnusson2, Jussi Paananen3, Leo Lahti1 1University of Turku, Finland; 2Uppsala University, Sweden; 3University of Eastern Finland The need to uniquely identify citizens has been critical for efficient governance in the modern era. Novel techniques, such as iris scans, fingerprints and other biometric information have only recently begun to supplement the tried-and-true method of assigning each individual a unique identifier, a national identification number. In Nordic countries national identification numbers are not random strings but contain information about the person’s birth date, sex and, in the case of Swedish personal identity numbers, place of birth. In addition, most identification numbers contain control characters that make them robust against input errors, ensuring data integrity. Datasets that lack aforementioned demographic information can be appended with data extracted from national identification numbers and already existing demographic data can be validated by comparing it to extracted data. The method of validating and extracting information from identification numbers is manually doable and simple in principle but in practice becomes unfeasible with datasets larger than a few dozen observations. Hetu-package provides easy-to-use tools for programmatic handling of Finnish personal identity codes (henkilötunnus) and Business ID codes (y-tunnus). Hetu-package utilizes R’s efficient vectorized operations and is able to generate and validate over 5 million Finnish personal identity codes or Business Identity Codes in under 10 minutes. This covers the practical upper limit set by the current population of Finland (5.5 million people) and also provides adequate headroom for handling large registry datasets. Privacy concerns can push Finland and other Nordic countries towards redesigning their national identification numbers to omit the embedded personal information sometime in the future, but policy changes will be closely monitored and, if necessary, the package functions will be adjusted accordingly. Link to package or code repository. ID: 150 / ep-01: 2 Elevator Pitch Topics: Data visualisationKeywords: feature selection Visualising variable importance and variable interaction effects in machine learning models. Alan Inglis1, Andrew Parnell1, Catherine Hurley2 1Hamilton Institute, Maynooth University; 2Dept. of Mathematics and statistics, Maynooth University Variable importance, interaction measures and partial dependence plots are important summaries in the interpretation of statistical and machine learning models. In our R package vivid (variable importance and variable interaction displays) we create new visualisation techniques for exploring these model summaries. We construct heatmap and graph-based displays showing variable importance and interaction jointly, which are carefully designed to highlight important aspects of the fit. We also construct a new matrix-type layout showing all single and bivariate partial dependence plots, and an alternative layout based on graph Eulerians focusing on key subsets. Our new visualisations are model-agnostic and are applicable to regression and classification supervised learning settings. They enhance interpretation even in situations where the number of variables is large and the interaction structure complex. In this work we demonstrate our visualisation techniques on a data set and explore and interpret the relationships provided by these important summaries. Link to package or code repository.https://github.com/AlanInglis/vivid ID: 279 / ep-01: 3 Elevator Pitch Topics: EcologyKeywords: landscape ecology, spatial data, remote sensing, wetland ecology, wildfire Obtaining reproducible reports on satellite hotspot data during a wildfire disaster Natalia Soledad Morandeira1,2 1University of San Martín, Environmental Research and Engineering Institute, Argentine Republic; 2CONICET (National Scientific and Technical Research Council, Argentina) Wildfires can be monitored and analyzed using thermal hotspots records derived from satellite data. In 2020, the Paraná River floodplain (Argentina) suffered from a severe drought, and thousands of hotspots —associated with active fires— were reported by the Fire Information for Resource Management System (FIRMS-NASA). FIRMS-NASA products are provided in spatial objects (shapefiles), including recent and archive records from several sensors (VIIRS and MODIS). I aimed to handle these data, analyze the number of hotspots during 2020, and compare the disaster with previous years' situation. Using sf, tidyverse, janitor, stringr, spdplyr, ggplot2 and RMarkDown, I imported and pre-processed the spatial objects, generated plots, and obtained reproducible reports. I used R to handle satellite data, monitor the number of active fires, and detect which wetland areas were being affected: this allowed me to quickly respond to peers and journalists about how the wildfires were evolving. As a case study, I summarize the 2020 outputs for my study area, the Paraná River Delta (19,300 km2). A total of 39,821 VIIRS thermal hotspots were detected, with August (winter in the Southern Hemisphere) accounting for 39.8% of the whole year’s hotspots. While VIIRS data (resolution: 375 m) is available from 2012, MODIS data is available from 2001. However, MODIS resolution is 1 km, so fewer hotspots are reported and each hotspot corresponds to a greater area. The cumulative MODIS hotspots recorded during 2020 were 8,673, the highest number of hotspots of the last 11 years. However, MODIS hotspots detected in 2020 were 62.9% of those recorded during 2008. All the plots were obtained in English and Spanish versions, showing daily and cumulative hotspots, monthly summaries, and a comparison with hotspots detected in previous years. My workflow can be used to analyze thermal hotspot data in any other interest area. Link to package or code repository.Code repository: https://github.com/nmorandeira/Fires_ParanaRiverDelta ; Three main dissemination articles and interviews (full list of articles in the repository): In English: https://www.reuters.com/article/us-argentina-environment/argentinas-wetlands-under-assault-by-worst-fires-in-more-than-a-decade-idUSKBN25T35V ; In Spanish (1/2): http://www.unsam.edu.ar/tss/el-delta-en-llamas/ ; In Spanish (2/2): http://www.hamartia.com.ar/2020/08/10/rodeados-fuego/ ID: 117 / ep-01: 4 Elevator Pitch Topics: Statistical modelsKeywords: clustering Modeling spatio-temporal point processes with nphawkes package Peter Boyd, Dr. James Molyneux Oregon State University As the literature on Hawkes processes grows, the use of such models continues to expand, encompassing a wide array of applications such as earthquakes, disease spread, social networks, neuron activity, and mass shootings. As new implementations are explored, correctly parameterizing the model is difficult with a dearth of field-specific research on parameter values, thus creating the need for nonparametric models. The model independent stochastic declustering (MISD) algorithm accomplishes this task through a complex, computationally expensive algorithm. In the package nphawkes, I have employed Rcpp functionalities to create a quick and user-friendly approach to MISD. The nphawkes R package allows users to analyze data in time or space-time, with or without a mark covariate, such as the magnitude of an earthquake. We demonstrate the use of such models on an earthquake catalog and highlight some features of the package such as using stationary/nonstationary background rates, model fitting, visualizations, and model diagnostics. Link to package or code repository.https://github.com/boydpe/nphawkes ID: 204 / ep-01: 5 Elevator Pitch Topics: Spatial analysisKeywords: Functional Programming, Spatial Point Pattern Analysis, Parallelization, tidy code Use Case: Functional Programming and Parallelization in Spatial Point Pattern Analysis Clara Chua, Tin Seong Kam Singapore Management University, Singapore Performing Spatial Point Pattern Analysis (SPPA) can be computationally intensive for larger data sets, or data with non-uniform observation windows. It can take a day or more to run a dataset of 7,000 points. There is also often a need to repeatedly apply the same method to different cuts of data (e.g. running the same tests for different regions, subtypes), or when mapping and visualising the results of the analysis. Part of my project looks at SPPA of Airbnb listings in Singapore, using an envelope simulation of Ripley’s K-function test from the spatstat package to determine if there is clustering in specific subregions. In my talk I will briefly explain the K-test and compare the performance of a for-loop function and a functional programming function using purr, as well as the performance of the normal K-test and K-test using the Fast Fourier Transform. I show that using functionals helps to break down the analysis into tidier chunks which result in tidier code, and reproducibility down the road. Computation time may sometimes be quicker with functional programming. Despite this, there is still a need for parallelization for spatial analysis of larger datasets. There are no in-built parallelization methods in spatstat. Parallelisation is also dependent on OS –functions such as mclapply from the base parallel package works for Mac and Linux, but not for Windows. Hence, I will also share my efforts to parallelize the envelope simulations that are OS agnostic. Link to package or code repository.https://github.com/clarachua/capstone-proj ID: 135 / ep-01: 6 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: Clinical trials, Bayesian modeling, Shiny Predicting the COVID-19 Pandemic Impact on Clinical Trial Recruitment at GSK Valeriia Sherina, Nicky Best, Graeme Archer, Jack Euesden, Dave Lunn, Inna Perevozskaya, Doug Thompson, Magda Zwierzyna GlaxoSmithKline The COVID-19 pandemic required an unprecedented response from the pharmaceutical industry, both in terms of developing new antiviral medicines and vaccines as rapidly and safely as possible, and in continuing to deliver its existing portfolio of important new medicines. As many countries went into lockdown to slow the spread of the disease, sponsors faced the twin dilemma of replanning study delivery on the fly, while rebalancing their portfolios to meet the emergent medical need. Our multidisciplinary team at GSK tackled the problem of delayed recruitment due to the pandemic. We aggregated external data on the pandemic across 42 countries into epidemiological forecasts, designed a novel Bayesian hierarchical model with 3 levels: site initiation, patient screening, and patient randomization, to link classical recruitment predictions with epidemiological COVID-19 predictions. We obtained COVID-adjusted estimates of time to achieve key recruitment milestones via forward sampling from posterior distributions of the model parameters. The results of this exercise were summarized and deployed in a user-friendly Shiny application to assist study teams with recruitment planning in the face of the pandemic. Here we showcase the results of the effective collaboration of statisticians and data scientists and how it fits into decision-making framework in the clinical operations. ID: 112 / ep-01: 7 Elevator Pitch Topics: Databases / Data managementKeywords: agriculture The Grammar of Experimental Design Emi Tanaka Monash University, The critical role of data collection is well captured in the expression "garbage in, garbage out" -- in other words, if the collected data is rubbish then no analysis, however complex it may be, can make something out of it. The gold standard for data collection is through well-designed experiments. Re-running an experiment is generally expensive, contrary to statistical analysis where re-doing it is generally low-cost; there's a higher stake in getting it wrong for experimental designs. But how do we design experiments in R? In this talk, I present my R-package edibble that implements a novel framework, which I refer to as the "grammar of experimental design", to facilitate the data collection and design of an experiment. The grammar builds the experimental design by describing the fundamental components of the experiment. Because the grammar resembles a natural language, there is greater clarity about the experimental structure, and includes considerations beyond the construction of the experimental layout. I will reconstruct some experimental layout using edibble with comparison to other popular R-packages. Link to package or code repository.https://github.com/emitanaka/edibble/ ID: 243 / ep-01: 8 Elevator Pitch Topics: Spatial analysisKeywords: open source rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R Varsha Ujjini Vijay Kumar1,2, Dilinie Seimon1,2, Paula Moraga2 1Faculty of Business and Economics, Monash University, Australia; 2Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia Open and reliable data, analytical tools and collaborative research are crucial for solving global challenges and achieving sustainable development. Spatial and spatio-temporal data are used in a wide range of fields including health, social and environmental disciplines to improve the evaluation and monitoring of goals both within and across countries. To facilitate the increasing need to easily access reliable spatial and spatio-temporal data using R, many R packages have been recently developed as clients for various spatial databases and repositories. While documentation and many open source repositories on how to use these packages exist, there is an increased need for a one stop repository for this information. In this talk, we present rspatialdata, a website that provides a collection of data sources and tutorials on downloading and visualising spatial data using R. The website includes a collection of data sources and tutorials on how to download and visualise a wide range of datasets including administrative boundaries of countries, Open Street Map data, population, elevation, temperature, vegetation and malaria data. The website can be considered a useful resource for individuals working with problems that require spatial data analysis and visualisation, such as estimating air pollution, quantifying disease burdens, predicting species occurrences, and evaluating and monitoring the UN Sustainable Development Goals. Link to package or code repository. ID: 361 / ep-01: 9 Elevator Pitch Topics: Web Applications (Shiny/Dash)Moved to Session 2 Test Test test test ID: 157 / ep-01: 10 Elevator Pitch Topics: Teaching R/R in TeachingKeywords: teaching, machine learning, data science, undergraduate Teaching Advanced Data Science in R: Successes and Failures Lisa Lendway Macalester College, United States of America I am about to embark on teaching a new course I named Advanced Data Science in R. This is a capstone course being taught to undergraduates at Macalester College, a small liberal arts college in the US. The course expands on what students learned in an Intro Data Science course focused on data visualization and wrangling using the tidyverse and a Statistical Machine Learning course focused on machine learning algorithms implemented via the caret package. Students will review machine learning algorithms while learning the tidymodels packages. They will become familiar with the larger machine learning process by using data from a database within RStudio and implementing the basics of putting a model into production. Other components of the class include practicing reproducible research by using Git and GitHub in RStudio; creating a website using R Markdown, distill, or blogdown; and building shiny apps. In order for the students to become more comfortable learning new R packages on their own, in groups of three, they will teach the class about a package or set of functions of their choosing. They will create materials that students can refer back to and will also write homework problems and solutions. The last few weeks of the course will be dedicated to working on group projects where they will be expected to implement the skills they learned in the course. Since this course hasn’t started yet, I do not yet know what the successes and failures will be. I will share the course website and will reflect on how it went and the changes I will make for the future. Link to package or code repository.https://advanced-ds-in-r.netlify.app/ ID: 234 / ep-01: 11 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: annotation rRbiMs, r tools for Reconstructing bin Metabolisms. Mirna Vázquez Rosas Landa, Valerie de Anda Torres, Sahil Shan, Brett Baker The University of Texas at Austin Although microorganisms play a crucial role in the biogeochemical cycles and ecosystems, most of their diversity is still unknown. Understanding the metabolic potential of novel microbial genomes is an essential step to access this diversity. Most annotation pipelines use predefined databases to add the functional annotation. However, they do not allow the user to add metadata information, making it challenging to explore the metabolism based on a specific scientific question. Our motivation to build the package rRbiMs is to create a workflow that helps researchers to explore metabolism data in a reproducible manner. rRbiMs reads different database outputs and includes a new custom-curated database that the user can easily access. Its module-based design allows the user to choose between running the whole pipeline or just part of it. Finally, a key feature is that it facilitates the incorporation of metadata such as taxonomy and sampling data, allowing the user to answer specific questions of biological relevance. rRbiMs is a user-friendly R workflow that allows performing reproducible and accurate microbial metabolism analyses. We are working on the package looking forward to submitting it to R/Bioconductor and make it available for the research community. Link to package or code repository.https://github.com/mirnavazquez/RbiMs ID: 202 / ep-01: 12 Elevator Pitch Topics: Statistical modelsKeywords: relative weights analysis, Key Drivers Analysis, residualization, nonlinear, main effect, interaction ResidualRWA: Detecting relevant variable using relative weight analysis with residualization Maikol Solís, Carlos Pasquier Universidad de Costa Rica, Escuela de Matemática, Centro de Investigación en Matemática Pura y Aplicada, Costa Rica In statistical models, determining the most influential variables is a continuous task. A common technique is called relative weights analysis (RWA) (a.k.a. Key Drivers Analysis). The method described in Tonidandel & LeBreton (2015), uses an orthonormal projection of the data to determine which variable has more impact into the model. Packages like “rwa” (https://CRAN.R-project.org/package=rwa) or “flipRegression” (https://github.com/Displayr/flipRegression) handle the situation when there are multiple linear primary effects into the model. To extend those packages, we present the novel package “ResidualRWA” (https://github.com/maikol-solis/residualrwa). This new package implements the relative weight analysis, with the possibility to handle complex models with nonlinear effects (via restricted splines) and nonlinear interactions. The package residualizes appropriately each interaction to report correctly the main and the pure interaction effects. The interactions are residualized against the primary effects as described in LeBreton et al. (2013). This step is necessary to remove the influence of the main variables into the interactions. This way, we can separate the effect of the main variable of a model and the pure interaction effect due to true the synergy of the variables. The package “ResidualRWA” handles the fit of the model, the residualization of interaction and the relative weight analysis estimation. Also, it reports through easy-to-read tables and graphics the results to the user. In this presentation we will show the capabilities of this package and test it with some simulated and real data. * References * LeBreton, J. M., Tonidandel, S., & Krasikova, D. V. (2013). Residualized Relative Importance Analysis. Organizational Research Methods, 16(3), 449–473. https://doi.org/10.1177/1094428113481065 Tonidandel, S., & LeBreton, J. M. (2015). RWA Web: A Free, Comprehensive, Web-Based, and User-Friendly Tool for Relative Weight Analyses. Journal of Business and Psychology, 30(2), 207–216. https://doi.org/10.1007/s10869-014-9351-z Link to package or code repository.https://github.com/maikol-solis/residualrwa ID: 203 / ep-01: 13 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: survey, serology, serological-survey, srvyr, tidyverse serosurvey R package: Serological Survey Analysis For Prevalence Estimation Under Misclassification Andree Valle-Campos Centro Nacional de Epidemiología Prevención Control Enfermedades CDC Perú, Peru Population-based serological surveys are fundamental to quantify how many people have been infected by a certain pathogen and where we are in the epidemic curve. Various methods exist to estimate prevalence considering misclassifications due to an imperfect diagnostic test with sensitivity and specificity known with certainty. However, during the first months of a novel pathogen outbreak like the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) diagnostic test performance is “unknown” or “uncertain” given limited validation studies available. During the pandemic, the increased demand for prevalence estimates and the absence of standardized procedures to address this issue went along with the usage of different methods and non-reproducible workflows for SARS-CoV-2 serological surveys within proprietary softwares or non-public R code repositories. Given this scenario, we created the serosurvey R package to gather Serological Survey Analysis functions and workflow templates for Prevalence Estimation Under Misclassification. We provide functions to calculate single prevalences using the srvyr package environment and generate tidy outputs for a hierarchical bayesian approach that incorporates the unknown test performance from Larremore et al. [bioRxiv, May 2020]. We applied them in a reproducible workflow with purrr and furrr R packages to efficiently iterate and parallelize this step for multiple prevalences. We test it simulating the usage of an imperfect test to measure an outcome within a multi-stage sampling survey design and incorporate the test uncertainty into the sampling design uncertainty. Therefore, the serosurvey R package could facilitate the generation of prevalence estimates for the current and improve our preparedness for next pandemics. In conclusion, the serosurvey R package reduces the reproducibility gap of serological survey analysis using diagnostic tests with unknown performance within a free software environment. Link to package or code repository. ID: 191 / ep-01: 14 Elevator Pitch Topics: Interfaces with other programming languagesKeywords: HTTP, API, reprex httrex: Help Debugging HTTP API Clients Greg Daniel Freedman Ellis Crunch.io, United States of America HTTP APIs are a fantastic way to collaborate across programming language divides. It does not matter what language the server's infrastructure is written in, an R user can write code to bring the data into R. There are many packages on CRAN, and presumably even more internally developed packages in organizations, that are wrappers of web APIs, abtracting over some of the details of APIs to make a friendlier interface for R users. When everything is working as intended, this is the ideal way to work, with users not needing to worry about the inner workings of APIs on a day to day basis. But when there's a bug, it can be hard to figure out the source because of the layers of abstraction and potentially the programming language barrier between the user and API owner. The httrex package provides tools to debug R code based on HTTP APIs and communicate what the code is doing in a language-agnostic way. The httrex package explores two approaches for this problem. The first is a shiny app that runs alongside your current R session and tracks the code you run and the API calls that the code makes and visualizes each step of that process. The second creates a document inspired by the reprex package that reruns code in a new environment and includes the HTTP API calls interspersed with the R code and output. With either of these two approaches users can better understand the HTTP requests that their code is making and share with API owners in a way that doesn't require them to understand R. Link to package or code repository.https://github.com/gergness/httrex ID: 213 / ep-01: 15 Elevator Pitch Topics: Efficient programmingKeywords: CLI (command line interface) Reading, Combining, and Pre-Filtering Files with R and AWK David Shilane, Mayur Bansal, Chung Woo Lee Columbia University, The authors have been developing a software package named awkreader that provides an R interface to utilize AWK for the purpose of combined and pre-filtered file reading. The proposed software is designed to solve a number of problems. For one or more files of the same column structure, we will demonstrate a method to read and bind the data. A vector of names can be supplied to limit the data to selected columns. Pre-filtering the data can then be achieved either through a specification of patterns to match or through logical inclusion criteria. The program then performs a translation to build a corresponding coding statement in AWK. Importantly, the logical criteria can be directly specified in syntax that is familiar to R’s users. In addition to reading in the data, the code can also return its translations to AWK. Combined data sets can also include a column that identifies the source file. The awkreader package creates a variety of novel capabilities. It reduces the computational complexity of filtering and binding data sets. Targeted queries can search a wider range of data than what might otherwise be loaded into R because the filters are applied in the reading process. A single statement can replace the labor and code of reading, binding, and filtering multiple files. Users can benefit from AWK for file reading without having to learn its syntax. Those who are interested in learning AWK will be able to generate working examples that correspond to their more familiar setting of programming in R. The presentation will showcase the benefits of the awkreader package. We will provide some details on the translation process, particularly for logical subsetting operators such as %in% and %nin%. We will also demonstrate examples of the code in data processing applications. Link to package or code repository.https://github.com/MB4511/awkreader ID: 131 / ep-01: 16 Elevator Pitch Topics: Social sciencesKeywords: data management DATA PIPELINE: IMPROVING MANAGEMENT OF FINANCIAL CONTRIBUTIONS TO THE FIGHT AGAINST POVERTY IN COSTA RICA Roberto Delgado Castro DIRECCION GENERAL DE DESARROLLO SOCIAL Y ASIGNACIONES FAMILIARES FODESAF, administered by DESAF (part of the Ministry of Labor and Social Security), is Costa Rica´s and Latin America´s largest public-social investment fund. It transfers around US1.000 million per year (2% of local GDP) to a wide variety of social programs natiowide. Local employers (patrons) and Ministry of Treasury (Government) provides its economic resources due to monthly financial contributions. Among with monthly financial transfers, a large database with specific information of contributors is attached. Since 1978, FODESAF´s establishment year, DESAF have not had the opportunity to classify and analyze such crucial data. A data science project was developed in SQL© and RStudio© to implement a Data Pipeline, in which all data was loaded to break down its key elements, in order to improve DESAF authority´s decision-making capabilites. As inputs, annual databases within 2003 and 2019 (17 years) were loaded into the Pipeline in a separate way (250.000 registers per year, 30.8 million in total). The results were RMarkdown© automatic reports with brand-new dataframes and visualizations for each year, that helped authorities visualize and analyze elements that had not been seen in 43 years of DESAF´s history. After its implementation, local government has now a unique-recurrent data science tool to improve management of financial contributions to the fight against poverty. As key learnings, data-taming skills were strenghten, project-questions were defined as project´s coding-structure strategy, brand-new Contributors Mass Report was developed, and this project has been used as an economic-recovery follow-up-instrument in post-pandemic era. Keywords: Data Pipepline, data-taming, visualizations, automatic-reports, poverty. Link to package or code repository.https://github.com/ROBERTODELGADOCASTRO/UseR_2021_Conference ID: 114 / ep-01: 17 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: sleep, chronobiology, mctq, data wrangling, reproducibility mctq: An R Package for the Munich ChronoType Questionnaire Daniel Vartanian, Ana Amélia Benedito-Silva, Mario Pedrazzoli School of Arts, Sciences and Humanities (EACH), University of Sao Paulo (USP), Sao Paulo, Brazil mctq is an R package that provides a complete and consistent toolkit to process the Munich ChronoType Questionnaire (MCTQ), a quantitative and validated method to assess peoples’ sleep behavior presented by Till Roenneberg, Anna Wirz-Justice, and Martha Merrow in 2003. The aim of mctq is to facilitate the work of sleep and chronobiology scientists with MCTQ data while also helping with research reproducibility. Although it may look like a simple questionnaire, MCTQ requires a lot of date/time manipulation. This poses a challenge for many scientists, being that most people have difficulties with date/time data, especially when dealing with an extensive set of data. The mctq package comes to address this issue. mctq can handle the processing tasks for the three MCTQ versions (standard, micro, and shift) with few dependencies, relying much of its applications on the lubridate and hms packages from tidyverse. We also designed mctq with the user experience in mind, by creating an interface that resembles the way the questionnaire data is shown in MCTQ publications, and by providing extensive and detailed documentation about each computation proposed by the MCTQ authors. The package also includes several utility tools to deal with different time representations (e.g., decimal hours, radians) and time arithmetic issues, along with fictional datasets for testing and learning purposes. The first stable version of mctq is available for download on GitHub. The package is currently under a software peer-review by the rOpenSci initiative. We plan to submit it to CRAN soon after the review process ends. Link to package or code repository. ID: 107 / ep-01: 18 Elevator Pitch Topics: EcologyKeywords: flow cytometry, cytometric diversity, microbial ecology, {flowDiv} workflow {flowDiv} workflow: reproducible cytometric diversity estimates María Victoria Quiroga1, Bruno M. S. Wanderley2, André M. Amado3, Fernando Unrein1 1Instituto Tecnológico de Chascomús (INTECH, UNSAM-CONICET), Argentina; 2Departamento de Oceanografia e Limnologia, Universidade Federal do Rio Grande do Norte, Brazil; 3Departamento de Biologia, Universidade Federal de Juiz de Fora, Brazil Flow cytometry is widely used in life sciences, as it records thousands of single-cell data related to their morphological or physiological state within minutes. Hence, each sample has a characteristic cytometric pattern that can be studied through diversity indexes: evenness, alpha- and beta-diversity. Applying this approach to microbial ecology research frequently involves intricate handling of data recorded with different instrumental settings or sample dilutions. The {flowDiv} package overcomes this through data normalization and volume correction steps before estimating cytometric diversity. Here, we share a reproducible {flowDiv} workflow, hoping to help researchers to streamline flow cytometry data processing. Link to package or code repository. ID: 118 / ep-01: 19 Elevator Pitch Topics: R in productionKeywords: data management From Sheets to Success: Reliable, Reproducible, yet Flexible Production Pipelines Katrina Brock Sunbasket At our growth-stage startup, our data team was challenged to produce accurate and reliable forecasts despite ever-changing product lines. To solve this, we used yaml-based config to specify inputs and models in a standardized way, the validate package to check inputs before running any custom transformations, and the drake package to isolate the business logic for each product line’s forecast. We transformed a process that previously was a patchwork of SQL, R, and hundreds of spreadsheets into a set of robust pipelines that use a standardized set of steps. The reliability frees our team from constant troubleshooting and allows us to focus on building and tuning models to ever-increasing accuracy. ID: 281 / ep-01: 20 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: survey design, multiwave sampling, Neyman allocation Efficient multi-wave sampling with the R package optimall Jasper B. Yang1, Bryan E. Shepherd2, Thomas Lumley3, Pamela A. Shaw1 1Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, U.S.A; 2Department of Biostatistics, Vanderbilt University School of Medicine, Vanderbilt University, Nashville, TN, U.S.A; 3Department of Statistics, University of Auckland, Auckland, New Zealand When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design process, especially when data are expensive to collect. One common approach in epidemiological studies is a two-phase sampling design, where inexpensive variables collected on all sampling units are used to inform the sampling scheme for collecting the expensive variables on a subsample in the second phase. Recent studies have demonstrated that even more precise estimates can be obtained when the second phase is conducted over a series of adaptive waves. Unlike simpler sampling schemes, executing multi-phase and multi-wave designs requires careful management of many moving parts over repetitive steps, which can be cumbersome and error-prone. We present the R package optimall, which offers a collection of functions that efficiently streamline the design process of survey sampling, ranging from simple to complex. The package’s main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum in order to minimize the variance of the target sample mean using Neyman allocation, and select specific IDs to sample based on a stratified sampling design. As the survey is performed, optimall provides a framework for every aspect of the sampling process, including the data and metadata, to be stored in a single object. Although it is particularly tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey. Link to package or code repository.https://github.com/yangjasp/optimall ID: 185 / ep-01: 21 Elevator Pitch Topics: Data mining / Machine learning / Deep Learning and AIKeywords: explainable machine learning, predictive modeling, interactive visualization, dashboards Open the Machine Learning Black-Box with modelStudio & Arena Hubert Baniecki, Piotr Piątyszek Warsaw University of Technology, Poland Complex machine learning predictive models, aka black-boxes, demonstrate high efficiency in a rapidly increasing number of applications. Simultaneously, there is a growing awareness among machine learning practitioners that we need more comprehensive tools for model explainability. Responsible machine learning will require continuous model monitoring, validation, and black-box transparency. These challenges can be satisfied with novel frameworks which add automation and interactivity into explainable machine learning pipelines. In this talk, we present the modelStudio and arenar packages which, at their core, automatically generate interactive and customizable dashboards allowing to "open the black-box". These tools build upon the DALEX package and are model-agnostic - compatible with most of the predictive models and frameworks in R. The effort is put on lowering the entry threshold for crucial parts of nowadays MLOps practice. We showcase how little coding is needed to produce a powerful dashboard consisting of model explanations and data exploration visualizations. The output can be saved and shared with anyone, further promoting reproducibility and explainability in machine learning practice. Finally, we highlight the Arena dashboard's features - it specifically aims to compare various predictive models. Link to package or code repository. ID: 247 / ep-01: 22 Elevator Pitch Topics: Data mining / Machine learning / Deep Learning and AIKeywords: classification Classifying Student’s Learning Pattern using R Sequence Analysis Packages: The Impact of Procrastination on Performance Teck Kiang Tan National University of Singapore, Sequence analysis is an analytical technique to analyze categorical longitudinal data, incorporating classification procedures that categorizing categorical sequences. A framework of sequence analysis will be provided to give an overview of this approach. This talk shares the findings of a study extracted from the National University of Singapore Learning Management System (LMS) that provides online video learning materials to students. The time spent on the learning and watching the online video of a statistics course that runs for 35 days is extracted from the LMS to form sequences for all the students with each student formed a learning sequence of 35 states. Sequence analysis was carried out to classify student’s time spent learning patterns and the outcomes of sequence analysis are linked to explain student’s performance. R package TraMineR and a few cluster analysis packages were used for carrying out sequence analysis and classifying sequences. Fifteen sequence distances were computed and their goodness-of-fit indices were determined for the selection of the best distance metric for classifying students. Four sequence complexity measures were examined to quantify whether the students vary their time use learning patterns as a strategy in studying. These four measures are sequence entropy, turbulence, complexity, and precarity index. The usefulness of graphing sequence analysis using state distribution plot, sequence frequency plot, transversal entropy plot, sequence modal state plot, and representative sequence plot will be shared to point out their relevancy in explaining the results of classifying student’s learning sequences. The inferential regression findings that student’s act of learning procrastination, the resulted classification from the sequence analysis, will be shared how to make use of the sequence analysis results to determine the degree of influence of student’s learning procrastination affecting student’s performance. ID: 200 / ep-01: 23 Elevator Pitch Topics: Data visualisationKeywords: outliers, dataviz, ggplot2, data communication, blogdown Outlier redemption Violeta Roizman Self-employed, France In Data Science, outliers are usually perceived negatively, as a problem to solve prior to the data analysis. Many tutorials, blog posts and papers are written to offer tips about how to get rid of them. There is even the whole area of Robust Statistics focused on techniques that try to ignore outliers. However, I’ve discovered that outliers can be a very exciting part of the data. Outliers are quite often fun facts about the data. And I love fun facts. I love them so much that I collect all these fun-data-facts in a website called outlier redemption, where I only publish short data-driven stories about them. For each story, I publish the data and the R code used to produce the graphics and simulations. In this short talk, I will start by introducing the positive side of outliers. After that, I will present different visual ways in which we can spot and highlight outliers with R. I will finish with some of the fun outlier stories that I encountered while analyzing data. Link to package or code repository. ID: 115 / ep-01: 24 Elevator Pitch Topics: ReproducibilityThe Canyon of Success: Scaling Best Practices in R with Internal R packages Malcolm Barrett Teladoc Health At useR! 2016, Hadley Wickham said that what he sought to design in the tidyverse was (after Jeff Atwood) a pit of success: it should be easy for users to succeed. They shouldn’t have to trudge uphill to use your tools effectively. They should fall right into the pit of success--no guard rails! In this talk, I’ll discuss how we use internal R packages to scale best practices across our data science and analytics teams at Teladoc Health. Using R packages has allowed new team members to quickly onboard, set up their work environments, and create reproducible work that aligns with our standards. Our sets of packages include opinionated, usethis-style workflows, database connections, reporting, and more. Designing these tools to make it easy to succeed has become a keystone in our design approach, allowing us to scale our practices with less intervention. withdrawnID: 102 / ep-01: 25 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: Cancer, simulation, mathematical modeling Modeling tumor evolutionary dynamics with SITH Phillip Nicol1, Amir Asiaee2 1Harvard University, United States of America; 2Ohio State University, United States of America A tumor of clinically detectable size is the result of a decades-long evolutionary process of cell replication and mutation. Mathematical models of cancer growth are becoming increasingly valuable for situations where high-quality clinical data is unavailable. We designed "SITH," a CRAN package that implements a stochastic spatial model of tumor growth and mutation. The goal of "SITH" is to provide fast simulations with a convenient interface for researchers interested in cancer evolution. The core simulation algorithm is written in C++ and integrated into R using the "Rcpp" framework. 3D interactive visualizations of the simulated tumor (using the "rgl" package) allow users to investigate the spatial location of various subpopulations. "SITH" also provides functions for analyzing the amount of heterogeneity present in the tumor. Finally, "SITH" can create simulated DNA sequencing datasets. This feature may be helpful for researchers interested in understanding how spatial heterogeneity can bias observed data. Link to package or code repository.https://CRAN.R-project.org/package=SITH withdrawnID: 225 / ep-01: 26 Elevator Pitch Topics: ReproducibilityKeywords: ecology A Glimpse into the Reproduciblity of Scientific Papers published in Movement Ecology: How are we doing ? Jenicca Poongavanan, Rocio Joo Arakawa, Mathieu Basille University of Florida Reproducibility is the earmark of science and thus Movement Ecology as well. However, studies in disciplines such as biology and geosciences have shown that published work is rarely reproducible. Ensuring reproducibility is not a mandatory part of the research process and thus there are no clear procedures in place to assess the reproducibility of scientific articles. In this study we put forward a reproducibility workflow scoring sheet based on six criteria that lead to successful reproducible papers. The reproducibility workflow can be used by authors to evaluate the reproducibility of their studies before publication and reviewers to evaluate the reproducibility of scientific papers. To assess the state of reproducibility in Movement Ecology, we attempted to reproduce the results from Movement Ecology papers that use behavioral pattern identification methods. We selected 75 papers published in several journals from 2010-2020. According to our proposed reproducibility workflow, sixteen studies reflected at least some reproducibility (scores ≥ 4). In particular, we were only able to obtain the data for 16 out of 75 papers. Out of these, a minority of papers also provided code with the data (6 out of the 16 studies). Out of the 6 studies that made both data and code available, only four studies reflected a high level of reproducibility (scores ≥ 9) owing it to good code annotation and execution. Based on our findings, we proposed guidelines for authors, journals and academic institutions to enhance the state of reproducibility in Movement Ecology. Talk-VideoID: 180 / ep-01: 27 Elevator Pitch Topics: Time seriesKeywords: Time series, Forecasting, Business/Industry, Economics, R programming Forecasting under COVID: when simple works - and when it doesn’t Maryam Shobeirinejad1, Steph Stammel2 1Transurban, Australia; 2Transurban, Australia The worldwide impact of COVID-19 represented a global change point across almost all areas of life. As a result, forecasting is presenting new challenges for us to manage. Fit-for-purpose time series forecasts can range from deceptively simple, but highly useful models to eye-watering complexity that require expertise to implement correctly and usefully. In addition, COVID-19 has resulted in many of the exogenous variables used to improve forecast models (such as macroeconomic data) are also compromised by the same global events. This talk discusses how we have dealt with these heightened challenges in corporate forecasting; balancing highly interpretable solutions with few assumptions (for example seasonal decomposition models) with complex modelling explicitly managing a rapidly changing environment (e.g. state space switching models). The balancing of different forecast model features (accuracy, interpretability, required assumptions etc.) with the needs of stakeholders is a subject of considerable empirical review within our team. In a corporate context, this is balanced with tight time frames, a need to rapidly fit, estimate and compare large numbers of models and come to a solution that meets the needs of a complex array of users. This presentation will discuss how the tidyverts ecosystem (tsibble, FEASTS, etc.) has been used to test and iterate batch sets of modelling to achieve robust solutions to time sensitive business problems. We will discuss some of our findings in the context of a cost-benefit framework as we traded off the key features of a forecast model in search of the ‘fit for purpose’ solution. ID: 209 / ep-01: 28 Elevator Pitch Topics: R in the wild (unusual applications of R)Keywords: GIS gatpkg: Developing a geographic aggregation tool in R for non-programmers Abigail Stamm New York State Department of Health Bureau of Environmental and Occupational Epidemiology, The Geographic Aggregation Tool (GAT) was developed in R to simplify and standardize the development of small geographic areas that meet the minimum thresholds required to provide stable and meaningful population measures. To improve usability, we converted GAT to an R package and designed it to be accessible to non-programmers with little to no experience in R. To run GAT, users will be required to install the package and run one line of R code. GAT’s user friendly interface offers a series of dialogs for the user to select their options and saves all files without requiring additional code. The package includes documentation on files created and guidance on interpretation of files and how to address aggregation issues. It also includes several examples using embedded shapefiles, a tutorial, and detailed documentation for advanced users interested in modifying or enhancing the tool. This talk will provide an overview of the tool, how it works, and the planning that informed its development, keeping accessibility and reproducibility in mind. Link to package or code repository.https://github.com/ajstamm/gatpkg ID: 278 / ep-01: 29 Elevator Pitch Topics: R in productionKeywords: HIV/AIDS, package, analysis, automation tidyndr: An R package for analysis of the Nigeria HIV National Data Repository Stephen Taiye Balogun, Scholastica Olanrewaju, Oluwaseun Okunuga, Temitope Kolade, Geraldine Chizoba Abone, Fati Murtala-Ibrahim, Helen Omuh Institute of Human Virology Nigeria, Nigeria Nigeria, being the fourth-largest HIV epidemic in the world, is central to achieving the UNAIDS target of epidemiologic control of HIV/AIDS by 2030. Data of 1.3 million HIV-positive patients on treatment in the country is stored centrally in the National Data Repository (NDR). Using access levels, this data is accessible to the Government of Nigeria, donor agencies, implementing partners and other stakeholders to track progress and improve HIV programming. To achieve this, the data must be cleaned, processed, summarized, and communicated for easy recognition of progress, and identification of gaps for tailored intervention. The analysis is traditionally conducted in Microsoft Excel using a downloaded file from the NDR. This means that the Excel software must be installed on the user’s computer, and the user must be familiar with the formula for the calculation of the various indicators. It lacks reproducibility and is error-prone, with errors occasionally going unnoticed. Performing the same analysis periodically can also be quite tedious and time-consuming. The tidyndr package eliminates these bottlenecks by improving the user friendliness and automating routine analysis, saving several man-hours in the process while eliminating individual errors. The functions are grouped into four categories: importing, treatment, supporting, and summary functions. Together, these ensure that patient-level data are consistently imported into R, subset the data based on specific indicators, and provide summary tables both aggregated and disaggregated in line with the national requirements. The output from this process is very useful to improve program performance and help achieve epidemiologic control of the virus at local, state, and national levels. With continued national efforts to provide patient-level information for HIV prevention and other services, the package can be scaled-up to support the analysis of these data. Finally, it provides a foundation upon which other relevant program applications can be built. Link to package or code repository.https://github.com/stephenbalogun/tidyndr ID: 259 / ep-01: 30 Elevator Pitch Topics: Data visualisationKeywords: color accessibility, data visualization, data organization, cvd accessible colors, microbiome Do you see what I see? Introducing microshades: An R package for improving color accessibility and organization of complex data Lisa Karstens, Erin Dahl, Emory Neer Oregon Health & Science University, United States of America Approximately 300 million people in the world have Color Vision Deficiency (CVD). When creating figures and graphics with color, it is important to consider that individuals with CVD will interact with this material, and may incorrectly perceive information associated with color. Multiple CVD friendly color palettes are available in R, however they are limited to 8 different colors. When working with complex data, such as microbiome data, this is insufficient. To overcome this limitation, we created the microshades R package, designed to provide custom color shading palettes that improve accessibility and data organization. The microshades package includes two crafted color palettes, microshades_cvd_palettes and microshades_palettes. Each color palette contains six base colors with five incremental light to dark shades, for a total of 30 available colors per palette type that can be directly applied to any plot. The microshades_cvd_palettes contain colors that are universally CVD friendly. The individual microshades_palettes are CVD friendly, but when used in conjunction with multiple microshades_palettes, are not universally accessible. The microshades package also contains functions to aid in data visualization such as creating stacked bar plots organized by a data-driven hierarchy. To further assist users with data storytelling, there are functions to sort data both vertically and horizontally based on ranked abundance or user specification. The accessibility and advanced color organization features help data reviewers and consumers notice visual patterns and trends in data easier. Examples of microshades in action are available on our website, for both microbiome and other datasets. Link to package or code repository.https://karstenslab.github.io/microshades 2:15am - 3:15am mixR!Music, networking channel and raffles. To end the day in a relaxing way 6:30am - 7:30am CreationLabSession Chair: Marcela Alfaro CordobaDraw cartoons and tell storiesWant to learn how to draw cartoons on your tablet or with pen and pencil? come join us! Analí will teach the basics on how to make cartoons, and tell stories with drawings. 7:30am - 8:30am Keynote: Can we do this in R? - Answering questions about air quality one code at a timeVirtual location: The Lounge #key_kushwahaSession Chair: Adithi R. UpadhyaZoom Host: Rachel HeyardReplacement Zoom Host: Nasrin Fathollahzadeh Attar Speaker Slides ID: 356 / [Single Presentation of ID 356]: 1 Keynote Talk Topics: Environmental sciences Meenakshi Kushwaha ILK Labs, Bangalore We are a young team of Environmental health researchers, geospatial analysts, and air quality researchers using innovative solutions for air quality monitoring in low resource settings. Every time we encounter a large dataset, a new modelling approach, a new statistical technique, a new visualization challenge, we ask ourselves - “Can we do this in R ?”, and for the past four years (since we started this work), the answer has been a resounding “yes”. I will share how we use R not just for data analysis and visualization but also as a great open source tool for collaboration and engagement. 8:30am - 8:45am BreakVirtual location: The Lounge #lobby 8:45am - 10:15am 3A - Machine Learning and Data ManagementLocation: The Lounge #talk_ml_dmSession Chair: Young-suk LeeZoom Host: Nasrin Fathollahzadeh AttarReplacement Zoom Host: Dorothea Hug PeterSession Sponsor: MemVerge Session Slides 8:45am - 9:05amTalk-LiveID: 188 / ses-03-A: 1 Regular Talk Topics: Data mining / Machine learning / Deep Learning and AIKeywords: Automated Machine Learning, R package, Hyperband mlr3automl - Automated Machine Learning in R Alexander Bernd Hanf Ludwig-Maximilians-Universität Munich We introduce mlr3automl, an open-source framework for Automated Machine Learning in R. Based on the mlr3 Machine Learning package, mlr3automl builds robust and accurate classification and regression models for tabular data. mlr3automl provides automatic preprocessing, which guarantees stable performance in the presence of missing data, categorical and high-cardinality features, and large data sets. Preprocessing and model building is solved through a flexible pipeline implemented with mlr3pipelines. This allows mlr3automl to jointly optimize preprocessing, model selection and model hyperparameters using Hyperband. mlr3automl shows strong performance and stability on a benchmark consisting of 39 challenging classification tasks. mlr3automl successfully completed every task in the benchmark within the strict time budget, which three out of five other state of the art AutoML systems failed to achieve. Link to package or code repository.https://github.com/a-hanf/mlr3automl 9:05am - 9:25amTalk-VideoID: 168 / ses-03-A: 2 Regular Talk Topics: Data mining / Machine learning / Deep Learning and AIKeywords: exploratory data analysis Triplot: model agnostic measures and visualisations for variable importance in predictive models that take into account the hierarchical correlation structure Katarzyna Pękala, Katarzyna Woźnica, Przemysław Biecek MI2 Data Lab, Warsaw University of Technology One of the key elements of the explanatory analysis of a predictive model is to assess the importance of the individual variables. The rapid development of the area of predictive model exploration (also called explainable artificial intelligence or interpretable machine learning) has led to the popularization of methods for local (instance level) and global (dataset level) methods, such as Permutational Variable Importance, Shapley Values (SHAP), Local Interpretable Model Explanations (LIME), Break Down and so on. However, these methods do not use information about the correlation between features which significantly reduce the explainability of the model behaviour. In this work, we propose new methods to support model analysis by exploiting the information about the correlation between variables. The dataset level aspect importance measure is inspired by the block permutations procedure, while the instance level aspect importance measure is inspired by the LIME method. We show how to analyse groups of variables (aspects) both when they are proposed by the user and when they should be determined automatically based on the hierarchical structure of correlations between variables. Additionally, we present a new type of model visualisation, triplot, that exploits a hierarchical structure of variable grouping to produce a high information density model visualisation. This visualisation provides a consistent illustration for either local or global model and data exploration. We also show an example of real-world data with 5k instances and 37 features in which a significant correlation between variables affects the interpretation of the effect of variable importance. The proposed method is, to our knowledge, the first to allow direct use of the correlation between variables in exploratory model analysis. Triplot package for R is developed under open source GPL-3 licence and is available on GitHub repository at https://github.com/ModelOriented/triplot. Link to package or code repository.https://github.com/ModelOriented/triplot 9:25am - 9:45amTalk-VideoID: 252 / ses-03-A: 3 Regular Talk Topics: Data mining / Machine learning / Deep Learning and AIKeywords: networks, embeddings, machine learning, algorithms Getting sprung in R: Introduction to the rsetse package for embedding feature-rich networks Jonathan Bourne UCL, United Kingdom The Strain Elevation Tension Spring embedding algorithm (SETSe) is a deterministic method for embedding feature-rich networks. The algorithm uses simple Newtonian equations of motion and Hooke's law to embed the network onto a locally euclidean manifold. To create the embedding, SETSe converts node attributes into forces and the edge attributes into springs. SETSe finds an equilibrium position when the forces on the springs balance the forces of the nodes. The algorithm has a time complexity of O(2) and linear memory complexity; this means the algorithm avoids issues faced by other physics based embedding methods and can be used to embed graphs with tens of thousands of nodes and more than a million edges. Some applications of SETSe are; analysing social networks; understanding the robustness of power grids; geographical analysis; predicting node features; understanding power dynamic between individuals and organisations; analysis of molecular structures. This presentation will provide both a brief technical discussion of the algorithm and its implementation, as well as several use cases. The use cases describe how to embed a network and then how to interpret that embedding. There are very few options for graph embeddings using R, and this is something that rsetse seeks to address; the algorithm has been implemented in the package rsetse and is available on CRAN. Link to package or code repository. 9:45am - 10:05amTalk-VideoID: 137 / ses-03-A: 4 Regular Talk Topics: Data mining / Machine learning / Deep Learning and AIKeywords: data envelopment analysis An R package for the implementation of Efficiency Analysis Trees and the estimation of technical efficiency Miriam Esteve, Victor J. España, Juan Aparicio, Xavier Barber Miguel Hernández University of Elche EAT is a new R package that includes functions to estimate production frontiers and technical efficiency measures using non-parametric techniques based on CART regression trees. The package implements the main algorithms associated with a new technique introduced to estimate the efficiency of a set of decision making units in Economics and Engineering through machine learning techniques, called Efficiency Analysis Trees (Esteve et al., 2020). It encompasses the estimation of radial measures, oriented Russell efficiency measures, the directional distance function, the weighted additive model, graphical representations of the production frontier using tree-shaped structures and the classification of input variable importance. In addition, it incorporates a code to carry out an adaptation of the Random Forest Algorithm to estimate technical efficiency. This work describes the methodology and application of the functions. Link to package or code repository.https://github.com/MiriamEsteve/EAT 8:45am - 10:15am 3B - Spatial AnalysisLocation: The Lounge #talk_spatial_analysisSession Chair: Inger Fabris-RotelliZoom Host: Tuli AmutenyaReplacement Zoom Host: Rachel Heyard 8:45am - 9:05amTalk-VideoID: 192 / ses-03-B: 1 Regular Talk Topics: Spatial analysisKeywords: spatial-analysis, graph-analysis, simple-features, tidygraph, spatial-networks Tidy Geospatial Networks in R Lucas van der Meer1, Lorena Abad1, Andrea Gilardi2, Robin Lovelace3 1University of Salzburg, Austria; 2University of Milano - Bicocca, Italy; 3University of Leeds, England Geospatial networks are graphs embedded in geographical space. That means that both the nodes and edges in the graph can be represented as geographic features: the nodes most commonly as points, and the edges as linestrings. They play an important role in many different domains, ranging from transportation planning and logistics to ecology and epidemiology. The structure and characteristics of geospatial networks go beyond standard graph topology, and therefore it is crucial to explicitly take space into account when analyzing them. The sfnetworks R package is created to facilitate such an integrated workflow. It brings together the sf package for spatial data science and the tidygraph package for standard graph analysis. The core of the package is a data structure that can be provided as input to both the graph analytical functions of tidygraph as well as the spatial analytical functions of sf, without the need for conversion. Additionally, it offers a set of geospatial network specific functions, such as routines for shortest path calculation, network cleaning and topology modification. The package is designed as a general-purpose package suitable for usage across different application domains, and can be seamlessly integrated in "tidy" workflows that use the tidyverse packages for data science. Link to package or code repository. 9:05am - 9:25amTalk-LiveID: 290 / ses-03-B: 2 Regular Talk Topics: Spatial analysisKeywords: slopes, gradient slopes: a package for calculated slopes of roads, rivers and other linear (simple) features Robin Lovelace1, Rosa Félix2 1University of Leeds, United Kingdom; 2University of Lisbon, Portugal The goal of the slopes is to enable reproducible calculation of slopes for urban, transport, ecological applications and research projects using free and open source software. We have developed the package to be fast, accurate and user friendly, calculating the longitudinal steepness of linear features such as roads and rivers based on open access datasets such as road geometries and digital elevation models (DEMs). The package has a few unique features, including the ability to calculate slopes based on multiple input classes for raster data, and the ability to download and use DEM data on-the-fly in places where users lack DEM data. The package is work in progress but has already attracted attention from road steepness maps of cities in Portugal and Brazil. Integrating with other packages such as sf and sfnetworks, the package should provide a strong foundation for research into the impacts of vertical gradient profiles on phenomena ranging from aquatic migration patterns to flooding and walking and cycling potential. In the talk we will present both the package and some of the research questions we have used it to explore and will ask the audience: how steep a hill would you be willing to walk or cycle up? We will conclude by discussing limitations of the package and future directions of development. Link to package or code repository.https://github.com/ITSLeeds/slopes 9:25am - 9:45amTalk-LiveID: 142 / ses-03-B: 3 Regular Talk Topics: Spatial analysisKeywords: cartography, maps, spatial analysis mapsf, a New Package for Thematic Mapping Timothée Giraud UMS RIATE - CNRS {mapsf} helps to design various cartographic representations such as proportional symbols, choropleth or typology maps. It also offers several functions to display layout elements that improve the graphic presentation of maps. The aim of {mapsf} is to obtain thematic maps with the visual quality of those built with a classical mapping or GIS software while being lightweight, versatile and user-friendly. To achieve this goal, the package takes advantage of the features offered by {sf} and provides a limited number of simple mapping functions. {mapsf} is the successor of {cartography}, it offers the same core features but it is simpler and more robust. Unlike other popular cartographic packages, it does not use grammar of graphics, it depends on a limited number of packages and displays georeferenced plots using base R graphics. The main function of the package, mf_map(), gives access to 9 map types: base maps, proportional or graduated symbols, choropleth maps, typology maps and various combinations of symbology. Many parameters are available to fine tune the cartographic representations. These parameters are the common ones found in GIS and automatic cartography tools (e.g. classification, color palettes, symbols sizes, legend layout...). Some additional functions are dedicated to layout design (graphic themes, legends, scale bar, north arrow, title, credits…), map insets or map exports. The development of {mapsf} follows the current best practices of the R ecosystem (CI/CD, coverage tests) and its documentation is enhanced by a vignette and a website. Link to package or code repository. 9:45am - 10:05amTalk-VideoID: 141 / ses-03-B: 4 Regular Talk Topics: Spatial analysisKeywords: data management osmextract: An R package to download, convert, and import large OpenStreetMap datasets Andrea Gilardi1, Robin Lovelace2 1University of Milano - Bicocca; 2University of Leeds OpenStreetMap (OSM) is an online database that provides open-access geographic and rich attribute data worldwide, representing a wide range of physical and human features, including roads, rivers, and political boundaries. OSM is the world’s largest open-access source of geographic vector data, comprising nodes (points), ways (lines and polygons) and relations (describing a wide range of entities). Practical applications include disaster response, transport planning, and service location. OSM datasets can be manually downloaded from the project’s servers directly or via the R package osmdata, which uses the Overpass API. Large 'extracts' are also available from external providers (such as geofabrik.de) in a compressed binary format based on protocol buffers. The aim of osmextract is to enable processing and import of such OSM extracts. The package is composed of three main functions that can be used to 1) match an input location with one of the OSM extracts, either via spatial matching or approximate string distance; 2) download the chosen file; 3) convert the compressed data to Geopackage format. The main function, named oe_get(), returns sf objects. This workflow is effective for importing OSM extracts covering large geographical areas. Furthermore, the conversion process is based on GDAL routines, enabling customized spatial filters or SQL-like queries, further boosting import performance. Link to package or code repository.Repository: https://github.com/ropensci/osmextract; Website: https://docs.ropensci.org/osmextract/ 8:45am - 10:15am 3C - R Packages 2Location: The Lounge #talk_packages_2Session Chair: Maëlle SalmonZoom Host: Maryam AlizadehReplacement Zoom Host: Faith Musili Session Slide 8:45am - 9:05amTalk-VideoID: 169 / ses-03-C: 1 Regular Talk Topics: Efficient programmingKeywords: API autotest: Automatic testing of R packages Mark Padgham rOpenSci The 'autotest' package has been developed by rOpenSci to automatically test the robustness of R packages to unexpected inputs. We hope that its usage will enable and encourage software to reach the highest possible quality prior to our peer-review process. 'autotest' implements a form of mutation testing which identifies expected or permitted forms for each parameter, and examines how each function responds to mutations of those inputs. Many software bugs are uncovered by packages being used in ways that developers themselves may not have anticipated, yet no developer can anticipate all potential ways software may be used. 'autotest' eases the task of making software robust to "unexpected" usage by testing and reporting any points at which mutations to inputs generate unexpected results. The package also matches expectations with textual descriptions provided by function documentation, and ensures that descriptions of input and output parameters are sufficient for users to understand ranges of admissible inputs, and of returned values. Application of 'autotest to a package should thus ensure that forms and ranges of every parameter of every function are clearly described, and that all functions respond consistently to as many diverse forms of input as possible. Finally, 'autotest' can also be used to automatically generate a package test suite. Although highly variable, applying the test to a package consisting primarily of numeric algorithms can "automatically" generate a test suite covering well over 50% of code. Link to package or code repository.https://github.com/ropenscilabs/autotest 9:05am - 9:25amTalk-VideoID: 111 / ses-03-C: 2 Regular Talk Topics: Efficient programmingKeywords: algorithms A fresh look at unit testing with tinytest Mark van der Loo Statistics Netherlands "The tinytest package[1,2] implements a light weight and flexible framework for unit testing R packages. In spite of its young age, tinytest has become quite popular: since it was released on CRAN in the spring of 2019, more than 140 packages on CRAN and Bioconductor have started to use tinytest for automatic unit testing. This includes influential packages such as Rcpp. Tinytest has a few unique features that set it apart from other testing frameworks, such as parallelization and tracking of side-effects. Side effects such as changes in environment variables are important, for example when working with locale-sensitive operations such as sorting, or date-time conversions. Tinytest also makes it easy to temporarily manipulate the testing environment during the run of the test. For example by changing environment variables. In tinytest, test results are just another type of data. They can be easily translated to data frame layout which to investigate results, or export them from an automated build environment. Moreover, tests are installed with the package so package authors can ask their users to run tests on the user's infrastructure. Using tinytest is easy, as test scripts require no special code: tinytest automatically collects and organizes test results that are created by any unit test expectation occurring in the script. As the name suggests, tinytest is a small package, built in less than 1200 lines of code and no dependencies other than two R-base packages that come with any R installation. [1] M van der Loo (2017). tinytest: R package version 1.2.4. https://cran.r-project.org/package=tinytest [2] MPJ van der Loo (2020) A method for deriving information from running R code. R-Journal (Accepted) https://arxiv.org/abs/2002.07472 Link to package or code repository.https://cran.r-project.org/package=tinytest 9:25am - 9:45amTalk-LiveID: 239 / ses-03-C: 3 Regular Talk Topics: R in productionKeywords: R markdown {fusen}: Create a package from Rmd files Sébastien Rochette ThinkR You know how to build a Rmarkdown for reproducibility, you were said or would like to put your work in a R package, but you think this is too much work? You do not understand where to put what and when? What if writing a Rmd was the same as writing a package? Let {fusen} help you with this task. When you write a Rmarkdown file (or a vignette), you create a documentation for your analysis (or package). Inside, you write some functions, you apply your functions to examples and you maybe write some unit tests to verify the outputs. This is even more true if you follow this guide : ['Rmd first': When development starts with documentation](https://rtask.thinkr.fr/blog/rmd-first-when-development-starts-with-documentation/). Why not transforming this workflow into a documented, tested and maintainable R package to ensure sustainability of your analyses ? To do so, you will need to move your functions and scripts in the correct place. Let {fusen} do this transformation for you! {fusen} is first addressed to people who never wrote a package before but know how to write a RMarkdown file. Understanding package infrastructure and correctly settling it can be frightening. This package may help them do the first step! {fusen} is also addressed to more advanced developers who are fed up with switching between R files, tests files, vignettes. In particular, when changing arguments of a function, we need to change examples, unit tests in multiple places. Here, you can do it in one place. No risk to forget one. Package {fusen} is himself built with himself, from Rmd template files stored in the appropriate place Link to package or code repository. 10:15am - 11:15am RechaRge 2Session Chair: Marcela Alfaro CordobaZoom Host: Rachel HeyardReplacement Zoom Host: Maryam AlizadehReplacement Zoom Host 2: Nasrin Fathollahzadeh AttarYoga for Calm + MeditationJana will teach us how to relax with yoga and will lead a couple of meditation techniques. Come join us! Beginners are welcome. 10:15am - 11:15am RLadies MeetingLocation: The Lounge #R-LadiesSession Chair: Sara MortaraZoom Host: Tuli AmutenyaReplacement Zoom Host: Faith Musili 11:15am - 12:45pm 4A - Trends, Markets and ModelsLocation: The Lounge #talk_trends_markets_modelsSession Chair: Mouna BelaidZoom Host: Faith MusiliReplacement Zoom Host: Maryam Alizadeh Session Slide 11:15am - 11:35amTalk-LiveID: 217 / ses-04-A: 1 Regular Talk Topics: Social sciencesKeywords: business/industry, internationalization, Google Trends Let Me Google That for You – Measuring global trends using Google Trends Harald Puhr, Jakob Müllner WU Vienna, We present the globaltrends package as a flexible and user-friendly means to analyze data from Google Trends. Google offers public access to global search volumes from its search engine through the Google Trends portal. Users select keywords for which they want to obtain search volumes and specify time period and location (global, country, state) of interest. For these combinations of keywords, periods, and locations Google Trends provides search volumes that indicate the number of search queries submitted to the Google search engine. However, Google constrains users to batches of five keywords and normalizes results for each batch. Thereby, large-scale analysis and comparability across batches is impaired. By re-normalizing results to a set of user-defined baseline keywords, the globaltrends package overcomes these limitations. This gives users the opportunity to download and measure search scores i.e., volumes set to a common baseline, for several keywords across or within locations. In addition, users can visualize distributions, developments, and out-of-the-ordinary changes in global search scores or for specific locations. The package allows researchers and analysts to use these search scores to investigate global trends based on patterns within them. This offers insights such as degree of internationalization of firms and organizations or dissemination of political, social, or technological trends across the globe or within single countries. Link to package or code repository.https://github.com/ha-pu/globaltrends 11:35am - 11:55amTalk-LiveID: 230 / ses-04-A: 2 Regular Talk Topics: Economics / Finance / InsuranceKeywords: economics, finance, beahaviour, package Computing Disposition Effect on Financial Market Data Lorenzo Mazzucchelli1, Marco Zanotti2 1University of Milan; 2T-Voice - Triboo Group In recent years, an irrational phenomenon in financial markets is grabbing the attention of behavioral economists: the disposition effect. Firstly discovered by H. Shefrin and M. Statman (1985), the disposition effect consists in the realization that investors are more likely to sell an asset when it is gaining value compared to when it is losing value. A phenomenon which is closely related to sunk costs’ bias, diminishing sensitivity, and loss aversion. From 1985 until now, the disposition effect has been documented in US retail stock investors as well as in foreign retail investors and even among professionals and institutions. By the time, it is a well-established fact that the disposition effect is a real behavioral anomaly that strongly influences the final profits (or losses) of investors. Furthermore, being able to correctly capture these irrational behaviors timely is even more important in periods of high financial volatility as nowadays. The presentation focuses on the new dispositionEffect R package that allows to quickly evaluate the presence of disposition effect’s behaviors of an investor based solely on his transactions and the market prices of the traded assets. A simple step-by-step practical guide is presented to understand how to effectively use all the implemented functionalities. Finally, since financial data may be potentially huge in size, efficiency concerns are discussed and the parallelized versions of the functions are shown. Link to package or code repository.https://github.com/marcozanotti/dispositionEffect;https://marcozanotti.github.io/dispositionEffect/ 11:55am - 12:15pmTalk-LiveID: 152 / ses-04-A: 3 Regular Talk Topics: Economics / Finance / InsuranceKeywords: benchmarking The R Package diseq: Estimation Methods for Markets in Equilibrium and Disequilibrium Pantelis Karapanagiotis Goethe University Frankfurt Market models constitute a major cornerstone of empirical research in industrial organization and macroeconomics. Previous literature in these fields has proposed a variety of estimation methods both for markets in equilibrium, which typically entail a market-clearing condition, and in disequilibrium, in which the primary identification condition comes from the short-side rule. Although methodologically attractive, the estimation methods of such models, in particular of the disequilibrium models, is computationally demanding and software providing simple, out-of-the-box methods for estimating them is scarce. Econometricians, therefore, mostly rely on their own implementations for estimating these models. This talk presents the R package Diseq, which provides functionality to simplify the estimation of models for markets in equilibrium and disequilibrium using full information maximum likelihood methods. The basic functionality of the package is presented based on the data and the classic analysis originally performed by Fair & Jaffee (1972). The talk also gives an overview of the design of the package, presents the post-estimation analysis capabilities that accompany it, and provides statistical evidence of the computational performance of its functionality gathered via large-scale benchmarking simulations. Diseq is free software that is distributed under the MIT license as part of the R software project. It comprises a set of estimation tools, which are to a large extend not available from either alternative R packages or other statistical software projects. Link to package or code repository.https://github.com/pi-kappa-devel/diseq 11:15am - 12:45pm 4B - Data viz and Spatial ApplicationsLocation: The Lounge #talk_dataviz_spatialSession Chair: Natalia Soledad MorandeiraZoom Host: Rachel HeyardReplacement Zoom Host: Tuli Amutenya Session Slide 11:15am - 11:35amTalk-LiveID: 216 / ses-04-B: 1 Regular Talk Topics: Data visualisationKeywords: high-dimensional data Visual Diagnostics for Constrained Optimisation with Application to Guided Tours H. Sherry Zhang1, Dianne Cook1, Ursula Laa2, Nicolas Langrené3, Patricia Menéndez1 1Monash University,; 2University of Natural Resources and Life Sciences; 3CSIRO Data61 Guided tour searches for interesting low-dimensional views of high-dimensional data via optimising a projection pursuit index function. The first paper of projection pursuit by Friedman and Tukey (1974) stated that “the technique used for maximising the projection index strongly influences both the statistical and the computational aspects of the procedure.” While much work has been done in proposing indices in the literature, less has been done on evaluating the performance of the optimisers. In this paper, we implement a data collection object in the optimisation of projection pursuit guided tour and introduce visual diagnostics based on the data object collected. These diagnostics and this workflow can be applied to a broad class of optimisers, to assess their performance. An R package, ferrn, has been created to implement the diagnostics. Link to package or code repository.https://github.com/huizezhang-sherry/ferrn 11:35am - 11:55amTalk-VideoID: 181 / ses-04-B: 2 Regular Talk Topics: Spatial analysisKeywords: open data, spatial data, data visualization, spatial analysis geofi-package: Facilitating the access to key spatial datasets in Finland Markus Kainu National Social Insurance Institution of Finland (KELA), Finland There is a groving demand for presenting statistical data on map. COVID-19 launched a race across internet in spatial data visualization where aesthetics, usability and real-timeliness are of high value. The demand for real-time data favours solutions that can be scripted, automated and refactored quickly and for that purpose we developed geofi R-package to facilitate access to key Finnish geospatial datasets. geofi combines resources of two Statistics Finland APIs: the regional classification API and spatial data API. Time-series of regional classifications are shipped as on-board data and larger spatial data is fetched through a WFS-api consisting of administrative borders, zipcodes and both population and statistical grids at various resolutions. Package aims at being a onboarding technology into R ecosystem with clear and concrete vignettes covering the basics of spatial data manipulation, working with attribute data and step-by-step instruction for creating maps both for static and interactive applications. This talk describes the main functions and design principles of the packages and makes comparison with similar packages such as geobr for Brazil and geouy for Uruguay. Link to package or code repository.https://github.com/rOpenGov/geofi 11:55am - 12:15pmTalk-VideoID: 159 / ses-04-B: 3 Regular Talk Topics: Data visualisationKeywords: environmental sciences Virtual Environments: Using R as a Frontend for 3D Rendering of Digital Landscapes Michael J. Mahoney1, Colin M. Beier2, Aidan C. Ackerman3 1Graduate Program in Environmental Science, State University of New York College of Environmental Science and Forestry; 2Department of Sustainable Resources Management, State University of New York College of Environmental Science and Forestry; 3Department of Landscape Architecture, State University of New York College of Environmental Science and Forestry This talk discusses a new approach to using R to create 3D landscape visualizations, which relies on external tooling designed specifically for detailed 3D rendering and interactive exploration. By using R as a frontend for high-performance rendering engines, users are able to quickly create data-defined renders which can then be interactively explored and manipulated. Two of the most promising engines for this approach are the (proprietary, source-available) Unity rendering engine, which excels at visualizing large swaths of land and the (free and open-source) Blender engine, which is well-adapted for visualizing smaller settings. Our new {terrainr} package (available from CRAN) helps users quickly produce terrain surfaces from real-world data in Unity, visualizing environmental patterns and processes across large scales. Two new experimental packages, {mvdf} and {forthetrees}, focus on creating smaller-scale renders in Blender. Taken together, these packages suggest a way for users to create data-defined 3D renderings within R, using their preexisting coding abilities in the place of complex user interfaces to control powerful rendering engines. Our approach makes it possible for users to create renderings from data in these engines faster and easier than has been historically possible. Link to package or code repository. 12:45pm - 1:00pm BreakVirtual location: The Lounge #lobby 1:00pm - 2:30pm Elevator Pitches 2Virtual location: The Lounge #elevator_pitches ID: 195 / ep-02: 1 Elevator Pitch Topics: ReproducibilityKeywords: DOI, reproducibility, credibility, Open Science, Zenodo, data science Make Your Computational Analysis Citable Batool Almarzouq1,2,3 1University of Liverpool, United Kingdom; 2Open Science Community Saudi Arabia; 3King Abdullah International Medical Research Center (KAIMRC), Saudi Arabia Although there are overwhelming resources about licencing and citation for R software packages, there's less attention paid to making non-package (data science) code in R citable. Academics and researchers who want to embrace Open Science practices are mostly unaware of how to make their R code citable before publishing in academic journals and what kind of licence they may use to protect the intellectual property of their work. This lightning talk will highlight important aspects to data scientists, which include generating persistent DOIs, metadata, tracking of data re-use, licensing, access control and long term availability. It'll start by introducing the zen4R package, which will be used to generate Digital Object Identifier (DOI) for any R code from RStudio. This R package provides an interface to the Zenodo e-infrastructure API, which is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. Then, I'll show how you can add metadata and track your data/code re-use. Also, to protect the project's intellectual property, several types of licenses applicable for non-package (data science) code in R will be described and applied using usethis package. By the end of the talk, academics and researchers, who use R frequently, will be provided with the necessary tools to publish all the research life cycle of their project while protecting the intellectual property of their work. This will increases efficiency and brings benefits to the broader scientific community by increasing reproducibility and credibility. Link to package or code repository.This is based on several blog posts I'll be publishing in: https://batool-blabber.netlify.app/posts/2021-06-23-make-your-computational-analysis-citable/ ID: 291 / ep-02: 2 Elevator Pitch Topics: EcologyKeywords: structural connectivity, landscape ecology, landscape metrics, principal component analysis, wetland forests Structural connectivity in the Lower Uruguay River Forest Adriana Rojas1, Mariel Bazzalo2, Natalia Morandeira1,3 13iA-UNSAM; 2CARU; 3CONICET In recent decades, a process of agricultural expansion took place in the wetlands of the lower Uruguay River that leading to the fragmentation of the landscape. We aimed to estimate the structural connectivity of the hydrophilic forest and the open forest, in the basins of the main tributaries of the Lower Uruguay River for the years 1985, 2002 and 2017. Our inputs were landcover classifications previously generated by the authors with Landsat imagery. For each date, the study area (339.000 km2) was subdivided into 49800 cells of 1Km2. The connectivity is estimated by calculating 14 landscape metrics in each of the 49800 cells, for each of the dates. The spatial representation of the connectivity indices was processed using the sf, tidyverse and dplyr packages. Subsequently, we performed a PCA analysis to reduce the dimensionality of the connectivity analysis and propose a simpler connectivity index without redundant variables, the stats, FactoMineR and factoextra packages were used. The variables with the highest score in components 1 and 2 of the PCA (which explain the greater variability) are graphically represented for one of the cells. Our proposed index is based on four landscape metrics: class area, number of patches, landscape shape,and area-weighted mean patch area. Based on this index, we identified areas with low/high connectivity of the forests, and trends in connectivity changes during the study period. ID: 222 / ep-02: 3 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: epidemiology Fitting the beta distribution for the intra-apiary dynamics study of the infestation rate with Varroa Destructor in honey bee colonies Camila Miotti, Ana Molineri, Adriana Pacini, Emanuel Orellano, Marcelo Signorini, Agostina Giacobino Instituto de Investigación de la Cadena Láctea (IDICAL-CONICET-INTA), The aim of this study was to estimate the infestation level with V. destructor mites of honey bee colonies as function of the autumn-winter parasitic dynamics. A total of six apiaries (with five colonies each) distributed within a 30 km radius with minimum distance of 2 km between them were set. All colonies were set with sister queens and the apiaries were balanced according to the adult bee population size. The following experimental conditions were established: a) two apiaries in a circular arrangement with one colony each infested with Varroa mites (donor colony), b) four apiaries in a lineal arrangement, two of them with a donor colony each located at the edge of the line and two apiaries with a donor colony each located in the middle of the line. All colonies were treated during autumn against V. destructor (except for the donor colonies) with amitraz (Amivar 500®) to reduce to 0% infestation level of the receiver colonies (four within each apiary). Samples for diagnose the phoretic Varroa infestation (PV) were taken 45 days after treatment (middle-may) and monthly from June to September. The PV mite infestation (estimated as N° Varroa / N° bees) was evaluated as function of the colonies disposition (circular/ lineal-middle/ lineal-edge) and the initial PV mite level of the donor colony. A generalized linear mixed model with Beta distribution and Logit link was fitted using glmmTMB function (glmmTMB package), including the colony as random effect. After the descriptive analysis, a cubic model was fitted. The colony disposition effect and the initial PV mite level were statistically significant (P=0.0126 and P=0.0314, respectively).This result suggested that the PV temporal dynamics within each colony differs according to initial PV of neighbor colonies and the infestation probability is higher for lineal-middle disposition of the colonies. withdrawnID: 274 / ep-02: 4 Elevator Pitch Topics: Data mining / Machine learning / Deep Learning and AIKeywords: NEET; C50; Classification trees; Imbalanced data; SDGs. C50 Classification of young Moroccan men and women not in employment, education or training (NEET). Salima MANSOURI, Hafsa EL HAFYANI, Ichark LAFRAM High Commission for Planning, Morocco Within the 2030 Agenda for Sustainable Development framework, the proportion of youth not in employment, education or training (NEET) has to be substantially reduced; in this context, in order to draw a clearer picture for targeting-policy designers, the present study aims to investigate the composition of Moroccan young NEET men and women aged from 15 to 29 years old by elaborating two classification trees; for both NEET men and NEET women; using some previously proved to be potential predictors (handicap situation, matrimonial status, age, level of education, economic activity of the head of household). This study provides a comparison of different classification trees by implementing some algorithms in R and Python (R: C50, Python: scikit, Orange3); and presents the optimal trees that splits the data the best. Besides, it is to be noted that youth with NEET status form a population that is characterized by a great gender related heterogeneity with respect to economic activity status: most of NEET women are housewives (76.7%) or unemployed (13%), while NEET men are mostly unemployed with no experience (51.6%), unemployed who have worked previously (25.4%) or economically inactive (23%); consequently, the imbalanced classes problem in the target variable had to be priorly solved applying SMOTE, Over Sampling, SMOTE-ENN methods. Keywords: NEET; Classification trees; Imbalanced data; R; Python; SMOTE. Link to package or code repository.https://2019.isiproceedings.org/Files/8.Contributed-Paper-Session(CPS)-Volume-2.pdf , (page 59) , https://2019.isiproceedings.org/ Contributed Paper Session (CPS) - Volume 2 (page 59) ID: 257 / ep-02: 5 Elevator Pitch Topics: R in productionKeywords: DataOps, Banks, Regulation, Government, Reproducibility Decision support using R and DataOps at a European Union bank regulator Jonas Bergstrom, Nicolas Pochet Single Resolution Board, Belgium We describe how the Single Resolution Board (SRB) created an environment for efficient and reproducible decision support using R and DataOps principles. The SRB is the Resolution Authority for the EU Banking Union. Its mission is to manage failures of large EU banking groups while protecting financial stability and minimizing impact on public finances. The SRB develops quantitative models to simulate interbank contagion and impacts to the financial system. The SRB uses these models as a basis for decisions, both as part of its day-to-day work and in crisis management situations. In a bank crisis, it is important that SRB is able to respond immediately to changing data. Furthermore, decisions taken by the SRB regarding a failing bank can be subject to legal proceedings, and it is crucial that SRB can reproduce and justify its decisions regarding the affected bank(s). The constraints imposed on SRB make the case for code-driven data analysis using R and DataOps principles to ensure reproducibility, correctness and the ability to quickly deploy new models in production. Working side-by-side, SRB IT operations engineers and Data Scientists have created an R-based infrastructure where models, packages and dashboards are automatically built, tested and deployed in reproducible environments. Finally, models in production deliver automated feedback which is used to improve future models. The end result is improved quality and reduced time to production. In conclusion, we make the case that using R and DevOps, public authorities can deliver better quality decisions more quickly and with lower cost to taxpayers. ID: 199 / ep-02: 6 Elevator Pitch Topics: Bayesian modelsKeywords: Model-Based Clustering, Finite Mixture Models, Infinite Mixture Models fipp : a bridge between domain knowledge and model specification in Dirichlet Process Mixtures and Mixture of Finite Mixtures Jan Greve, Bettina Grün, Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter WU Vienna University of Economics and Business, Austria Bayesian methods have established a firm foothold in unsupervised learning, particularly in the area of clustering. The probabilistic and generative nature of the Bayesian paradigm offers a rich inference in clustering framework that are successfully applied to various areas in science and industry such as natural language processing, computer vision and volatility modeling to name a few. The fipp package is aimed at enhancing the use of the most popular and successful Bayesian methodology in this area: Dirichlet Process Mixtures (DPMs) and its parametric counterpart Mixture of Finite Mixtures (MFMs). A major source of uncertainty when implementing these models in practice is how one can incorporate domain-specific knowledge in prior distribution and hyperparameters. For example, a practitioner may have a rough idea of the number of clusters to be expected or the unevenness in partition structure, which should be translated appropriately to the prior and hyperparameter specification. Bridging this gap between statistical formulation and domain knowledge is what the functionalities implemented in fipp package do. Specifically, it allows users to evaluate the prior distribution of the number of clusters and computation of functionals over the prior partitions in a computationally efficient manner. This enables efficient experimentation using various prior and hyperparameter settings. Suggested use of this package is to combine it with R packages aimed at fitting the aforementioned methodology to real data such as PReMiuM and dirichetprocess. Link to package or code repository. ID: 266 / ep-02: 7 Elevator Pitch Topics: Teaching R/R in TeachingKeywords: data science teaching, learnr, shiny, bookdown, learning management system An integrated teaching environment for R with {learnitdown} Philippe Grosjean, Guyliann Engels Numerical ecology department, Complexys and InforTech Institutes, University of Mons, Belgium Many R resources exist for teaching R and data science, like {bookdown}, {blogdown} or {distill} for textbook material, {learnr} and {gradethis} for tutorials with interactive exercises, {shiny} applications for interactive demonstrations, R/exams for exams generation and administration... However, as far as we know, there is still no integrated system that manage all these tools and other common ones like Moodle or H5P, in a coherent teaching platform. The {learnitdown} R package (https://github.com/SciViews/learnitdown) brings all these tools together into a little LMS (learning management system) dedicated to teaching with R and R Markdown. Students authentication from Moodle or Wordpress allow to track individual activity in the H5P, {learnr} or {shiny} exercises in a centralized database. A list of exercises is build automatically for each {bookdown} chapter and an auto-generated progress report help students to manage their exercises more easily. Data gathered from these exercises can be pseudonymized and analyzed. The {learnitdown} system is used to teach data science to students in biology at the University of Mons, Belgium, since 2018 with great satisfaction, see https://wp.sciviews.org (in French) and https://github.com/BioDataScience-Course. Link to package or code repository. withdrawnID: 120 / ep-02: 8 Elevator Pitch Topics: Data visualisationKeywords: applications, case studies Visualization of the one-way flexible ANOVA tests using with {doexplot} Mustafa Cavus Eskisehir Technical University, Department of Statistics It is not always easy to interpret the output of statistical tests. This situation can be made easier with visualization methods. The {ggbetweenstats} package provides tools for visualizing and reporting the output of ANOVA tests under normality. However, the violation of assumptions are commonly faced problem in ANOVA. The {doex} package provides several one-way ANOVA tests under heteroscedasticity and non-normally distributed data. In this study, {doexplot} package is implemented to visualize the output of one-way ANOVA tests provided in the {doex} package. In this way, made easier for researchers to interpret and report the results of flexible ANOVA methods in case of violation of the assumptions. ID: 129 / ep-02: 9 Elevator Pitch Topics: EcologyKeywords: biology TrackJR: a new R-package using Julia language for tracking tiny insects Gerardo de la vega1,2, Federico Triñanes2, Andres Gonzalez Ritzel2 1CONICET (IFAB-INTA) ARGENTINA; 2LEQ (UDELAR) URUGUAY Here we present the trackJR package, a tool to analyze tiny insect behaviours in bioassays where the main important variable is the position of the insect (for example, an olfactometer bioassay or other orientation experiment). This package allows to work with tiny objects, understood as an individual representing ~1% of the frame, thus it could be used with other species rather than insects. It was written in Julia and R as a common tool for biologists, with user-friendly’ Shiny widget to a broad audience. Therefore, the package allows biologists to use a script wrote in Julia language with a basic R language knowledge. Also, the results can be easily merged with others R object outputs (i.e. data-frame, matrix or lists). ID: 232 / ep-02: 10 Elevator Pitch Topics: Web Applications (Shiny/Dash)Keywords: javascript shiny.fluent and shiny.react: Build Beautiful Shiny Apps Using Microsoft's Fluent UI Marek Rogala Appsilon In this talk we will present the functionality and ideas behind a new open source package we have developed called shiny.fluent. UI plays a huge role in the success of Shiny projects. shiny.fluent enables you to build Shiny apps in a novel way using Microsoft’s Fluent UI as the UI foundation. It gives your app a beautiful, professional look and a rich set of components while retaining the speed of development that Shiny is famous for. Fluent UI is based on the Javascript library React, so it’s a challenging task to make it work with Shiny. We have put the parts responsible for making this possible in a separate package called shiny.react, which enables you to port other React-based components and UI libraries so that they work in Shiny. During the talk, we will demonstrate how to use shiny.fluent to build your own Shiny apps, and explain how we solved the main challenges in integrating React and Shiny. Link to package or code repository.https://github.com/Appsilon/shiny.fluent ID: 231 / ep-02: 11 Elevator Pitch Topics: Web Applications (Shiny/Dash)Keywords: API Conducting Effective User Tests for Shiny Dashboards Maria Grycuk Appsilon, User tests are a crucial part of development, yet we frequently skip over them or conduct them too late in the process. Involving users early on allows us to verify if the tool we want to build will be used by them or will be forgotten in the next few months. Another risk that increases significantly when we don’t show the product to end users before going live is that we will build something unintuitive and difficult to use. When you are working with a product for a few months and you know every button and feature by heart, it is hard to take a step back and think about usability. In this talk, I would like to share a few tips on how to perform an excellent user interview, based on my experience working with Fortune 500 clients on Shiny dashboards. I will show why conducting effective user tests is so critical, and explain how to ask the right questions to gain the most from the interview. ID: 124 / ep-02: 12 Elevator Pitch Topics: R in productionKeywords: business, industry NNcompare: An R package supporting the peer programming process in clinical studies Mette Bendtsen, Steffen Falgreen Larsen, Frederik Vandvig Heinen, Claus Dethlefsen Novo Nordisk A/S, Alfred Nobels Vej 27, DK-9220 Aalborg Øst, Denmark Analysing and reporting data from clinical studies require a high level of quality in the entire process from data collection to final clinical study report (CSR). In Novo Nordisk, part of the quality assurance is ‘peer programming’ of important data derivations, complex combinations, and statistical analyses included in data sets and TFL (tables, figures, and listings) for the CSR. In this context, peer programming involves two persons solving a specific programming task: the programmer and the reviewer. The programmer creates a program that solves the task, and the reviewer creates a ‘peer program’ that reviews/validates the programmer’s work. To avoid being influenced by the programmer’s code, the reviewer should not read it until after preparation of the peer program. NNcompare is an R package that supports this peer programming process in Novo Nordisk. The package builds on the comparedf() function from the ‘arsenal’ package, which essentially provides functionality for comparing two data frames and reporting the results of the comparison. To support the peer programming process in Novo Nordisk the NNcompare package provides additional functionality for exporting comparison reports to various formats using R markdown, and for creating summary reports across multiple peer programs to provide an overview of the status of all peer programs for a given trial. Furthermore, the package includes functionality for comparing png files using pixel-wise comparisons and marking differences in the plot. Future development will include comparisons of other file types and comparisons of multiple data frames with one function call. ID: 253 / ep-02: 13 Elevator Pitch Topics: Statistical modelsThe evolution of the dependencies of CRAN packages Clement Lee Lancaster University, United Kingdom The number of CRAN packages has been growing steadily over the years. In this talk, we examine two aspects of the package dependencies. First, we look at a snapshot of the dependency network, and apply statistical network models to study its properties, including the degree distribution and the different clusters of packages. Second, we study the evolution of the network over the last year and how the number of reverse dependencies of grows for a typical package. This will allow us to examine the extent to which the preferential attachment model (or the-rich-gets-richer effect) is valid. Link to package or code repository.https://cran.r-project.org/package=crandep ID: 249 / ep-02: 14 Elevator Pitch Topics: AlgorithmsKeywords: Resampling, Linear mixed-effect models, Bootstrap, Nested data Bootstrapping Multilevel Models in R using lmeresampler Adam Loy Carleton College, United States of America Linear mixed-effects (LME) models are commonly used to analyze clustered data, such as split plot experiments, longitudinal studies, and stratified samples. In R, there are two primary packages to fit LME models: nlme and lme4. In this talk, we present an extension of the nlme and lme4 packages to include methods for bootstrapping model fits. The lmeresampler packages implements several bootstrap methods for LME models with nested dependence structures using a unified framework: the cases bootstrap resamples entire clusters or observations within clusters (or both); the parametric bootstrap simulates data from the model fit; the residual bootstrap resamples both the predicted random effects and the predicted error terms; and the random effect block bootstrap utilizes the marginal residuals to calculate nonparametric predicted random effects as part of the resampling process. We will discuss and demonstrate the implementation of these bootstrap procedures, and outline plans for future development. lmeresampler is available on CRAN. Link to package or code repository. ID: 299 / ep-02: 15 Elevator Pitch Topics: OtherKeywords: IDE RCode, a new IDE for R Nicolas Baradel, William Jouot PGM Solutions, France RCode is a new and modern IDE for R. It includes usual features like code highlighting, environment for R variables, history of execution, etc. It also provides extra features such as an Excel-like data grid in which data.frame are directly editable. RCode is multiplatform and available in several languages. ID: 295 / ep-02: 16 Elevator Pitch Topics: R in productionKeywords: pharma, validation, verification, qualification R Package Validation and {valtools} Ellis Hughes Fred Hutch Cancer Research Center, United States of America The R Package Validation Framework offers a clear, easy to follow guide to automate the creation of validated R packages for use in regulated industries. By combining many of the package development tools and philosophies already in existence in the R ecosystem, the framework minimizes overhead while improving the quality of both the package and validation. {valtools} is the implementation of this framework as an R package. Much like {usethis}, {valtools} automates the creation of the validation infrastructure and eventual validation report so users can focus on what matters: writing the R package. By the end of this talk, listeners will know the basics to implement the R Package Validation Framework using the {valtools} package. Link to package or code repository. ID: 151 / ep-02: 17 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: algorithms NetCoupler: Inferring causal pathways between high-dimensional metabolomics data and external factors Luke W. Johnston1, Clemens Wittenbecher2, Fabian Eichelmann3 1Steno Diabetes Center Aarhus; 2Harvard T.H. Chan School of Public Health; 3Department of Molecular Epidemiology, German Institute of Human Nutrition and German Center for Diabetes Research High-dimensional metabolomics data are highly intercorrelated, implying that associations with lifestyle and other exposures or with disease outcomes generally propagate across sets of co-varying metabolites. When inferring biological pathways from metabolomics studies, it is often crucial to detect direct exposure-metabolite or metabolite-outcome relationships instead of associations that can be explained by correlations with other metabolites. To tackle this challenge, we have developed the NetCoupler-algorithm R package (found at github.com/NetCoupler). NetCoupler builds on evidence showing that data-driven networks recover biological dependencies from metabolomics data and that, based on causal inference theory, adjustment for at least one subset of direct neighbors is sufficient to block all confounding influences within a conditional dependency network. NetCoupler estimates a conditional dependency network from metabolomics data and then uses a multi-model approach to adjust for all possible subsets of direct neighbors in the network in order to identify exposure-affected metabolites or metabolites that have direct effects on disease outcomes. We demonstrate using simulated data that NetCoupler correctly identifies direct exposure-metabolite and metabolite-outcome effects and provide an example of its application in a prospective cohort study to integrate the information on food consumption habits, metabolomics profiles, and type 2 diabetes incidence. While NetCoupler was developed from a need to process and analyze the data from metabolomics studies, NetCoupler can also be applied to detect direct links between other external variables and network types. Link to package or code repository.https://github.com/NetCoupler/NetCoupler ID: 236 / ep-02: 18 Elevator Pitch Topics: R in productionKeywords: CI/CD Continuously expanding Techguides: An open source project based on bookdown using CI/CD pipelines from GitHub Actions Peter Schmid Mirai Solutions The Data Scientist work is often about solving unfamiliar problems. Online resources are a bliss in this regard, with the community providing answers to virtually any problem. However, it can be difficult to find the working solution in an ocean of more or less useful suggestions. Therefore, we at Mirai Solutions, have started to gather solutions to some of these issues in an open source project: techguides. This initiative is meant to give back to the community a bit of our know-how. It resulted in a public repository that elegantly puts together several rmarkdown files and renders them as a bookdown website served on GitHub Pages. In this talk, I would like to show how we are continuously expanding our techguides in a flexible way based on an automated continuous integration and deployment workflow using GitHub Actions. As Github Actions is fairly new and not yet trivial to set up, we hope that our explanations can help and inspire others to consider using CI / CD. Link to package or code repository.https://github.com/miraisolutions/techguides ID: 261 / ep-02: 19 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: k-mer, prediction, protein, functional analysis R as a an environment for the functional analysis of proteins Michał Burdukiewicz Medical University of Białystok, Poland The functional analysis of proteins, development of models associating the protein sequence with its function, was always one of the cornerstones of bioinformatics. Like every other application of machine learning, it is prone to issues such as reproducibility or benchmarking. Moreover, as potential users are mostly biologists, these models should be accessible without any coding. Unfortunately, the resources necessary to build and share such models in accordance with CRAN/Bioconductor guidelines and requirements of reproducible science are still scattered. During my presentation, I sum up my experience of developing several tools for functional analysis of proteins (AmyloGram, SignalHsmm, AmpGram, and CancerGram). I show the advantages of the R ecosystem during the development of the model (tidysq, mlr3) and deployment (R packages, Shiny web servers, and Electron-based standalone apps). As sharing very large (>10 MB) predictive models on CRAN is not intuitive, I show how to do it in a way that satisfies submission requirements. Link to package or code repository. ID: 139 / ep-02: 20 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: Bioimaging, R workflow, high dimensional data Statistical Workflows in R for Imaging Mass Spectrometry Data Hoang Tran, Valeriia Sherina, Fang Xie GlaxoSmithKline, United States of America Matrix-assisted laser desorption/ionization (MALDI) imaging mass spectrometry (IMS) is a technique that can reveal powerful insights into the correlation between molecular distributions and histological features. Due to their high-dimensional, hierarchical and spatial nature, MALDI IMS datasets present numerous statistical challenges. In collaboration with the bioimaging team at GlaxoSmithKline (GSK), we have developed special purpose statistical workflows in R that provide end-to-end support for the entire MALDI IMS analysis pipeline, from study design and assay quantification to functional pharmacology. These applications leverage numerous R packages, with a particular focus on the “tidyverse” and “tidymodels” ecosystems due to their modularity and interconnectedness (to protect GSK’s intellectual property, we are currently unable to share our code). Our workflows include robust smoothing and estimation of calibration curves; non-trivial animal and tissue sample size calculations via in silico experiments; and AI/ML implementations for prediction of drug effects from the high-dimensional molecular space. These solutions addressed unique biological and quantitative challenges, and yielded actionable insights for GSK’s bioimaging team. ID: 258 / ep-02: 21 Elevator Pitch Topics: Web Applications (Shiny/Dash)Keywords: Shiny, NLP, Human-computer interaction, Chatbot, AI&Society Hi, Let’s Talk About Data Science! - Customize Your Personal Data Science Assistant Bot. Livia Eichenberger, Oliver Guggenbühl STATWORX GmbH, Switzerland In June 2020, OpenAI released their newest NLP model, GPT-3, and thus set a new standard for language understanding and generation. GPT-3 is an autoregressive language model, enabling the generation of human-like text. Sample use cases are chatbots, Q&A systems and text summarization. Due to the complexity of GPT-3, it is difficult for non-technical specialists to experience both the strengths but also the shortcomings of this technology. A fundamental challenge faced today is educating society about the potentials and risks of AI and not leaving anyone behind. To approach this task, R’s Shiny framework can be leveraged to lower the barrier of entry for interaction with AI models. Specifically, GPT-3 can be instructed to incorporate different types of chatbots by supplying it with a precise description of how it should behave during a conversation. We provide an interface to chat with a Data Science bot, where various parameters of the bot’s behaviour can be selected on the fly. Examples are the preferred language and the user’s knowledge level. A mockup of our interface is attached. Shiny is the preferred framework for this application because it comes packaged with all the necessary tools for interacting with a customizable chatbot based on GPT-3. With Shiny’s input widgets the user can then manipulate various parameters to influence the pre-defined chatbot’s personality. The chatbot will immediately adjust their behaviour and finetune their personality, allowing the user to experience their input on GPT-3 in real-time. All this will be done in a clearly laid out interface where users need no prior experience with R coding or creating Shiny apps. We present how we use Shiny to lower the barrier to interact with AI models with little overhead and thus to tackle one of today’s most important problems: AI education of the broader population. Link to package or code repository.http://files.statworx.com/datascience-assistant.jpg ID: 133 / ep-02: 22 Elevator Pitch Topics: R in productionKeywords: business, industry NNSampleSize: A tool for communicating, determining and documenting sample size in clinical trials Claus Dethlefsen1, Steffen Falgreen Larsen1, Anders Ellern Bilgrau2, Nynne Holdt-Caspersen1, Maika Lindkvist Jensen1 1Novo Nordisk A/S; 2Seluxit Determination of sample size in clinical studies is an iterative process involving many stakeholders and leading to many decisions. When data from other studies become available, assumptions may be revised or other scenarios for study design may be considered. Assumptions also feed into decision guiding framework aimed to determine if the sample size is adequate to make a decision for the future development of the product. At Novo Nordisk, we have developed an R shiny that can assist us in this process. In the R shiny application, several sample size scenarios can be carried out for a given study. The application has a documentation module for keeping track on decisions using Rmarkdown as well as facilities for programming and reviewing the final determination of sample size. When finalized, the idea is to download word-files ready for archiving in a documentation system. ID: 250 / ep-02: 23 Elevator Pitch Topics: Multivariate analysisKeywords: True Discovery Proportion, Permutation Test, Multiple Testing, Selective Inference, fMRI Cluster Analysis pARI package: valid double-dipping via permutation-based All Resolutions Inference Angela Andreella1, Jelle Goeman2, Livio Finos3, Wouter Weeda4, Jesse Hemerik5 1Department of Statistical Sciences, University of Padova; 2Biomedical Data Sciences, Leiden University Medical Center; 3Department of Developmental Psychology and Socialization, University of Padua; 4Methodology and Statistics Unit, Department of Psychology, Leiden University; 5Biometris, Wageningen University and Research The functional Magnetic Resonance Imaging cluster extent-based thresholding is popular for finding neural activation associated with some stimulus. However, it suffers from the spatial specificity paradox: we only know that a specific cluster of voxels is significant under the null hypothesis of no activation. We can not find out the number of truly active voxels inside that cluster without falling into the double-dipping problem. For that, Rosenblatt et al. (2018) developed All-Resolution Inference (ARI), which associates the lower bound of the number of truly active voxels for each cluster. However, ARI can lose power if the data are strongly correlated, e.g., fMRI data. So, we re-phrase it using the permutation theory, developing the package pARI. The main function pARIbrain takes as input a list of contrast maps, one for each subject, given by neuroimaging tools. The user can then insert a cluster map, and pARIbrain returns the lower bounds of true discoveries for each cluster from the cluster map inserted. The package was developed for the fMRI scenario; however, we develop also the function pARI. It takes the permutation p-values null distribution and the indexes of the hypothesis of interest as inputs and returns the lower bound for the number of true discoveries inside the set of hypotheses specified. The user can compute the permutation null distribution concerning the two-sample t-tests and one-sample t-tests by the permTest and signTest functions. The set of hypotheses can be specified as often as the user wants, and pARI still controls FWER. Link to package or code repository. ID: 233 / ep-02: 24 Elevator Pitch Topics: Web Applications (Shiny/Dash)Keywords: biostatistics Data Access and dynamic Visualization for Clinical Insights (DaVinci) Matthias Trampisch, Julia Igel, Andre Haugg Boehringer Ingelheim This talk introduces the Boehringer Ingelheim initiative on Data Access and dynamic Visualization for Clinical Insights (DaVinci). It is named after Leonardo da Vinci, one of the most diversely talented individuals ever to have lived. The main objective of the DaVinci project is to reflect this diversity creating a modular framework based on the shiny, which enables end-users to have direct access to clinical data via advanced visualization during clinical development. DaVinci consists of a collection of shiny-based modules to review, aggregate and visualize data to develop and deliver safe and effective treatments for patients. Based on harmonized data concepts (SDTM/ADaM), DaVinci provides and maintains GCP compliant modules for data review and analysis, which can easily be combined and customized into trial-specific dashboards by the end-user. The talk outlines the developed approach, including the developed modular manager and highly flexible, custom-designed modules which all lead to an individual and customizable app experience. Main advantages of this approach are that the individual modules can be validated separately and used flexible in a joint shiny application, which permits easy validation considering GDPR, GxP and 21 CFR part 11. This approach also supports trial, project or substance specific needs to get the most value out of the data. Deployment of these apps is done via a CI/CD pipeline using the Atlassian Stack and Jenkins, resulting in dockerized shiny server instances, which can easily scale up to the application needs. ID: 179 / ep-02: 25 Elevator Pitch Topics: Environmental sciencesKeywords: Environmental research; Big data; Reproducibility; Data visualisation Reproducibility and dissemination in the research: a case of study of the bioaerosol dynamics Jesús Rojo1, Antonio Picornell2, Jeroen Buters3, Jose Oteros4 1Department of Pharmacology, Pharmacognosy and Botany, Complutense University. Madrid (Spain); 2Department of Botany and Plant Physiology. University of Malaga. Malaga (Spain); 3Center of Allergy & Environment (ZAUM), Technische Universität München/Helmholtz Center Munich. Munich (Germany); 4Department of Botany, Ecology and Plant Physiology. University of Cordoba. Cordoba (Spain) Environmental databases are constantly increasing which require computational tools to be efficiently managed. This experience is an example of the procedure followed to manage the aerobiological databases used in the publication led by Rojo et al. [Environ Res,174:160-169; doi:10.1016/j.envres.2019.04.027] based on the effect of height on pollen exposure. While the analysis of pollen time-series at local scale may provide no clear or even contradictory findings from different study areas, a global study provides robust results avoiding biases or the effect of local factors masking the true patterns in bioaerosol dynamics. We analysed about 2,000,000 daily pollen concentrations from 59 monitoring stations of Europe, North America and Australia, using R Software and 'AeRobiology', a specific package in this field [Rojo et al., Methods Ecol Evol,10:1371-1376; doi:10.1111/2041-210X.13203]. Due to the huge amount of data contributors involved we conducted a first step of exhaustive filtering and quality control of data for making standard and comparable datasets between sites. This quality control required basic rules of removing uncertain or missing data, but also scientific criteria based on the optimisation of parameters like distance or degree of similarity between sites. The pollen rate between paired stations was used to study the effect of height on pollen concentrations which constituted the second step (analysis of data) and the main scientific findings. One of the key benefits of computational tools is the automation of the processes. In this case, the processing and analysis systems made it possible to dynamically incorporate the pollen data from new stations, obtaining an automatic update of the statistical analysis. Finally, since reproducibility and dissemination are both very important principles of the scientific research, we designed a Shiny Application where the users may interpret the results and generate the graphs selecting specific scientific criteria by themselves. Link of the Shiny Application: https://modeling-jesus-rojo.shinyapps.io/result_app2/ Link to package or code repository.https://cran.r-project.org/web/packages/AeRobiology/index.html ID: 210 / ep-02: 26 Elevator Pitch Topics: Environmental sciencesKeywords: environmental sciences R in the aiR! Adithi R. Upadhya1, Pratyush Agrawal1, Sreekanth Vakacherla2, Meenakshi Kushwaha1 1ILK Labs, Bengaluru, India; 2Center for Study of Science, Technology and Policy, Bengaluru, India R is a powerful tool in analysing air-quality data. With the ever-increasing global measurements of air pollutants (through stationary, mobile, low-cost, and satellite monitoring), the amount of data being collected is huge and necessitates the use of management platforms. In an effort to address this issue, we developed two Shiny applications to analyse and visualise air-pollution data. ‘mmaqshiny’, now on CRAN, is aimed at handling, calibrating, integrating, and visualising spatially and temporally acquired air-pollution data from mobile monitoring campaigns. Currently, the application caters to data collected using specific instruments. With just the click of a button, even non-programmers can generate summary statistics, time series, and spatial maps. The application is capable of handling high-resolution data from multiple instruments and formats. Moreover, it also allows users to visualize data at near-real time and helps in keeping a tab on data quality and instrument health. Our second Shiny application (currently in the development phase) is specific to India, and allows users to handle open-source air-quality datasets available from OpenAQ (https://openaq.org/#/countries/IN?_k=5ecycz), CPCB (https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqm-landing), and AirNow (https://www.airnow.gov/international/us-embassies-and-consulates/#India). sers can visualize data, perform basic statistical operations, and generate a variety of publication-ready plots. It also provides outlier detection and replacement of fill/negative values. We have also integrated the popular openair package in this application. Link to package or code repository. ID: 108 / ep-02: 27 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticssegmenter: A Wrapper for JAVA ChromHMM Mahmoud Ahmed, Deok Ryong Kim Gyeongsang National University Chromatin segmentation analysis transforms ChIP-seq data into signals over the genome. The latter represents the observed states in a multivariate Markov model to predict the chromatin's underlying (hidden) states. ChromHMM, written in Java, integrates histone modification datasets to learn the chromatin states de-novo. We developed an R package around this program to leverage the existing R/Bioconductor tools and data structures in the segmentation analysis context. segmentr wraps the Java modules to call ChromHMM and captures the output in an S4 R object. This allows for iterating with different parameters, which are given in R syntax. Capturing the output in R makes it easier to work with the results and to integrate them in downstream analyses. Finally, segmentr provides additional tools to test, select and visualize the models. To sum, we developed an R package to wrap a popular chromatin segmentation tool and capture the output in R for testing and visualization. Link to package or code repository.https://github.com/MahShaaban/segmenter ID: 122 / ep-02: 28 Elevator Pitch Topics: Efficient programmingKeywords: recursion, list, nested, efficient programming, C Efficient list recursion in R with rrapply Joris Chau Open Analytics The little used R function rapply() applies a function to all elements of a list recursively and provides control in structuring the result. Although occasionally useful due to its simplicity, the rapply() function is not sufficiently flexible to solve many common list recursion tasks. In such cases, the solution is to write custom list recursion code, which can quickly become hard to follow or reason about, making it a time-consuming and error-prone task to update or modify the code. The rrapply() function in the rrapply-package is an attempt to enhance and extend base rapply() to make it more generally applicable in the context of efficient list recursion in R. For instance: i) rapply() only allows to apply a function f to list elements of certain classes, rrapply() generalizes this concept through a general condition function; ii) rrapply() allows additional flexibility in structuring the result by e.g. pruning or unnesting list elements; iii) with rapply() there is no convenient way to access the name or location of the list element under evaluation, rrapply() allows the use of a number of special arguments to overcome this limitation. The rrapply()-function aims at efficiency by building on rapply() ’s native C implementation and does not require any external R-package dependencies. The rrapply-package is available on CRAN and several vignettes illustrating its use can be found online. Link to package or code repository. ID: 263 / ep-02: 29 Elevator Pitch Topics: Mathematical modelsKeywords: flow chart, flow diagram, model diagram, ggplot2, visualization An R package to flexibly generate simulation model flow diagrams Andreas Handel1, Andrew Tredennick2 1University of Georgia; 2Western EcoSystems Technology, Inc. We recently developed an R package that allows users to quickly generate ggplot2 based flow diagrams of compartmental simulation models that are commonly used in infectious disease modeling and many other areas of science and engineering. The package allows users to create publication quality diagrams in a user-friendly manner. Full access to the ggplot2 code that generates the diagram means advanced users can further customize the final diagram as needed. In this talk, we will provide a brief overview and introduction to the package. Link to package or code repository.https://github.com/andreashandel/flowdiagramr ID: 272 / ep-02: 30 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: Markdown, automation, trend epidemiology, daily report, metrics Using R Markdown to Automate COVID-19 Reporting Farzad Islam, Michael Elten, Najmus Saqib Public Health Agency of Canada, Canada The COVID-19 pandemic has impacted the operational needs at the Public Health Agency of Canada (PHAC), and consequently the day-to-day responsibilities of its employees. The emergency surveillance needs in light of the pandemic require around-the-clock monitoring 7 days a week. To accompany the surveillance, daily reporting needs were developed to keep the Office of the Chief Public Health Officer (OCPHO) informed on nationwide trends that would ultimately help inform public policy decisions and craft communication strategies. Due to the abrupt development of these needs, the solutions initially devised were labour-intensive and inefficient. Epidemiologists were utilizing the same datasets across different teams, writing scripts in various languages and maintaining them in silos. The Center for Data Management, Innovation and Analytics at PHAC was responsible for taking these functions and improving them so that they a) they became standardized, b) they reduced the need for manual labour, and c) they eliminated the risk of human error. As a result, R was utilized to automate the functions of reporting, moved to the back-end, and the outputs of the scripts were generated in the form of PowerPoint decks. This included the uses of various plots (using ggplot2), charts (flextable, officer), and cross-functionality with Python (reticulate). The data ingestion systems were also improved by utilizing Googlesheets, reading public data directly from websites, and utilizing web-scraping techniques to pull data reported daily. As a result of these efforts, daily reporting needs which could take hours to accomplish were reduced to the click of a button and five minutes of processing. ID: 186 / ep-02: 31 Elevator Pitch Topics: Statistical modelsKeywords: statistics, Cumulative Link Mixed-effects Models, Ordinal response variable Cumulative Link Mixed-effects Models (CLMMs) as a tool to model ordinal response variables and incorporate random effects Christophe Bousquet Lyon Neuroscience Research Center, France Ordinal response variables are frequent in various scientific domains, including ecology, ethology and psychology. However, researchers often analyse these data with methods suitable for non-ordinal response variables. The R package ‘ordinal’ has been developed specifically to model ordinal response variables and also offers the possibility to incorporate random effects. In this elevator pitch, I will present how to approach this kind of analysis, from the integration of random effects to the production of visualisations to communicate the results. The dataset is based on experiments in behavioural biology, specifically on leadership in mallards. The code to access the data and analysis is available on GitHub and may allow other researchers to learn analysis techniques for ordinal data. Link to package or code repository.https://github.com/krisanathema/Tutorials/tree/master/Cumulative%20Link%20Mixed-effects%20Models_R ID: 242 / ep-02: 32 Elevator Pitch Topics: Data visualisationKeywords: ggplot2 High dimensional data visualization in ggplot2 Zehao Xu, Wayne Oldford University of Waterloo Package 'ggmulti' extends the 'ggplot2' package to add high dimensional visualization functionality such as serialaxes coordinates (e.g., parallel, ...) and multivariate scatterplot glyphs (e.g. encoding many variables in a radial axes or star glyph). Much more general glyphs (e.g., polygons, images) are also now possible as point symbols in a scatterplot and can provide more evocative pictures for each point (e.g. an airplane for flight data or a team’s logo for sports data). As its name suggests, serial axes coordinates arranges variable axes in series (radially for stars, parallel for parallel coordinates) and can be used as its own plot or as a glyph. These are extended to a continuous curve representation (e.g., Andrews curves) through function transformations (e.g. Fourier series). The parallel coordinates work in the ggplot pipeline allowing histograms, density, etc. to be overlaid on the axes. In this talk, an overview of ggmulti will be given, largely by example. Link to package or code repository.https://github.com/great-northern-diver/ggmulti/ ID: 211 / ep-02: 33 Elevator Pitch Topics: Data visualisationKeywords: API Charting Covid with the DatawRappr-Package Benedict Witzenberger Süddeutsche Zeitung / TUM, Covid-19 swept across the world like a huge, sudden wave. Data journalists all around the globe had a brand new beat to cover from one moment to another. A lot of newsrooms used the available data to start automated and regularly updated visualizations or dashboard. One tool that is often used for creating charts, maps or dashboard-like tables in journalism (and corporate) is Datawrapper. I created an R-API-package to combine the power of R-code for analysing data and the various options Datawrapper offers in creating interactive and responsive visualizations. I would like to show some examples and best practices for useful automated visualizations in Datawrapper - created in R. Link to package or code repository.https://github.com/munichrocker/DatawRappr ID: 154 / ep-02: 34 Elevator Pitch Topics: Efficient programmingKeywords: C++, AutoDiff, packages Bringing AutoDiff to R packages Michael Komodromos Imperial College London We demonstrate the use of a C++ automatic differentiation (AD) library and show how it can be used with R to solve problems in optimization, MCMC and beyond. In particular, we show how gradients produced with AD can be used with R's built in optimization routines. We hope such integrations will enable package developers to produce robust efficient code by overcoming the need to produce functions that compute gradients. Link to package or code repository.https://github.com/mkomod/rad ID: 218 / ep-02: 35 Elevator Pitch Topics: Community and OutreachKeywords: interface, community, education, workflow Healthier & Happier Hands: Software and Hardware Solutions for More Ergonomic Typing John Paul Helveston George Washington University, Most R users spend multiple hours every day typing on a keyboard, which can lead to serious injuries such as Repetitive Strain Injury (RSI) and Carpal Tunnel Syndrome. This talk discusses a variety of software and hardware tools to improve the ergonomics of typing. I will discuss a wide range of solutions, from implementing software tools for remapping keys to using a split mechanical keyboard for improved hand and arm positioning. Each solution involves a trade-off between the time and effort required to learn and implement it and the benefits in terms of health and typing improvements, like speed and accuracy. I will also showcase some specific applications of how these solutions can improve the experience of working with R. No one solution will work for everyone, but my goal is that by introducing a broad overview of solutions, many will leave inspired to try (and eventually adopt) some and end up with healthier and happier hands. ID: 144 / ep-02: 36 Elevator Pitch Topics: AlgorithmsKeywords: high-dimensional data High Dimensional Penalized Generalized Linear Mixed Models: The glmmPen R Package Hillary M. Heiling1, Naim U. Rashid1,2, Quefeng Li1, Joseph G. Ibrahim1 1University of North Carolina at Chapel Hill; 2UNC Lineberger Comprehensive Cancer Cencer Generalized linear mixed models (GLMMs) are popular for their flexibility and their ability to estimate population-level effects while accounting for between-unit heterogeneity. While GLMMs are very versatile, the specification of fixed and random effects is a critical part of the modeling process. Historically, variable selection in GLMMs has been restricted to a search over a limited set of candidate models or has required selection criteria that are computationally difficult to compute for GLMMs, limiting variable selection in GLMMs to lower dimensional models. To address this, we developed the R package glmmPen, which simultaneously selects fixed and random effects from high dimensional penalized generalized linear mixed models (pGLMMs). Model parameters are estimated using a Monte Carlo Expectation Conditional Maximization (MCECM) algorithm, which leverages Stan and RcppArmadillo to increase computational efficiency. Our package supports the penalty functions MCP, SCAD, and Lasso, and the distributional families Binomial, Gaussian, and Poisson. Tools available in the package include automated tuning parameter selection and automated initialization of the random effect variance. Optimal tuning parameters are selected using BIC-ICQ or other BIC selection criteria; the marginal log-likelihoods used for the BIC criteria calculation are estimated using a corrected arithmetic mean estimator. The package can also be used to fit traditional generalized linear mixed models without penalization, and provides a user interface that is similar to the popular lme4 R package. Link to package or code repository.https://github.com/hheiling/glmmPen ID: 140 / ep-02: 37 Elevator Pitch Topics: ReproducibilityKeywords: R markdown trackdown: collaborative writing and editing your R Markdown and Sweave documents in Google Drive Filippo Gambarota1, Claudio Zandonella Callegher1, Janosch Linkersdörfer2, Mathew Ling3, Emily Kothe3 1University of Padova; 2University of California, San Diego; 3Misinformation Lab, Deakin University "The advantages of using literate programming that combines plain-text and code chunks (e.g., R Markdown and Sweave) are well recognized. This allows creation of rich, high quality, and reproducible documents. However, collaborative writing and editing have always been a bottleneck. Distributed version control systems like git are recommended for collaborative code editing but far from ideal when working with prose. In the latter cases, other software (e.g, Microsoft Word or Google Docs) offer a more fluent experience, tracking document changes in a simple and intuitive way. When you further consider that collaborators often do not have the same level of programming competence, there does not appear to be an optimal collaborative workflow for writing reproducible documents. trackdown (formerly rmdrive) overcomes this issue by offering a simple solution to collaborative writing and editing of reproducible documents. Using trackdown, the local R Markdown or Sweave document is uploaded as plain-text in Google Drive allowing other colleagues to contribute to the prose using convenient features like tracking changes and comments. After integrating all authors’ contributions, the edited document is downloaded and rendered locally. This smooth workflow allows taking advantage of the easily readable Markdown and LaTeX plain-text combined with the optimal and well-known text editing experience offered by Google Docs. In this contribution, we will present the package and its main features. trackdown aims to promote good scientific practices that enhance overall work quality and reproducibility allowing collaborators with no or limited R knowledge to contribute to literate programming workflows." Link to package or code repository.https://github.com/ekothe/trackdown ID: 289 / ep-02: 38 Elevator Pitch Topics: Statistical modelsKeywords: multivariate funtional data, outliers detection, functional classification, clustering, machine learning Multivariate functional data analysis Manuel Oviedo-de la Fuente1, Manuel Febrero-Bande2 1University of Coruña, Spain; 2University of Santiago de Compostela, Spain This talk proposes new tools to use multivariate functional data (MFD) in R. For this, to handle multivariate functional data the class "mfdata" is proposed, and to handle complex data (scalar, multivariate, directional, images, and functional) the class "ldata". These new classes are useful in problems such as i) visualizing centrality and detecting outliers for MFD, ii) extending supervised classification algorithms in machine learning and iii) also unsupervised algorithms such as hierarchical and k-means procedures. Link to package or code repository.https://cran.r-project.org/web/packages/fda.usc/ ID: 220 / ep-02: 39 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: big data Multivariate functional principal component analysis on high dimensional gait data Sajal Kaur Minhas1, Morgan Sangeux3, Julia Polak2, Michelle Carey1 1University College Dublin,; 2School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia; 3Murdoch Childrens Research Institute, Melbourne, Australia A typical gait analysis requires the analysis of the kinematics of five joints (trunk, pelvis, hip, knee and ankle/foot) in three planes. It requires how much a subject’s gait deviates from an average normal profile as a single number. This can quantify the overall severity of a condition affecting walking, monitor progress or evaluate the outcome of an intervention prescribed to improve the gait pattern. The Gait Deviation Index (GDI) and Gait Profile Score (GPS) are the standard indices for measuring gait abnormality and work well on common gait pathologies such as Cerebral palsy etc. The GDI is easy to interpret and is normally distributed allowing for parametric statistical testing whereas GPS has the ability to decompose scores by individual joints/planes and produce altered indices without the need for a large control database but it is not normally distributed. Neither index accounts for the potential co-variation between the kinematic variables for any individual subject, i.e. the motions of one joint affect the motions of adjacent. Additionally, the intrinsic smoothness of the gait movement in each kinematic variable is not accounted for, i.e. the position of a joint at one time affects the positions at a later instant. The aim of this work is to utilize techniques from multivariate functional principal components analysis in the R package MFPCA to create an index that combines the advantages of the existing GDI and GPS i.e, an index that is easy to interpret, is normally distributed, has the ability to decompose scores by individual joints and planes, and is easily adaptable. While also accounting for the intrinsic smoothness of the gait movement in each kinematic variable and the potential co-variation between the kinematic variables. The functional gait deviation index is implemented in R and provides a computationally efficient and easily administered metric to quantify gait impairment. Link to package or code repository.https://github.com/Sajal010/MFPCA_gaitanalysis ID: 184 / ep-02: 40 Elevator Pitch Topics: Teaching R/R in TeachingKeywords: teaching, lecture, introduction, programming Teaching an introductory programming course with R Reto Stauffer1,2, Joanna Chimiak-Opoka1, Luis M Rodriguez-R1,3, Achim Zeileis2 1Digital Science Center, Universität Innsbruck, Austria; 2Department of Statistics, Universität Innsbruck, Austria; 3Department of Microbiology, Universität Innsbruck, Austria As part of a large digitalization initiative, Universität Innsbruck established a Digital Science Center that aims to foster both interdisciplinary research and modern education using digital and data-driven methods. Specifically, the center offers a package of elective courses that can be taken by all students and that covers programming, data management, data analysis, and further aspects of digitalization. The first course within this package is a general introduction to programming for novices, offered in two tracks, using either Python or R. The focus is on teaching data types including object classes, writing and testing functions, control flow, etc. While some basic data management and data analysis is touched upon, these topics are mainly deferred to subsequent courses. As this design differs from most introductory R materials that emphasize data analysis early on, we developed new course materials centered around an online textbook: https://discdown.org/rprogramming/. Our course follows the flipped classroom design allowing the diverse group of participants to learn at their own pace. In class open questions are resolved before students can work jointly on non-mandatory programming tasks with guidance and feedback from the instructors. The assessment is based on short weekly (randomized) online quizzes generated with the R/exams package (http://www.R-exams.org/) that are automatically graded, as well as manually graded mid-term and final exams. The concept of the course turned out to work well both in-person and in virtual teaching. Link to package or code repository.https://discdown.org/rprogramming/ ID: 255 / ep-02: 41 Elevator Pitch Topics: Data mining / Machine learning / Deep Learning and AIKeywords: XAI, DALEX, iml, flashlight, shap, Interpretable Artificial Intelligence Landscape of R packages for eXplainable Artificial Intelligence Szymon Maksymiuk, Alicja Gosiewska, Przemysław Biecek Warsaw University of Technology, Poland The growing availability of data and computing power is fueling the development of predictive models. To ensure the safe and effective functioning of such models, we need methods for exploration, debugging, and validation. New methods and tools for this purpose are being developed within the eXplainable Artificial Intelligence (XAI) subdomain of machine learning. In this lightning talk, we present the design by us taxonomy of methods for a model explanation, show what methods are included in the most popular R XAI packages, and acknowledge trends in recent developments. Link to package or code repository.Link to a site presenting the results: http://xai-tools.drwhy.ai/ Repo with codes: https://github.com/MI2DataLab/XAI-tools ID: 227 / ep-02: 42 Elevator Pitch Topics: R in productionKeywords: interactive visualization Reactive PK/PD: An R shiny application simplifying the PK/PD review process Kristoffer Segerstrøm Mørk, Steffen Falgreen Larsen Novo Nordisk In phase 1 of clinical drug development there is a greater interest in the pharmacokinetics (PK) and pharmacodynamics (PD) of a drug. PK describes what the body does to the drug. PD describes what the drug does the body. Due to the limitations and uncertainties related to the procedures used to assess the PK and PD of a drug there is a need to review the PK and PD data on a patient level. Such a review is usually conducted in a smaller group of people from different skill areas. In this elevator pitch you will be presented to how we at Novo Nordisk have simplified and automated a lot of the tasks related to a PK/PD review using R shiny. We have developed an application that automatically generates the figures that we need in order to conduct a review. The app enables the users to comment on the data through the autogenerated figures and the comments are instantly shared with other users. Once a review has been conducted, minutes can be downloaded in a word format including the added comments. ID: 121 / ep-02: 43 Elevator Pitch Topics: Teaching R/R in TeachingKeywords: data processing r-cubed: Guiding the overwhelmed scientist from random wrangling to Reproducible Research in R Hannah Chatwin1, Luke W. Johnston2, Helene Baek Juel3, Bettina Lengger4, Daniel R. Witte2,5, Malene Revsbech Christiansen3, Anders Aasted Isaksen5 1University of Southern Denmark; 2Steno Diabetes Center Aarhus; 3University of Copenhagen; 4Technical University of Denmark; 5Aarhus University The volume of biological data increases yearly, driven largely by technologies like high-throughput omics, real-time monitoring, and high-resolution imaging, as well as by greater access to routine administrative data and larger study populations. This presents operational challenges and requires considerable knowledge of and skills to manage, process, and analyze this data. Along with the growing open science movement, research is also increasingly expected to be open, transparent, and reproducible. Training in modern computational skills has not yet kept pace, particularly in biomedical research where training often focuses on clinical, experimental, or wet lab competencies. We developed a computational learning module, r-cubed, that introduces and improves skills in R, reproducibility, and open science that was designed with biomedical researchers in mind. The r-cubed learning module is structured as a three-day workshop with five submodules. Over the five submodules, we use a combination of code-alongs, exercises, lectures, and a group project to cover skills in collaboration with Git and GitHub, project management, data wrangling, reproducible document writing, and data visualization. We have specifically designed the module as an open educational resource that instructors can use directly or to modify for their own lessons, and that learners can use independently or as a reference during and after participating in the workshop. All content is available for re-use under CC-BY and MIT Licenses. The course website is found at https://r-cubed.rostools.org/ and the repository with the source material is at https://gitlab.com/rostools/r-cubed. Link to package or code repository.https://r-cubed.rostools.org/ ID: 128 / ep-02: 44 Elevator Pitch Topics: Databases / Data managementKeywords: databases Validate observations stored in a DB Edwin de Jonge Statistics Netherlands / CBS Data cleaning is an important step before analyzing your data. Often it is wise to check the validity of your observations before running your statistical methods on the data. Validation checks embody real world knowledge about your observations, e.g. age cannot be negative or over 150 years old. R package validate allows for formulating validation checks in R syntax and run these checks on a data.frame. validatedb brings validate to the database: It allows for running the validation checks on a (potentially very) large database tables, offering the same benefits as validate, namely a clean documented set of validation rules, but checked on a database. The presentation will go into the details of the implementation, describe the output of the validation checks, and also discuss an alternative sparse format for describing errors in your data. Link to package or code repository. ID: 208 / ep-02: 45 Elevator Pitch Topics: Teaching R/R in TeachingKeywords: data science class, flipped classroom, learnr, gradethis Teaching Biology students to code smoothly with learnR and gradethis Guyliann Engels, Philippe Grosjean Numerical Ecology Department, Complexys and InforTec Institutes, University of Mons, Belgium R is teach in a biology curriculum at the University of Mons, Belgium, in the context of five data science courses spanning from 2nd Bachelor to last Master classes (https://wp.sciviews.org). Since 2018 the flipped classroom approach is used. Three levels of exercices of increasing difficulties are proposed. First, students read a {bookdown} with integrated interactive exercises written in H5P or {Shiny}. Then, they practice R using {learnR} tutorials. Finally, they apply the new concepts on real datasets in individual or group projets managed with GitHub and GitHub Classroom (https://github.com/BioDataScience-Course). {LearnR} is a useful tool to bridge the gap between theory and practice in R learning. Students can auto-assess their skills and get immediate feedback thanks to {gradethis}. All the exercises generate xAPI events that are recorded in a MongoDB database (more than 300,000 events recorded so far for a total of 182 students over three academic years). These data allow to quantify and visualize the progression (individual progress reports as {Shiny} applications). Thanks to the detailed visualization of their own progression, students are more motivated to complete the exercises. Whether {learnr} is used alone , or in combination with {gradethis} for immediate feedback on the answers, determine student's behavior. They spend more time on each exercise and try harder to find the right answer when {gradethis} is used. Link to package or code repository.https://github.com/BioDataScience-Course ID: 138 / ep-02: 46 Elevator Pitch Topics: Statistical modelsKeywords: algorithms Partial Least Squares Regression for Beta Regression Models Frederic Bertrand, Myriam Maumy European University of Technology - Troyes Technology University Many responses, for instance, experimental results, yields or economic indices, can be naturally expressed as rates or proportions whose values must lie between zero and one or between any two given values. The Beta regression often allows modelling these data accurately since the shapes of the densities of Beta laws are very versatile. Yet, as any of the usual regression model, it cannot be applied safely in the case of multicollinearity and not at all when the model matrix is rectangular. These situations are frequently found from chemistry to medicine through economics or marketing. To circumvent this difﬁculty, we derived an extension of PLS regression to Beta regression models in PLS regression for beta regression models, Bertrand, F., [...], Maumy-Bertrand, M. (2013). “Régression Bêta PLS” [French]. JSFDS, 154(3):143-159. The plsRbeta package provides Partial least squares Regression for (weighted) beta regression models and k-fold cross-validation using various criteria. It allows for missing data in the explanatory variables. Bootstrap conﬁdence intervals constructions are also available. Parallel computing (CPU and GPU) support is currently being implemented. Link to package or code repository. ID: 194 / ep-02: 47 Elevator Pitch Topics: Data mining / Machine learning / Deep Learning and AIKeywords: interpretability, machine learning, explainability Simpler is Better: Lifting Interpretability-Performance Trade-off via Automated Feature Engineering Alicja Gosiewska1, Anna Kozak1, Przemysław Biecek1,2 1Warsaw University of Technology, Poland; 2University of Warsaw, Poland Machine learning generates useful predictive models that can and should support decision-makers in many areas. The availability of tools for AutoML makes it possible to quickly create an effective but complex predictive model. However, the complexity of such models is often a major obstacle in applications, especially in terms of high-stake decisions. We are experiencing a growing number of examples where the use of black boxes leads to decisions that are harmful, unfair, or simply wrong. Here, we show that very often we can simplify complex models without compromising their performance; however, with the benefit of much-needed transparency. We propose a framework that uses elastic black boxes as supervisor models to create simpler, less opaque, yet still accurate and interpretable glass box models. The new models were created using newly engineered features extracted with the help of a supervisor model. We supply the analysis using a large-scale benchmark on several tabular data sets from the OpenML database. There are three main results: 1) we show that extracting information from complex models may improve the performance of simpler models, 2) we question a common myth that complex predictive models outperform simpler predictive models, 3) we present a real-life application of the proposed method. The proposed method is available as an R package rSAFE, https://github.com/ModelOriented/rSAFE. Link to package or code repository.https://github.com/ModelOriented/rSAFE ID: 162 / ep-02: 48 Elevator Pitch Topics: Bayesian modelsKeywords: Bayesian analysis State of the Market - Infinite State Hidden Markov Models Dean Markwick BestX The stock market is either in a bull or a bear market at any given time. In a bull market, on average prices increase. In a bear market, prices decrease on average. In this talk I will build a non-parametric Bayesian model that can classify the stock market into these different states. This model is a practical application of my dirichletprocess R package and will serve as an introduction to both the package and non-parametric Bayesian models. I use free stock data and take you through the full quantitative modelling process. I will show how to: prepare the data, build the model and analyze the model output. This model is able to highlight the dot-com crash of the 2000s, the credit crisis of 2008 and the more recent COVID turmoil in the market. As it is a Bayesian model I am also able to highlight the uncertainty around these market states without having to do any extra work. Overall, this talk will provide a practical example and introduction into how R can be used in quantitative finance. Link to package or code repository.http://dm13450.github.io/2020/06/03/State-of-the-Market.html ID: 164 / ep-02: 49 Elevator Pitch Topics: Bioinformatics / Biomedical or health informaticsKeywords: algorithms networkABC: Network Reverse Engineering with Approximate Bayesian Computation Myriam Maumy, Frederic Bertrand European Technology University - Troyes Technology University We developed an inference tool based on approximate Bayesian computation to decipher network data and assess the strength of the inferred links between network's actors. It is a new multi-level approximate Bayesian computation (ABC) approach. At the first level, the method captures the global properties of the network, such as scale-freeness and clustering coefficients, whereas the second level is targeted to capture local properties, including the probability of each couple of genes being linked. Up to now, Approximate Bayesian Computation (ABC) algorithms have been scarcely used in that setting and, due to the computational overhead, their application was limited to a small number of genes. On the contrary, our algorithm was made to cope with that issue and has a low computational cost. It can be used, for instance, for elucidating gene regulatory network, which is an important step towards understanding the normal cell physiology and complex pathological phenotype. Reverse-engineering consists of using gene expressions over time or over different experimental conditions to discover the structure of the gene network in a targeted cellular process. Link to package or code repository. ID: 187 / ep-02: 50 Elevator Pitch Topics: Economics / Finance / InsuranceKeywords: Tidymodels, Tidyverse, actuarial science, actuarial claim cost analysis Navigating Insurance Claim Data Through Tidymodels Universe Jun Haur Lok, Tin Seong Kam Singapore Management University, Singapore The increasing ability to store and analyze the data due to the advancement in technology has provided actuaries opportunities in optimizing capital held by insurance companies. Often, the ability to optimize the capital would lower the cost of capital for companies. This could translate into an increase in profit from the lower cost incurred or an increase in competitiveness through lowering the premiums companies charge for their insurance plans. In this analysis, tidyverse and tidymodels packages are used to demonstrate how the modern data science R packages could assist the actuaries in predicting the ultimate claim cost once the claims are reported. The conformity with tidy data concepts by these R packages has flattened the learning curve to use different machine learning techniques to complement the conventional actuarial analysis. This has effectively allowed actuaries in building various machine learning models in a more tidy and efficient manner. The packages also enable users to harass on the power of data science to mine the “gold” in unstructured data, such as claim descriptions, item descriptions, and so on. Nevertheless, these would enable the companies to hold less reserve through a more accurate claim estimation while not compromising the solvency of the companies, allowing the capital to be re-deployed for other purposes. ID: 146 / ep-02: 51 Elevator Pitch Topics: Biostatistics and EpidemiologyKeywords: regression, mixed-effects model, grouped data, correlated outcomes, transfromation model tramME: Mixed-Effects Transformation Models Using Template Model Builder Balint Tamasi, Torsten Hothorn Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich, Switzerland Statistical models that allow for departures from strong distributional assumptions on the outcome and accommodate correlated data structures are essential in many applied regression settings. Our technical note presents the R package tramME that implements the mixed-effects extension of the linear transformation models. The model is appealing because it directly parameterizes the (conditional) distribution function and estimates the necessary transformation of the outcome in a data-driven way. As a result, transformation models represent a general and flexible approach to regression modeling of discrete and continuous outcomes. The package tramME builds on existing implementations of transformation models (the mlt and tram packages) as well as the Laplace approximation and automatic differentiation (using the TMB package) to perform fast and efficient likelihood-based estimation and inference in mixed-effects transformation models. The resulting framework can be readily applied to a wide range of regression problems with grouped data structures. Two examples are presented, which demonstrate how the model can be used for modeling correlated outcomes without strict distributional assumptions: 1) A mixed-effects continuous outcome logistic regression for longitudinal data with a bounded response. 2) A flexible parametric proportional hazards model for time-to-event data from a multi-center trial. Keywords: correlated outcomes, mixed-effects models, R package development, regression, transformation models Link to package or code repository. ID: 119 / ep-02: 52 Elevator Pitch Topics: Statistical modelsKeywords: big data The one-step estimation procedure in R Alexandre Brouste1, Christophe Dutang2 1Le Mans Université; 2Université Paris-Dauphine In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial\sqrt(n)$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed. Link to package or code repository.https://cran.r-project.org/web/packages/OneStep/index.html 2:30pm - 2:45pm BreakVirtual location: The Lounge #lobby 2:45pm - 3:45pm Keynote: Enseñando a enseñar sin perder a nadie en el camino - Teaching how to teach without leaving anyone behindVirtual location: The Lounge #key_metadocenciaSession Chair: Juan Pablo Ruiz NicoliniZoom Host: Nasrin Fathollahzadeh AttarReplacement Zoom Host: Tuli AmutenyaSession Sponsor: ixpantia Session Slide ID: 357 / [Single Presentation of ID 357]: 1 Keynote Talk Topics: Teaching R/R in Teaching Paola Corrales, Elio Campitelli, Ivan Poggio Metadocencia Metadocencia nace en marzo de 2020 cuando la pandemia nos obligó a cambiar la manera en que enseñamos y aprendemos. En ese momento nos encontramos casi sin tiempo ni recursos pero con muchas ganas de ayudar y compartir nuestra experiencia con otras y otros docentes. Comenzamos dando un taller para compartir métodos educativos basados en evidencia y que se pudieran aplicar de manera sencilla. También brindamos recursos abiertos para fomentar prácticas de enseñanza eficaces e invitamos a las personas a compartir su experiencia y formar una comunidad. Un año después abrimos 3 nuevos talleres y llegamos a más de 1500 personas en 30 países. En esta charla contaremos sobre algunos de los valores fundamentales que nos definen: encontrarnos con nuestros estudiantes en su lugar y atendiendo a su contexto en Latinoamérica. Esto significa no hacer suposiciones sobre su conocimiento de tecnología o sobre el acceso y la disponibilidad de Internet, diferencias culturales, barreras y necesidades específicas. Queremos compartir lo que aprendimos enseñando en comunidad. Metadocencia was born in March 2020 when the pandemic forced us to change the way we teach and learn. At the time we found ourselves with little time or resources but eager to help by sharing our experience with other teachers. We began by running a workshop with evidence-based educational methods that could be applied in a simple way. We also provided open resources to encourage effective teaching practices and invited people to share their experience and form a community. A year later, we opened 3 new workshops and reached more than 1500 people in 30 countries. 3:45pm - 4:00pm BreakVirtual location: The Lounge #lobby 4:00pm - 5:00pm Incubator: The role of the R community in the RSE movementLocation: The Lounge #incubator_rseSession Chair: Matt BannertSession Chair: Heather TurnerZoom Host: Faith MusiliReplacement Zoom Host: Rachel Heyard Session Slides ID: 352 / [Single Presentation of ID 352]: 1 Incubator Topics: Community and Outreach Heidi Seibold, Heather Turner, Matt Bannert Johner Institut, Germany The term "Research Software Engineer"(RSE) was proposed by a group of software developers working in academia at a workshop in Oxford, UK, 2012. It was the beginning of a grass-roots movement to establish Research Software Engineering as a profession for people that combined expertise in programming with an intricate understanding of research. Since then, the movement has grown substantially, leading to recognition, reward and career opportunities for RSEs and the creation of national RSE associations in Australia/New Zealand, Belgium, Germany, the Netherlands, the Nordic region, the UK and the USA. This incubator will provide an opportunity to discuss the role of the R community in the RSE movement. What can we share with this wider community? How can we help the movement grow? What could the R community gain from this movement? We will identify a range of actions from quick wins to more ambitious projects that could be pursued after useR! 2021. 4:00pm - 5:00pm Panel: R User or R Developer? This is the question!Location: The Lounge #panel_user_developerSession Chair: Francesca VitaliniZoom Host: Maryam AlizadehReplacement Zoom Host: Tuli Amutenya Session Slides ID: 214 / [Single Presentation of ID 214]: 1 Panel Topics: OtherKeywords: DevOps Francesca Vitalini, Riccardo Porreca, Stéphanie Gehring, Peter Schmid Mirai Solutions GmbH, Since its first official release back in 1995, R has outgrown its statistician-tool origin, spreading out to different fields. A key factor in R’s popularity is without a doubt its approachability for people without a software engineering background. As a result, R is often considered more as a scripting / prototyping / data analytics tool than a proper software development language. Thanks to its low barrier and accessibility, however, people with a wide variety of (non-technical) backgrounds can quickly become active and effective users. This in turn can set the ground for exploring and building up programming and development skills, transitioning towards what is normally associated with software engineering profiles. This raises some questions: what does it mean to be an R user? Is there such a thing as an R developer, and (how) does it differ from being an R user? In this time where IT skills are required across virtually every domain, can an R user afford not to be a software engineer as well? What is (and does it even exist) the R equivalent of the Python stack developer? What type of background and expertise should an R user have to fit what companies are looking for? And what about academia? What is the current trend? We will discuss these questions in a panel featuring the point of view of: - experts from both industry and academia, - data scientists who have made the transition from R users to software developers, - R Core team, and of course the Community perspective. 5:00pm - 6:00pm mixR!Music, networking channel and raffles. To end the day in a relaxing way 11:15pm - 11:59pm Tutorials - Track 2 11:15pm - 2:15amID: 321 / 2A-Tut: 1 Tutorial Topics: Efficient programmingTranslating R to Your Language Michael Chirico, Michael Lawrence - Language: English - Duration: 180 - Participants: 30 - Level: Intermediate+ R users are a global bunch. Providing error messages in languages besides English can greatly improve the user experience (and debugging experience) of those using R who may not be English natives. This tutorial aims to get package developers and other R community members started implementing foreign-language translations of R's messages (errors, warnings, verbose output, etc.) into a language of their choosing. The standard tools for providing translations can be somewhat esoteric; in this tutorial, we'll go over some of the challenges presented by translations, the process for providing and/or updating translations to R itself, and finally introduce a package (potools) that will remove some of the frictions potential translators may face. We especially encourage attendance from speakers of major world languages currently missing from the R translation data base, in particular Hindi, Arabic, Bengali, Urdu, and Bahasa Indonesia.  Date: Wednesday, 07/July/2021 12:05am Movie: Coded BiasThe movie: "Coded Bias" will be available to watch during 24 hours, and we'll have a channel open for discussions 7:00am - 11:59pm Tutorials - Track 1 7:00am - 9:00amID: 300 / 1-Tut: 1 Tutorial Topics: Data visualisationData visualization using ggplot2 and its extensions Haifa Ben Messaoud, Mouna Belaid, Kaouthar Driss, Amir Souissi - Language: English - Duration: 120mn - N° Participants: 100 - Level: Beginner "This tutorial will cover the introduction to ggplot2 and its main functions. We will cover how to make visualization of one variable, two variables, and three or more variables, how to lay out multiple plots, the use of ggstats for statistical visualizations, how to make interactive graphs using plotly and gganimate, some extensions of ggplot2. Finally, we will show you how to enhance the quality of your graphs by changing the theme or adding a logo and how to export your graph. We will share the code on the github repository of R-Ladies Tunis." 9:00am - 9:15amID: 322 / 1-Tut: 2 Breaks Break useR! 2021 9:15am - 11:45amID: 301 / 1-Tut: 3 Tutorial Topics: Bayesian modelsAdditive Bayesian Networks Modeling Gilles Kratzer, Reinhard Furrer - Language: English - Duration: 150 mn - N° Participants: 60 - Level: Intermediate Additive Bayesian Networks (ABN) have been developed to disentangle complex relationships of highly correlated datasets as frequently encountered in risk factor analysis studies. ABN is an efficient approach to sort out direct and indirect relationships among variables which is surprisingly common in systemic epidemiology. After the tutorial, you will run the particular steps within an ABN analysis with real-world data. You will be able to contrast this approach with standard regression (linear, logistic, Poisson regression, and multinomial models) used for classical risk factor analysis. Towards the end, we also cover Bayesian Model Averaging in the context of an ABN, which is useful to assess the validity of the learned model and more advanced inference on the network. 11:45am - 12:00pmID: 323 / 1-Tut: 4 Breaks Break useR! 2021 12:00pm - 2:00pmID: 302 / 1-Tut: 5 Tutorial Topics: Spatial analysis, Data visualisationQuick high quality maps with R Jan-Philipp Kolb - Language: English - Duration: 120 mn - N° Participants: 40 - Level: Beginner This tutorial covers the basic use of R for creating maps. Useful tools, as well as data sources, are both presented. Concerning tools, the focus is on the packages osmplotr, tmap, and raster. In the first part of the tutorial, you will learn how to use OpenStreetMap data. Geocoding and creation of bounding boxes will be presented as well as the use of shapefiles to create thematic maps and color-coding in R. After this introduction to the basic concepts and functionalities of mapping with R, you will go through a prototypical data analysis workflow: import, wrangling, exploration, (basic) analysis, reporting. You will have the opportunity to create your own maps during the workshop. A GitHub repo on the course will be shared. 2:00pm - 2:15pmID: 324 / 1-Tut: 6 Breaks Break useR! 2021 2:15pm - 4:15pmID: 305 / 1-Tut: 7 Tutorial Topics: Big / High dimensional data, Spatial analysis, Efficient programmingKeywords: spam, maximum likelihood estimation, covariance function, BLUP, Gaussian process Spatial Statistics for huge datasets and best practices Reinhard Furrer, Roman Flury, Federico Blasi - Language: English - Duration: 120 mn - N° Participants: 50 - Level: Advanced During the last decade, several advanced approaches have been proposed to address computational issues of larger and larger multivariate space-time datasets. These can essentially be categorized as (i) construct "simpler" models or (e.g., low-rank models, composite likelihood methods, predictive process models) (ii) approximate the models (e.g., with Gaussian Markov random fields, compactly supported covariance function). In this tutorial, we discuss this last point by using sparse covariance matrix approximations. There is seemingly no limit to the sample size with the possibility of working with long vectors jointly with 64bit handling algorithms. However, the devil is in the details and to avoid encountering negative surprises we provide best practices, strategies, and tricks when modeling huge spatial data. 4:15pm - 4:30pmID: 325 / 1-Tut: 8 Breaks Break useR! 2021 4:30pm - 7:30pmID: 304 / 1-Tut: 9 Tutorial Topics: Community and Outreach, AlgorithmsContributing to R Gabriel Becker, Martin Maechler Clindata Insights, United States of America - Language: English - Duration: 180 mn - N° Participants: 30 - Level: Intermediate to Advanced. Did you always want to contribute to (base) R but don't know how? Come to our Tutorial! We will show cases where and how users have contributed actively to (base) R, by submitting bug reports with minimal reproducible examples, how testing, reading source code, and providing patches to the R source code has helped making R better. Depending on the participants willingness and level of sophistication, we will look into doing things right now, for currently non-resolved issues and bug reports. 7:30pm - 7:45pmID: 326 / 1-Tut: 10 Breaks Break useR! 2021 7:45pm - 10:15pmID: 303 / 1-Tut: 11 Tutorial Topics: Data visualisationGraphing multivariate categorical data: The how, what and why of mosaic plots and alluvial diagrams Joyce Robbins, Ludmila Janda - Language: English - Duration: 150 - N° Participants: 24 - Level: Beginner Multivariate categorical data present unique data visualization challenges. This tutorial provides two options to meet such challenges: mosaic plots and alluvial diagrams. First, we will focus on how to choose the best graph for given data types and communication goals. You will then learn how to get the underlying data in the correct shape to make each graph and then create both graph types using the vcd and ggalluvial packages. We will use engaging datasets and aim to equip you with the skills to make these graphs (and the choice whether to use them) on your own. 7:00am - 11:59pm Tutorials - Track 2 7:00am - 10:00amID: 316 / 2-Tut: 1 Tutorial Topics: Other, Efficient programmingKeywords: testing, vcr, testthat, mocking, fixtures GET better at testing your R package! Maëlle Salmon, Scott Chamberlain Are you a package developer who wants to improve your understanding and practice of unit testing? You've come to the right place: This tutorial is about Advanced testing of R packages, with HTTP testing as a case study. Unit tests have numerous advantages like preventing future breakage of your package and helping you define features (test-driven development). In many introductions to package development you learn how to set up testthat infrastructure, and how to write a few “cute little tests” (https://testthat.r-lib.org/articles/test-fixtures.html#test-fixtures) with only inline assertions. This might work for a bit but soon you will encounter some practical and theoretical challenges: e.g. where do you put data and helpers for your tests? If your package is wrapping a web API, how do you test it independently from any internet connection? And how do you test the behavior of your package in case of API errors? In this tutorial we shall use HTTP testing with the vcr package as an opportunity to empower you with more knowledge of testing principles (e.g. cleaning after yourself, testing error behavior) and testthat practicalities (e.g. testthat helper files, testthat custom skippers). After this tutorial, you will be able to use the handy vcr package for your package wrapping a web API or any other web resource, but you will also have gained skills transferable to your other testing endeavours! Come and learn from rOpenSci expertise! Related materials https://devguide.ropensci.org/building.html#testing https://books.ropensci.org/http-testing https://blog.r-hub.io/2019/10/29/mocking/ https://blog.r-hub.io/2020/11/18/testthat-utility-belt/ 10:00am - 10:15amID: 327 / 2-Tut: 2 Breaks Break useR! 2021 10:15am - 2:15pmID: 317 / 2-Tut: 3 Tutorial Topics: Other, R in productionSystematic data validation with the validate package Mark van der Loo, Edwin de Jonge Statistics Netherlands, Netherlands, The - Language: English - Duration: 240 mn - N° Participants: 30 - Level: Intermediate Checking the quality of data is a task that pervades data analyses. It does not matter whether you are working with raw data, cleaned data, or with the results of an analyses. It is always important to convince yourself that the data you are using is fit for its intended purpose. Since it is such a common task, why not automate it? The 'validate' package is designed for exactly this task: it implements a domain specific language for data checking that aims to encompass any check you might wish to perform. In this course you will will learn to define and measure data quality in a precise way with the validate package. We will focus on the main workflow, and show you how you can involve domain experts directly with your work, even if they do not know R. You will learn the main principles of data validation, both from the point of view of organizing a data processing work flow, as well as from a more formal perspective. You will exercise data validation tasks that range from checking input format and types to complex checks that involve data from multiple sources. You will learn how to follow the evolution of data quality as it is processed using the lumberjack package. And you will learn how to flush out redundant or contradictory quality demands using the validatetools package. The course will consist of hands-on work, based on a prepared tutorial that will be published on GitHub. There will be break-out sessions with assignments where you can discuss the materials with other course participants. The presentations will include some Kahoot quizzes to keep things interactive, fun, and focused. 2:15pm - 2:30pmID: 330 / 2-Tut: 4 Breaks Break useR! 2021 2:30pm - 6:00pmID: 318 / 2-Tut: 5 Tutorial Topics: R in production, Web Applications (Shiny/Dash)Production-grade Shiny Apps with {golem} - French Vincent GUYADER, Cervan Girard - Language: French - Duration: 120 mn - N° Participants: 30 - Level: Intermediate This tutorial is aimed at intermediate or advanced shiny application developers who want to design "clean" applications following best practices. We will present the different steps necessary to obtain an application deployed in production. An active participation of the participants is expected, with screen sharing, microphone (and if possible webcam). 6:00pm - 6:15pmID: 331 / 2-Tut: 6 Breaks Break useR! 2021 6:15pm - 7:15pmID: 319 / 2-Tut: 7 Tutorial Topics: Data mining / Machine learning / Deep Learning and AIPenguins in a Box: Interactive Data Science Tutorial with Penguins. Maria Dermit, Susana Escobar - Language: English - Duration: 60 mn - N° Participants: 100 - Level: Intermediate Penguins in a Box is a learnr package that covers the topics of R for Data Science book and uses the widely used dataset penguins to explore book's concepts. The package currently contains one tutorial for each chapter of the book and will be introduced during the presentation. In addition, you will join breakout rooms to work on modules on the book's main sections (e.i. Explore, Wrangle, Program, Model and Communicate; 6 sections in total) according to your learning objectives. This tutorial is aimed at both students who want to improve their data science skills in an interactive way and teachers who want to access additional learnr resources similar to Rstudio Primers (https://rstudio.cloud/learn/primers). The tutorial is aimed to be interactive and peer-instruction between attendees is aimed to guide learning at breakout rooms. 7:15pm - 8:00pmID: 329 / 2-Tut: 8 Breaks Break useR! 2021 8:00pm - 11:00pmID: 320 / 2-Tut: 9 Tutorial Topics: Teaching R/R in Teaching, OtherProfessional, Polished, Presentable: Making Great Slides with xaringan Garrick Aden-Buie, Silvia Canelón - Language: English - Duration: 180mn - N° Participants: 60 - Level: Intermediate The xaringan package brings professional, impressive, and visually appealing slides to the powerful R Markdown ecosystem. Through our hands-on tutorial, you will learn how to design highly effective slides that support presentations for teaching and reporting alike. Over three hours, you will learn how to create an accessible baseline design that matches your institution or organization’s style guide. Together we’ll explore the basics of CSS—the design language of the internet—and how we can leverage CSS to produce elegant slides for effective communication. Finally, we’ll deploy our slides online where they can be shared and discovered by others long after they support our presentations. The tutorial will demonstrate how to use the skills learned to incorporate principles of accessible design into their presentations. The tutorial will feature live-coding and interactive question and answer periods, interspersed with small-group break out sessions for guided hands-on experience. The tutorial will be supported by a repository of materials. 11:00pm - 11:15pmID: 328 / 2-Tut: 10 Breaks Break useR! 2021 7:00am - 11:59pm Tutorials - Track 3 7:00am - 10:00amID: 311 / 3-Tut: 1 Tutorial Topics: Algorithms, Data mining / Machine learning / Deep Learning and AIKeywords: Interpretable Machine Learning, Explainable Artificial Intelligence, Machine Learning, Fairness, Responsible Machine Learning Introduction to Responsible Machine Learning Anna Kozak, Hubert Baniecki, Przemyslaw Biecek, Jakub Wisniewski - Language: English - Duration: 180 - N° Participants: 150 - Level: Beginner What? The workshop focuses on responsible machine learning, including areas such as model fairness, explainability, and validation. Why? To gain the theory and hands-on experience in developing safe and effective predictive models. For who? For those with basic knowledge of R, familiar with supervised machine learning and interested in model validation. What will be used? We will use the DALEX package for explanations, fairmodels for checking bias, and modelStudio for interactive model analysis. Where? 100% online When? Wednesday, 7th of July, 7:00 - 10:00 am (UTC) 10:00am - 10:15amID: 332 / 3-Tut: 2 Breaks Break useR! 2021 10:15am - 2:15pmID: 312 / 3-Tut: 3 Tutorial Topics: Spatial analysis, Data visualisationEntry level R maps from African data - French - English Andy South, Anelda van der Walt, Ahmadou Dicko, Shelmith Kariuki, Laurie Baker - Language: French - English - Duration: 240 - N° Participants: 60 - Level: Beginner This tutorial will provide an introduction to mapping and spatial data in R using African data. By the end of the tutorial, you should be able to make a map that is useful to you from data that you have brought yourselves. We will focus on developing confidence in doing the basics really well in preference to straying too far into more advanced analyses. Our tutorials focus on flexible workflows that you can take away. You will also learn how to spot and avoid common pitfalls. The training will be partly based around a set of interactive learnr tutorials that we have created as part of the afrilearnr package (https://github.com/afrimapr/afrilearnr) and accompanying online demos described in this blog post : https://afrimapr.github.io/afrimapr.website/blog/2021/interactive-tutorials-for-african-maps/. The tutorial will be available on shinyapps for those that are unable to install locally. There will be separate English & French language groups with dedicated materials. Each group will start together for the first few sessions and then break into sub-groups of up to 10 learners with one trainer each for improved feedback and discussion. Towards the end of the tutorial we will challenge you to make a map using data that you have brought or found. Each language group will come back together for a final wrapup session. 2:15pm - 2:30pmID: 333 / 3-Tut: 4 Breaks Break useR! 2021 2:30pm - 5:30pmID: 313 / 3-Tut: 5 Tutorial Topics: Community and Outreach, Reproducibility, OtherHow to build a package with "Rmd First" method Sébastien Rochette, Emily Riederer ThinkR, - Language: English - Duration: 180 - N° Participants: 30 - Level: Intermediate "Rmd First" method can reduce mental load when building packages by keeping users in a natural environment, using a tool they know: a RMarkdown document. The step between writing your own R code to analyze some data and refactoring it into a well-documented, ready-to-share R package seems unreachable to many R users. The package structure is sometimes perceived as useful only for building general-purpose tools for data analysis to be shared on official platforms. However, packages can be used for a broader range of purposes, from internal use to open-source sharing. Because packages are designed for robustness and enforce helpful standards for documentation and testing, the package structure provides a useful framework for refactoring analyses and preparing them to go into production. The following approach to write a development or an analysis inside a Rmd, will significantly reduce the work to transform a Rmd into a package : - _Design_ : define the goal of your next steps and the tools needed to reach them - _Prototype_ : use some small examples to prototype your script in Rmd - _Build_ : Build your script as functions and document your work to be able to use them, in the future, on real-life datasets - _Strengthen_ : Create tests to assure stability of your code and follow modifications through time - _Deploy_ : Transform as a well-structured package to deploy and share with your community During this tutorial, we will work through the steps of Rmd Driven Development to persuade attendees that their experience writing R code means that they already know how to build a package. They only need to be in a safe environment to find it out, which will be what we propose. We will take advantage of all existing tools such as {devtools}, {testthat}, {attachment} and {usethis} that ease package development from Rmd to building a package. The recent package [{fusen}](https://thinkr-open.github.io/fusen), which "inflates a package from a simple flat Rmd", will be presented to further reduce the step between well-designed Rmd and package deployment. Attendees will leave this workshop having built their first package with the "Rmd First" method and with the skills and tools to build more packages on their own. 5:30pm - 5:45pmID: 334 / 3-Tut: 6 Breaks Break useR! 2021 5:45pm - 8:30pmID: 314 / 3-Tut: 7 Tutorial Topics: Bayesian models, Statistical modelsBayesian modeling in R with {rstanarm} - Spanish Fernando Antonio Zepeda Herrera - Language: Spanish - Duration: 165 - N° Participants: 30 - Level: Intermediate This tutorial would introduce Bayesian modeling in R particularly through {rstanarm}. We would alternate between "lectures" and "practical" examples (with {learnr} tutorials). Starting with a brief introduction of the Bayesian paradigm, we would cover linear and generalized linear regression as well as useful diagnostics and posterior visualization. 8:30pm - 8:45pmID: 335 / 3-Tut: 8 Breaks Break useR! 2021 8:45pm - 11:45pmID: 315 / 3-Tut: 9 Tutorial Topics: Big / High dimensional data, R in productionIntroduction to TileDB for R Dirk Eddelbuettel, Aaron Wolen - Language: English - Duration: 180 mn - N° Participants: 200 - Level: Intermediate TileDB is an open source universal data engine that natively supports dense and sparse multidimensional arrays, as well as data frames. Large datasets can be stored on multiple backends ranging from a local filesystem to cloud storage providers such as Amazon S3 (as well Google Cloud Storage and Azure Cloud Storage) and accessed using almost any language, including Python and R. The tutorial introduces the 'tiledb' R package on CRAN, which allows users to efficiently operate on large dense/sparse arrays using familiar R techniques and data structures. It also offers key features of the underlying TileDB Embedded library: parallelised read and write operations, multiple compression formats, time traveling (i.e., the ability to recover data stored at previous timepoints), flexible encryption, and Apache Arrow support. Several simple usage examples will be provided and you will have an opportunity to follow along on your laptops. One or two fuller usage examples from Bioinformatics will serve as a more extended case study. We will illustrate how TileDB can be used to create a performant data store for results produced by Genome-Wide Association Studies, and demonstrate the BioConductor package, TileDBArray, which is built on top of the DelayedArray framework and has shown excellent performance relevant to existing (hdf5-based) solution. Finally, usage of TileDB with cloud storage providers will be illustrated. This covers both direct reads and writes to, for example, Amazon S3 as well as a brief illustration of the 'pay-as-you-go' Software-as-a-Service offering of TileDB Cloud with its additional features. 7:00am - 11:59pm Tutorials - Track 4 7:00am - 9:00amID: 306 / 4-Tut: 1 Tutorial Topics: Web Applications (Shiny/Dash)Keywords: Shiny, Modules, Code reuse, Software engineering, Reactivity Structure your app: introduction to Shiny modules Jonas Hagenberg - Language: English - Duration: 120 - N° Participants: 25 - Level: Intermediate You communicate your results interactively with Shiny, maintain a dashboard or provide business logic, but the codebase of your app becomes too complex? Then modules are the right tool for you, they are the Shiny built-in solution to manage this complexity. Shiny modules allow you to break down your code into smaller building blocks that can be combined and reused. In this tutorial I give an introduction into modules, its advantages over simple R functions and how existing functionality can be transferred to modules. For an easy start, I cover common pitfalls needed to overcome for a productive use of modules: - Passing reactive objects to modules - Returning reactive values from the module to the calling environment - Nesting modules - Dynamically generating modules (including UI) The contents of the tutorial are delivered by short lectures followed by hands-on coding sessions in break-out rooms. For this, you need a basic knowledge of reactive programming/Shiny. 9:00am - 10:30amID: 336 / 4-Tut: 2 Breaks Break useR! 2021 10:30am - 1:30pmID: 307 / 4-Tut: 3 Tutorial Topics: Data mining / Machine learning / Deep Learning and AI, Interfaces with other programming languagesGetting started with torch (in French) Sigrid Keydana, Daniel Falbel - Language: English - Duration: 180 mn - N° Participants: 100 - Level: Intermediate Torch (https://torch.mlverse.org/) is an open source machine learning framework based on PyTorch. Not requiring any Python dependencies, torch for R is at once a powerful computational engine with including GPU acceleration, a neural network library, and an ecosystem providing tools for, among others, image, text, and audio processing. This tutorial will provide a thorough introduction to torch basics: tensors, automatic differentiation, and neural network modules. Thereafter, we delve into two areas of special interest to R users: time series forecasting and numerical optimization. All sections will include time slots for practice. Training materials will be available in an English version as well. Participants not speaking French, but who would like to join the training anyway, are welcome to ask questions in English in the chat. 1:30pm - 1:45pmID: 337 / 4-Tut: 4 Breaks Break useR! 2021 1:45pm - 2:45pmID: 308 / 4-Tut: 5 Tutorial Topics: Data mining / Machine learning / Deep Learning and AIPinguinos en caja : tutorial interactivo de ciencia de datos con pinguinos - Español. Maria Dermit, Susana Escobar - Language: Spanish - Duration: 60 mn - N° Participants: 100 - Level: Intermediate Pingüinos en Caja es un paquete learnr que cubre los temas del libro R para ciencia de datos y utiliza el conocido paquete de datos pinguinos para explorar los conceptos del libro. El paquete contiene actualmente un tutorial para cada capítulo del libro y se presentará durante el taller. Además, los asistentes trabajarán en salas para grupos pequeños módulos divididos por las secciones principales del libro (por ejemplo, Explorar, Wrangle, Programar, Modelar y Comunicar; 6 apartados en total) según sus objetivos de aprendizaje. Las personas de la audiencia de este tutorial son estudiantes que quieran mejorar sus habilidades en ciencia de datos de forma interactiva y docentes que quieran acceder a recursos de aprendizaje adicionales similares a los Primers de Rstudio (https://rstudio.cloud/learn/primers). El tutorial tiene como objetivo ser interactivo y la instrucción entre pares entre los asistentes será dirigida para guiar el aprendizaje en las salas de grupos pequeños. 2:45pm - 3:00pmID: 338 / 4-Tut: 6 Breaks Break useR! 2021 3:00pm - 6:00pmID: 309 / 4-Tut: 7 Tutorial Topics: Big / High dimensional data, R in production, Web Applications (Shiny/Dash)Data Pipelines at scale with R and Kubernetes - Spanish Frans van Dunné - Language: Spanish - Duration: 180 - N° Participants: 40 - Level: Advanced "Many R users are confronted with larger and larger amounts of data that need to be processed. In this tutorial we will show you how to go to the next level by massively parallelizing your R code on a Kubernetes cluster. We will show you how to move your entire data pipeline to Kubernetes where each node in the pipeline consists of a container running R code. These containers can run with multiple cores, and then farmed out to tens or hundreds of these containers running in parallel. Our experience has shown that this allows for massive speed gains, at relatively low cost when the kubernetes cluster is populated with ephemeral virtual machines (e.g. preemptible VM's on GCP - Spot instances on AWS). You need to have an interest in the more technical aspects of running R code, but only to a degree. We hope to dispel any fear that you might have that setting up a cluster is something that is very difficult. A key tool we will introduce is a tool to create data pipelines on kubernetes called Pachyderm (the open source version). The tutorial will be a combination of theory, break outs to run things hands on, regrouping to talk about experiences and then taking the next step. We will set up code examples in steps, so that if one step di not work out, after regrouping the group can take off from the next starting point. 6:00pm - 6:15pmID: 339 / 4-Tut: 8 Breaks Break useR! 2021 6:15pm - 9:15pmID: 310 / 4-Tut: 9 Tutorial Topics: Spatial analysis, Environmental sciences, Data visualisationDatos espaciales a lo tidy - Español Elio Campitelli, Paola Corrales - Language: Spanish - Duration: 180 - N° Participants: 40 - Level: Intermediate En este tutorial vas a aprender a descargar, leer, analizar y visualizar datos espaciales grillados en R usando datos tidy. Va a ser un tutorial participativo con programación en vivo y ejercicios, bajo la idea de que puedas usar los datos para responder tus propias preguntas, escribiendo tu propio código. Al final del taller vas a haber aprendido como: - descargar datos meteorológicos y climáticos programáticamente desde R, - leerlos en un formato tidy, - computar estadísicas espaciales y temporales, - graficar los resultados usando ggplot2 y extensiones.  Date: Thursday, 08/July/2021 12:30am - 1:30am Keynote: Expanding the Vocabulary of R GraphicsVirtual location: The Lounge #key_murrellSession Chair: Joyce RobbinsZoom Host: Olgun AydinReplacement Zoom Host: Jyoti Bhogal ID: 345 / [Single Presentation of ID 345]: 1 Keynote Talk Topics: Data visualisation Paul Murrell The University of Auckland, New Zealand At the heart of the R Graphics system lies a graphics engine. This defines a graphics vocabulary for R - a set of possible graphics operations like drawing a line, colouring in a polygon, or setting a clipping region. Graphics packages like 'ggplot2' allow users to describe a plot in terms of high-level concepts like geoms, scales, and aesthetics, but that high-level description has to be reduced to a set of graphics operations that the graphics engine can understand. Unfortunately, the R graphics engine has a limited vocabulary. It can only draw simple shapes, it can only fill regions with solid colour, and it can only set rectangular clipping regions. This makes it hard (or impossible) for packages like 'ggplot2' to produce some types of graphical output because the graphics engine does not support several fundamental graphical operations. This talk will describe recent work on the graphics engine that expands its vocabulary to include gradient fills, pattern fills, clipping paths, and masks. 1:30am - 1:45am BreakVirtual location: The Lounge #lobby 1:45am - 3:15am 5A - Teaching R and R in TeachingVirtual location: The Lounge #talk_teachingSession Chair: Karthik RamanZoom Host: Adrian MagaReplacement Zoom Host: Jyoti BhogalSession Sponsor: Appsilon Session Slides 1:45am - 2:05amTalk-VideoID: 149 / ses-05-A: 1 Regular Talk Topics: Teaching R/R in TeachingKeywords: ecology Developing a datasets based R package to teach environmental data science Allison Horst1, Julien Brun2 1UC Santa Barbara; 2NCEAS There are many openly available environmental datasets out there. However, it is time and energy consuming for teachers to identify, explore and clean complex datasets for use in environmental data science classes. As the success (>60k downloads) of the recent palmerpenguins R package demonstrates, there is strong demand and interest in curated real-world datasets ready to be used “out of the box” for data science teaching purposes. In this project, our goal was to develop a sample dataset and an associated analytical example for every site of the Long Term Ecological Research (LTER) network. This network, founded by the US National Science Foundation, is made of 30 sites where both observational and experimental environmental data sets are collected with a long-term perspective, and thus provide a treasure trove of interesting, real-world environmental data. All of those resources have been combined into one R package. R packages are an ideal vehicle for teaching datasets because R is widely used in environmental research communities and degree programs, and packages can be installed in one command. In addition, the R Markdown ecosystem provides a suite of tools to publish the documentation and examples as a website to expose all the pedagogic content to non-R users as well. We relied on the package structure to develop a reproducible workflow to ingest and document the LTER data. We also wanted to share the code necessary to access the full dataset to enable further investigation of more complex datasets. In this presentation, we will explain our process to design this R package and provide a set of analysis examples for environmental data science teaching purposes. 2:05am - 2:25amTalk-VideoID: 248 / ses-05-A: 2 Regular Talk Topics: Teaching R/R in TeachingKeywords: community, outreach, rmarkdown Using R as a Community Workbench for The Carpentries Lesson Infrastructure Zhian N. Kamvar, François Michonneau The Carpentries, United States of America The Carpentries is a global community of volunteers that collaboratively develops and delivers lessons to build capacity in data and coding skills (in R and multiple other languages) to researchers worldwide. For the past five years, our collaboratively-developed lesson template (https://github.com/carpentries/styles/) has been the basis for our growing collection of peer-reviewed lesson content. This template was fully self-contained with all the tools and styles needed to create a full lesson website. While the lessons themselves were designed to be easy to author, there were two significant barriers in our toolchain for contributors: software installation and style updating. As our lesson repertoire and community has continued to grow, this template model has not scaled well, resulting in barriers to entry and wasted volunteer time. In 2020 we began the process to redesign our template from the ground up using a combination of R’s literate programming ecosystem and GitHub Workflows, resulting in three R packages called {sandpaper}, {pegboard}, and {varnish} for handling, validating, and styling lessons. The new approach separates the content from the tools and style, allowing for seamless updates so the maintainers can focus on authoring their lessons and not on the tools needed to build them. To accommodate the wide array of diverse skill sets in our community, we wanted to ensure the tools could be used by anyone without any prior knowledge of R. We will detail how we involved our community in iterated development of the new template with user stories, passive community feedback, community member interviews, and user experience testing. In the end, we will show how the wide array of tools available in the R ecosystem makes it easy for us to rebuild our lesson infrastructure in a way that significantly reduces the barrier for entry for our community volunteers. Link to package or code repository. 2:25am - 2:45amTalk-VideoID: 134 / ses-05-A: 3 Regular Talk Topics: Teaching R/R in TeachingKeywords: Bayesian analysis Teaching and Learning Bayesian Statistics with {bayesrules} Mine Dogucu1, Alicia A. Johnson2, Miles Ott3 1University of California, Irvine; 2Macalester College; 3Smith College Bayesian statistics is becoming more popular in data science. Data scientists are often not trained in Bayesian statistics and if they are, it is usually part of their graduate training. During this talk, we will introduce an introductory course in Bayesian statistics for learners at the undergraduate level and comparably trained practitioners. We will share tools for teaching (and learning) the first course in Bayesian statistics, specifically the {bayesrules} package that accompanies the open-access Bayes Rules! An Introduction to Bayesian Modeling with R book. We will provide an outline of the curriculum and examples for novice learners and their instructors. Link to package or code repository.https://github.com/mdogucu/bayesrules 2:45am - 3:05amTalk-LiveID: 246 / ses-05-A: 4 Regular Talk Topics: Teaching R/R in TeachingKeywords: textbook, open-source, non-profit, bookdown, continuous integration Building and maintaining OpenIntro using the R ecosystem Mine Cetinkaya-Rundel Duke University, RStudio, United States of America OpenIntro's (openintro.org) mission is to make educational products that are free and transparent and that lower barriers to education. The products include textbooks (in print and online), supporting resources for instructors as well as for students. From day one, OpenIntro materials have been built using tools within the R ecosystem. In this talk we will discuss how the OpenIntro project has shaped and grown over the years, our process for developing and publishing open-source textbooks at the high school and college level, and our computing resources such as interactive R tutorials and R packages as well as labs in various languages. We will highlight recent workflows we have developed and lessons learned for converting books from LaTeX to bookdown and give an overview of our project organization and tooling for authoring, collaboration, and maintenance, much of which is built with R, R Markdown, Git, and GitHub. Finally, we will discuss opportunities for getting involved for educators and students contributing to the development of open-source educational resources under the OpenIntro umbrella and beyond. Link to package or code repository. 1:45am - 3:15am 5B - Mathematical/Statistical MethodsVirtual location: The Lounge #talk_math_statsSession Chair: Marcela Alfaro CordobaZoom Host: Nick SpyrisonReplacement Zoom Host: Olgun AydinSession Slides 1:45am - 2:05amTalk-VideoID: 254 / ses-05-B: 1 Regular Talk Topics: Statistical modelsKeywords: Model misspecIfication, tidyverse, assumptions, variance estimation maars: Tidy Inference under misspecified statistical models in R Riccardo Fogliato, Shamindra Shrotriya, Arun Kumar Kuchibhotla Carnegie Mellon University, United States of America Linear regression using ordinary least squares (OLS) is a critical part of every statistician's toolkit. In R, this is elegantly implemented via lm() and its related functions. However, the statistical inference output from this suite of functions is based on the assumption that the model is well specified. This assumption is often unrealistic and at best satisfied approximately. In the statistics and econometric literature, this has long been recognized and a large body of work provides inference for OLS under more practical assumptions (e.g., only assuming independence of the observations). In this talk, we will introduce our package “maars” (models as approximations) that aims at bringing research on inference in misspecified models to R via a comprehensive workflow. Our "maars" package differs from other packages that also implement variance estimation, such as “sandwich”, in three key ways. First, all functions in “maars” follow a consistent grammar and return output in tidy format (Wickham, 2014), with minimal deviation from the typical lm() workflow. Second, “maars'' contains several tools for inference including empirical, wild, residual bootstrap, and subsampling. Third, “maars” is developed with pedagogy in mind. For this, most of its functions explicitly return the assumptions under which the output is valid. This key innovation makes “maars” useful in teaching inference under misspecification and also a powerful tool for applied researchers. We hope our default feature of explicitly presenting assumptions will become a de facto standard for most statistical modeling in R. Link to package or code repository. 2:05am - 2:25amTalk-LiveID: 285 / ses-05-B: 2 Regular Talk Topics: Operational research and optimizationKeywords: Evolutionary Strategies, Mixed Integer Problems, Multifidelity Optimization, Black Box Optimization, Multi-Objective Optimization Mixed Integer Evolutionary Strategies with "miesmuschel" Martin Binder LMU Munich, Germany Evolutionary Strategies (ES) are optimization algorithms inspired by biological evolution that do not make use of gradient information, and are therefore well-suited for "black-box optimization" where this information is not available. Mixed-Integer ES (MIES) are an extension that allow optimization of mixed continuous, integer, and categorical search spaces by defining different mutation and recombination operations on different subspaces. We present our new package "miesmuschel" (pronounced MEES-mooshl), a modular toolbox for MIES optimization. It provides "Operator" objects for mutation, recombination, and parent/survival selection that can be configured and combined in various ways to match the optimization problem at hand. Configuration parameters of operators can even be self-adaptive and evolve together with the solutions of the optimization problem. Miesmuschel can be used for both single- and multi-objective optimization, simply by using different selection operations. The multi-fidelity optimization capabilities of miesmuschel can be used for expensive objectives where early generations or new samples are preliminarily evaluated with less effort. A standard optimization loop (parent selection, recombination, mutation, survival selection) is given and can be used out-of-the-box, but the supplied methods can also be combined as building blocks to form more specialized algorithms. Miesmuschel makes use of the "paradox" and "bbotk" packages and integrates with the "mlr3" ecosystem. Link to package or code repository.https://github.com/mlr-org/miesmuschel 2:25am - 2:45amTalk-VideoID: 110 / ses-05-B: 3 Regular Talk Topics: Data mining / Machine learning / Deep Learning and AIKeywords: anomaly detection Here is the anomalow-down! Sevvandi Kandanaarachchi1, Rob J Hyndman2 1RMIT University; 2Monash University Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies. What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates. In this talk we introduce two R packages – dobin and lookout – that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions. On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory. We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high. Link to package or code repository. 3:15am - 3:30am BreakVirtual location: The Lounge #lobby 3:30am - 4:30am Incubator: Strategies to build a strong AsiaR CommunityVirtual location: The Lounge #incubator_asiarSession Chair: Janani RaviSession Chair: Adithi R. UpadhyaZoom Host: Jyoti BhogalReplacement Zoom Host: Olgun Aydin ID: 342 / [Single Presentation of ID 342]: 1 Incubator Topics: Community and OutreachKeywords: Community building, R in Asia, Education and outreach, Diversity Adithi R. Upadhya1, Dr. Janani Ravi2 1ILK Labs, India; 2Michigan State University, United States of America R has been a very inclusive community and collective learning has always helped, with many users of R in Asian countries we can as well have a strongly knit community for R’s users. Inspired by the MENA (Middle East and North Africa) R, AfricaR, and LatinR user groups, we propose a similar panel discussion to connect and strengthen the R community in Asia. We aim to target participants who are active R users and/or learners who have not been engaged with any R community and we want to invite panelists who are successful R developers/educators/community leaders in various Asian countries. Our goal is to build a diverse and vibrant R community within Asia. We wish to connect Asian useRs to each other, identify Asian R speakers/participants and facilitate regular webinars, workshops. We want to address the lower participation of Asian, specially Asian Underrepresented Minorities in local meetups and international conferences like useR! 2021, discuss and learn about best practices for nucleating and sustaining an engaged community. We also would like to understand how people from various backgrounds and organisations engage the community for assistance. We also wish to build a strong enough community to host a AsiaR conference in the upcoming years. 4:30am - 5:30am aRt GalleryVirtual location: The Lounge #announcementsSession Chair: Sara MortaraSession Chair: Marcela Alfaro CordobaMeet some aRtists!This is a live session with some of our aRtists: people who have been using R to produce art. We'll have live interviews and a gallery for all of us to enjoy. 5:30am - 7:00am 6A - Data visualisationVirtual location: The Lounge #talk_datavizSession Chair: Praveena MathewsZoom Host: Olgun AydinReplacement Zoom Host: Adrian MagaSession Sponsor: R Studio Session Slides 5:30am - 5:50amTalk-VideoID: 172 / ses-06-A: 1 Regular Talk Topics: Data visualisationKeywords: ggplot2 Easy R Markdown reporting with chronicle Philippe Heymans Smith NA The chronicle package aims to ease the process of making R Markdown reports for R practitioners. With chronicle, the user is only required to provide the data and structure of the report, and chronicle will write the corresponding R Markdown file on behalf of the user. This means that the user can take the role of a director of the report, focusing on its content and structure, while delegating all the intricacies of visual consistency and interactiveness to the package. chronicle currently supports 16 of the most popular R Markdown output formats, and lets the user add each element of a report in an additive paradigm inspired by ggplot. Link to package or code repository.https://github.com/pheymanss/chronicle 5:50am - 6:10amTalk-VideoID: 156 / ses-06-A: 2 Regular Talk Topics: Data visualisationKeywords: exploratory data analysis virgo: a layered interactive grammar of graphics in R Stuart Lee1, Earo Wang2 1Monash University; 2University of Auckland The virgo package enables interactive graphics for exploratory data analysis (EDA). Like ggplot2, our package takes a grammar based approach, that is, variables are mapped to visual encodings and plots are built layer by layer with marks. However, unlike ggplot2, the virgo package incorporates interactivity directly into its design by extending the Vega-Lite Javascript library and the vegawidget R package. Users can easily initialize "selection" objects to specify client side events like brushing or clicking. Once a "selection" object is specified it can used in two different ways. First, a "selection" can be broadcast to modify an encoding channel - for example, points being colored after a selection event has happened. Second, the data in a visual layer can react to a "selection" - for example, computing a mean on the fly given the occurence of a selection event. Through composing multiple selection objects we can achieve rich interactivity. In this talk, we will discuss the motivations behind the virgo package and grammar. We will demonstrate how virgo seamlessly integrates into existing EDA workflows through a case study. The virgo package is available online at https://vegawidget.github.io/virgo. Link to package or code repository.https://github.com/vegawidget/virgo 6:10am - 6:30amTalk-LiveID: 155 / ses-06-A: 3 Regular Talk Topics: Data visualisationKeywords: dynamic graphics New displays for the visualization of multivariate data in the tourr package Ursula Laa University of Natural Resources and Life Sciences, Vienna Tour methods allow the visualization of multi-dimensional structures as animated sequences of interpolated projections. The viewer can extrapolate from the observed low-dimensional shapes, to build intuition about the high-dimensional distribution. These methods are available in the tourr package (Wickham et al., 2011), including a range of display functions. The package is on CRAN, see https://CRAN.R-project.org/package=tourr. The traditional displays are however limited in the case of large data: in scenarios with many observations, overplotting will often hide features, while a large number of variables typically leads to piling of the observations near the center of a projection. In this talk I will introduce new tourr displays that can address these issues. The slice tour (Laa et al., 2020) shows sections of the data, alleviating overplotting issues and potentially revealing concave structures not visible in projections; the sage display (Laa et al., under review, arXiv:2009.10979 ) redistributes the projected data points to reverse piling effects. After introducing the new displays I will briefly describe the implementation in R and show examples that illustrate the advantages of the new approaches. Link to package or code repository.https://github.com/ggobi/tourr 6:30am - 6:50amTalk-LiveID: 269 / ses-06-A: 4 Regular Talk Topics: Data visualisationKeywords: AR, VR, 3D plotVR - walk through your data Philipp Thomann D ONE, Switzerland Are you bored by 3D-plots that only give you a simple rotatable 2d-projection? plotVR is an open source package that provides a simple way for data scientists to plot data, pick up a phone, get a real 3d impression - either by VR or by AR - and use the computer's keyboard to walk through the scatter plot: https://www.github.com/thomann/plotVR After installing the package and plotting your dataframe, scan the QR code on your phone (iOS or Android) and start walking. Either with recent phones directly in the web browser or using an iOS-app (Android App in prepared). Once you are immersed in your Cardboard how do you navigate through the scatter? plotVR lets you use the computer's keyboard to walk as you would in any first person game. You want to share your impression? Just use the generated USD (iOS) or gltf (Android) files! The technologies beneath this project are: a web server that handles the communication between the DataScience-session and the phone, WebSockets to quickly proxy the keyboard events, QR-codes facilitate the simple pairing of both, and an HTML-Page on the computer to grab the keyboard events. And the translation of these keyboard events into 3D terms is a nice exercise in three.js, OpenGL, and SceneKit for HTML, Android, and iOS resp. For in-browser AR experience the package generates USD and GLTF formats. Ready to see your data as you have never seen before? Join the talk! Link to package or code repository.https://github.com/thomann/plotVR 5:30am - 7:00am 6B - R in Production 1Virtual location: The Lounge #talk_r_production_1Session Chair: Emi TanakaZoom Host: Jyoti BhogalReplacement Zoom Host: Nick SpyrisonSession Sponsored by: cynkra Session Slides 5:30am - 5:50amsponsored-liveID: 350 / ses-06-B: 1 Sponsored Talk Topics: Big / High dimensional dataKeywords: memory, big data Big Memory for R Jingchao Sun1, Chris Kang3, Austin Gutierrez2 1MemVerge; 2The Translational Genomics Research Institute (TGen); 3Analytical Biosciences As we are stepping into the big data era, R and programs written in R are facing various new challenges. First, the data to be processed is growing exponentially and thus results in large memory usage when running R programs. Memory is becoming one of the bottlenecks for large data processing. Second, large data processing dramatically increases the processing time and the risk of program crashes. Scientists or developers might lose hours or days due to a program crash with no chance to save the data. Third, iterative analysis for large data is a pain point due to the long data loading time from disk. Fourth, a large amount of legacy R code does not support multi-threading which was recently introduced to R. This leads to the sequential processing of R code and wastes the CPU's multi-core capability. To tackle these challenges, MemVerge developed Memory Machine software which supports Intel Optane Persistent Memory. With the help of Memory Machine, R programs can use up to 8 TB of memory on a single server with a 30-50% cost saving. R users can take snapshots of their workload at any time within 1 second to get data persistence without writing data to disk, and restore the workload within 1 second without loading data from disk. Moreover, with the help of instant restore, R users can easily try different parameters multiple times for their workloads. Memory Machine can also restore the workload into different namespaces to enable parallel processing and greatly reduce program execution time. This talk will provide an overview of Big Memory Computing consisting of Intel Optane Persistent Memory and memory virtualization software working together. R users from Analytical Biosciences and The Translational Genomics Research Institute (TGen) will provide overviews of their implementations of Big Memory. 5:50am - 6:10amTalk-LiveID: 282 / ses-06-B: 2 Regular Talk Topics: R in productionKeywords: DevOps, Agile, Production, Docker Bridging the Unproductive Valley: Building Data Products Strictly Without Magic Maximilian Held, Najko Jahn State- and University Library Goettingen, Germany Between GUI-based reports and scripted data science lies an unproductive valley that combines the worst of both worlds: poor scaleability *and* high overhead. To avoid getting stuck there, small and medium-sized teams must 1) build strategic data products (not one-off scripts), 2) adopt software development best practices (not hacks) and 3) concentrate on business value (not infrastructure). 1) Strategic data products focus on the ETL pipelines, common visualisations and other modules that are central to the mission. These unix-style building blocks can then be recombined into various reports. 2) These modules are designed "as-if-for-CRAN" and written as type/length-stable, unit-tested and exported functions. 3) If something is not related to our mission, we rely on industry standards (Docker) and CaaS/DBaaS (Azure, GCP). {muggle}'s opininated DevOps provides some technical scaffolding to help with this transition. It standardises the compute environment in development, testing and deployment on a multi-stage Dockerfile with ONBUILD triggers for lightweight target images and leverages public cloud services (RSPM, GitHub Actions, GitHub Packages). In contrast to some existing approaches, {muggle} never infers developer intent and has a minimal git footprint. Success also requires a cultural shift. Development may still be agile, but it must not build prototype code. Fancy plots and reports are good, but reproducibility is more important. We believe this is a necessary change to ensure value generation, and thereby, to ensure the future of democratic, and open-source data science. Link to package or code repository.https://subugoe.github.io/muggle 6:10am - 6:30amTalk-VideoID: 237 / ses-06-B: 3 Regular Talk Topics: R in productionKeywords: API Data science serverless-style with R and OpenFaaS Peter Solymos Analythium Solutions Inc. R is well suited for data science due to its diverse tooling and its ability to leverage and integrate with other languages and solutions. In production, R is often just a piece of a much larger puzzle providing API endpoints via e.g. plumber, RestRServe, or a similar web framework. Managing many API endpoints can lead to problems due to shifting dependency requirements or more recent additions breaking older code. The common solution is to use Docker containers to provide isolation to these components. However, managing containers at scale is not trivial, and managing serverless infrastructure is often outsourced to public cloud providers. Providers differ in their approaches, leading to independent integrations of R and repeated efforts. The OpenFaaS project was born to mitigate these problems and to avoid vendor lock-in. OpenFaaS is an open-source framework to deploy functions and microservices anywhere (local cluster, public cloud, edge devices) and at any scale (including 0), with an emphasis on Kubernetes. It provides auto-scaling, metrics, API gateway, and is language-agnostic. In this talk, I introduce R templates for OpenFaaS. The templates support different Docker R base images (Debian, Ubuntu, Alpine Linux) and 6 different frameworks, including plumber. I explain the development life cycle with OpenFaaS using an example cloud function for time series forecasting on daily updated epidemiological data. I end with a review of production use cases where R can truly shine in the multilingual serverless landscape. Link to package or code repository.https://github.com/analythium/openfaas-rstats-templates 7:00am - 7:15am BreakVirtual location: The Lounge #lobby 7:15am - 8:15am Keynote: Research software engineers and academiaVirtual location: The Lounge #key_seiboldSession Chair: Dorothea Hug PeterZoom Host: Adrian MagaReplacement Zoom Host: Nick SpyrisonSession Slides ID: 351 / [Single Presentation of ID 351]: 1 Keynote Talk Topics: R in production Heidi Seibold Johner Institut, Germany Academia is a strange place. On the one hand it is a hotbed of innovations, on the other hand it is a frustratingly lethargic system. The movement of Research Software Engineers (RSEs) shows this really well as nearly all research relies on research software, yet we are still lacking adequate acknowledgment let alone career paths for RSEs. In this talk I want to discuss the status quo and future of software in research, the role of the R community, and also what it has to do with my personal path. 8:15am - 9:15am RechaRge 3Virtual location: The Lounge #announcementsSession Chair: Marcela Alfaro CordobaYoga for the Spine + StretchingJana will teach us how to stretch our spines and offer a bit of help for back pain. Come join us! Beginners are welcome. 9:15am - 10:45am 7A - Ecology and Environmental SciencesVirtual location: The Lounge #talk_ecology_environmentSession Chair: Ulfah MardhiahZoom Host: Adrian MagaReplacement Zoom Host: Dorothea Hug PeterSession Slide 9:15am - 9:35amTalk-LiveID: 147 / ses-07-A: 1 Regular Talk Topics: Environmental sciencesKeywords: big data startR: A tool for large multi-dimensional data processing An-Chi Ho, Núria Pérez-Zanón, Nicolau Manubens, Francesco Benincasa, Pierre-Antoine Bretonnière Barcelona Supercomputing Center (BSC-CNS) Nowadays, the growing data volume and variety in various scientific domains have made data analysis challenging. Simple operations like extracting data from storage and performing statistical analysis on them have to be rethought. startR is an R package developed at the Earth Science Department in Barcelona Supercomputing Center (BSC-CNS) that allows to retrieve, arrange, and process large multi-dimensional datasets automatically with a concise workflow. startR provides a framework under which the datasets to be processed can be perceived as a single multi-dimensional array. The array is first declared, then a user-defined function can be applied to the relevant dimensions in an apply-like fashion, building up a declarative workflow that can be executed in various computing platforms. During execution, startR implements the MapReduce paradigm, chunking the data and processing them either locally or remotely on high-performance computing systems, leveraging multi-node and multi-core parallelism where possible. Besides the data, metadata are also well-preserved and expanded with the operation information, ensuring the reproducibility of the analysis. Several functionalities in startR, like spatial interpolation and time manipulation, are tailored for atmospheric sciences research such as climate, weather, and air quality. It is compatible with other R tools developed in BSC-CNS, forming a strong toolset for climate research. However, it is potentially competent in other research fields. Even though netCDF is the only data format supported in the current release, adaptors for other file formats can be plugged in, enabling the tool to be exploited in different scientific domains where large multi-dimensional data is involved. Link to package or code repository.https://earth.bsc.es/gitlab/es/startR 9:35am - 9:55amTalk-LiveID: 193 / ses-07-A: 2 Regular Talk Topics: Environmental sciencesKeywords: hydrology, river hydrograph, hydrograph separation, climate change, spatial analysis grwat: a new R package for automated separation and analysis of river hydrograph Timofey Samsonov1, Ekaterina Rets2, Maria Kireeva1 1Faculty of Geography, Lomonosov Moscow State University, Russian Federation; 2Institute of Water Problems, Russian Academy of Science, Russian Federation grwat is a new R package aimed at analysis of river hydrograph — a time series of river discharge values. The overall shape of hydrograph is specific for each river and is heavily influenced by climatic conditions within a river basin. Since the climate is changing, the shape of a typical hydrograph for each river is also transformed. The main goal of grwat package is to provide automated tools to extract the genetic components of river discharge (e.g. how much disharge is due to thaws, floods etc.) as well as graphical and statistical tools to reveal interannual and long-term changes of these components. The core procedure which allows extraction of genetic components is separation. The implementation of separation in grwat is two-stage. First, it follows the generaly acclaimed approach to separate the discharge into quick flow and baseflow. Second, it involves the temperature and precipitation time series to separate the quick flow into seasonal (snowmelting), thaw and flood-induced discharge using the originally developed algorithm. The separation is programmed in pure STL C++17 and then interfaced into grwat via Rcpp. Separated hydrograph is represented as a data frame where for each observation the input total discharge is distributed between several columns, each representing a genetic component. Such data frame can be further analyzed with grwat resulting in more than 30 interannual and long-term statistically tested variables characterizing the aggregated values, dates and durations of specific events and periods. Examples are seasonal flood runoff, annual groundwater discharge, number of thaw days, and the date of seasonal flood beginning. Finally, grwat` contains convenient functions to quickly visualize one or more variables using ggplot2 graphics, and to generate high-quality R Markdown-based HTML reports which combine graphics and results of statistical tests for all computed variables. Development is funded by Russian Science Foundation (Project 19-77-10032) Link to package or code repository.https://tsamsonov.github.io/grwat/ 9:55am - 10:15amTalk-VideoID: 190 / ses-07-A: 3 Regular Talk Topics: EcologyKeywords: agent-based modelling, animal, R6, simulation, OOP Using R6 object-oriented programming to build agent-based models Liam Daniel Bailey, Alexandre Courtiol IZW Berlin, Germany Agent or individual-based modelling is an invaluable tool in the biological sciences, used to understand complex topics such as conservation management, invasive species, and animal population dynamics. However, while R is one of the most common programming languages used in the biological sciences it is often considered 'unsuitable' for agent-based modelling tasks, with other tools such as NetLogo, Java, and C++ utilized instead. Here, we introduce how the package R6 can be used to build agent-based models and simulate complex population and evolutionary dynamics in R. R6 offers the possibility to easily define classes with encapsulated methods. It has become the package of choice behind many well-known R packages that use encapsulated object-oriented programming (e.g. shiny, dplyr, testthat). Yet, while simulations have been built in R using other class systems such as S3 and S4, the potential of R6 to perform such tasks remains untapped. We provide a real-world example from our research on the large African carnivore, the spotted hyena. Object-oriented programming using R6 was easy to learn and implement, and working in R allowed us to quickly build, document, and unit test our code by taking advantage of existing tools in R/RStudio with which we were already familiar (e.g. RStudio projects, roxygen2, testthat). Implementing agent-based modelling in R will allow ecologists to easily make use of this powerful tool in their research. Researchers will not be required to learn any new programming languages but can instead implement agent-based models in the same language they already use for data wrangling, statistical analysis, and data visualisation. 10:15am - 10:35amTalk-LiveID: 171 / ses-07-A: 4 Regular Talk Topics: Environmental sciencesKeywords: data processing Climate Forecast Analysis Tools Framework: from the storage to the HPC to get reproducible climate research results and services Núria Pérez-Zanón1, An-Chi Ho1, Francesco Benincasa1, Pierre-Antoine Bretonnière1, Louis-Philippe Caron2, Chihchung Chou1, Carlos Delgado-Torres1, Llorenç Lledó1, Nicolau Manubens3, Lluís Palma1 1Barcelona Supercomputing Center (BSC); 2Ouranos; 3NA Climate forecast researchers need to assess the quality of their forecasts by comparing them against reference observation datasets using state-of-the-art verification metrics. This procedure requires reading in the seasonal forecasts and reference data and restructuring them for later comparison (e.g.: regridding, resampling or reordering). Only then, statistical methods can be applied to assess forecast skill and, finally, tailored visualization tools are employed to explore the results. At the Earth Sciences department of the Barcelona Supercomputing Center, the expertise in seasonal forecast research has traditionally been compiled in the s2dverification R package since its first release in 2009. The package provides tools implementing all the steps required for the procedure described above, allowing researchers to share their methods while reducing development and maintenance cost. However, as the department broadened its activity to include research on sub-seasonal forecast, decadal prediction and climate projections, as well as development of climate services for various stakeholders, new state-of-the-art tools to manipulate climate data became necessary. As a result, the department is currently maintaining eight R packages. These packages can be used separately or in their common framework, and include methods for calibration, downscaling and combination in the CSTools package, climate indicators in ClimProjDiags, and CSIndicators -among other climatological methods- in s2dv (s2dverification’s successor). The framework has been designed to be flexible and efficient. The Big Data issue inherent to climate data analysis is addressed by employing the startR and multiApply packages to seamlessly enable chunked multi-core processing, optionally leveraging multi-node parallelism in HPC platforms. 9:15am - 10:45am 7B: Statistical modeling in RVirtual location: The Lounge #talk_statsSession Chair: Sevvandi KandanaarachchiZoom Host: Nick SpyrisonReplacement Zoom Host: Olgun AydinSession Slides 9:15am - 9:35amTalk-VideoID: 207 / ses-07-B: 1 Regular Talk Topics: Statistical modelsKeywords: distributional regression, probabilistic forecasts, regression trees, random forests, graphical model assessment Probability Distribution Forecasts: Learning with Random Forests and Graphical Assessment Moritz N. Lang1, Reto Stauffer1,2, Achim Zeileis1 1Department of Statistics, Faculty of Economics and Statistics, Universität Innsbruck; 2Digital Science Center, Universität Innsbruck Forecasts in terms of entire probability distributions (often called "probabilistic forecasts" for short) - as opposed to predictions of only the mean of these distributions - are of prime importance in many different disciplines from natural sciences to social sciences and beyond. Hence, distributional regression models have been receiving increasing interest over the last decade. Here, we make contributions to two common challenges in distributional regression modeling: 1. Obtaining sufficiently flexible regression models that can capture complex patterns in a data-driven way. 2. Assessing the goodness-of-fit of distributional models both in-sample and out-of-sample using visualizations that bring out potential deficits of these models. Regarding challenge 1, we present the R package "disttree" (Schlosser et al. 2021), that implements distributional trees and forests (Schlosser et al. 2019). These blend the recursive partitioning strategy of classical regression trees and random forests with distributional modeling. The resulting tree-based models can capture nonlinear effects and interactions and automatically select the relevant covariates that determine differences in the underlying distributional parameters. For graphically evaluating the goodness-of-fit of the resulting probabilistic forecasts (challenge 2), the R package "topmodels" (Zeileis et al. 2021) is introduced, providing extensible probabilistic forecasting infrastructure and corresponding diagnostic graphics such as Q-Q plots of randomized residuals, PIT (probability integral transform) histograms, reliability diagrams, and rootograms. In addition to distributional trees and forests other models can be plugged into these displays, which can be rendered both in base R graphics and "ggplot2" (Wickham 2016). Link to package or code repository. 9:35am - 9:55amTalk-LiveID: 178 / ses-07-B: 2 Regular Talk Topics: Statistical modelsspaMM: an R package to fit generalized, linear, and mixed models allowing for complex covariance structures François Rousset1, Alexandre Courtiol2 1Univ. Montpellier, CNRS, Institut des Sciences de l'Evolution, Montpellier, France; 2Leibniz Institute for Zoo and Wildlife Research, Berlin Introduced to make the fit of spatial Mixed Models accessible, the R package spaMM has grown a lot since its first CRAN release eight years ago. The package now offers the possibility to fit a variety of regression models, from simple linear models (LM) to generalised linear mixed-effects models (GLMM), including multivariate-response models, and double hierarchical GLMMs (DHGLM) in which both the mean of a response and the residual variance can be modelled as a function of fixed and random effects. The package provides a diversity of response families beyond the standard ones, such as the (truncated or not) negative binomial, and the Conway-Maxwell-Poisson, as well as non-gaussian random effect families such as the inverse gaussian. Random effects can further be modelled using several autocorrelation functions for the consideration of spatial, temporal and other forms of dependence between observations (e.g. genetic pedigrees). spaMM handles this diversity of models through a simple formula-based interface akin to glm() or lme4::glmer(). Advanced users will nonetheless appreciate the possibility to fine tune many aspects of the fit (e.g. select among several likelihood approximations; set parameters to fixed values). The package also provides tailored methods for many generics, so that for instance anova() can be called to perform likelihood ratio tests by parametric bootstrap and that AIC() computes both the marginal and conditional AIC. The package is finally competitive in terms of computational speed, for both non-spatial, geostatistical, and autoregressive models Link to package or code repository. 9:55am - 10:15amwithdrawnID: 362 / ses-07-B: 3 Regular Talk Topics: Statistical modelsKeywords: big data Changed to Elevator Pitch: The one-step estimation procedure in R Alexandre Brouste1, Christophe Dutang2 1Le Mans Université; 2Université Paris-Dauphine In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial$\sqrt(n)\$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed. Link to package or code repository.https://cran.r-project.org/web/packages/OneStep/index.html 10:15am - 10:35amTalk-VideoID: 206 / ses-07-B: 4 Regular Talk Topics: Statistical modelsKeywords: probabilistic graphical models The R Package stagedtrees for Structural Learning of Stratified Staged Trees Federico Carli1, Manuele Leonelli2, Eva Riccomagno1, Gherardo Varando3 1Università degli Studi di Genova, Dipartimento di Matematica, Italy; 2IE University, School of Human Sciences and Technology, Spain; 3Universitat de València, Image Processing Laboratory (IPL), Spain stagedtrees is an R package which includes several algorithms for learning the structure of staged trees and chain event graphs from categorical data. In the past twenty years there has been an explosion of the use of graphical models to represent the relationship among a vector of random variables and perform inference taking advantage of the underlying graphical representations. Bayesian networks are nowadays one of the most used graphical models, with applications to a wide array of domains and implementation in various software. However, they can only represent symmetric conditional independence statements which in practical applications may not be fully justified. Most often, the greater the number of levels of categorical variables involved, the more difficult it is for conditional independence to hold for all the variables’ levels. Therefore, models accommodating also asymmetric relations as context-specific, partial and local independences have been developed. Staged trees are one such class. Staged tree modeling has proved its worth in many fields, as for instance cohort studies, causal analysis, case-control studies, Bayesian games and medical diagnosis. stagedtrees permits to estimate any type of non-symmetric conditional independences from data via score-based and clustering-based algorithms. Various functionalities to provide inferential, visualization, descriptive and summary statistics tools for such models and about their graph structure are implemented. These functions help users in handling categorical experimental data and analyzing the learned models to untangle complex dependence structures. Link to package or code repository. 9:15am - 10:45am 7C - Teaching, Automation and ReproducibilityVirtual location: The Lounge #talk_teaching_automation_rSession Chair: Earo WangZoom Host: Jyoti BhogalReplacement Zoom Host: Matt BannertSession Sponsor: Roche Session Slide 9:15am - 9:35amTalk-LiveID: 143 / ses-07-C: 1 Regular Talk Topics: Teaching R/R in TeachingKeywords: automation A semi-automatic grader for R scripts Vik Gopal, Samuel Seah, Viknesh Jaya Kumar National University of Singapore My department teaches a class in R. The aims of this class are to teach visualisation and good programming practices in R. Every week, we would attempt to go over as many script submissions as we could, as closely as we could. We would then summarise the feedback verbally to the students. Due to the increasing class size, we were unable to rigorously go through every single student script every week due to time constraints. As such, we could not identify the common misconceptions that students had. We could not intervene and correct the most critical ones early one in the class. Finally, we were unable to visualise all the visualisations that students created. Hence we developed an R package to automatically run all student scripts and extract metrics such as run-time and certain code features. The package would also collate all the graphs so that we can see them at one go. We also set up a server for students to test their code before submission, ensuring that we can run their code smoothly. We can now ensure that every students’ code is run and analysed consistently and reliably. Instead of scrutinising the code, we look through a summary table of features generated for each script. If something looks strange here, we go back to the script. By uploading this table, with comments, to our LMS, we can provide custom feedback for each student. Finally, having such a summary table of features indicates the areas that students need more practice in - it allows us to tailor future homework problems. Link to package or code repository.https://cran.r-project.org/web/packages/autoharp/index.html 9:35am - 9:55amTalk-LiveID: 174 / ses-07-C: 2 Regular Talk Topics: Teaching R/R in TeachingKeywords: Training, Automation, Systems Administration, Reproducibility, Workflow Automating bespoke online teaching with R Rhian Davies Jumping Rivers, United Kingdom At Jumping Rivers we deliver over 100 R, Python and Stan training courses each year, engaging with thousands of new learners. The necessity to move to fully online training in March last year meant we quickly had to completely rethink how to deliver R training interactively online. We internally trialled running our usual in-person training just on Zoom - and it really doesn’t work trust us! We already used R & R Markdown to create all training materials including slides and notes. However, our new workflow uses R in every step of the way, from creating a bespoke learning environment, to collating feedback and generating certificates. Upcoming training sessions are stored in Asana. Using a single call from R, we extract the relevant Asana task details and: * Provide the client with a single URL that contains all necessary information for the course * Deploy a bespoke virtual training environment with {analogsea} * Automate password generation with {shiny} * Track and upload attendance sheets * Create bespoke Google Documents for code quizzes * Generate fill-in-the-blank tutor R scripts * Provide automatic feedback reports for clients with {rmarkdown}, {shiny} & {rtypeform} * Deliver personalised certificates in {shiny} * Tag the training materials and VM to enable a completely reproducible set-up This improves the learning experience as the “small” things are automated and allows the trainer to concentrate on actual training. 9:55am - 10:15amTalk-VideoID: 270 / ses-07-C: 3 Regular Talk Topics: ReproducibilityKeywords: reproducibility, rmarkdown, knitr, report, communication Extend the functionality of your R Markdown documents Christophe Dervieux RStudio R Markdown is a powerful tool that has quickly grown since its creation. If it can be rather simple to quickly create and maintain a simple reproducible report, it can be more challenging to do advanced customization and dynamic content creation due to the different tools involved (rmarkdown, knitr, Pandoc, LaTeX, ...) and a lot of possible tweaks. And this is increaded all the more if you consider the already widespread and still growing ecosytem surrounding R Markdown. Helping users to better find and know how to do specific tasks with R Markdown was the main driver for the book "R Markdown Cookbook" (CRC Press). This talks will be based on the content of this book and will present a selection of advanced recipes to go further with a R Markdown document. These examples combines little-known features of some R packages (rmarkdown, knitr) and other tools (Pandoc) to provide flexibility and to extend greatly the functionnality for producing communication product, programmatically and with reproducibility. This talks will also include last features at the time of the talk included in R Markdown family of packages (rmarkdown, knitr, bookdown, blogdown, ...) Link to package or code repository. 10:15am - 10:35amTalk-VideoID: 189 / ses-07-C: 4 Regular Talk Topics: Teaching R/R in TeachingKeywords: psychometrics, reliability, item response theory, Shiny, teaching R Computational aspects of psychometrics taught with R and Shiny Patricia Martinkova1,2 1Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic; 2Faculty of Education, Charles University, Prague, Czech Republic Psychometrics deals with the advancement of quantitative measurement practices in psychology, education, health, and many other fields. It covers a number of statistical methods that are useful for the behavioral and social sciences. Among other topics, it includes the estimation of reliability to deal with the omnipresence of measurement error, as well as a more detailed description of item functioning encompassed in item response theory models. In this talk, I will discuss some computational aspects of psychometrics, and how understanding these aspects may be supported by real and simulated datasets, interactive examples, and hands on methods. I will first focus on reliability estimation and the issue of restricted range, showing that zero may not always be zero. I will then focus on a deeper understanding of the context behind more complex models and their much simpler counterparts. The last example discusses group-specific models and the importance of item-level analysis for situations where differences in overall gains are not apparent but the differences in item gains may be. I will finally discuss experiences from teaching computational aspects of psychometrics to a diverse group of students from various fields, including statistics, computer science, psychology, education, medicine, and participants from industry. I will discuss the challenges and joys of creating a truly interdisciplinary course. Link to package or code repository.https://github.com/patriciamar/ShinyItemAnalysis 10:45am - 11:45am mixR!Music, networking channel and raffles. To end the day in a relaxing way