Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

Please note that all times are shown in the time zone of the conference. The current conference time is: 17th Aug 2022, 05:16:42pm UTC

 
Filter by Track or Type of Session 
Filter by Topic 
Only Sessions at Location/Venue 
 
 
Session Overview
Date: Sunday, 04/July/2021
10:15pm - 11:15pmWelcome and Navigation Guide (PDT)
Virtual location: The Lounge #announcements
Session Chair: Rocío Joo
Session Chair: Heather Turner
Zoom Host: Marcela Alfaro Cordoba
Replacement Zoom Host: Andrea Sánchez-Tapia
Session Chair: Yanina Noemi Bellini Saibene
A warm welcome for our first timezone, along with a quick guide on how to navigate the conference
11:15pm - 11:59pmMy first UseR! (PDT)
Location: The Lounge #new_to_UseR!
Session Chair: Batool Almarzouq
Zoom Host: Andrea Sánchez-Tapia
Replacement Zoom Host: Marcela Alfaro Cordoba
A session for newcomers to useR!
 
11:15pm - 12:15am
ID: 360
Social Events

My first UseR! (PDT)

Batool Almazrouq

 
Date: Monday, 05/July/2021
7:15am - 8:15amWelcome and Navigation Guide (AWST)
Virtual location: The Lounge #announcements
Session Chair: Dorothea Hug Peter
Session Chair: Matt Bannert
Zoom Host: Saranjeet Kaur Bhogal
Replacement Zoom Host: Faith Musili
Session Chair: Heather Turner
A warm welcome for our second timezone, along with a quick guide on how to navigate the conference
1:15pm - 2:15pmWelcome and Navigation Guide (CEST)
Virtual location: The Lounge #announcements
Session Chair: Rocío Joo
Session Chair: Yanina Noemi Bellini Saibene
Zoom Host: Faith Musili
Replacement Zoom Host: Saranjeet Kaur Bhogal
Replacement Zoom Host 2: Heather Turner
Session Chair: Matt Bannert
A warm welcome for our third timezone, along with a quick guide on how to navigate the conference
2:15pm - 3:15pmMy first UseR! (CEST)
Location: The Lounge #new_to_UseR!
Session Chair: Batool Almarzouq
Zoom Host: Erin West
Replacement Zoom Host: Natalia Soledad Morandeira
A session for newcomers to useR!
3:30pm - 4:30pmKeynote: R Spatial
Virtual location: The Lounge #key_pebesma
Session Chair: Peter Macharia
Zoom Host: Linda Jazmín Cabrera Orellana
Replacement Zoom Host: s gwynn sturdevant
 
ID: 353 / [Single Presentation of ID 353]: 1
Keynote Talk
Topics: Spatial analysis

Edzer Pebesma

Universität Münster, Germany

R Spatial is a lively community of people using R for analysing spatial data. From the early days of R, spatial packages have formed a substantial part of the R package ecosystem. Things took off from 2005 on when packages like sp, rgdal, rgeos and raster provided shareable infrastructure for spatial vector and raster data. Second

generation packages including sf, stars and terra will take over this role when rgdal and rgeos retire in 2024. The pattern of “download followed by local file access” gradually shifts to directly accessing and processing massive, cloud-based spatiotemporal data sources that include remote sensing imagery archives, weather and climate data, point clouds, and large vector datasets such as OpenStreetMap or census datasets. Responsive analysis by working at lower resolutions or with other spatial generalizations forms a challenge that is particular to spatial data. R Spatial has constantly relied on the OSGEO libraries GDAL, PROJ and GEOS for I/O, coordinate transformations, and geometrical operations. These libraries create de facto interoperability across geospatial communities and will remain central. Upcoming changes for R Spatial include switching to spherical geometry, handling of data cubes, and time-dependent coordinate reference systems that cope with plate

tectonics.

 
4:30pm - 4:45pmBreak
Virtual location: The Lounge #lobby
4:45pm - 6:15pm1A - Community and Outreach 1
Location: The Lounge #talk_community_outreach_1
Session Chair: Erin LeDell
Zoom Host: s gwynn sturdevant
Replacement Zoom Host: Linda Jazmín Cabrera Orellana
Session Sponsor: ixpantia
Session Slide
 
4:45pm - 5:05pm
Talk-Live
ID: 260 / ses-01-A: 1
Regular Talk
Topics: Community and Outreach
Keywords: community, infrastructure, maintenance, non-profit sector, open source

rOpenSci’s Model for Managing a Federated Open Source Software Community

Stefanie Butland1, Lou Woodley2, Karthik Ram1,3

1rOpenSci; 2Center for Scientific Collaboration and Community Engagement; 3University of California, Berkeley

rOpenSci hosts over 350 staff- and community-contributed R packages. We have evolved a unique model of community management to support the complex needs of people who develop, review, and use these packages.

The Community Participation Model from the Center for Scientific Collaboration and Community Engagement (http://doi.org/10.5281/zenodo.3997802) provides a framework to assess how community members interact with programs and each other. The four modes on a continuum are: Convey/Consume; Contribute; Collaborate; Co-create. Engagement among rOpenSci community members happens primarily in the first three modes. 1) Convey/Consume. One-way dissemination of information via a newsletter, and Community Contributing Guide (https://contributing.ropensci.org/) that helps people match their motivations and skills to different ways to contribute. 2) Contribute. Opportunities for members to share knowledge e.g. via our package development guide (https://devguide.ropensci.org/), and blog posts written by members that draw attention to their work. 3) Collaborate. Scaffolded activities where members work together e.g. in open software peer-review as authors, reviewers, or editors. Additional programming, such as community calls, provides opportunities for multiple modes of participation at once e.g. by collaborating on developing topics and presenting, contributing by sharing resources and questions and consuming by attending presentations or reading the recaps.

Strong social facilitation by a full-time community manager and a team that values trust-based relationships supports community members in modes 1 - 3. We have recently explored what might be required to provide more opportunities for co-creation - mode 4 in the CSCCE Community Participation Model. An interviews-based assessment of community needs and audit of current programming aims to help us understand what people get from the rOpenSci community that they can’t get elsewhere, and how we can best facilitate productive and valuable collaboration and co-creation. We’ll present the results of this work as a methodology for others to consider how to review their own community engagement activities.



5:05pm - 5:25pm
Talk-Video
ID: 297 / ses-01-A: 2
Regular Talk
Topics: Community and Outreach
Keywords: bug fixing, contribution, outreach, R development

R Developer's Guide

Saranjeet Kaur Bhogal1, Heather Turner2, Michael Lawrence3

1Savitribai Phule Pune University, India; 2University of Warwick; 3Genentech

The R Developer's Guide (https://github.com/forwards/rdevguide) is an open-source project that aims to facilitate the on-boarding of new contributors to R Core development. New contributors to R are often not aware of where they can start contributing to the development of R. This guide helps in providing ways by which you could start contributing. How would you report a bug? What is the procedure to submit a patch? How could you help improve the documentation? These are some of the questions which new contributors may not have an idea about. In this talk, I will discuss these procedures. I will also walk through this guide for new contributors interested in referring to it for their contributions.

Link to package or code repository.
https://github.com/forwards/rdevguide


5:25pm - 5:45pm
Talk-Video
ID: 228 / ses-01-A: 3
Regular Talk
Topics: Community and Outreach
Keywords: community

Packages submission and reviews; how does it work?

Lluís Revilla Sancho1,2

1IDIBAPS; 2CIBEREHD

We benefit from others’ work on R and also by shared packages and for our programming tasks. Occasionally we might generate some piece of software that we want to share with the community. Usually sharing our work with the R community means submitting a package to an archive (CRAN, Bioconductor or others). While each individual archive has some rules they share some common principles.

If your package follows their rules about the submission process and has a good quality according to their rules it will be included. All submissions have some common sections: First, an initial screening; second, a more profound manual review of the code. Then, if the suggestions are applied or correctly replied then the package is included in the archive.

On each step some rules and criteria are used to decide if the package moves forward or not. Understanding what these rules say, common problems and comments from reviewers will help avoiding submitting a package to get it rejected. Reducing the friction between sharing our work, providing useful packages to the community and minimizing reviewers’ time and efforts.

Looking at the review process of three archives of R packages, CRAN, Bioconductor and rOpenSci, I’ll explain common rules, patterns, timelines and checks required to get the package included, as well as personal anecdotes with them. The talk is based on the post analyzing reviews available here: https://llrs.dev/tags/reviews/

Link to package or code repository.
https://llrs.dev/tags/reviews/
 
4:45pm - 6:15pm1B - R packages 1
Virtual location: The Lounge #talk_packages_1
Session Chair: Luis Darcy Verde Arregoitia
Zoom Host: Juan Pablo Narváez-Gómez
Replacement Zoom Host: CHIBUOKEM BEN UBAH
Session Sponsor: R Studio
Session Slide
 
4:45pm - 5:05pm
Talk-Video
ID: 123 / ses-01-B: 1
Regular Talk
Topics: Interfaces with other programming languages
Keywords: anomaly detection

RcppDeepState, a simple way to fuzz test R packages.

Akhila Chowdary Kolla, Dr. Toby Dylan Hocking, Dr. Alex groce

Northern Arizona University

R packages are typically tested using expected input/output pairs that are manually coded by package developers. These manually written tests are validated under various CRAN checks that perform both static and dynamic analysis. Manually written tests could still allow subtle bugs, if they do not anticipate all possible inputs, and miss important code paths. In contrast to manually written tests, fuzzers are programs that pass random, unexpected, potentially invalid inputs to a function, expecting it to break or identify subtle bugs.

Link to package or code repository.
https://github.com/akhikolla/RcppDeepState


5:05pm - 5:25pm
Talk-Video
ID: 153 / ses-01-B: 2
Regular Talk
Topics: R in production
Keywords: business, industry

R in Regulated Industries: Assessing Risk with {riskmetric}

Doug Kelkhoff1,2, Yilong Zhang3, and many more contributors at the R Validation Hub (https://www.pharmar.org/about/)1

1R Validation Hub; 2Roche; 3Merck

Regulated industries are built on long histories of software validation to bring confidence to the results that they produce. Historically, this validation has come through licensed software — at a time, such software tools were the preferred tool of practitioners. As free and open source tools rise in popularity, the spectrum of preferred tools grows ever wider. Today, R is the preferred language for many graduating statisticians and biologists. Furthermore, the open nature of these tools brings more ready access to bleeding edge methods expected by regulators.

While the R world brings some amazing pieces of infrastructure for software quality, there is an inevitable gap in translating open source practices to a validation world that is unfamiliar with open software development. With {riskmetric}, we have built a platform for implementing such quality assessments. From code quality to community engagement, we hope to characterize R packages to an extent that industry players can be confident that they can stand behind the tools they use when presenting analysis and software to regulators.

Here we present the {riskmetric} package and surrounding tools to help support R package validation, show how we’ve built a foundation for pulling metrics from a diverse set of information sources and provide next steps in the R Validation Hub’s roadmap to help bridge the gap toward open source tools in regulated industries.

Link to package or code repository.
https://github.com/pharmaR/riskmetric


5:25pm - 5:45pm
Talk-Video
ID: 198 / ses-01-B: 3
Regular Talk
Topics: Other
Keywords: base, grammar, data manipulation, independence

{poorman} - A dependency free recreation of {dplyr}

Nathan Wayne Eastwood

NE Data

{poorman} is a package that unapologetically attempts to recreate the {dplyr} API in a dependency free way using only {base} R. {poorman} is still under development and doesn’t have all of {dplyr}’s functionality but what is considered the "core" functionality is included and the package is available on CRAN at: https://cran.r-project.org/web/packages/poorman/. The idea behind {poorman} is that a user should be able to take a {dplyr} based script and run it using {poorman} without any hiccups.

{poorman} provides a consistent set of verbs that help you solve the most common data manipulation challenges:

* select() picks variables based on their names.

* mutate() adds new variables that are functions of existing variables.

* filter() picks cases based on their values.

* summarise() reduces multiple values down to a single summary.

* arrange() changes the ordering of the rows.

 
6:15pm - 6:30pmBreak
Virtual location: The Lounge #lobby
6:30pm - 8:00pm2A - Data Management
Virtual location: The Lounge #talk_data_management
Session Chair: Christopher Maronga
Zoom Host: Linda Jazmín Cabrera Orellana
Replacement Zoom Host: s gwynn sturdevant
Session Sponsor: cynkra
Session Slide
 
6:30pm - 6:50pm
Talk-Video
ID: 235 / ses-02-A: 1
Regular Talk
Topics: Algorithms
Keywords: applications, case studies

Solving Big Data Problems with Apache Arrow

Neal Richardson

Ursa Computing

As distributed computing platforms and data warehouses become more prevalent, R users face new challenges in working with the data they generate and store. From R, we may need to analyze a dataset that is too big to fit into memory, is split across many files, is hosted on cloud storage, or was produced by systems using other languages. The {arrow} package helps to solve these integration problems, allowing R users to employ familiar, idiomatic R code to work with larger-than-memory datasets on their own workstations. This presentation will discuss several of these common challenges people face when working with bigger data, and using case studies from the community, it will demonstrate how Arrow can help solve these problems.



6:50pm - 7:10pm
Talk-Live
ID: 292 / ses-02-A: 2
Regular Talk
Topics: Bioinformatics / Biomedical or health informatics
Keywords: simulation, relational data, data analysis, computing, biostatistics, CDISC, clinical trials

respectables and synthetic.cdisc.data: A framework for creating relational data with application

Gabriel Becker1, Adrian Waddell2

1Clindata Insights, United States of America; 2F. Hoffmann-La Roche, Switzerland

Synthetic data provides a privacy-safe mechanism for developing, benchmarking, testing, and showcasing analysis plans and data processing pipelines. Existing tools in R focus primarily on creating or manipulating individual rectangular datasets (dplyr) or combining already existing multiple rectangular datasets (dm). Many crucial types of data, however, involve inter-related rectangular sets of data, with columns in one table acting as keys or lookups within another. An example of this is the CDISC data standard for clinical trial data, which, given an overall set of patients, has some rectangular datasets containing exactly one row per patient and other datasets where some patients have data represented in multiple rows while other patients are entirely absent (e.g., because they did not have any adverse events).

The respectables package defines a recipe-based framework which allows for the specification of data to be synthesized at three distinct levels. First, we provide an intuitive recipe mechanism for specifying the creation - via sampling, synthesis or both - of a rectangular dataset. These recipes support the definition of both conditional and joint behaviors between sets of variables. Second, we define the concept of a scaffolding join recipe which specifies the creation of a rectangular dataset with a particular foreign-key style relationship with another dataset. Finally, we combine these two types of recipes to create recipe books which specify, a priori, the creation and construction of full database-like cohorts of inter-related datasets, either based on existing starting data or from whole cloth.

The synthetic.cdisc.data package provides respectables recipes for the creation of synthetic clinical trial readout data that adheres to the CDISC standard.

respectables and synthetic.cdisc.data will be released open source - approval granted prior to abstract submission - and available on github at the time of the presentation.



7:10pm - 7:30pm
Talk-Video
ID: 165 / ses-02-A: 3
Regular Talk
Topics: Databases / Data management
Keywords: big data

Going Big and Fast - {kafkaesque} for Kafka Access

Peter Meißner

virtual7

Kafka is a big data technology that allows for high throughput low latency stream processing, storing and distributing data. Kafka is written in Java and has become an infrastructure industry standard to for real time, near time, microservice and distributed applications.

This talk introduces {kafkaesque} a package which allows R users to integrate their code and models with Kafka e.g., to use it as a distributed message queue or to access and process data fed into Kafka by other systems. Besides presenting core concepts and how to use the package, the talk will also talk about the development process and the pros and cons of using Java code in R packages.

Link to package or code repository.
https://github.com/petermeissner/kafkaesque


7:30pm - 7:50pm
Talk-Video
ID: 148 / ses-02-A: 4
Regular Talk
Topics: Databases / Data management
Keywords: API

Scaling R for Enterprise Data

Mark Hornick

Oracle

While R has made significant gains in performance with each new release, sometimes the data itself is the bottleneck - having to move it from one environment to another. With enterprise data stored in Oracle databases, the integration of R with Oracle Database enables using R more efficiently at on larger volume data. Oracle Machine Learning for R (OML4R) – the R interface to in-database machine learning and R script deployment from Oracle – enables you to work with database tables and views using familiar R syntax and functions. For scalable and performant data exploration, data preparation, and machine learning, R users leverage Oracle Database as a high-performance compute engine and build machine learning models using parallelized in-database algorithms using R Formula-based specification. Further, deployment of user-defined R functions from SQL facilitates application and dashboard development, where R engines are dynamically spawned and controlled by Oracle Database. Users can even take advantage of running user-defined R functions in a data-parallel and task-parallel manner.

 
6:30pm - 8:00pm2B - Shiny
Location: The Lounge #talk_shiny
Session Chair: Mohamed El Fodil Ihaddaden
Zoom Host: CHIBUOKEM BEN UBAH
Replacement Zoom Host: Juan Pablo Narváez-Gómez
Session Sponsor: Open Analytics
Session Slide
 
6:30pm - 6:50pm
Talk-Video
ID: 205 / ses-02-B: 1
Regular Talk
Topics: Web Applications (Shiny/Dash)
Keywords: Lateral Flow Assay, Smartphone-based Analysis, Point-of-care Diagnostics, R Shiny App, Image Analysis

All-in-one smartphone-based system for quantitative analysis of point-of-care diagnostics

Weronika Schary1,2, Filip Paskali1,2, Matthias Kohl1,2

1Furtwangen University, Medical and Life Sciences Faculty, Jakob-Kienzle Str. 17, D-78054, Villingen Schwenningen, Germany; 2Institute of Precision Medicine, Jakob-Kienzle Str. 17, D-78054, Villingen-Schwenningen, Germany

We propose a smartphone-based system for the quantification of various lateral flow assays for the detection and diagnosis of diseases. The proposed smartphone-based system consists of a 3D-printed photo box for standardized positioning and lighting, a smartphone for image acquisition and an R Shiny Software Package with modular, customizable applications for image editing, analysis, data extraction, calibration and quantification. This system is less expensive than commonly used hardware and software for analysis, so it could prove very beneficial for diagnostic testing in the context of pandemics, as well as low-resource countries, in which laboratory equipment as well as diagnostic facilities are scarce.

The proposed system is facilitated with R Shiny, an open-source package - free to use and modify. It can be used without extensive programming skills, which could further the development of diagnosis becoming simpler, quicker, more efficient and still cost-effective compared to the gold standard methods used in detection and diagnosis today. Also, the automatic documentation of all analysis steps implemented in the application via R Markdown allows for accurate reproducibility in research and clinical practice.

For further image analysis, package LFApp was created to enable image editing, cropping, segmentation, background correction, data analysis, calibration and quantification of extracted pixel intensity values from the image.

Besides Shiny, other major packages used are EBImage, ggplot2, DT, shinyjs, stats, shinyFiles, rmarkdown and shinythemes. Furthermore, we designed an additional version of the UI module, using ShinyMobile, to make the app more accessible on small touchscreens.

Our goal was to build a versatile free open-source system, that is scalable and extensible, and also modifiable to suit any research team requirements. It represents an all-in-one, portable, cost-efficient and easily reproducible system for full analysis, that works well on computers as well as portable devices, such as smartphones.



6:50pm - 7:10pm
Talk-Live
ID: 226 / ses-02-B: 2
Regular Talk
Topics: Web Applications (Shiny/Dash)
Keywords: CI/CD

Unit Testing Shiny App Reactivity

Jonathan Sidi

Sage Therapeutics

When developing Shiny apps there are a lot of reactivity problems that can arise when one reactive or observe element triggers other elements. In some cases these can create cascading reactivity (the horror). The goal of reactor is to efficiently diagnose these reactivity problems and then plan unit tests to avert them during app development, making it a less painful and more robust experience. Reactor can improve the stability of shiny app development with many collaborators through its application in a version control and CI/CD framework.

Link to package or code repository.
https://github.com/yonicd/reactor


7:10pm - 7:30pm
Talk-Video
ID: 116 / ses-02-B: 3
Regular Talk
Topics: Web Applications (Shiny/Dash)
Keywords: Shiny, RStudio add-ins, dashboards, web application

ShinyQuickStarter: Build Shiny apps interactively with Drag & Drop

Leon Binder

Deggendorf Institute of Technology

The development of Shiny apps is often very time-consuming. This applies to the initial setup of the folder structure, but especially to the creation of the user interface and the implementation of the program logic. Many UI and logic elements are available and distributed across several packages.

To make the development process more efficient, we developed the RStudio addin ‘ShinyQuickStarter’. ‘ShinyQuickStarter’ is designed both for beginners who have acquired some basic knowledge of Shiny but have not gained much practical experience, and for advanced users who want to accelerate the process of developing new, powerful Shiny apps. It helps to setup the design of Shiny apps within a few minutes, so developers can start implementing the actual program logic almost immediately.

‘ShinyQuickStarter’ enables developers to create Shiny apps interactively using an intuitive drag and drop interface. A variety of page types and over 75 UI elements for navigation, layout, inputs, and outputs are supported. The options of these UI elements are interactively customizable, so developers can easily tailor Shiny apps towards their requirements and see the effect of an option immediately. Context-sensitive documentation furthermore supports especially beginners in orchestrating easy-to-use Shiny apps. ‘ShinyQuickStarter’ also creates the required folder structure for a new app and exports the source code of both the UI and the server component. Developers have the opportunity to organize the source code into Shiny modules. ‘ShinyQuickStarter’ solves the core problem of creating Shiny apps by streamlining the workflow of creating UI elements and corresponding server-side elements.



7:30pm - 7:50pm
sponsored-video
ID: 348 / ses-02-B: 4
Sponsored Talk
Topics: Web Applications (Shiny/Dash)
Keywords: Shiny, enterprise computing, open source

ShinyProxy. The Good News Show.

Tobia De Koninck

Open Analytics

Interactive web applications have become standard data science artefacts. Since 5 years ShinyProxy offers a 100% open source enterprise solution to run and manage such applications. Whether you want to deploy Shiny, Dash, H2O Wave or Streamlit apps, ShinyProxy has your back. Whether you serve a small team or host internet-facing apps for thousands of users, ShinyProxy will scale and stand the load. Whether you use LDAP, ActiveDirectory, OpenID Connect, SAML 2.0 or Kerberos to authenticate / authorize users, ShinyProxy makes it happen. Want to mix in an IDE (e.g. RStudio) or notebook server (e.g. Jupyter or Zeppelin notebooks)? Look no further than ShinyProxy. Monitor your stack and gather usage statistics? Check. Embed the apps over APIs into other websites? Solved problem.

In this talk we provide technical detail to the good news and focus on the latest ShinyProxy refinements and developments.

Link to package or code repository.
https://shinyproxy.io/
 
8:00pm - 8:30pmBreak
Virtual location: The Lounge #lobby
8:30pm - 10:00pmKeynote: Tools and technologies for supporting algorithm fairness and inclusion
Virtual location: The Lounge #key_resp_prog
Session Chair: Vebashini Naidoo
Session Chair: Shelmith Kariuki
Zoom Host: Juan Pablo Narváez-Gómez
Replacement Zoom Host: CHIBUOKEM BEN UBAH
 
ID: 359 / [Single Presentation of ID 359]: 1
Keynote Talk
Topics: Algorithms

Achim Zeileis1, Dorothy Gordon3, Kristian Lum3, Jonathan Godfrey2

1Faculty of Economics and Statistics at Universität Innsbruck; 2Massey University, Aotearoa New Zealand; 3TBA

As R programmers and R users we create artefacts for use. The hope is that the artefact may serve, and be used, by all in our intended audience.However, we often exclude certain people, or have embedded bias in our data that we need to be aware of - Is our package, teaching material, or visualisations accessible to people with disabilities?; Is my algorithm or data analysis biased with respect to gender, race, or class? Is my technology or algorithm deepening inequality? On the other hand, graphic representations created by R are easily understood by even the most statistically illiterate individuals, which makes it a great tool for advancing public policy and the Sustainable Development Goals (SDGs).

“Inyathi ibuzwa kwabaphambili” is a Xhosa proverb, which means wisdom is learnt or sought from the elders, or those ahead in the journey. In this multi-contribution keynote we will hear from those ahead in the journey - Dorothy Gordon, Achim Zeileis, Kristian Lum and Jonathan Godfrey.

Dorothy Gordon, chair of the UNESCO Information For All Programme, will talk about making technology accessible particularly to women and Africans, and how utilising tools such as R can help advance public policy. Achim Zeileis, Professor of Statistics, Universität Innsbruck, Austria, will discuss making the color schemes in data visualizations accessible for as many users as possible. Kristian Lum, Assistant Research Professor, in the Department of Computer and Information Science at University of Pennsylvania will shine a light on what may be missing from a dataset and bias in algorithms used in high-stake decision making. Jonathan Godfrey, Lecturer of Statistics at Massey University, Aotearoa New Zealand, will discuss how to choose the right tools that make collaboration possible and fruitful so that people from all walks of life can see themselves as part of the community.

 
10:00pm - 10:30pmBreak
Virtual location: The Lounge #lobby
10:30pm - 11:30pmIncubator: Five principles to grow up your R community
Virtual location: The Lounge #incubator_r_community
Session Chair: Joselyn Chávez
Session Chair: Leonardo Collado Torres
Zoom Host: s gwynn sturdevant
Replacement Zoom Host: Linda Jazmín Cabrera Orellana
 
ID: 344 / [Single Presentation of ID 344]: 1
Incubator
Topics: Community and Outreach
Keywords: CDSB, five principles, grow, community

Joselyn C. Chávez-Fuentes1, Erick Cuevas-Fernández1, Alejandro Reyes2, Leonardo Collado-Torres3

1Instituto de Biotecnología, UNAM, Mexico; 2Novartis Institutes for BioMedical Research Basel: Basel, Basel-Stadt, CH; 3Lieber Institute for Brain Development: Baltimore, MD, US

The Community of Bioinformatics Software Developers CDSB (https://comunidadbioinfo.github.io) was created in 2018 with the aim to promote R software development in México, increase Latin American representation in global communities, and facilitate the transition from software users to software developers. Through CDSB, we encourage users to develop and present R packages at international conferences, such as useR, RStudio, Bioconductor, LatinR, and ConectaR. Three years after its creation, the CDSB community has members from all over the country as well as from other Latin American countries, e.g. Costa Rica, Perú, Ecuador, Colombia. Importantly, this community has served as a base for CDSB members to create their own local communities, like R-Ladies Chapters (https://rladiesmx.netlify.app). In this incubator session, we will share five key principles that have helped us grow as a community, and how you can apply them in your local community.

Link to package or code repository.
https://comunidadbioinfo.github.io
 
11:30pm - 11:59pmMiR Meeting
Location: The Lounge #MiR_minorities_in_R
Session Chair: Audris Campbell
Zoom Host: CHIBUOKEM BEN UBAH
Replacement Zoom Host: Juan Pablo Narváez-Gómez
11:30pm - 11:59pmRechaRge 1
Virtual location: The Lounge #announcements
Zoom Host: Marcela Alfaro Cordoba
Yoga Intro + Breathing exercises
Date: Tuesday, 06/July/2021
12:00am - 12:30amRechaRge 1
Session Chair: Marcela Alfaro Cordoba
Yoga Intro + Breathing exercises
12:30am - 12:45amBreak
Virtual location: The Lounge #lobby
12:45am - 2:15amElevator Pitches 1
Virtual location: The Lounge #elevator_pitches
 
ID: 264 / ep-01: 1
Elevator Pitch
Topics: Social sciences
Keywords: national identification number, demographic data, generator, privacy, Finland

Hetu-package: Validating and Extracting Information from Finnish National Identification Numbers

Pyry Kantanen1, Måns Magnusson2, Jussi Paananen3, Leo Lahti1

1University of Turku, Finland; 2Uppsala University, Sweden; 3University of Eastern Finland

The need to uniquely identify citizens has been critical for efficient governance in the modern era. Novel techniques, such as iris scans, fingerprints and other biometric information have only recently begun to supplement the tried-and-true method of assigning each individual a unique identifier, a national identification number.

In Nordic countries national identification numbers are not random strings but contain information about the person’s birth date, sex and, in the case of Swedish personal identity numbers, place of birth. In addition, most identification numbers contain control characters that make them robust against input errors, ensuring data integrity. Datasets that lack aforementioned demographic information can be appended with data extracted from national identification numbers and already existing demographic data can be validated by comparing it to extracted data.

The method of validating and extracting information from identification numbers is manually doable and simple in principle but in practice becomes unfeasible with datasets larger than a few dozen observations. Hetu-package provides easy-to-use tools for programmatic handling of Finnish personal identity codes (henkilötunnus) and Business ID codes (y-tunnus). Hetu-package utilizes R’s efficient vectorized operations and is able to generate and validate over 5 million Finnish personal identity codes or Business Identity Codes in under 10 minutes. This covers the practical upper limit set by the current population of Finland (5.5 million people) and also provides adequate headroom for handling large registry datasets.

Privacy concerns can push Finland and other Nordic countries towards redesigning their national identification numbers to omit the embedded personal information sometime in the future, but policy changes will be closely monitored and, if necessary, the package functions will be adjusted accordingly.



ID: 150 / ep-01: 2
Elevator Pitch
Topics: Data visualisation
Keywords: feature selection

Visualising variable importance and variable interaction effects in machine learning models.

Alan Inglis1, Andrew Parnell1, Catherine Hurley2

1Hamilton Institute, Maynooth University; 2Dept. of Mathematics and statistics, Maynooth University

Variable importance, interaction measures and partial dependence plots are important summaries in the interpretation of statistical and machine learning models. In our R package vivid (variable importance and variable interaction displays) we create new visualisation techniques for exploring these model summaries. We construct heatmap and graph-based displays showing variable importance and interaction jointly, which are carefully designed to highlight important aspects of the fit. We also construct a new matrix-type layout showing all single and bivariate partial dependence plots, and an alternative layout based on graph Eulerians focusing on key subsets. Our new visualisations are model-agnostic and are applicable to regression and classification supervised learning settings. They enhance interpretation even in situations where the number of variables is large and the interaction structure complex. In this work we demonstrate our visualisation techniques on a data set and explore and interpret the relationships provided by these important summaries.

Link to package or code repository.
https://github.com/AlanInglis/vivid


ID: 279 / ep-01: 3
Elevator Pitch
Topics: Ecology
Keywords: landscape ecology, spatial data, remote sensing, wetland ecology, wildfire

Obtaining reproducible reports on satellite hotspot data during a wildfire disaster

Natalia Soledad Morandeira1,2

1University of San Martín, Environmental Research and Engineering Institute, Argentine Republic; 2CONICET (National Scientific and Technical Research Council, Argentina)

Wildfires can be monitored and analyzed using thermal hotspots records derived from satellite data. In 2020, the Paraná River floodplain (Argentina) suffered from a severe drought, and thousands of hotspots —associated with active fires— were reported by the Fire Information for Resource Management System (FIRMS-NASA). FIRMS-NASA products are provided in spatial objects (shapefiles), including recent and archive records from several sensors (VIIRS and MODIS). I aimed to handle these data, analyze the number of hotspots during 2020, and compare the disaster with previous years' situation. Using sf, tidyverse, janitor, stringr, spdplyr, ggplot2 and RMarkDown, I imported and pre-processed the spatial objects, generated plots, and obtained reproducible reports. I used R to handle satellite data, monitor the number of active fires, and detect which wetland areas were being affected: this allowed me to quickly respond to peers and journalists about how the wildfires were evolving.

As a case study, I summarize the 2020 outputs for my study area, the Paraná River Delta (19,300 km2). A total of 39,821 VIIRS thermal hotspots were detected, with August (winter in the Southern Hemisphere) accounting for 39.8% of the whole year’s hotspots. While VIIRS data (resolution: 375 m) is available from 2012, MODIS data is available from 2001. However, MODIS resolution is 1 km, so fewer hotspots are reported and each hotspot corresponds to a greater area. The cumulative MODIS hotspots recorded during 2020 were 8,673, the highest number of hotspots of the last 11 years. However, MODIS hotspots detected in 2020 were 62.9% of those recorded during 2008. All the plots were obtained in English and Spanish versions, showing daily and cumulative hotspots, monthly summaries, and a comparison with hotspots detected in previous years. My workflow can be used to analyze thermal hotspot data in any other interest area.

Link to package or code repository.
Code repository:
https://github.com/nmorandeira/Fires_ParanaRiverDelta ;

Three main dissemination articles and interviews (full list of articles in the repository):

In English: https://www.reuters.com/article/us-argentina-environment/argentinas-wetlands-under-assault-by-worst-fires-in-more-than-a-decade-idUSKBN25T35V ;

In Spanish (1/2):
http://www.unsam.edu.ar/tss/el-delta-en-llamas/ ;

In Spanish (2/2): http://www.hamartia.com.ar/2020/08/10/rodeados-fuego/


ID: 117 / ep-01: 4
Elevator Pitch
Topics: Statistical models
Keywords: clustering

Modeling spatio-temporal point processes with nphawkes package

Peter Boyd, Dr. James Molyneux

Oregon State University

As the literature on Hawkes processes grows, the use of such models continues to expand, encompassing a wide array of applications such as earthquakes, disease spread, social networks, neuron activity, and mass shootings. As new implementations are explored, correctly parameterizing the model is difficult with a dearth of field-specific research on parameter values, thus creating the need for nonparametric models. The model independent stochastic declustering (MISD) algorithm accomplishes this task through a complex, computationally expensive algorithm. In the package nphawkes, I have employed Rcpp functionalities to create a quick and user-friendly approach to MISD. The nphawkes R package allows users to analyze data in time or space-time, with or without a mark covariate, such as the magnitude of an earthquake. We demonstrate the use of such models on an earthquake catalog and highlight some features of the package such as using stationary/nonstationary background rates, model fitting, visualizations, and model diagnostics.

Link to package or code repository.
https://github.com/boydpe/nphawkes


ID: 204 / ep-01: 5
Elevator Pitch
Topics: Spatial analysis
Keywords: Functional Programming, Spatial Point Pattern Analysis, Parallelization, tidy code

Use Case: Functional Programming and Parallelization in Spatial Point Pattern Analysis

Clara Chua, Tin Seong Kam

Singapore Management University, Singapore

Performing Spatial Point Pattern Analysis (SPPA) can be computationally intensive for larger data sets, or data with non-uniform observation windows. It can take a day or more to run a dataset of 7,000 points. There is also often a need to repeatedly apply the same method to different cuts of data (e.g. running the same tests for different regions, subtypes), or when mapping and visualising the results of the analysis.

Part of my project looks at SPPA of Airbnb listings in Singapore, using an envelope simulation of Ripley’s K-function test from the spatstat package to determine if there is clustering in specific subregions.

In my talk I will briefly explain the K-test and compare the performance of a for-loop function and a functional programming function using purr, as well as the performance of the normal K-test and K-test using the Fast Fourier Transform. I show that using functionals helps to break down the analysis into tidier chunks which result in tidier code, and reproducibility down the road. Computation time may sometimes be quicker with functional programming.

Despite this, there is still a need for parallelization for spatial analysis of larger datasets. There are no in-built parallelization methods in spatstat. Parallelisation is also dependent on OS –functions such as `mclapply` from the base parallel package works for Mac and Linux, but not for Windows. Hence, I will also share my efforts to parallelize the envelope simulations that are OS agnostic.

Link to package or code repository.
https://github.com/clarachua/capstone-proj


ID: 135 / ep-01: 6
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: Clinical trials, Bayesian modeling, Shiny

Predicting the COVID-19 Pandemic Impact on Clinical Trial Recruitment at GSK

Valeriia Sherina, Nicky Best, Graeme Archer, Jack Euesden, Dave Lunn, Inna Perevozskaya, Doug Thompson, Magda Zwierzyna

GlaxoSmithKline

The COVID-19 pandemic required an unprecedented response from the pharmaceutical industry, both in terms of developing new antiviral medicines and vaccines as rapidly and safely as possible, and in continuing to deliver its existing portfolio of important new medicines. As many countries went into lockdown to slow the spread of the disease, sponsors faced the twin dilemma of replanning study delivery on the fly, while rebalancing their portfolios to meet the emergent medical need. Our multidisciplinary team at GSK tackled the problem of delayed recruitment due to the pandemic. We aggregated external data on the pandemic across 42 countries into epidemiological forecasts, designed a novel Bayesian hierarchical model with 3 levels: site initiation, patient screening, and patient randomization, to link classical recruitment predictions with epidemiological COVID-19 predictions. We obtained COVID-adjusted estimates of time to achieve key recruitment milestones via forward sampling from posterior distributions of the model parameters. The results of this exercise were summarized and deployed in a user-friendly Shiny application to assist study teams with recruitment planning in the face of the pandemic. Here we showcase the results of the effective collaboration of statisticians and data scientists and how it fits into decision-making framework in the clinical operations.



ID: 112 / ep-01: 7
Elevator Pitch
Topics: Databases / Data management
Keywords: agriculture

The Grammar of Experimental Design

Emi Tanaka

Monash University,

The critical role of data collection is well captured in the expression "garbage in, garbage out" -- in other words, if the collected data is rubbish then no analysis, however complex it may be, can make something out of it. The gold standard for data collection is through well-designed experiments. Re-running an experiment is generally expensive, contrary to statistical analysis where re-doing it is generally low-cost; there's a higher stake in getting it wrong for experimental designs. But how do we design experiments in R? In this talk, I present my R-package edibble that implements a novel framework, which I refer to as the "grammar of experimental design", to facilitate the data collection and design of an experiment. The grammar builds the experimental design by describing the fundamental components of the experiment. Because the grammar resembles a natural language, there is greater clarity about the experimental structure, and includes considerations beyond the construction of the experimental layout. I will reconstruct some experimental layout using edibble with comparison to other popular R-packages.

Link to package or code repository.
https://github.com/emitanaka/edibble/


ID: 243 / ep-01: 8
Elevator Pitch
Topics: Spatial analysis
Keywords: open source

rspatialdata: a collection of data sources and tutorials on downloading and visualising spatial data using R

Varsha Ujjini Vijay Kumar1,2, Dilinie Seimon1,2, Paula Moraga2

1Faculty of Business and Economics, Monash University, Australia; 2Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia

Open and reliable data, analytical tools and collaborative research are crucial for solving global challenges and achieving sustainable development. Spatial and spatio-temporal data are used in a wide range of fields including health, social and environmental disciplines to improve the evaluation and monitoring of goals both within and across countries.

To facilitate the increasing need to easily access reliable spatial and spatio-temporal data using R, many R packages have been recently developed as clients for various spatial databases and repositories. While documentation and many open source repositories on how to use these packages exist, there is an increased need for a one stop repository for this information.

In this talk, we present rspatialdata, a website that provides a collection of data sources and tutorials on downloading and visualising spatial data using R. The website includes a collection of data sources and tutorials on how to download and visualise a wide range of datasets including administrative boundaries of countries, Open Street Map data, population, elevation, temperature, vegetation and malaria data.

The website can be considered a useful resource for individuals working with problems that require spatial data analysis and visualisation, such as estimating air pollution, quantifying disease burdens, predicting species occurrences, and evaluating and monitoring the UN Sustainable Development Goals.



ID: 361 / ep-01: 9
Elevator Pitch
Topics: Web Applications (Shiny/Dash)

Moved to Session 2

Test Test

test

test



ID: 157 / ep-01: 10
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: teaching, machine learning, data science, undergraduate

Teaching Advanced Data Science in R: Successes and Failures

Lisa Lendway

Macalester College, United States of America

I am about to embark on teaching a new course I named Advanced Data Science in R. This is a capstone course being taught to undergraduates at Macalester College, a small liberal arts college in the US. The course expands on what students learned in an Intro Data Science course focused on data visualization and wrangling using the tidyverse and a Statistical Machine Learning course focused on machine learning algorithms implemented via the caret package.

Students will review machine learning algorithms while learning the tidymodels packages. They will become familiar with the larger machine learning process by using data from a database within RStudio and implementing the basics of putting a model into production. Other components of the class include practicing reproducible research by using Git and GitHub in RStudio; creating a website using R Markdown, distill, or blogdown; and building shiny apps. In order for the students to become more comfortable learning new R packages on their own, in groups of three, they will teach the class about a package or set of functions of their choosing. They will create materials that students can refer back to and will also write homework problems and solutions. The last few weeks of the course will be dedicated to working on group projects where they will be expected to implement the skills they learned in the course.

Since this course hasn’t started yet, I do not yet know what the successes and failures will be. I will share the course website and will reflect on how it went and the changes I will make for the future.

Link to package or code repository.
https://advanced-ds-in-r.netlify.app/


ID: 234 / ep-01: 11
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: annotation

rRbiMs, r tools for Reconstructing bin Metabolisms.

Mirna Vázquez Rosas Landa, Valerie de Anda Torres, Sahil Shan, Brett Baker

The University of Texas at Austin

Although microorganisms play a crucial role in the biogeochemical cycles and ecosystems, most of their diversity is still unknown. Understanding the metabolic potential of novel microbial genomes is an essential step to access this diversity. Most annotation pipelines use predefined databases to add the functional annotation. However, they do not allow the user to add metadata information, making it challenging to explore the metabolism based on a specific scientific question. Our motivation to build the package rRbiMs is to create a workflow that helps researchers to explore metabolism data in a reproducible manner. rRbiMs reads different database outputs and includes a new custom-curated database that the user can easily access. Its module-based design allows the user to choose between running the whole pipeline or just part of it. Finally, a key feature is that it facilitates the incorporation of metadata such as taxonomy and sampling data, allowing the user to answer specific questions of biological relevance. rRbiMs is a user-friendly R workflow that allows performing reproducible and accurate microbial metabolism analyses. We are working on the package looking forward to submitting it to R/Bioconductor and make it available for the research community.

Link to package or code repository.
https://github.com/mirnavazquez/RbiMs


ID: 202 / ep-01: 12
Elevator Pitch
Topics: Statistical models
Keywords: relative weights analysis, Key Drivers Analysis, residualization, nonlinear, main effect, interaction

ResidualRWA: Detecting relevant variable using relative weight analysis with residualization

Maikol Solís, Carlos Pasquier

Universidad de Costa Rica, Escuela de Matemática, Centro de Investigación en Matemática Pura y Aplicada, Costa Rica

In statistical models, determining the most influential variables is a continuous task. A common technique is called relative weights analysis (RWA) (a.k.a. Key Drivers Analysis). The method described in Tonidandel & LeBreton (2015), uses an orthonormal projection of the data to determine which variable has more impact into the model. Packages like “rwa” (https://CRAN.R-project.org/package=rwa) or “flipRegression” (https://github.com/Displayr/flipRegression) handle the situation when there are multiple linear primary effects into the model.

To extend those packages, we present the novel package “ResidualRWA” (https://github.com/maikol-solis/residualrwa). This new package implements the relative weight analysis, with the possibility to handle complex models with nonlinear effects (via restricted splines) and nonlinear interactions. The package residualizes appropriately each interaction to report correctly the main and the pure interaction effects.

The interactions are residualized against the primary effects as described in LeBreton et al. (2013). This step is necessary to remove the influence of the main variables into the interactions. This way, we can separate the effect of the main variable of a model and the pure interaction effect due to true the synergy of the variables.

The package “ResidualRWA” handles the fit of the model, the residualization of interaction and the relative weight analysis estimation. Also, it reports through easy-to-read tables and graphics the results to the user.

In this presentation we will show the capabilities of this package and test it with some simulated and real data.

* References *

LeBreton, J. M., Tonidandel, S., & Krasikova, D. V. (2013). Residualized Relative Importance Analysis. Organizational Research Methods, 16(3), 449–473. https://doi.org/10.1177/1094428113481065

Tonidandel, S., & LeBreton, J. M. (2015). RWA Web: A Free, Comprehensive, Web-Based, and User-Friendly Tool for Relative Weight Analyses. Journal of Business and Psychology, 30(2), 207–216. https://doi.org/10.1007/s10869-014-9351-z

Link to package or code repository.
https://github.com/maikol-solis/residualrwa


ID: 203 / ep-01: 13
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: survey, serology, serological-survey, srvyr, tidyverse

serosurvey R package: Serological Survey Analysis For Prevalence Estimation Under Misclassification

Andree Valle-Campos

Centro Nacional de Epidemiología Prevención Control Enfermedades CDC Perú, Peru

Population-based serological surveys are fundamental to quantify how many people have been infected by a certain pathogen and where we are in the epidemic curve. Various methods exist to estimate prevalence considering misclassifications due to an imperfect diagnostic test with sensitivity and specificity known with certainty. However, during the first months of a novel pathogen outbreak like the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) diagnostic test performance is “unknown” or “uncertain” given limited validation studies available. During the pandemic, the increased demand for prevalence estimates and the absence of standardized procedures to address this issue went along with the usage of different methods and non-reproducible workflows for SARS-CoV-2 serological surveys within proprietary softwares or non-public R code repositories. Given this scenario, we created the serosurvey R package to gather Serological Survey Analysis functions and workflow templates for Prevalence Estimation Under Misclassification. We provide functions to calculate single prevalences using the srvyr package environment and generate tidy outputs for a hierarchical bayesian approach that incorporates the unknown test performance from Larremore et al. [bioRxiv, May 2020]. We applied them in a reproducible workflow with purrr and furrr R packages to efficiently iterate and parallelize this step for multiple prevalences. We test it simulating the usage of an imperfect test to measure an outcome within a multi-stage sampling survey design and incorporate the test uncertainty into the sampling design uncertainty. Therefore, the serosurvey R package could facilitate the generation of prevalence estimates for the current and improve our preparedness for next pandemics. In conclusion, the serosurvey R package reduces the reproducibility gap of serological survey analysis using diagnostic tests with unknown performance within a free software environment.



ID: 191 / ep-01: 14
Elevator Pitch
Topics: Interfaces with other programming languages
Keywords: HTTP, API, reprex

httrex: Help Debugging HTTP API Clients

Greg Daniel Freedman Ellis

Crunch.io, United States of America

HTTP APIs are a fantastic way to collaborate across programming language divides. It does not matter what language the server's infrastructure is written in, an R user can write code to bring the data into R. There are many packages on CRAN, and presumably even more internally developed packages in organizations, that are wrappers of web APIs, abtracting over some of the details of APIs to make a friendlier interface for R users. When everything is working as intended, this is the ideal way to work, with users not needing to worry about the inner workings of APIs on a day to day basis. But when there's a bug, it can be hard to figure out the source because of the layers of abstraction and potentially the programming language barrier between the user and API owner. The httrex package provides tools to debug R code based on HTTP APIs and communicate what the code is doing in a language-agnostic way.

The httrex package explores two approaches for this problem. The first is a shiny app that runs alongside your current R session and tracks the code you run and the API calls that the code makes and visualizes each step of that process. The second creates a document inspired by the reprex package that reruns code in a new environment and includes the HTTP API calls interspersed with the R code and output. With either of these two approaches users can better understand the HTTP requests that their code is making and share with API owners in a way that doesn't require them to understand R.

Link to package or code repository.
https://github.com/gergness/httrex


ID: 213 / ep-01: 15
Elevator Pitch
Topics: Efficient programming
Keywords: CLI (command line interface)

Reading, Combining, and Pre-Filtering Files with R and AWK

David Shilane, Mayur Bansal, Chung Woo Lee

Columbia University,

The authors have been developing a software package named awkreader that provides an R interface to utilize AWK for the purpose of combined and pre-filtered file reading.

The proposed software is designed to solve a number of problems. For one or more files of the same column structure, we will demonstrate a method to read and bind the data. A vector of names can be supplied to limit the data to selected columns. Pre-filtering the data can then be achieved either through a specification of patterns to match or through logical inclusion criteria. The program then performs a translation to build a corresponding coding statement in AWK. Importantly, the logical criteria can be directly specified in syntax that is familiar to R’s users. In addition to reading in the data, the code can also return its translations to AWK. Combined data sets can also include a column that identifies the source file.

The awkreader package creates a variety of novel capabilities. It reduces the computational complexity of filtering and binding data sets. Targeted queries can search a wider range of data than what might otherwise be loaded into R because the filters are applied in the reading process. A single statement can replace the labor and code of reading, binding, and filtering multiple files. Users can benefit from AWK for file reading without having to learn its syntax. Those who are interested in learning AWK will be able to generate working examples that correspond to their more familiar setting of programming in R.

The presentation will showcase the benefits of the awkreader package. We will provide some details on the translation process, particularly for logical subsetting operators such as %in% and %nin%. We will also demonstrate examples of the code in data processing applications.

Link to package or code repository.
https://github.com/MB4511/awkreader


ID: 131 / ep-01: 16
Elevator Pitch
Topics: Social sciences
Keywords: data management

DATA PIPELINE: IMPROVING MANAGEMENT OF FINANCIAL CONTRIBUTIONS TO THE FIGHT AGAINST POVERTY IN COSTA RICA

Roberto Delgado Castro

DIRECCION GENERAL DE DESARROLLO SOCIAL Y ASIGNACIONES FAMILIARES

FODESAF, administered by DESAF (part of the Ministry of Labor and Social Security), is Costa Rica´s and Latin America´s largest public-social investment fund. It transfers around US$1.000 million per year (2% of local GDP) to a wide variety of social programs natiowide.

Local employers (patrons) and Ministry of Treasury (Government) provides its economic resources due to monthly financial contributions.

Among with monthly financial transfers, a large database with specific information of contributors is attached. Since 1978, FODESAF´s establishment year, DESAF have not had the opportunity to classify and analyze such crucial data.

A data science project was developed in SQL© and RStudio© to implement a Data Pipeline, in which all data was loaded to break down its key elements, in order to improve DESAF authority´s decision-making capabilites.

As inputs, annual databases within 2003 and 2019 (17 years) were loaded into the Pipeline in a separate way (250.000 registers per year, 30.8 million in total). The results were RMarkdown© automatic reports with brand-new dataframes and visualizations for each year, that helped authorities visualize and analyze elements that had not been seen in 43 years of DESAF´s history.

After its implementation, local government has now a unique-recurrent data science tool to improve management of financial contributions to the fight against poverty. As key learnings, data-taming skills were strenghten, project-questions were defined as project´s coding-structure strategy, brand-new Contributors Mass Report was developed, and this project has been used as an economic-recovery follow-up-instrument in post-pandemic era.

Keywords: Data Pipepline, data-taming, visualizations, automatic-reports, poverty.



ID: 114 / ep-01: 17
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: sleep, chronobiology, mctq, data wrangling, reproducibility

mctq: An R Package for the Munich ChronoType Questionnaire

Daniel Vartanian, Ana Amélia Benedito-Silva, Mario Pedrazzoli

School of Arts, Sciences and Humanities (EACH), University of Sao Paulo (USP), Sao Paulo, Brazil

mctq is an R package that provides a complete and consistent toolkit to process the Munich ChronoType Questionnaire (MCTQ), a quantitative and validated method to assess peoples’ sleep behavior presented by Till Roenneberg, Anna Wirz-Justice, and Martha Merrow in 2003. The aim of mctq is to facilitate the work of sleep and chronobiology scientists with MCTQ data while also helping with research reproducibility.

Although it may look like a simple questionnaire, MCTQ requires a lot of date/time manipulation. This poses a challenge for many scientists, being that most people have difficulties with date/time data, especially when dealing with an extensive set of data. The mctq package comes to address this issue. mctq can handle the processing tasks for the three MCTQ versions (standard, micro, and shift) with few dependencies, relying much of its applications on the lubridate and hms packages from tidyverse. We also designed mctq with the user experience in mind, by creating an interface that resembles the way the questionnaire data is shown in MCTQ publications, and by providing extensive and detailed documentation about each computation proposed by the MCTQ authors. The package also includes several utility tools to deal with different time representations (e.g., decimal hours, radians) and time arithmetic issues, along with fictional datasets for testing and learning purposes.

The first stable version of mctq is available for download on GitHub. The package is currently under a software peer-review by the rOpenSci initiative. We plan to submit it to CRAN soon after the review process ends.



ID: 107 / ep-01: 18
Elevator Pitch
Topics: Ecology
Keywords: flow cytometry, cytometric diversity, microbial ecology, {flowDiv} workflow

{flowDiv} workflow: reproducible cytometric diversity estimates

María Victoria Quiroga1, Bruno M. S. Wanderley2, André M. Amado3, Fernando Unrein1

1Instituto Tecnológico de Chascomús (INTECH, UNSAM-CONICET), Argentina; 2Departamento de Oceanografia e Limnologia, Universidade Federal do Rio Grande do Norte, Brazil; 3Departamento de Biologia, Universidade Federal de Juiz de Fora, Brazil

Flow cytometry is widely used in life sciences, as it records thousands of single-cell data related to their morphological or physiological state within minutes. Hence, each sample has a characteristic cytometric pattern that can be studied through diversity indexes: evenness, alpha- and beta-diversity. Applying this approach to microbial ecology research frequently involves intricate handling of data recorded with different instrumental settings or sample dilutions. The {flowDiv} package overcomes this through data normalization and volume correction steps before estimating cytometric diversity. Here, we share a reproducible {flowDiv} workflow, hoping to help researchers to streamline flow cytometry data processing.



ID: 118 / ep-01: 19
Elevator Pitch
Topics: R in production
Keywords: data management

From Sheets to Success: Reliable, Reproducible, yet Flexible Production Pipelines

Katrina Brock

Sunbasket

At our growth-stage startup, our data team was challenged to produce accurate and reliable forecasts despite ever-changing product lines. To solve this, we used yaml-based config to specify inputs and models in a standardized way, the validate package to check inputs before running any custom transformations, and the drake package to isolate the business logic for each product line’s forecast. We transformed a process that previously was a patchwork of SQL, R, and hundreds of spreadsheets into a set of robust pipelines that use a standardized set of steps. The reliability frees our team from constant troubleshooting and allows us to focus on building and tuning models to ever-increasing accuracy.



ID: 281 / ep-01: 20
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: survey design, multiwave sampling, Neyman allocation

Efficient multi-wave sampling with the R package optimall

Jasper B. Yang1, Bryan E. Shepherd2, Thomas Lumley3, Pamela A. Shaw1

1Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, U.S.A; 2Department of Biostatistics, Vanderbilt University School of Medicine, Vanderbilt University, Nashville, TN, U.S.A; 3Department of Statistics, University of Auckland, Auckland, New Zealand

When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design process, especially when data are expensive to collect. One common approach in epidemiological studies is a two-phase sampling design, where inexpensive variables collected on all sampling units are used to inform the sampling scheme for collecting the expensive variables on a subsample in the second phase. Recent studies have demonstrated that even more precise estimates can be obtained when the second phase is conducted over a series of adaptive waves. Unlike simpler sampling schemes, executing multi-phase and multi-wave designs requires careful management of many moving parts over repetitive steps, which can be cumbersome and error-prone. We present the R package optimall, which offers a collection of functions that efficiently streamline the design process of survey sampling, ranging from simple to complex. The package’s main functions allow users to interactively define and adjust strata cut points based on values or quantiles of auxiliary covariates, adaptively calculate the optimum number of samples to allocate to each stratum in order to minimize the variance of the target sample mean using Neyman allocation, and select specific IDs to sample based on a stratified sampling design. As the survey is performed, optimall provides a framework for every aspect of the sampling process, including the data and metadata, to be stored in a single object. Although it is particularly tailored towards multi-wave sampling under two- or three-phase designs, the R package optimall may be useful for any sampling survey.

Link to package or code repository.
https://github.com/yangjasp/optimall


ID: 185 / ep-01: 21
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: explainable machine learning, predictive modeling, interactive visualization, dashboards

Open the Machine Learning Black-Box with modelStudio & Arena

Hubert Baniecki, Piotr Piątyszek

Warsaw University of Technology, Poland

Complex machine learning predictive models, aka black-boxes, demonstrate high efficiency in a rapidly increasing number of applications. Simultaneously, there is a growing awareness among machine learning practitioners that we need more comprehensive tools for model explainability. Responsible machine learning will require continuous model monitoring, validation, and black-box transparency. These challenges can be satisfied with novel frameworks which add automation and interactivity into explainable machine learning pipelines.

In this talk, we present the modelStudio and arenar packages which, at their core, automatically generate interactive and customizable dashboards allowing to "open the black-box". These tools build upon the DALEX package and are model-agnostic - compatible with most of the predictive models and frameworks in R. The effort is put on lowering the entry threshold for crucial parts of nowadays MLOps practice. We showcase how little coding is needed to produce a powerful dashboard consisting of model explanations and data exploration visualizations. The output can be saved and shared with anyone, further promoting reproducibility and explainability in machine learning practice. Finally, we highlight the Arena dashboard's features - it specifically aims to compare various predictive models.



ID: 247 / ep-01: 22
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: classification

Classifying Student’s Learning Pattern using R Sequence Analysis Packages: The Impact of Procrastination on Performance

Teck Kiang Tan

National University of Singapore,

Sequence analysis is an analytical technique to analyze categorical longitudinal data, incorporating classification procedures that categorizing categorical sequences. A framework of sequence analysis will be provided to give an overview of this approach.

This talk shares the findings of a study extracted from the National University of Singapore Learning Management System (LMS) that provides online video learning materials to students. The time spent on the learning and watching the online video of a statistics course that runs for 35 days is extracted from the LMS to form sequences for all the students with each student formed a learning sequence of 35 states. Sequence analysis was carried out to classify student’s time spent learning patterns and the outcomes of sequence analysis are linked to explain student’s performance.

R package TraMineR and a few cluster analysis packages were used for carrying out sequence analysis and classifying sequences. Fifteen sequence distances were computed and their goodness-of-fit indices were determined for the selection of the best distance metric for classifying students. Four sequence complexity measures were examined to quantify whether the students vary their time use learning patterns as a strategy in studying. These four measures are sequence entropy, turbulence, complexity, and precarity index. The usefulness of graphing sequence analysis using state distribution plot, sequence frequency plot, transversal entropy plot, sequence modal state plot, and representative sequence plot will be shared to point out their relevancy in explaining the results of classifying student’s learning sequences. The inferential regression findings that student’s act of learning procrastination, the resulted classification from the sequence analysis, will be shared how to make use of the sequence analysis results to determine the degree of influence of student’s learning procrastination affecting student’s performance.



ID: 200 / ep-01: 23
Elevator Pitch
Topics: Data visualisation
Keywords: outliers, dataviz, ggplot2, data communication, blogdown

Outlier redemption

Violeta Roizman

Self-employed, France

In Data Science, outliers are usually perceived negatively, as a problem to solve prior to the data analysis. Many tutorials, blog posts and papers are written to offer tips about how to get rid of them. There is even the whole area of Robust Statistics focused on techniques that try to ignore outliers. However, I’ve discovered that outliers can be a very exciting part of the data. Outliers are quite often fun facts about the data. And I love fun facts. I love them so much that I collect all these fun-data-facts in a website called outlier redemption, where I only publish short data-driven stories about them. For each story, I publish the data and the R code used to produce the graphics and simulations. In this short talk, I will start by introducing the positive side of outliers. After that, I will present different visual ways in which we can spot and highlight outliers with R. I will finish with some of the fun outlier stories that I encountered while analyzing data.



ID: 115 / ep-01: 24
Elevator Pitch
Topics: Reproducibility

The Canyon of Success: Scaling Best Practices in R with Internal R packages

Malcolm Barrett

Teladoc Health

At useR! 2016, Hadley Wickham said that what he sought to design in the tidyverse was (after Jeff Atwood) a pit of success: it should be easy for users to succeed. They shouldn’t have to trudge uphill to use your tools effectively. They should fall right into the pit of success--no guard rails! In this talk, I’ll discuss how we use internal R packages to scale best practices across our data science and analytics teams at Teladoc Health. Using R packages has allowed new team members to quickly onboard, set up their work environments, and create reproducible work that aligns with our standards. Our sets of packages include opinionated, usethis-style workflows, database connections, reporting, and more. Designing these tools to make it easy to succeed has become a keystone in our design approach, allowing us to scale our practices with less intervention.



withdrawn
ID: 102 / ep-01: 25
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: Cancer, simulation, mathematical modeling

Modeling tumor evolutionary dynamics with SITH

Phillip Nicol1, Amir Asiaee2

1Harvard University, United States of America; 2Ohio State University, United States of America

A tumor of clinically detectable size is the result of a decades-long evolutionary process of cell replication and mutation. Mathematical models of cancer growth are becoming increasingly valuable for situations where high-quality clinical data is unavailable. We designed "SITH," a CRAN package that implements a stochastic spatial model of tumor growth and mutation. The goal of "SITH" is to provide fast simulations with a convenient interface for researchers interested in cancer evolution. The core simulation algorithm is written in C++ and integrated into R using the "Rcpp" framework. 3D interactive visualizations of the simulated tumor (using the "rgl" package) allow users to investigate the spatial location of various subpopulations. "SITH" also provides functions for analyzing the amount of heterogeneity present in the tumor. Finally, "SITH" can create simulated DNA sequencing datasets. This feature may be helpful for researchers interested in understanding how spatial heterogeneity can bias observed data.

Link to package or code repository.
https://CRAN.R-project.org/package=SITH


withdrawn
ID: 225 / ep-01: 26
Elevator Pitch
Topics: Reproducibility
Keywords: ecology

A Glimpse into the Reproduciblity of Scientific Papers published in Movement Ecology: How are we doing ?

Jenicca Poongavanan, Rocio Joo Arakawa, Mathieu Basille

University of Florida

Reproducibility is the earmark of science and thus Movement Ecology as well. However, studies in disciplines such as biology and geosciences have shown that published work is rarely reproducible. Ensuring reproducibility is not a mandatory part of the research process and thus there are no clear procedures in place to assess the reproducibility of scientific articles. In this study we put forward a reproducibility workflow scoring sheet based on six criteria that lead to successful reproducible papers. The reproducibility workflow can be used by authors to evaluate the reproducibility of their studies before publication and reviewers to evaluate the reproducibility of scientific papers. To assess the state of reproducibility in Movement Ecology, we attempted to reproduce the results from Movement Ecology papers that use behavioral pattern identification methods. We selected 75 papers published in several journals from 2010-2020. According to our proposed reproducibility workflow, sixteen studies reflected at least some reproducibility (scores ≥ 4). In particular, we were only able to obtain the data for 16 out of 75 papers. Out of these, a minority of papers also provided code with the data (6 out of the 16 studies). Out of the 6 studies that made both data and code available, only four studies reflected a high level of reproducibility (scores ≥ 9) owing it to good code annotation and execution. Based on our findings, we proposed guidelines for authors, journals and academic institutions to enhance the state of reproducibility in Movement Ecology.



Talk-Video
ID: 180 / ep-01: 27
Elevator Pitch
Topics: Time series
Keywords: Time series, Forecasting, Business/Industry, Economics, R programming

Forecasting under COVID: when simple works - and when it doesn’t

Maryam Shobeirinejad1, Steph Stammel2

1Transurban, Australia; 2Transurban, Australia

The worldwide impact of COVID-19 represented a global change point across almost all areas of life. As a result, forecasting is presenting new challenges for us to manage. Fit-for-purpose time series forecasts can range from deceptively simple, but highly useful models to eye-watering complexity that require expertise to implement correctly and usefully. In addition, COVID-19 has resulted in many of the exogenous variables used to improve forecast models (such as macroeconomic data) are also compromised by the same global events. This talk discusses how we have dealt with these heightened challenges in corporate forecasting; balancing highly interpretable solutions with few assumptions (for example seasonal decomposition models) with complex modelling explicitly managing a rapidly changing environment (e.g. state space switching models). The balancing of different forecast model features (accuracy, interpretability, required assumptions etc.) with the needs of stakeholders is a subject of considerable empirical review within our team. In a corporate context, this is balanced with tight time frames, a need to rapidly fit, estimate and compare large numbers of models and come to a solution that meets the needs of a complex array of users.

This presentation will discuss how the tidyverts ecosystem (tsibble, FEASTS, etc.) has been used to test and iterate batch sets of modelling to achieve robust solutions to time sensitive business problems. We will discuss some of our findings in the context of a cost-benefit framework as we traded off the key features of a forecast model in search of the ‘fit for purpose’ solution.



ID: 209 / ep-01: 28
Elevator Pitch
Topics: R in the wild (unusual applications of R)
Keywords: GIS

gatpkg: Developing a geographic aggregation tool in R for non-programmers

Abigail Stamm

New York State Department of Health Bureau of Environmental and Occupational Epidemiology,

The Geographic Aggregation Tool (GAT) was developed in R to simplify and standardize the development of small geographic areas that meet the minimum thresholds required to provide stable and meaningful population measures. To improve usability, we converted GAT to an R package and designed it to be accessible to non-programmers with little to no experience in R. To run GAT, users will be required to install the package and run one line of R code. GAT’s user friendly interface offers a series of dialogs for the user to select their options and saves all files without requiring additional code. The package includes documentation on files created and guidance on interpretation of files and how to address aggregation issues. It also includes several examples using embedded shapefiles, a tutorial, and detailed documentation for advanced users interested in modifying or enhancing the tool. This talk will provide an overview of the tool, how it works, and the planning that informed its development, keeping accessibility and reproducibility in mind.

Link to package or code repository.
https://github.com/ajstamm/gatpkg


ID: 278 / ep-01: 29
Elevator Pitch
Topics: R in production
Keywords: HIV/AIDS, package, analysis, automation

tidyndr: An R package for analysis of the Nigeria HIV National Data Repository

Stephen Taiye Balogun, Scholastica Olanrewaju, Oluwaseun Okunuga, Temitope Kolade, Geraldine Chizoba Abone, Fati Murtala-Ibrahim, Helen Omuh

Institute of Human Virology Nigeria, Nigeria

Nigeria, being the fourth-largest HIV epidemic in the world, is central to achieving the UNAIDS target of epidemiologic control of HIV/AIDS by 2030. Data of 1.3 million HIV-positive patients on treatment in the country is stored centrally in the National Data Repository (NDR). Using access levels, this data is accessible to the Government of Nigeria, donor agencies, implementing partners and other stakeholders to track progress and improve HIV programming. To achieve this, the data must be cleaned, processed, summarized, and communicated for easy recognition of progress, and identification of gaps for tailored intervention. The analysis is traditionally conducted in Microsoft Excel using a downloaded file from the NDR. This means that the Excel software must be installed on the user’s computer, and the user must be familiar with the formula for the calculation of the various indicators. It lacks reproducibility and is error-prone, with errors occasionally going unnoticed. Performing the same analysis periodically can also be quite tedious and time-consuming.

The tidyndr package eliminates these bottlenecks by improving the user friendliness and automating routine analysis, saving several man-hours in the process while eliminating individual errors. The functions are grouped into four categories: importing, treatment, supporting, and summary functions. Together, these ensure that patient-level data are consistently imported into R, subset the data based on specific indicators, and provide summary tables both aggregated and disaggregated in line with the national requirements.

The output from this process is very useful to improve program performance and help achieve epidemiologic control of the virus at local, state, and national levels. With continued national efforts to provide patient-level information for HIV prevention and other services, the package can be scaled-up to support the analysis of these data. Finally, it provides a foundation upon which other relevant program applications can be built.

Link to package or code repository.
https://github.com/stephenbalogun/tidyndr


ID: 259 / ep-01: 30
Elevator Pitch
Topics: Data visualisation
Keywords: color accessibility, data visualization, data organization, cvd accessible colors, microbiome

Do you see what I see? Introducing microshades: An R package for improving color accessibility and organization of complex data

Lisa Karstens, Erin Dahl, Emory Neer

Oregon Health & Science University, United States of America

Approximately 300 million people in the world have Color Vision Deficiency (CVD). When creating figures and graphics with color, it is important to consider that individuals with CVD will interact with this material, and may incorrectly perceive information associated with color. Multiple CVD friendly color palettes are available in R, however they are limited to 8 different colors. When working with complex data, such as microbiome data, this is insufficient. To overcome this limitation, we created the microshades R package, designed to provide custom color shading palettes that improve accessibility and data organization.

The microshades package includes two crafted color palettes, microshades_cvd_palettes and microshades_palettes. Each color palette contains six base colors with five incremental light to dark shades, for a total of 30 available colors per palette type that can be directly applied to any plot. The microshades_cvd_palettes contain colors that are universally CVD friendly. The individual microshades_palettes are CVD friendly, but when used in conjunction with multiple microshades_palettes, are not universally accessible.

The microshades package also contains functions to aid in data visualization such as creating stacked bar plots organized by a data-driven hierarchy. To further assist users with data storytelling, there are functions to sort data both vertically and horizontally based on ranked abundance or user specification. The accessibility and advanced color organization features help data reviewers and consumers notice visual patterns and trends in data easier. Examples of microshades in action are available on our website, for both microbiome and other datasets.

Link to package or code repository.
https://karstenslab.github.io/microshades
 
2:15am - 3:15ammixR!
Music, networking channel and raffles. To end the day in a relaxing way
6:30am - 7:30amCreationLab
Session Chair: Marcela Alfaro Cordoba
Draw cartoons and tell stories
7:30am - 8:30amKeynote: Can we do this in R? - Answering questions about air quality one code at a time
Virtual location: The Lounge #key_kushwaha
Session Chair: Adithi R. Upadhya
Zoom Host: Rachel Heyard
Replacement Zoom Host: Nasrin Fathollahzadeh Attar
 
ID: 356 / [Single Presentation of ID 356]: 1
Keynote Talk
Topics: Environmental sciences

Meenakshi Kushwaha

ILK Labs, Bangalore

We are a young team of Environmental health researchers, geospatial analysts, and air quality researchers using innovative solutions for air quality monitoring in low resource settings. Every time we encounter a large dataset, a new modelling approach, a new statistical technique, a new visualization challenge, we ask ourselves - “Can we do this in R ?”, and for the past four years (since we started this work), the answer has been a resounding “yes”. I will share how we use R not just for data analysis and visualization but also as a great open source tool for collaboration and engagement.

 
8:30am - 8:45amBreak
Virtual location: The Lounge #lobby
8:45am - 10:15am3A - Machine Learning and Data Management
Location: The Lounge #talk_ml_dm
Session Chair: Young-suk Lee
Zoom Host: Nasrin Fathollahzadeh Attar
Replacement Zoom Host: Dorothea Hug Peter
Session Sponsor: MemVerge
Session Slides
 
8:45am - 9:05am
Talk-Live
ID: 188 / ses-03-A: 1
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: Automated Machine Learning, R package, Hyperband

mlr3automl - Automated Machine Learning in R

Alexander Bernd Hanf

Ludwig-Maximilians-Universität Munich

We introduce mlr3automl, an open-source framework for Automated Machine Learning in R. Based on the mlr3 Machine Learning package, mlr3automl builds robust and accurate classification and regression models for tabular data.

mlr3automl provides automatic preprocessing, which guarantees stable performance in the presence of missing data, categorical and high-cardinality features, and large data sets. Preprocessing and model building is solved through a flexible pipeline implemented with mlr3pipelines. This allows mlr3automl to jointly optimize preprocessing, model selection and model hyperparameters using Hyperband.

mlr3automl shows strong performance and stability on a benchmark consisting of 39 challenging classification tasks. mlr3automl successfully completed every task in the benchmark within the strict time budget, which three out of five other state of the art AutoML systems failed to achieve.

Link to package or code repository.
https://github.com/a-hanf/mlr3automl


9:05am - 9:25am
Talk-Video
ID: 168 / ses-03-A: 2
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: exploratory data analysis

Triplot: model agnostic measures and visualisations for variable importance in predictive models that take into account the hierarchical correlation structure

Katarzyna Pękala, Katarzyna Woźnica, Przemysław Biecek

MI2 Data Lab, Warsaw University of Technology

One of the key elements of the explanatory analysis of a predictive model is to assess the importance of the individual variables. The rapid development of the area of predictive model exploration (also called explainable artificial intelligence or interpretable machine learning) has led to the popularization of methods for local (instance level) and global (dataset level) methods, such as Permutational Variable Importance, Shapley Values (SHAP), Local Interpretable Model Explanations (LIME), Break Down and so on. However, these methods do not use information about the correlation between features which significantly reduce the explainability of the model behaviour.

In this work, we propose new methods to support model analysis by exploiting the information about the correlation between variables. The dataset level aspect importance measure is inspired by the block permutations procedure, while the instance level aspect importance measure is inspired by the LIME method. We show how to analyse groups of variables (aspects) both when they are proposed by the user and when they should be determined automatically based on the hierarchical structure of correlations between variables.

Additionally, we present a new type of model visualisation, triplot, that exploits a hierarchical structure of variable grouping to produce a high information density model visualisation. This visualisation provides a consistent illustration for either local or global model and data exploration.

We also show an example of real-world data with 5k instances and 37 features in which a significant correlation between variables affects the interpretation of the effect of variable importance.

The proposed method is, to our knowledge, the first to allow direct use of the correlation between variables in exploratory model analysis. Triplot package for R is developed under open source GPL-3 licence and is available on GitHub repository at https://github.com/ModelOriented/triplot.

Link to package or code repository.
https://github.com/ModelOriented/triplot


9:25am - 9:45am
Talk-Video
ID: 252 / ses-03-A: 3
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: networks, embeddings, machine learning, algorithms

Getting sprung in R: Introduction to the rsetse package for embedding feature-rich networks

Jonathan Bourne

UCL, United Kingdom

The Strain Elevation Tension Spring embedding algorithm (SETSe) is a deterministic method for embedding feature-rich networks. The algorithm uses simple Newtonian equations of motion and Hooke's law to embed the network onto a locally euclidean manifold. To create the embedding, SETSe converts node attributes into forces and the edge attributes into springs. SETSe finds an equilibrium position when the forces on the springs balance the forces of the nodes. The algorithm has a time complexity of O(2) and linear memory complexity; this means the algorithm avoids issues faced by other physics based embedding methods and can be used to embed graphs with tens of thousands of nodes and more than a million edges.

Some applications of SETSe are; analysing social networks; understanding the robustness of power grids; geographical analysis; predicting node features; understanding power dynamic between individuals and organisations; analysis of molecular structures.

This presentation will provide both a brief technical discussion of the algorithm and its implementation, as well as several use cases. The use cases describe how to embed a network and then how to interpret that embedding.

There are very few options for graph embeddings using R, and this is something that rsetse seeks to address; the algorithm has been implemented in the package `rsetse` and is available on CRAN.



9:45am - 10:05am
Talk-Video
ID: 137 / ses-03-A: 4
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: data envelopment analysis

An R package for the implementation of Efficiency Analysis Trees and the estimation of technical efficiency

Miriam Esteve, Victor J. España, Juan Aparicio, Xavier Barber

Miguel Hernández University of Elche

EAT is a new R package that includes functions to estimate production frontiers and technical efficiency measures using non-parametric techniques based on CART regression trees. The package implements the main algorithms associated with a new technique introduced to estimate the efficiency of a set of decision making units in Economics and Engineering through machine learning techniques, called Efficiency Analysis Trees (Esteve et al., 2020). It encompasses the estimation of radial measures, oriented Russell efficiency measures, the directional distance function, the weighted additive model, graphical representations of the production frontier using tree-shaped structures and the classification of input variable importance. In addition, it incorporates a code to carry out an adaptation of the Random Forest Algorithm to estimate technical efficiency. This work describes the methodology and application of the functions.

Link to package or code repository.
https://github.com/MiriamEsteve/EAT
 
8:45am - 10:15am3B - Spatial Analysis
Location: The Lounge #talk_spatial_analysis
Session Chair: Inger Fabris-Rotelli
Zoom Host: Tuli Amutenya
Replacement Zoom Host: Rachel Heyard
 
8:45am - 9:05am
Talk-Video
ID: 192 / ses-03-B: 1
Regular Talk
Topics: Spatial analysis
Keywords: spatial-analysis, graph-analysis, simple-features, tidygraph, spatial-networks

Tidy Geospatial Networks in R

Lucas van der Meer1, Lorena Abad1, Andrea Gilardi2, Robin Lovelace3

1University of Salzburg, Austria; 2University of Milano - Bicocca, Italy; 3University of Leeds, England

Geospatial networks are graphs embedded in geographical space. That means that both the nodes and edges in the graph can be represented as geographic features: the nodes most commonly as points, and the edges as linestrings. They play an important role in many different domains, ranging from transportation planning and logistics to ecology and epidemiology. The structure and characteristics of geospatial networks go beyond standard graph topology, and therefore it is crucial to explicitly take space into account when analyzing them. The sfnetworks R package is created to facilitate such an integrated workflow. It brings together the sf package for spatial data science and the tidygraph package for standard graph analysis. The core of the package is a data structure that can be provided as input to both the graph analytical functions of tidygraph as well as the spatial analytical functions of sf, without the need for conversion. Additionally, it offers a set of geospatial network specific functions, such as routines for shortest path calculation, network cleaning and topology modification. The package is designed as a general-purpose package suitable for usage across different application domains, and can be seamlessly integrated in "tidy" workflows that use the tidyverse packages for data science.



9:05am - 9:25am
Talk-Live
ID: 290 / ses-03-B: 2
Regular Talk
Topics: Spatial analysis
Keywords: slopes, gradient

slopes: a package for calculated slopes of roads, rivers and other linear (simple) features

Robin Lovelace1, Rosa Félix2

1University of Leeds, United Kingdom; 2University of Lisbon, Portugal

The goal of the slopes is to enable reproducible calculation of slopes for urban, transport, ecological applications and research projects using free and open source software. We have developed the package to be fast, accurate and user friendly, calculating the longitudinal steepness of linear features such as roads and rivers based on open access datasets such as road geometries and digital elevation models (DEMs). The package has a few unique features, including the ability to calculate slopes based on multiple input classes for raster data, and the ability to download and use DEM data on-the-fly in places where users lack DEM data. The package is work in progress but has already attracted attention from road steepness maps of cities in Portugal and Brazil. Integrating with other packages such as sf and sfnetworks, the package should provide a strong foundation for research into the impacts of vertical gradient profiles on phenomena ranging from aquatic migration patterns to flooding and walking and cycling potential. In the talk we will present both the package and some of the research questions we have used it to explore and will ask the audience: how steep a hill would you be willing to walk or cycle up? We will conclude by discussing limitations of the package and future directions of development.

Link to package or code repository.
https://github.com/ITSLeeds/slopes


9:25am - 9:45am
Talk-Live
ID: 142 / ses-03-B: 3
Regular Talk
Topics: Spatial analysis
Keywords: cartography, maps, spatial analysis

mapsf, a New Package for Thematic Mapping

Timothée Giraud

UMS RIATE - CNRS

{mapsf} helps to design various cartographic representations such as proportional symbols, choropleth or typology maps. It also offers several functions to display layout elements that improve the graphic presentation of maps.

The aim of {mapsf} is to obtain thematic maps with the visual quality of those built with a classical mapping or GIS software while being lightweight, versatile and user-friendly. To achieve this goal, the package takes advantage of the features offered by {sf} and provides a limited number of simple mapping functions.

{mapsf} is the successor of {cartography}, it offers the same core features but it is simpler and more robust. Unlike other popular cartographic packages, it does not use grammar of graphics, it depends on a limited number of packages and displays georeferenced plots using base R graphics.

The main function of the package, mf_map(), gives access to 9 map types: base maps, proportional or graduated symbols, choropleth maps, typology maps and various combinations of symbology. Many parameters are available to fine tune the cartographic representations. These parameters are the common ones found in GIS and automatic cartography tools (e.g. classification, color palettes, symbols sizes, legend layout...).

Some additional functions are dedicated to layout design (graphic themes, legends, scale bar, north arrow, title, credits…), map insets or map exports.

The development of {mapsf} follows the current best practices of the R ecosystem (CI/CD, coverage tests) and its documentation is enhanced by a vignette and a website.



9:45am - 10:05am
Talk-Video
ID: 141 / ses-03-B: 4
Regular Talk
Topics: Spatial analysis
Keywords: data management

osmextract: An R package to download, convert, and import large OpenStreetMap datasets

Andrea Gilardi1, Robin Lovelace2

1University of Milano - Bicocca; 2University of Leeds

OpenStreetMap (OSM) is an online database that provides open-access geographic and rich attribute data worldwide, representing a wide range of physical and human features, including roads, rivers, and political boundaries. OSM is the world’s largest open-access source of geographic vector data, comprising nodes (points), ways (lines and polygons) and relations (describing a wide range of entities). Practical applications include disaster response, transport planning, and service location. OSM datasets can be manually downloaded from the project’s servers directly or via the R package osmdata, which uses the Overpass API. Large 'extracts' are also available from external providers (such as geofabrik.de) in a compressed binary format based on protocol buffers. The aim of osmextract is to enable processing and import of such OSM extracts. The package is composed of three main functions that can be used to 1) match an input location with one of the OSM extracts, either via spatial matching or approximate string distance; 2) download the chosen file; 3) convert the compressed data to Geopackage format. The main function, named oe_get(), returns sf objects. This workflow is effective for importing OSM extracts covering large geographical areas. Furthermore, the conversion process is based on GDAL routines, enabling customized spatial filters or SQL-like queries, further boosting import performance.

Link to package or code repository.
Repository: https://github.com/ropensci/osmextract; Website: https://docs.ropensci.org/osmextract/
 
8:45am - 10:15am3C - R Packages 2
Location: The Lounge #talk_packages_2
Session Chair: Maëlle Salmon
Zoom Host: Maryam Alizadeh
Replacement Zoom Host: Faith Musili
 
8:45am - 9:05am
Talk-Video
ID: 169 / ses-03-C: 1
Regular Talk
Topics: Efficient programming
Keywords: API

autotest: Automatic testing of R packages

Mark Padgham

rOpenSci

The 'autotest' package has been developed by rOpenSci to automatically test the robustness of R packages to unexpected inputs. We hope that its usage will enable and encourage software to reach the highest possible quality prior to our peer-review process. 'autotest' implements a form of mutation testing which identifies expected or permitted forms for each parameter, and examines how each function responds to mutations of those inputs. Many software bugs are uncovered by packages being used in ways that developers themselves may not have anticipated, yet no developer can anticipate all potential ways software may be used. 'autotest' eases the task of making software robust to "unexpected" usage by testing and reporting any points at which mutations to inputs generate unexpected results.

The package also matches expectations with textual descriptions provided by function documentation, and ensures that descriptions of input and output parameters are sufficient for users to understand ranges of admissible inputs, and of returned values. Application of 'autotest` to a package should thus ensure that forms and ranges of every parameter of every function are clearly described, and that all functions respond consistently to as many diverse forms of input as possible.

Finally, 'autotest' can also be used to automatically generate a package test suite. Although highly variable, applying the test to a package consisting primarily of numeric algorithms can "automatically" generate a test suite covering well over 50% of code.

Link to package or code repository.
https://github.com/ropenscilabs/autotest


9:05am - 9:25am
Talk-Video
ID: 111 / ses-03-C: 2
Regular Talk
Topics: Efficient programming
Keywords: algorithms

A fresh look at unit testing with tinytest

Mark van der Loo

Statistics Netherlands

"The tinytest package[1,2] implements a light weight and flexible framework for unit testing R packages. In spite of its young age, tinytest has become quite popular: since it was released on CRAN in the spring of 2019, more than 140 packages on CRAN and Bioconductor have started to use tinytest for automatic unit testing. This includes influential packages such as Rcpp.

Tinytest has a few unique features that set it apart from other testing frameworks, such as parallelization and tracking of side-effects. Side effects such as changes in environment variables are important, for example when working with locale-sensitive operations such as sorting, or date-time conversions. Tinytest also makes it easy to temporarily manipulate the testing environment during the run of the test. For example by changing environment

variables. In tinytest, test results are just another type of data. They can be easily translated to data frame layout which to investigate results, or export them from an automated build environment. Moreover, tests are installed with the package so package authors can ask their users to run tests on the user's infrastructure. Using tinytest is easy, as test scripts require no special code: tinytest automatically collects and organizes test results that are

created by any unit test expectation occurring in the script.

As the name suggests, tinytest is a small package, built in less than 1200 lines of code and no dependencies other than two R-base packages that come with any R installation.

[1] M van der Loo (2017). tinytest: R package version 1.2.4. https://cran.r-project.org/package=tinytest

[2] MPJ van der Loo (2020) A method for deriving information from running R code. R-Journal (Accepted) https://arxiv.org/abs/2002.07472

Link to package or code repository.
https://cran.r-project.org/package=tinytest


9:25am - 9:45am
Talk-Live
ID: 239 / ses-03-C: 3
Regular Talk
Topics: R in production
Keywords: R markdown

{fusen}: Create a package from Rmd files

Sébastien Rochette

ThinkR

You know how to build a Rmarkdown for reproducibility, you were said or would like to put your work in a R package, but you think this is too much work? You do not understand where to put what and when? What if writing a Rmd was the same as writing a package? Let {fusen} help you with this task.

When you write a Rmarkdown file (or a vignette), you create a documentation for your analysis (or package). Inside, you write some functions, you apply your functions to examples and you maybe write some unit tests to verify the outputs. This is even more true if you follow this guide : ['Rmd first': When development starts with documentation](https://rtask.thinkr.fr/blog/rmd-first-when-development-starts-with-documentation/).

Why not transforming this workflow into a documented, tested and maintainable R package to ensure sustainability of your analyses ? To do so, you will need to move your functions and scripts in the correct place. Let {fusen} do this transformation for you!

{fusen} is first addressed to people who never wrote a package before but know how to write a RMarkdown file. Understanding package infrastructure and correctly settling it can be frightening. This package may help them do the first step!

{fusen} is also addressed to more advanced developers who are fed up with switching between R files, tests files, vignettes. In particular, when changing arguments of a function, we need to change examples, unit tests in multiple places. Here, you can do it in one place. No risk to forget one. Package {fusen} is himself built with himself, from Rmd template files stored in the appropriate place

 
10:15am - 11:15amRechaRge 2
Session Chair: Marcela Alfaro Cordoba
Zoom Host: Rachel Heyard
Replacement Zoom Host: Maryam Alizadeh
Replacement Zoom Host 2: Nasrin Fathollahzadeh Attar
Yoga for Calm + Meditation
10:15am - 11:15amRLadies Meeting
Location: The Lounge #R-Ladies
Session Chair: Sara Mortara
Zoom Host: Tuli Amutenya
Replacement Zoom Host: Faith Musili
11:15am - 12:45pm4A - Trends, Markets and Models
Location: The Lounge #talk_trends_markets_models
Session Chair: Mouna Belaid
Zoom Host: Faith Musili
Replacement Zoom Host: Maryam Alizadeh
 
11:15am - 11:35am
Talk-Live
ID: 217 / ses-04-A: 1
Regular Talk
Topics: Social sciences
Keywords: business/industry, internationalization, Google Trends

Let Me Google That for You – Measuring global trends using Google Trends

Harald Puhr, Jakob Müllner

WU Vienna,

We present the globaltrends package as a flexible and user-friendly means to analyze data from Google Trends. Google offers public access to global search volumes from its search engine through the Google Trends portal. Users select keywords for which they want to obtain search volumes and specify time period and location (global, country, state) of interest. For these combinations of keywords, periods, and locations Google Trends provides search volumes that indicate the number of search queries submitted to the Google search engine. However, Google constrains users to batches of five keywords and normalizes results for each batch. Thereby, large-scale analysis and comparability across batches is impaired. By re-normalizing results to a set of user-defined baseline keywords, the globaltrends package overcomes these limitations. This gives users the opportunity to download and measure search scores i.e., volumes set to a common baseline, for several keywords across or within locations. In addition, users can visualize distributions, developments, and out-of-the-ordinary changes in global search scores or for specific locations. The package allows researchers and analysts to use these search scores to investigate global trends based on patterns within them. This offers insights such as degree of internationalization of firms and organizations or dissemination of political, social, or technological trends across the globe or within single countries.

Link to package or code repository.
https://github.com/ha-pu/globaltrends


11:35am - 11:55am
Talk-Live
ID: 230 / ses-04-A: 2
Regular Talk
Topics: Economics / Finance / Insurance
Keywords: economics, finance, beahaviour, package

Computing Disposition Effect on Financial Market Data

Lorenzo Mazzucchelli1, Marco Zanotti2

1University of Milan; 2T-Voice - Triboo Group

In recent years, an irrational phenomenon in financial markets is grabbing the attention of behavioral economists: the disposition effect. Firstly discovered by H. Shefrin and M. Statman (1985), the disposition effect consists in the realization that investors are more likely to sell an asset when it is gaining value compared to when it is losing value. A phenomenon which is closely related to sunk costs’ bias, diminishing sensitivity, and loss aversion.

From 1985 until now, the disposition effect has been documented in US retail stock investors as well as in foreign retail investors and even among professionals and institutions. By the time, it is a well-established fact that the disposition effect is a real behavioral anomaly that strongly influences the final profits (or losses) of investors. Furthermore, being able to correctly capture these irrational behaviors timely is even more important in periods of high financial volatility as nowadays.

The presentation focuses on the new dispositionEffect R package that allows to quickly evaluate the presence of disposition effect’s behaviors of an investor based solely on his transactions and the market prices of the traded assets. A simple step-by-step practical guide is presented to understand how to effectively use all the implemented functionalities. Finally, since financial data may be potentially huge in size, efficiency concerns are discussed and the parallelized versions of the functions are shown.



11:55am - 12:15pm
Talk-Live
ID: 152 / ses-04-A: 3
Regular Talk
Topics: Economics / Finance / Insurance
Keywords: benchmarking

The R Package diseq: Estimation Methods for Markets in Equilibrium and Disequilibrium

Pantelis Karapanagiotis

Goethe University Frankfurt

Market models constitute a major cornerstone of empirical research in industrial organization and macroeconomics. Previous literature in these fields has proposed a variety of estimation methods both for markets in equilibrium, which typically entail a market-clearing condition, and in disequilibrium, in which the primary identification condition comes from the short-side rule. Although methodologically attractive, the estimation methods of such models, in particular of the disequilibrium models, is computationally demanding and software providing simple, out-of-the-box methods for estimating them is scarce. Econometricians, therefore, mostly rely on their own implementations for estimating these models. This talk presents the R package Diseq, which provides functionality to simplify the estimation of models for markets in equilibrium and disequilibrium using full information maximum likelihood methods. The basic functionality of the package is presented based on the data and the classic analysis originally performed by Fair & Jaffee (1972). The talk also gives an overview of the design of the package, presents the post-estimation analysis capabilities that accompany it, and provides statistical evidence of the computational performance of its functionality gathered via large-scale benchmarking simulations. Diseq is free software that is distributed under the MIT license as part of the R software project. It comprises a set of estimation tools, which are to a large extend not available from either alternative R packages or other statistical software projects.

Link to package or code repository.
https://github.com/pi-kappa-devel/diseq
 
11:15am - 12:45pm4B - Data viz and Spatial Applications
Location: The Lounge #talk_dataviz_spatial
Session Chair: Natalia Soledad Morandeira
Zoom Host: Rachel Heyard
Replacement Zoom Host: Tuli Amutenya
 
11:15am - 11:35am
Talk-Live
ID: 216 / ses-04-B: 1
Regular Talk
Topics: Data visualisation
Keywords: high-dimensional data

Visual Diagnostics for Constrained Optimisation with Application to Guided Tours

H. Sherry Zhang1, Dianne Cook1, Ursula Laa2, Nicolas Langrené3, Patricia Menéndez1

1Monash University,; 2University of Natural Resources and Life Sciences; 3CSIRO Data61

Guided tour searches for interesting low-dimensional views of high-dimensional data via optimising a projection pursuit index function. The first paper of projection pursuit by Friedman and Tukey (1974) stated that “the technique used for maximising the projection index strongly influences both the statistical and the computational aspects of the procedure.” While much work has been done in proposing indices in the literature, less has been done on evaluating the performance of the optimisers. In this paper, we implement a data collection object in the optimisation of projection pursuit guided tour and introduce visual diagnostics based on the data object collected. These diagnostics and this workflow can be applied to a broad class of optimisers, to assess their performance. An R package, ferrn, has been created to implement the diagnostics.

Link to package or code repository.
https://github.com/huizezhang-sherry/ferrn


11:35am - 11:55am
Talk-Video
ID: 181 / ses-04-B: 2
Regular Talk
Topics: Spatial analysis
Keywords: open data, spatial data, data visualization, spatial analysis

geofi-package: Facilitating the access to key spatial datasets in Finland

Markus Kainu

National Social Insurance Institution of Finland (KELA), Finland

There is a groving demand for presenting statistical data on map. COVID-19 launched a race across internet in spatial data visualization where aesthetics, usability and real-timeliness are of high value. The demand for real-time data favours solutions that can be scripted, automated and refactored quickly and for that purpose we developed geofi R-package to facilitate access to key Finnish geospatial datasets. <https://ropengov.github.io/geofi/> geofi combines resources of two Statistics Finland APIs: the regional classification API and spatial data API. Time-series of regional classifications are shipped as on-board data and larger spatial data is fetched through a WFS-api consisting of administrative borders, zipcodes and both population and statistical grids at various resolutions. Package aims at being a onboarding technology into R ecosystem with clear and concrete vignettes covering the basics of spatial data manipulation, working with attribute data and step-by-step instruction for creating maps both for static and interactive applications. This talk describes the main functions and design principles of the packages and makes comparison with similar packages such as geobr for Brazil and geouy for Uruguay.

Link to package or code repository.
https://github.com/rOpenGov/geofi


11:55am - 12:15pm
Talk-Video
ID: 159 / ses-04-B: 3
Regular Talk
Topics: Data visualisation
Keywords: environmental sciences

Virtual Environments: Using R as a Frontend for 3D Rendering of Digital Landscapes

Michael J. Mahoney1, Colin M. Beier2, Aidan C. Ackerman3

1Graduate Program in Environmental Science, State University of New York College of Environmental Science and Forestry; 2Department of Sustainable Resources Management, State University of New York College of Environmental Science and Forestry; 3Department of Landscape Architecture, State University of New York College of Environmental Science and Forestry

This talk discusses a new approach to using R to create 3D landscape visualizations, which relies on external tooling designed specifically for detailed 3D rendering and interactive exploration. By using R as a frontend for high-performance rendering engines, users are able to quickly create data-defined renders which can then be interactively explored and manipulated. Two of the most promising engines for this approach are the (proprietary, source-available) Unity rendering engine, which excels at visualizing large swaths of land and the (free and open-source) Blender engine, which is well-adapted for visualizing smaller settings.

Our new {terrainr} package (available from CRAN) helps users quickly produce terrain surfaces from real-world data in Unity, visualizing environmental patterns and processes across large scales. Two new experimental packages, {mvdf} and {forthetrees}, focus on creating smaller-scale renders in Blender. Taken together, these packages suggest a way for users to create data-defined 3D renderings within R, using their preexisting coding abilities in the place of complex user interfaces to control powerful rendering engines. Our approach makes it possible for users to create renderings from data in these engines faster and easier than has been historically possible.

 
12:45pm - 1:00pmBreak
Virtual location: The Lounge #lobby
1:00pm - 2:30pmElevator Pitches 2
Virtual location: The Lounge #elevator_pitches
 
ID: 195 / ep-02: 1
Elevator Pitch
Topics: Reproducibility
Keywords: DOI, reproducibility, credibility, Open Science, Zenodo, data science

Make Your Computational Analysis Citable

Batool Almarzouq1,2,3

1University of Liverpool, United Kingdom; 2Open Science Community Saudi Arabia; 3King Abdullah International Medical Research Center (KAIMRC), Saudi Arabia

Although there are overwhelming resources about licencing and citation for R software packages, there's less attention paid to making non-package (data science) code in R citable. Academics and researchers who want to embrace Open Science practices are mostly unaware of how to make their R code citable before publishing in academic journals and what kind of licence they may use to protect the intellectual property of their work. This lightning talk will highlight important aspects to data scientists, which include generating persistent DOIs, metadata, tracking of data re-use, licensing, access control and long term availability. It'll start by introducing the `zen4R` package, which will be used to generate Digital Object Identifier (DOI) for any R code from RStudio. This R package provides an interface to the Zenodo e-infrastructure API, which is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. Then, I'll show how you can add metadata and track your data/code re-use. Also, to protect the project's intellectual property, several types of licenses applicable for non-package (data science) code in R will be described and applied using `usethis` package.

By the end of the talk, academics and researchers, who use R frequently, will be provided with the necessary tools to publish all the research life cycle of their project while protecting the intellectual property of their work. This will increases efficiency and brings benefits to the broader scientific community by increasing reproducibility and credibility.

Link to package or code repository.
This is based on several blog posts I'll be publishing in:
https://batool-blabber.netlify.app/posts/2021-06-23-make-your-computational-analysis-citable/


ID: 291 / ep-02: 2
Elevator Pitch
Topics: Ecology
Keywords: structural connectivity, landscape ecology, landscape metrics, principal component analysis, wetland forests

Structural connectivity in the Lower Uruguay River Forest

Adriana Rojas1, Mariel Bazzalo2, Natalia Morandeira1,3

13iA-UNSAM; 2CARU; 3CONICET

In recent decades, a process of agricultural expansion took place in the wetlands of the lower Uruguay River that leading to the fragmentation of the landscape. We aimed to estimate the structural connectivity of the hydrophilic forest and the open forest, in the basins of the main tributaries of the Lower Uruguay River for the years 1985, 2002 and 2017. Our inputs were landcover classifications previously generated by the authors with Landsat imagery. For each date, the study area (339.000 km2) was subdivided into 49800 cells of 1Km2. The connectivity is estimated by calculating 14 landscape metrics in each of the 49800 cells, for each of the dates. The spatial representation of the connectivity indices was processed using the sf, tidyverse and dplyr packages. Subsequently, we performed a PCA analysis to reduce the dimensionality of the connectivity analysis and propose a simpler connectivity index without redundant variables, the stats, FactoMineR and factoextra packages were used. The variables with the highest score in components 1 and 2 of the PCA (which explain the greater variability) are graphically represented for one of the cells. Our proposed index is based on four landscape metrics: class area,

number of patches, landscape shape,and area-weighted mean patch area. Based on this index, we identified areas with low/high connectivity of the forests, and trends in connectivity changes during the study period.



ID: 222 / ep-02: 3
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: epidemiology

Fitting the beta distribution for the intra-apiary dynamics study of the infestation rate with Varroa Destructor in honey bee colonies

Camila Miotti, Ana Molineri, Adriana Pacini, Emanuel Orellano, Marcelo Signorini, Agostina Giacobino

Instituto de Investigación de la Cadena Láctea (IDICAL-CONICET-INTA),

The aim of this study was to estimate the infestation level with V. destructor mites of honey bee colonies as function of the autumn-winter parasitic dynamics. A total of six apiaries (with five colonies each) distributed within a 30 km radius with minimum distance of 2 km between them were set. All colonies were set with sister queens and the apiaries were balanced according to the adult bee population size. The following experimental conditions were established: a) two apiaries in a circular arrangement with one colony each infested with Varroa mites (donor colony), b) four apiaries in a lineal arrangement, two of them with a donor colony each located at the edge of the line and two apiaries with a donor colony each located in the middle of the line. All colonies were treated during autumn against V. destructor (except for the donor colonies) with amitraz (Amivar 500®) to reduce to 0% infestation level of the receiver colonies (four within each apiary). Samples for diagnose the phoretic Varroa infestation (PV) were taken 45 days after treatment (middle-may) and monthly from June to September. The PV mite infestation (estimated as N° Varroa / N° bees) was evaluated as function of the colonies disposition (circular/ lineal-middle/ lineal-edge) and the initial PV mite level of the donor colony. A generalized linear mixed model with Beta distribution and Logit link was fitted using glmmTMB function (glmmTMB package), including the colony as random effect. After the descriptive analysis, a cubic model was fitted. The colony disposition effect and the initial PV mite level were statistically significant (P=0.0126 and P=0.0314, respectively).This result suggested that the PV temporal dynamics within each colony differs according to initial PV of neighbor colonies and the infestation probability is higher for lineal-middle disposition of the colonies.



withdrawn
ID: 274 / ep-02: 4
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: NEET; C50; Classification trees; Imbalanced data; SDGs.

C50 Classification of young Moroccan men and women not in employment, education or training (NEET).

Salima MANSOURI, Hafsa EL HAFYANI, Ichark LAFRAM

High Commission for Planning, Morocco

Within the 2030 Agenda for Sustainable Development framework, the proportion of youth not in employment, education or training (NEET) has to be substantially reduced; in this context, in order to draw a clearer picture for targeting-policy designers, the present study aims to investigate the composition of Moroccan young NEET men and women aged from 15 to 29 years old by elaborating two classification trees; for both NEET men and NEET women; using some previously proved to be potential predictors (handicap situation, matrimonial status, age, level of education, economic activity of the head of household). This study provides a comparison of different classification trees by implementing some algorithms in R and Python (R: C50, Python: scikit, Orange3); and presents the optimal trees that splits the data the best. Besides, it is to be noted that youth with NEET status form a population that is characterized by a great gender related heterogeneity with respect to economic activity status: most of NEET women are housewives (76.7%) or unemployed (13%), while NEET men are mostly unemployed with no experience (51.6%), unemployed who have worked previously (25.4%) or economically inactive (23%); consequently, the imbalanced classes problem in the target variable had to be priorly solved applying SMOTE, Over Sampling, SMOTE-ENN methods.

Keywords: NEET; Classification trees; Imbalanced data; R; Python; SMOTE.

Link to package or code repository.
https://2019.isiproceedings.org/Files/8.Contributed-Paper-Session(CPS)-Volume-2.pdf , (page 59) ,
https://2019.isiproceedings.org/ Contributed Paper Session (CPS) - Volume 2 (page 59)


ID: 257 / ep-02: 5
Elevator Pitch
Topics: R in production
Keywords: DataOps, Banks, Regulation, Government, Reproducibility

Decision support using R and DataOps at a European Union bank regulator

Jonas Bergstrom, Nicolas Pochet

Single Resolution Board, Belgium

We describe how the Single Resolution Board (SRB) created an environment for efficient and reproducible decision support using R and DataOps principles.

The SRB is the Resolution Authority for the EU Banking Union. Its mission is to manage failures of large EU banking groups while protecting financial stability and minimizing impact on public finances. The SRB develops quantitative models to simulate interbank contagion and impacts to the financial system. The SRB uses these models as a basis for decisions, both as part of its day-to-day work and in crisis management situations. In a bank crisis, it is important that SRB is able to respond immediately to changing data. Furthermore, decisions taken by the SRB regarding a failing bank can be subject to legal proceedings, and it is crucial that SRB can reproduce and justify its decisions regarding the affected bank(s).

The constraints imposed on SRB make the case for code-driven data analysis using R and DataOps principles to ensure reproducibility, correctness and the ability to quickly deploy new models in production. Working side-by-side, SRB IT operations engineers and Data Scientists have created an R-based infrastructure where models, packages and dashboards are automatically built, tested and deployed in reproducible environments. Finally, models in production deliver automated feedback which is used to improve future models. The end result is improved quality and reduced time to production.

In conclusion, we make the case that using R and DevOps, public authorities can deliver better quality decisions more quickly and with lower cost to taxpayers.



ID: 199 / ep-02: 6
Elevator Pitch
Topics: Bayesian models
Keywords: Model-Based Clustering, Finite Mixture Models, Infinite Mixture Models

fipp : a bridge between domain knowledge and model specification in Dirichlet Process Mixtures and Mixture of Finite Mixtures

Jan Greve, Bettina Grün, Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter

WU Vienna University of Economics and Business, Austria

Bayesian methods have established a firm foothold in unsupervised learning, particularly in the area of clustering. The probabilistic and generative nature of the Bayesian paradigm offers a rich inference in clustering framework that are successfully applied to various areas in science and industry such as natural language processing, computer vision and volatility modeling to name a few. The fipp package is aimed at enhancing the use of the most popular and successful Bayesian methodology in this area: Dirichlet Process Mixtures (DPMs) and its parametric counterpart Mixture of Finite Mixtures (MFMs). A major source of uncertainty when implementing these models in practice is how one can incorporate domain-specific knowledge in prior distribution and hyperparameters. For example, a practitioner may have a rough idea of the number of clusters to be expected or the unevenness in partition structure, which should be translated appropriately to the prior and hyperparameter specification. Bridging this gap between statistical formulation and domain knowledge is what the functionalities implemented in fipp package do. Specifically, it allows users to evaluate the prior distribution of the number of clusters and computation of functionals over the prior partitions in a computationally efficient manner. This enables efficient experimentation using various prior and hyperparameter settings. Suggested use of this package is to combine it with R packages aimed at fitting the aforementioned methodology to real data such as PReMiuM and dirichetprocess.



ID: 266 / ep-02: 7
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data science teaching, learnr, shiny, bookdown, learning management system

An integrated teaching environment for R with {learnitdown}

Philippe Grosjean, Guyliann Engels

Numerical ecology department, Complexys and InforTech Institutes, University of Mons, Belgium

Many R resources exist for teaching R and data science, like {bookdown}, {blogdown} or {distill} for textbook material, {learnr} and {gradethis} for tutorials with interactive exercises, {shiny} applications for interactive demonstrations, R/exams for exams generation and administration... However, as far as we know, there is still no integrated system that manage all these tools and other common ones like Moodle or H5P, in a coherent teaching platform. The {learnitdown} R package (https://github.com/SciViews/learnitdown) brings all these tools together into a little LMS (learning management system) dedicated to teaching with R and R Markdown.

Students authentication from Moodle or Wordpress allow to track individual activity in the H5P, {learnr} or {shiny} exercises in a centralized database. A list of exercises is build automatically for each {bookdown} chapter and an auto-generated progress report help students to manage their exercises more easily. Data gathered from these exercises can be pseudonymized and analyzed. The {learnitdown} system is used to teach data science to students in biology at the University of Mons, Belgium, since 2018 with great satisfaction, see https://wp.sciviews.org (in French) and https://github.com/BioDataScience-Course.



withdrawn
ID: 120 / ep-02: 8
Elevator Pitch
Topics: Data visualisation
Keywords: applications, case studies

Visualization of the one-way flexible ANOVA tests using with {doexplot}

Mustafa Cavus

Eskisehir Technical University, Department of Statistics

It is not always easy to interpret the output of statistical tests. This situation can be made easier with visualization methods. The {ggbetweenstats} package provides tools for visualizing and reporting the output of ANOVA tests under normality. However, the violation of assumptions are commonly faced problem in ANOVA. The {doex} package provides several one-way ANOVA tests under heteroscedasticity and non-normally distributed data. In this study, {doexplot} package is implemented to visualize the output of one-way ANOVA tests provided in the {doex} package. In this way, made easier for researchers to interpret and report the results of flexible ANOVA methods in case of violation of the assumptions.



ID: 129 / ep-02: 9
Elevator Pitch
Topics: Ecology
Keywords: biology

TrackJR: a new R-package using Julia language for tracking tiny insects

Gerardo de la vega1,2, Federico Triñanes2, Andres Gonzalez Ritzel2

1CONICET (IFAB-INTA) ARGENTINA; 2LEQ (UDELAR) URUGUAY

Here we present the trackJR package, a tool to analyze tiny insect behaviours in bioassays where the main important variable is the position of the insect (for example, an olfactometer bioassay or other orientation experiment). This package allows to work with tiny objects, understood as an individual representing ~1% of the frame, thus it could be used with other species rather than insects. It was written in Julia and R as a common tool for biologists, with user-friendly’ Shiny widget to a broad audience. Therefore, the package allows biologists to use a script wrote in Julia language with a basic R language knowledge. Also, the results can be easily merged with others R object outputs (i.e. data-frame, matrix or lists).



ID: 232 / ep-02: 10
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: javascript

shiny.fluent and shiny.react: Build Beautiful Shiny Apps Using Microsoft's Fluent UI

Marek Rogala

Appsilon

In this talk we will present the functionality and ideas behind a new open source package we have developed called shiny.fluent.

UI plays a huge role in the success of Shiny projects. shiny.fluent enables you to build Shiny apps in a novel way using Microsoft’s Fluent UI as the UI foundation. It gives your app a beautiful, professional look and a rich set of components while retaining the speed of development that Shiny is famous for.

Fluent UI is based on the Javascript library React, so it’s a challenging task to make it work with Shiny. We have put the parts responsible for making this possible in a separate package called shiny.react, which enables you to port other React-based components and UI libraries so that they work in Shiny.

During the talk, we will demonstrate how to use shiny.fluent to build your own Shiny apps, and explain how we solved the main challenges in integrating React and Shiny.

Link to package or code repository.
https://github.com/Appsilon/shiny.fluent


ID: 231 / ep-02: 11
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: API

Conducting Effective User Tests for Shiny Dashboards

Maria Grycuk

Appsilon,

User tests are a crucial part of development, yet we frequently skip over them or conduct them too late in the process. Involving users early on allows us to verify if the tool we want to build will be used by them or will be forgotten in the next few months. Another risk that increases significantly when we don’t show the product to end users before going live is that we will build something unintuitive and difficult to use. When you are working with a product for a few months and you know every button and feature by heart, it is hard to take a step back and think about usability. In this talk, I would like to share a few tips on how to perform an excellent user interview, based on my experience working with Fortune 500 clients on Shiny dashboards. I will show why conducting effective user tests is so critical, and explain how to ask the right questions to gain the most from the interview.



ID: 124 / ep-02: 12
Elevator Pitch
Topics: R in production
Keywords: business, industry

NNcompare: An R package supporting the peer programming process in clinical studies

Mette Bendtsen, Steffen Falgreen Larsen, Frederik Vandvig Heinen, Claus Dethlefsen

Novo Nordisk A/S, Alfred Nobels Vej 27, DK-9220 Aalborg Øst, Denmark

Analysing and reporting data from clinical studies require a high level of quality in the entire process from data collection to final clinical study report (CSR). In Novo Nordisk, part of the quality assurance is ‘peer programming’ of important data derivations, complex combinations, and statistical analyses included in data sets and TFL (tables, figures, and listings) for the CSR. In this context, peer programming involves two persons solving a specific programming task: the programmer and the reviewer. The programmer creates a program that solves the task, and the reviewer creates a ‘peer program’ that reviews/validates the programmer’s work. To avoid being influenced by the programmer’s code, the reviewer should not read it until after preparation of the peer program. NNcompare is an R package that supports this peer programming process in Novo Nordisk. The package builds on the comparedf() function from the ‘arsenal’ package, which essentially provides functionality for comparing two data frames and reporting the results of the comparison. To support the peer programming process in Novo Nordisk the NNcompare package provides additional functionality for exporting comparison reports to various formats using R markdown, and for creating summary reports across multiple peer programs to provide an overview of the status of all peer programs for a given trial. Furthermore, the package includes functionality for comparing png files using pixel-wise comparisons and marking differences in the plot. Future development will include comparisons of other file types and comparisons of multiple data frames with one function call.



ID: 253 / ep-02: 13
Elevator Pitch
Topics: Statistical models

The evolution of the dependencies of CRAN packages

Clement Lee

Lancaster University, United Kingdom

The number of CRAN packages has been growing steadily over the years. In this talk, we examine two aspects of the package dependencies. First, we look at a snapshot of the dependency network, and apply statistical network models to study its properties, including the degree distribution and the different clusters of packages. Second, we study the evolution of the network over the last year and how the number of reverse dependencies of grows for a typical package. This will allow us to examine the extent to which the preferential attachment model (or the-rich-gets-richer effect) is valid.

Link to package or code repository.
https://cran.r-project.org/package=crandep


ID: 249 / ep-02: 14
Elevator Pitch
Topics: Algorithms
Keywords: Resampling, Linear mixed-effect models, Bootstrap, Nested data

Bootstrapping Multilevel Models in R using lmeresampler

Adam Loy

Carleton College, United States of America

Linear mixed-effects (LME) models are commonly used to analyze clustered data, such as split plot experiments, longitudinal studies, and stratified samples. In R, there are two primary packages to fit LME models: nlme and lme4. In this talk, we present an extension of the nlme and lme4 packages to include methods for bootstrapping model fits. The lmeresampler packages implements several bootstrap methods for LME models with nested dependence structures using a unified framework: the cases bootstrap resamples entire clusters or observations within clusters (or both); the parametric bootstrap simulates data from the model fit; the residual bootstrap resamples both the predicted random effects and the predicted error terms; and the random effect block bootstrap utilizes the marginal residuals to calculate nonparametric predicted random effects as part of the resampling process. We will discuss and demonstrate the implementation of these bootstrap procedures, and outline plans for future development.

lmeresampler is available on CRAN.



ID: 299 / ep-02: 15
Elevator Pitch
Topics: Other
Keywords: IDE

RCode, a new IDE for R

Nicolas Baradel, William Jouot

PGM Solutions, France

RCode is a new and modern IDE for R. It includes usual features like code highlighting, environment for R variables, history of execution, etc. It also provides extra features such as an Excel-like data grid in which data.frame are directly editable. RCode is multiplatform and available in several languages.



ID: 295 / ep-02: 16
Elevator Pitch
Topics: R in production
Keywords: pharma, validation, verification, qualification

R Package Validation and {valtools}

Ellis Hughes

Fred Hutch Cancer Research Center, United States of America

The R Package Validation Framework offers a clear, easy to follow guide to automate the creation of validated R packages for use in regulated industries. By combining many of the package development tools and philosophies already in existence in the R ecosystem, the framework minimizes overhead while improving the quality of both the package and validation.

{valtools} is the implementation of this framework as an R package. Much like {usethis}, {valtools} automates the creation of the validation infrastructure and eventual validation report so users can focus on what matters: writing the R package.

By the end of this talk, listeners will know the basics to implement the R Package Validation Framework using the {valtools} package.



ID: 151 / ep-02: 17
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: algorithms

NetCoupler: Inferring causal pathways between high-dimensional metabolomics data and external factors

Luke W. Johnston1, Clemens Wittenbecher2, Fabian Eichelmann3

1Steno Diabetes Center Aarhus; 2Harvard T.H. Chan School of Public Health; 3Department of Molecular Epidemiology, German Institute of Human Nutrition and German Center for Diabetes Research

High-dimensional metabolomics data are highly intercorrelated, implying that associations with lifestyle and other exposures or with disease outcomes generally propagate across sets of co-varying metabolites. When inferring biological pathways from metabolomics studies, it is often crucial to detect direct exposure-metabolite or metabolite-outcome relationships instead of associations that can be explained by correlations with other metabolites. To tackle this challenge, we have developed the NetCoupler-algorithm R package (found at github.com/NetCoupler). NetCoupler builds on evidence showing that data-driven networks recover biological dependencies from metabolomics data and that, based on causal inference theory, adjustment for at least one subset of direct neighbors is sufficient to block all confounding influences within a conditional dependency network. NetCoupler estimates a conditional dependency network from metabolomics data and then uses a multi-model approach to adjust for all possible subsets of direct neighbors in the network in order to identify exposure-affected metabolites or metabolites that have direct effects on disease outcomes. We demonstrate using simulated data that NetCoupler correctly identifies direct exposure-metabolite and metabolite-outcome effects and provide an example of its application in a prospective cohort study to integrate the information on food consumption habits, metabolomics profiles, and type 2 diabetes incidence. While NetCoupler was developed from a need to process and analyze the data from metabolomics studies, NetCoupler can also be applied to detect direct links between other external variables and network types.

Link to package or code repository.
https://github.com/NetCoupler/NetCoupler


ID: 236 / ep-02: 18
Elevator Pitch
Topics: R in production
Keywords: CI/CD

Continuously expanding Techguides: An open source project based on bookdown using CI/CD pipelines from GitHub Actions

Peter Schmid

Mirai Solutions

The Data Scientist work is often about solving unfamiliar problems. Online resources are a bliss in this regard, with the community providing answers to virtually any problem. However, it can be difficult to find the working solution in an ocean of more or less useful suggestions. Therefore, we at Mirai Solutions, have started to gather solutions to some of these issues in an open source project: techguides. This initiative is meant to give back to the community a bit of our know-how. It resulted in a public repository that elegantly puts together several rmarkdown files and renders them as a bookdown website served on GitHub Pages.

In this talk, I would like to show how we are continuously expanding our techguides in a flexible way based on an automated continuous integration and deployment workflow using GitHub Actions. As Github Actions is fairly new and not yet trivial to set up, we hope that our explanations can help and inspire others to consider using CI / CD.

Link to package or code repository.
https://github.com/miraisolutions/techguides


ID: 261 / ep-02: 19
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: k-mer, prediction, protein, functional analysis

R as a an environment for the functional analysis of proteins

Michał Burdukiewicz

Medical University of Białystok, Poland

The functional analysis of proteins, development of models associating the protein sequence with its function, was always one of the cornerstones of bioinformatics. Like every other application of machine learning, it is prone to issues such as reproducibility or benchmarking. Moreover, as potential users are mostly biologists, these models should be accessible without any coding. Unfortunately, the resources necessary to build and share such models in accordance with CRAN/Bioconductor guidelines and requirements of reproducible science are still scattered.

During my presentation, I sum up my experience of developing several tools for functional analysis of proteins (AmyloGram, SignalHsmm, AmpGram, and CancerGram). I show the advantages of the R ecosystem during the development of the model (tidysq, mlr3) and deployment (R packages, Shiny web servers, and Electron-based standalone apps). As sharing very large (>10 MB) predictive models on CRAN is not intuitive, I show how to do it in a way that satisfies submission requirements.



ID: 139 / ep-02: 20
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: Bioimaging, R workflow, high dimensional data

Statistical Workflows in R for Imaging Mass Spectrometry Data

Hoang Tran, Valeriia Sherina, Fang Xie

GlaxoSmithKline, United States of America

Matrix-assisted laser desorption/ionization (MALDI) imaging mass spectrometry (IMS) is a technique that can reveal powerful insights into the correlation between molecular distributions and histological features. Due to their high-dimensional, hierarchical and spatial nature, MALDI IMS datasets present numerous statistical challenges. In collaboration with the bioimaging team at GlaxoSmithKline (GSK), we have developed special purpose statistical workflows in R that provide end-to-end support for the entire MALDI IMS analysis pipeline, from study design and assay quantification to functional pharmacology. These applications leverage numerous R packages, with a particular focus on the “tidyverse” and “tidymodels” ecosystems due to their modularity and interconnectedness (to protect GSK’s intellectual property, we are currently unable to share our code). Our workflows include robust smoothing and estimation of calibration curves; non-trivial animal and tissue sample size calculations via in silico experiments; and AI/ML implementations for prediction of drug effects from the high-dimensional molecular space. These solutions addressed unique biological and quantitative challenges, and yielded actionable insights for GSK’s bioimaging team.



ID: 258 / ep-02: 21
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: Shiny, NLP, Human-computer interaction, Chatbot, AI&Society

Hi, Let’s Talk About Data Science! - Customize Your Personal Data Science Assistant Bot.

Livia Eichenberger, Oliver Guggenbühl

STATWORX GmbH, Switzerland

In June 2020, OpenAI released their newest NLP model, GPT-3, and thus set a new standard for language understanding and generation. GPT-3 is an autoregressive language model, enabling the generation of human-like text. Sample use cases are chatbots, Q&A systems and text summarization. Due to the complexity of GPT-3, it is difficult for non-technical specialists to experience both the strengths but also the shortcomings of this technology. A fundamental challenge faced today is educating society about the potentials and risks of AI and not leaving anyone behind.

To approach this task, R’s Shiny framework can be leveraged to lower the barrier of entry for interaction with AI models. Specifically, GPT-3 can be instructed to incorporate different types of chatbots by supplying it with a precise description of how it should behave during a conversation. We provide an interface to chat with a Data Science bot, where various parameters of the bot’s behaviour can be selected on the fly. Examples are the preferred language and the user’s knowledge level. A mockup of our interface is attached.

Shiny is the preferred framework for this application because it comes packaged with all the necessary tools for interacting with a customizable chatbot based on GPT-3. With Shiny’s input widgets the user can then manipulate various parameters to influence the pre-defined chatbot’s personality. The chatbot will immediately adjust their behaviour and finetune their personality, allowing the user to experience their input on GPT-3 in real-time. All this will be done in a clearly laid out interface where users need no prior experience with R coding or creating Shiny apps.

We present how we use Shiny to lower the barrier to interact with AI models with little overhead and thus to tackle one of today’s most important problems: AI education of the broader population.

Link to package or code repository.
http://files.statworx.com/datascience-assistant.jpg


ID: 133 / ep-02: 22
Elevator Pitch
Topics: R in production
Keywords: business, industry

NNSampleSize: A tool for communicating, determining and documenting sample size in clinical trials

Claus Dethlefsen1, Steffen Falgreen Larsen1, Anders Ellern Bilgrau2, Nynne Holdt-Caspersen1, Maika Lindkvist Jensen1

1Novo Nordisk A/S; 2Seluxit

Determination of sample size in clinical studies is an iterative process involving many stakeholders and leading to many decisions. When data from other studies become available, assumptions may be revised or other scenarios for study design may be considered. Assumptions also feed into decision guiding framework aimed to determine if the sample size is adequate to make a decision for the future development of the product. At Novo Nordisk, we have developed an R shiny that can assist us in this process. In the R shiny application, several sample size scenarios can be carried out for a given study. The application has a documentation module for keeping track on decisions using Rmarkdown as well as facilities for programming and reviewing the final determination of sample size. When finalized, the idea is to download word-files ready for archiving in a documentation system.



ID: 250 / ep-02: 23
Elevator Pitch
Topics: Multivariate analysis
Keywords: True Discovery Proportion, Permutation Test, Multiple Testing, Selective Inference, fMRI Cluster Analysis

pARI package: valid double-dipping via permutation-based All Resolutions Inference

Angela Andreella1, Jelle Goeman2, Livio Finos3, Wouter Weeda4, Jesse Hemerik5

1Department of Statistical Sciences, University of Padova; 2Biomedical Data Sciences, Leiden University Medical Center; 3Department of Developmental Psychology and Socialization, University of Padua; 4Methodology and Statistics Unit, Department of Psychology, Leiden University; 5Biometris, Wageningen University and Research

The functional Magnetic Resonance Imaging cluster extent-based thresholding is popular for finding neural activation associated with some stimulus. However, it suffers from the spatial specificity paradox: we only know that a specific cluster of voxels is significant under the null hypothesis of no activation. We can not find out the number of truly active voxels inside that cluster without falling into the double-dipping problem. For that, Rosenblatt et al. (2018) developed All-Resolution Inference (ARI), which associates the lower bound of the number of truly active voxels for each cluster. However, ARI can lose power if the data are strongly correlated, e.g., fMRI data. So, we re-phrase it using the permutation theory, developing the package pARI. The main function pARIbrain takes as input a list of contrast maps, one for each subject, given by neuroimaging tools. The user can then insert a cluster map, and pARIbrain returns the lower bounds of true discoveries for each cluster from the cluster map inserted. The package was developed for the fMRI scenario; however, we develop also the function pARI. It takes the permutation p-values null distribution and the indexes of the hypothesis of interest as inputs and returns the lower bound for the number of true discoveries inside the set of hypotheses specified. The user can compute the permutation null distribution concerning the two-sample t-tests and one-sample t-tests by the permTest and signTest functions. The set of hypotheses can be specified as often as the user wants, and pARI still controls FWER.



ID: 233 / ep-02: 24
Elevator Pitch
Topics: Web Applications (Shiny/Dash)
Keywords: biostatistics

Data Access and dynamic Visualization for Clinical Insights (DaVinci)

Matthias Trampisch, Julia Igel, Andre Haugg

Boehringer Ingelheim

This talk introduces the Boehringer Ingelheim initiative on Data Access and dynamic Visualization for Clinical Insights (DaVinci). It is named after Leonardo da Vinci, one of the most diversely talented individuals ever to have lived. The main objective of the DaVinci project is to reflect this diversity creating a modular framework based on the shiny, which enables end-users to have direct access to clinical data via advanced visualization during clinical development.

DaVinci consists of a collection of shiny-based modules to review, aggregate and visualize data to develop and deliver safe and effective treatments for patients. Based on harmonized data concepts (SDTM/ADaM), DaVinci provides and maintains GCP compliant modules for data review and analysis, which can easily be combined and customized into trial-specific dashboards by the end-user.

The talk outlines the developed approach, including the developed modular manager and highly flexible, custom-designed modules which all lead to an individual and customizable app experience. Main advantages of this approach are that the individual modules can be validated separately and used flexible in a joint shiny application, which permits easy validation considering GDPR, GxP and 21 CFR part 11. This approach also supports trial, project or substance specific needs to get the most value out of the data.

Deployment of these apps is done via a CI/CD pipeline using the Atlassian Stack and Jenkins, resulting in dockerized shiny server instances, which can easily scale up to the application needs.



ID: 179 / ep-02: 25
Elevator Pitch
Topics: Environmental sciences
Keywords: Environmental research; Big data; Reproducibility; Data visualisation

Reproducibility and dissemination in the research: a case of study of the bioaerosol dynamics

Jesús Rojo1, Antonio Picornell2, Jeroen Buters3, Jose Oteros4

1Department of Pharmacology, Pharmacognosy and Botany, Complutense University. Madrid (Spain); 2Department of Botany and Plant Physiology. University of Malaga. Malaga (Spain); 3Center of Allergy & Environment (ZAUM), Technische Universität München/Helmholtz Center Munich. Munich (Germany); 4Department of Botany, Ecology and Plant Physiology. University of Cordoba. Cordoba (Spain)

Environmental databases are constantly increasing which require computational tools to be efficiently managed. This experience is an example of the procedure followed to manage the aerobiological databases used in the publication led by Rojo et al. [Environ Res,174:160-169; doi:10.1016/j.envres.2019.04.027] based on the effect of height on pollen exposure. While the analysis of pollen time-series at local scale may provide no clear or even contradictory findings from different study areas, a global study provides robust results avoiding biases or the effect of local factors masking the true patterns in bioaerosol dynamics. We analysed about 2,000,000 daily pollen concentrations from 59 monitoring stations of Europe, North America and Australia, using R Software and 'AeRobiology', a specific package in this field [Rojo et al., Methods Ecol Evol,10:1371-1376; doi:10.1111/2041-210X.13203]. Due to the huge amount of data contributors involved we conducted a first step of exhaustive filtering and quality control of data for making standard and comparable datasets between sites. This quality control required basic rules of removing uncertain or missing data, but also scientific criteria based on the optimisation of parameters like distance or degree of similarity between sites. The pollen rate between paired stations was used to study the effect of height on pollen concentrations which constituted the second step (analysis of data) and the main scientific findings. One of the key benefits of computational tools is the automation of the processes. In this case, the processing and analysis systems made it possible to dynamically incorporate the pollen data from new stations, obtaining an automatic update of the statistical analysis. Finally, since reproducibility and dissemination are both very important principles of the scientific research, we designed a Shiny Application where the users may interpret the results and generate the graphs selecting specific scientific criteria by themselves. Link of the Shiny Application: https://modeling-jesus-rojo.shinyapps.io/result_app2/



ID: 210 / ep-02: 26
Elevator Pitch
Topics: Environmental sciences
Keywords: environmental sciences

R in the aiR!

Adithi R. Upadhya1, Pratyush Agrawal1, Sreekanth Vakacherla2, Meenakshi Kushwaha1

1ILK Labs, Bengaluru, India; 2Center for Study of Science, Technology and Policy, Bengaluru, India

R is a powerful tool in analysing air-quality data. With the ever-increasing global measurements of air pollutants (through stationary, mobile, low-cost, and satellite monitoring), the amount of data being collected is huge and necessitates the use of management platforms. In an effort to address this issue, we developed two Shiny applications to analyse and visualise air-pollution data.

‘mmaqshiny’, now on CRAN, is aimed at handling, calibrating, integrating, and visualising spatially and temporally acquired air-pollution data from mobile monitoring campaigns. Currently, the application caters to data collected using specific instruments. With just the click of a button, even non-programmers can generate summary statistics, time series, and spatial maps. The application is capable of handling high-resolution data from multiple instruments and formats. Moreover, it also allows users to visualize data at near-real time and helps in keeping a tab on data quality and instrument health.

Our second Shiny application (currently in the development phase) is specific to India, and allows users to handle open-source air-quality datasets available from OpenAQ (https://openaq.org/#/countries/IN?_k=5ecycz), CPCB (https://app.cpcbccr.com/ccr/#/caaqm-dashboard-all/caaqm-landing), and AirNow (https://www.airnow.gov/international/us-embassies-and-consulates/#India). sers can visualize data, perform basic statistical operations, and generate a variety of publication-ready plots. It also provides outlier detection and replacement of fill/negative values. We have also integrated the popular openair package in this application.



ID: 108 / ep-02: 27
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics

segmenter: A Wrapper for JAVA ChromHMM

Mahmoud Ahmed, Deok Ryong Kim

Gyeongsang National University

Chromatin segmentation analysis transforms ChIP-seq data into signals over the genome. The latter represents the observed states in a multivariate Markov model to predict the chromatin's underlying (hidden) states. ChromHMM, written in Java, integrates histone modification datasets to learn the chromatin states de-novo. We developed an R package around this program to leverage the existing R/Bioconductor tools and data structures in the segmentation analysis context. segmentr wraps the Java modules to call ChromHMM and captures the output in an S4 R object. This allows for iterating with different parameters, which are given in R syntax. Capturing the output in R makes it easier to work with the results and to integrate them in downstream analyses. Finally, segmentr provides additional tools to test, select and visualize the models. To sum, we developed an R package to wrap a popular chromatin segmentation tool and capture the output in R for testing and visualization.

Link to package or code repository.
https://github.com/MahShaaban/segmenter


ID: 122 / ep-02: 28
Elevator Pitch
Topics: Efficient programming
Keywords: recursion, list, nested, efficient programming, C

Efficient list recursion in R with rrapply

Joris Chau

Open Analytics

The little used R function rapply() applies a function to all elements of a list recursively and provides control in structuring the result. Although occasionally useful due to its simplicity, the rapply() function is not sufficiently flexible to solve many common list recursion tasks. In such cases, the solution is to write custom list recursion code, which can quickly become hard to follow or reason about, making it a time-consuming and error-prone task to update or modify the code. The rrapply() function in the rrapply-package is an attempt to enhance and extend base rapply() to make it more generally applicable in the context of efficient list recursion in R. For instance: i) rapply() only allows to apply a function f to list elements of certain classes, rrapply() generalizes this concept through a general condition function; ii) rrapply() allows additional flexibility in structuring the result by e.g. pruning or unnesting list elements; iii) with rapply() there is no convenient way to access the name or location of the list element under evaluation, rrapply() allows the use of a number of special arguments to overcome this limitation. The rrapply()-function aims at efficiency by building on rapply() ’s native C implementation and does not require any external R-package dependencies. The rrapply-package is available on CRAN and several vignettes illustrating its use can be found online.



ID: 263 / ep-02: 29
Elevator Pitch
Topics: Mathematical models
Keywords: flow chart, flow diagram, model diagram, ggplot2, visualization

An R package to flexibly generate simulation model flow diagrams

Andreas Handel1, Andrew Tredennick2

1University of Georgia; 2Western EcoSystems Technology, Inc.

We recently developed an R package that allows users to quickly generate ggplot2 based flow diagrams of compartmental simulation models that are commonly used in infectious disease modeling and many other areas of science and engineering. The package allows users to create publication quality diagrams in a user-friendly manner. Full access to the ggplot2 code that generates the diagram means advanced users can further customize the final diagram as needed. In this talk, we will provide a brief overview and introduction to the package.

Link to package or code repository.
https://github.com/andreashandel/flowdiagramr


ID: 272 / ep-02: 30
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: Markdown, automation, trend epidemiology, daily report, metrics

Using R Markdown to Automate COVID-19 Reporting

Farzad Islam, Michael Elten, Najmus Saqib

Public Health Agency of Canada, Canada

The COVID-19 pandemic has impacted the operational needs at the Public Health Agency of Canada (PHAC), and consequently the day-to-day responsibilities of its employees. The emergency surveillance needs in light of the pandemic require around-the-clock monitoring 7 days a week. To accompany the surveillance, daily reporting needs were developed to keep the Office of the Chief Public Health Officer (OCPHO) informed on nationwide trends that would ultimately help inform public policy decisions and craft communication strategies. Due to the abrupt development of these needs, the solutions initially devised were labour-intensive and inefficient. Epidemiologists were utilizing the same datasets across different teams, writing scripts in various languages and maintaining them in silos.

The Center for Data Management, Innovation and Analytics at PHAC was responsible for taking these functions and improving them so that they a) they became standardized, b) they reduced the need for manual labour, and c) they eliminated the risk of human error. As a result, R was utilized to automate the functions of reporting, moved to the back-end, and the outputs of the scripts were generated in the form of PowerPoint decks. This included the uses of various plots (using ggplot2), charts (flextable, officer), and cross-functionality with Python (reticulate). The data ingestion systems were also improved by utilizing Googlesheets, reading public data directly from websites, and utilizing web-scraping techniques to pull data reported daily.

As a result of these efforts, daily reporting needs which could take hours to accomplish were reduced to the click of a button and five minutes of processing.



ID: 186 / ep-02: 31
Elevator Pitch
Topics: Statistical models
Keywords: statistics, Cumulative Link Mixed-effects Models, Ordinal response variable

Cumulative Link Mixed-effects Models (CLMMs) as a tool to model ordinal response variables and incorporate random effects

Christophe Bousquet

Lyon Neuroscience Research Center, France

Ordinal response variables are frequent in various scientific domains, including ecology, ethology and psychology. However, researchers often analyse these data with methods suitable for non-ordinal response variables. The R package ‘ordinal’ has been developed specifically to model ordinal response variables and also offers the possibility to incorporate random effects. In this elevator pitch, I will present how to approach this kind of analysis, from the integration of random effects to the production of visualisations to communicate the results. The dataset is based on experiments in behavioural biology, specifically on leadership in mallards. The code to access the data and analysis is available on GitHub and may allow other researchers to learn analysis techniques for ordinal data.



ID: 242 / ep-02: 32
Elevator Pitch
Topics: Data visualisation
Keywords: ggplot2

High dimensional data visualization in ggplot2

Zehao Xu, Wayne Oldford

University of Waterloo

Package 'ggmulti' extends the 'ggplot2' package to add high dimensional visualization functionality such as serialaxes coordinates (e.g., parallel, ...) and multivariate scatterplot glyphs (e.g. encoding many variables in a radial axes or star glyph).

Much more general glyphs (e.g., polygons, images) are also now possible as point symbols in a scatterplot and can provide more evocative pictures for each point (e.g. an airplane for flight data or a team’s logo for sports data).

As its name suggests, serial axes coordinates arranges variable axes in series (radially for stars, parallel for parallel coordinates) and can be used as its own plot or as a glyph. These are extended to a continuous curve representation (e.g., Andrews curves) through function transformations (e.g. Fourier series). The parallel coordinates work in the ggplot pipeline allowing histograms, density, etc. to be overlaid on the axes.

In this talk, an overview of ggmulti will be given, largely by example.

Link to package or code repository.
https://github.com/great-northern-diver/ggmulti/


ID: 211 / ep-02: 33
Elevator Pitch
Topics: Data visualisation
Keywords: API

Charting Covid with the DatawRappr-Package

Benedict Witzenberger

Süddeutsche Zeitung / TUM,

Covid-19 swept across the world like a huge, sudden wave. Data journalists all around the globe had a brand new beat to cover from one moment to another. A lot of newsrooms used the available data to start automated and regularly updated visualizations or dashboard. One tool that is often used for creating charts, maps or dashboard-like tables in journalism (and corporate) is Datawrapper.

I created an R-API-package to combine the power of R-code for analysing data and the various options Datawrapper offers in creating interactive and responsive visualizations.

I would like to show some examples and best practices for useful automated visualizations in Datawrapper - created in R.

Link to package or code repository.
https://github.com/munichrocker/DatawRappr


ID: 154 / ep-02: 34
Elevator Pitch
Topics: Efficient programming
Keywords: C++, AutoDiff, packages

Bringing AutoDiff to R packages

Michael Komodromos

Imperial College London

We demonstrate the use of a C++ automatic differentiation (AD) library and show how it can be used with R to solve problems in optimization, MCMC and beyond. In particular, we show how gradients produced with AD can be used with R's built in optimization routines. We hope such integrations will enable package developers to produce robust efficient code by overcoming the need to produce functions that compute gradients.

Link to package or code repository.
https://github.com/mkomod/rad


ID: 218 / ep-02: 35
Elevator Pitch
Topics: Community and Outreach
Keywords: interface, community, education, workflow

Healthier & Happier Hands: Software and Hardware Solutions for More Ergonomic Typing

John Paul Helveston

George Washington University,

Most R users spend multiple hours every day typing on a keyboard, which can lead to serious injuries such as Repetitive Strain Injury (RSI) and Carpal Tunnel Syndrome. This talk discusses a variety of software and hardware tools to improve the ergonomics of typing. I will discuss a wide range of solutions, from implementing software tools for remapping keys to using a split mechanical keyboard for improved hand and arm positioning. Each solution involves a trade-off between the time and effort required to learn and implement it and the benefits in terms of health and typing improvements, like speed and accuracy. I will also showcase some specific applications of how these solutions can improve the experience of working with R. No one solution will work for everyone, but my goal is that by introducing a broad overview of solutions, many will leave inspired to try (and eventually adopt) some and end up with healthier and happier hands.



ID: 144 / ep-02: 36
Elevator Pitch
Topics: Algorithms
Keywords: high-dimensional data

High Dimensional Penalized Generalized Linear Mixed Models: The glmmPen R Package

Hillary M. Heiling1, Naim U. Rashid1,2, Quefeng Li1, Joseph G. Ibrahim1

1University of North Carolina at Chapel Hill; 2UNC Lineberger Comprehensive Cancer Cencer

Generalized linear mixed models (GLMMs) are popular for their flexibility and their ability to estimate population-level effects while accounting for between-unit heterogeneity. While GLMMs are very versatile, the specification of fixed and random effects is a critical part of the modeling process. Historically, variable selection in GLMMs has been restricted to a search over a limited set of candidate models or has required selection criteria that are computationally difficult to compute for GLMMs, limiting variable selection in GLMMs to lower dimensional models. To address this, we developed the R package glmmPen, which simultaneously selects fixed and random effects from high dimensional penalized generalized linear mixed models (pGLMMs). Model parameters are estimated using a Monte Carlo Expectation Conditional Maximization (MCECM) algorithm, which leverages Stan and RcppArmadillo to increase computational efficiency. Our package supports the penalty functions MCP, SCAD, and Lasso, and the distributional families Binomial, Gaussian, and Poisson. Tools available in the package include automated tuning parameter selection and automated initialization of the random effect variance. Optimal tuning parameters are selected using BIC-ICQ or other BIC selection criteria; the marginal log-likelihoods used for the BIC criteria calculation are estimated using a corrected arithmetic mean estimator. The package can also be used to fit traditional generalized linear mixed models without penalization, and provides a user interface that is similar to the popular lme4 R package.

Link to package or code repository.
https://github.com/hheiling/glmmPen


ID: 140 / ep-02: 37
Elevator Pitch
Topics: Reproducibility
Keywords: R markdown

trackdown: collaborative writing and editing your R Markdown and Sweave documents in Google Drive

Filippo Gambarota1, Claudio Zandonella Callegher1, Janosch Linkersdörfer2, Mathew Ling3, Emily Kothe3

1University of Padova; 2University of California, San Diego; 3Misinformation Lab, Deakin University

"The advantages of using literate programming that combines plain-text and code chunks (e.g., R Markdown and Sweave) are well recognized. This allows creation of rich, high quality, and reproducible documents. However, collaborative writing and editing have always been a bottleneck. Distributed version control systems like git are recommended for collaborative code editing but far from ideal when working with prose. In the latter cases, other software (e.g, Microsoft Word or Google Docs) offer a more fluent experience, tracking document changes in a simple and intuitive way. When you further consider that collaborators often do not have the same level of programming competence, there does not appear to be an optimal collaborative workflow for writing reproducible documents.

trackdown (formerly rmdrive) overcomes this issue by offering a simple solution to collaborative writing and editing of reproducible documents. Using trackdown, the local R Markdown or Sweave document is uploaded as plain-text in Google Drive allowing other colleagues to contribute to the prose using convenient features like tracking changes and comments. After integrating all authors’ contributions, the edited document is downloaded and rendered locally. This smooth workflow allows taking advantage of the easily readable Markdown and LaTeX plain-text combined with the optimal and well-known text editing experience offered by Google Docs.

In this contribution, we will present the package and its main features. trackdown aims to promote good scientific practices that enhance overall work quality and reproducibility allowing collaborators with no or limited R knowledge to contribute to literate programming workflows."

Link to package or code repository.
https://github.com/ekothe/trackdown


ID: 289 / ep-02: 38
Elevator Pitch
Topics: Statistical models
Keywords: multivariate funtional data, outliers detection, functional classification, clustering, machine learning

Multivariate functional data analysis

Manuel Oviedo-de la Fuente1, Manuel Febrero-Bande2

1University of Coruña, Spain; 2University of Santiago de Compostela, Spain

This talk proposes new tools to use multivariate functional data (MFD) in R. For this, to handle multivariate functional data the class "mfdata" is proposed, and to handle complex data (scalar, multivariate, directional, images, and functional) the class "ldata". These new classes are useful in problems such as i) visualizing centrality and detecting outliers for MFD, ii) extending supervised classification algorithms in machine learning and iii) also unsupervised algorithms such as hierarchical and k-means procedures.

Link to package or code repository.
https://cran.r-project.org/web/packages/fda.usc/


ID: 220 / ep-02: 39
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: big data

Multivariate functional principal component analysis on high dimensional gait data

Sajal Kaur Minhas1, Morgan Sangeux3, Julia Polak2, Michelle Carey1

1University College Dublin,; 2School of Mathematics and Statistics, University of Melbourne, Melbourne, Australia; 3Murdoch Childrens Research Institute, Melbourne, Australia

A typical gait analysis requires the analysis of the kinematics of five joints (trunk, pelvis, hip, knee and ankle/foot) in three planes. It requires how much a subject’s gait deviates from an average normal profile as a single number. This can quantify the overall severity of a condition affecting walking, monitor progress or evaluate the outcome of an intervention prescribed to improve the gait pattern. The Gait Deviation Index (GDI) and Gait Profile Score (GPS) are the standard indices for measuring gait abnormality and work well on common gait pathologies such as Cerebral palsy etc. The GDI is easy to interpret and is normally distributed allowing for parametric statistical testing whereas GPS has the ability to decompose scores by individual joints/planes and produce altered indices without the need for a large control database but it is not normally distributed. Neither index accounts for the potential co-variation between the kinematic variables for any individual subject, i.e. the motions of one joint affect the motions of adjacent. Additionally, the intrinsic smoothness of the gait movement in each kinematic variable is not accounted for, i.e. the position of a joint at one time affects the positions at a later instant. The aim of this work is to utilize techniques from multivariate functional principal components analysis in the R package MFPCA to create an index that combines the advantages of the existing GDI and GPS i.e, an index that is easy to interpret, is normally distributed, has the ability to decompose scores by individual joints and planes, and is easily adaptable. While also accounting for the intrinsic smoothness of the gait movement in each kinematic variable and the potential co-variation between the kinematic variables. The functional gait deviation index is implemented in R and provides a computationally efficient and easily administered metric to quantify gait impairment.

Link to package or code repository.
https://github.com/Sajal010/MFPCA_gaitanalysis


ID: 184 / ep-02: 40
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: teaching, lecture, introduction, programming

Teaching an introductory programming course with R

Reto Stauffer1,2, Joanna Chimiak-Opoka1, Luis M Rodriguez-R1,3, Achim Zeileis2

1Digital Science Center, Universität Innsbruck, Austria; 2Department of Statistics, Universität Innsbruck, Austria; 3Department of Microbiology, Universität Innsbruck, Austria

As part of a large digitalization initiative, Universität Innsbruck established a Digital Science Center that aims to foster both interdisciplinary research and modern education using digital and data-driven methods. Specifically, the center offers a package of elective courses that can be taken by all students and that covers programming, data management, data analysis, and further aspects of digitalization.

The first course within this package is a general introduction to programming for novices, offered in two tracks, using either Python or R. The focus is on teaching data types including object classes, writing and testing functions, control flow, etc. While some basic data management and data analysis is touched upon, these topics are mainly deferred to subsequent courses.

As this design differs from most introductory R materials that emphasize data analysis early on, we developed new course materials centered around an online textbook: https://discdown.org/rprogramming/. Our course follows the flipped classroom design allowing the diverse group of participants to learn at their own pace. In class open questions are resolved before students can work jointly on non-mandatory programming tasks with guidance and feedback from the instructors. The assessment is based on short weekly (randomized) online quizzes generated with the R/exams package (http://www.R-exams.org/) that are automatically graded, as well as manually graded mid-term and final exams. The concept of the course turned out to work well both in-person and in virtual teaching.

Link to package or code repository.
https://discdown.org/rprogramming/


ID: 255 / ep-02: 41
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: XAI, DALEX, iml, flashlight, shap, Interpretable Artificial Intelligence

Landscape of R packages for eXplainable Artificial Intelligence

Szymon Maksymiuk, Alicja Gosiewska, Przemysław Biecek

Warsaw University of Technology, Poland

The growing availability of data and computing power is fueling the development of predictive models. To ensure the safe and effective functioning of such models, we need methods for exploration, debugging, and validation. New methods and tools for this purpose are being developed within the eXplainable Artificial Intelligence (XAI) subdomain of machine learning. In this lightning talk, we present the design by us taxonomy of methods for a model explanation, show what methods are included in the most popular R XAI packages, and acknowledge trends in recent developments.

Link to package or code repository.
Link to a site presenting the results: http://xai-tools.drwhy.ai/
Repo with codes: https://github.com/MI2DataLab/XAI-tools


ID: 227 / ep-02: 42
Elevator Pitch
Topics: R in production
Keywords: interactive visualization

Reactive PK/PD: An R shiny application simplifying the PK/PD review process

Kristoffer Segerstrøm Mørk, Steffen Falgreen Larsen

Novo Nordisk

In phase 1 of clinical drug development there is a greater interest in the pharmacokinetics (PK) and pharmacodynamics (PD) of a drug. PK describes what the body does to the drug. PD describes what the drug does the body. Due to the limitations and uncertainties related to the procedures used to assess the PK and PD of a drug there is a need to review the PK and PD data on a patient level. Such a review is usually conducted in a smaller group of people from different skill areas.

In this elevator pitch you will be presented to how we at Novo Nordisk have simplified and automated a lot of the tasks related to a PK/PD review using R shiny. We have developed an application that automatically generates the figures that we need in order to conduct a review. The app enables the users to comment on the data through the autogenerated figures and the comments are instantly shared with other users. Once a review has been conducted, minutes can be downloaded in a word format including the added comments.



ID: 121 / ep-02: 43
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data processing

r-cubed: Guiding the overwhelmed scientist from random wrangling to Reproducible Research in R

Hannah Chatwin1, Luke W. Johnston2, Helene Baek Juel3, Bettina Lengger4, Daniel R. Witte2,5, Malene Revsbech Christiansen3, Anders Aasted Isaksen5

1University of Southern Denmark; 2Steno Diabetes Center Aarhus; 3University of Copenhagen; 4Technical University of Denmark; 5Aarhus University

The volume of biological data increases yearly, driven largely by technologies like high-throughput omics, real-time monitoring, and high-resolution imaging, as well as by greater access to routine administrative data and larger study populations. This presents operational challenges and requires considerable knowledge of and skills to manage, process, and analyze this data. Along with the growing open science movement, research is also increasingly expected to be open, transparent, and reproducible. Training in modern computational skills has not yet kept pace, particularly in biomedical research where training often focuses on clinical, experimental, or wet lab competencies. We developed a computational learning module, r-cubed, that introduces and improves skills in R, reproducibility, and open science that was designed with biomedical researchers in mind. The r-cubed learning module is structured as a three-day workshop with five submodules. Over the five submodules, we use a combination of code-alongs, exercises, lectures, and a group project to cover skills in collaboration with Git and GitHub, project management, data wrangling, reproducible document writing, and data visualization. We have specifically designed the module as an open educational resource that instructors can use directly or to modify for their own lessons, and that learners can use independently or as a reference during and after participating in the workshop. All content is available for re-use under CC-BY and MIT Licenses. The course website is found at https://r-cubed.rostools.org/ and the repository with the source material is at https://gitlab.com/rostools/r-cubed.

Link to package or code repository.
https://r-cubed.rostools.org/


ID: 128 / ep-02: 44
Elevator Pitch
Topics: Databases / Data management
Keywords: databases

Validate observations stored in a DB

Edwin de Jonge

Statistics Netherlands / CBS

Data cleaning is an important step before analyzing your data.

Often it is wise to check the validity of your observations before running

your statistical methods on the data. Validation checks embody real world

knowledge about your observations, e.g. age cannot be negative or over 150 years

old.

R package `validate` allows for formulating validation checks in R syntax and

run these checks on a `data.frame`.

`validatedb` brings `validate` to the database:

It allows for running the validation

checks on a (potentially very) large database tables, offering the same benefits

as `validate`, namely a clean documented set of validation rules, but checked on

a database. The presentation will go into the details of the implementation,

describe the output of the validation checks, and also discuss an alternative

sparse format for describing errors in your data.



ID: 208 / ep-02: 45
Elevator Pitch
Topics: Teaching R/R in Teaching
Keywords: data science class, flipped classroom, learnr, gradethis

Teaching Biology students to code smoothly with learnR and gradethis

Guyliann Engels, Philippe Grosjean

Numerical Ecology Department, Complexys and InforTec Institutes, University of Mons, Belgium

R is teach in a biology curriculum at the University of Mons, Belgium, in the context of five data science courses spanning from 2nd Bachelor to last Master classes (https://wp.sciviews.org). Since 2018 the flipped classroom approach is used. Three levels of exercices of increasing difficulties are proposed. First, students read a {bookdown} with integrated interactive exercises written in H5P or {Shiny}. Then, they practice R using {learnR} tutorials. Finally, they apply the new concepts on real datasets in individual or group projets managed with GitHub and GitHub Classroom (https://github.com/BioDataScience-Course).

{LearnR} is a useful tool to bridge the gap between theory and practice in R learning. Students can auto-assess their skills and get immediate feedback thanks to {gradethis}. All the exercises generate xAPI events that are recorded in a MongoDB database (more than 300,000 events recorded so far for a total of 182 students over three academic years). These data allow to quantify and visualize the progression (individual progress reports as {Shiny} applications). Thanks to the detailed visualization of their own progression, students are more motivated to complete the exercises. Whether {learnr} is used alone , or in combination with {gradethis} for immediate feedback on the answers, determine student's behavior. They spend more time on each exercise and try harder to find the right answer when {gradethis} is used.

Link to package or code repository.
https://github.com/BioDataScience-Course


ID: 138 / ep-02: 46
Elevator Pitch
Topics: Statistical models
Keywords: algorithms

Partial Least Squares Regression for Beta Regression Models

Frederic Bertrand, Myriam Maumy

European University of Technology - Troyes Technology University

Many responses, for instance, experimental results, yields or economic indices, can be naturally expressed as rates or proportions whose values must lie between zero and one or between any two given values.

The Beta regression often allows modelling these data accurately since the shapes of the densities of Beta laws are very versatile.

Yet, as any of the usual regression model, it cannot be applied safely in the case of multicollinearity and not at all when the model matrix is rectangular. These situations are frequently found from chemistry to medicine through economics or marketing.

To circumvent this difficulty, we derived an extension of PLS regression to Beta regression models in PLS regression for beta regression models, Bertrand, F., [...], Maumy-Bertrand, M. (2013). “Régression Bêta PLS” [French]. JSFDS, 154(3):143-159.

The plsRbeta package provides Partial least squares Regression for (weighted) beta regression models and k-fold cross-validation using various criteria. It allows for missing data in the explanatory variables. Bootstrap confidence intervals constructions are also available. Parallel computing (CPU and GPU) support is currently being implemented.



ID: 194 / ep-02: 47
Elevator Pitch
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: interpretability, machine learning, explainability

Simpler is Better: Lifting Interpretability-Performance Trade-off via Automated Feature Engineering

Alicja Gosiewska1, Anna Kozak1, Przemysław Biecek1,2

1Warsaw University of Technology, Poland; 2University of Warsaw, Poland

Machine learning generates useful predictive models that can and should support decision-makers in many areas. The availability of tools for AutoML makes it possible to quickly create an effective but complex predictive model. However, the complexity of such models is often a major obstacle in applications, especially in terms of high-stake decisions. We are experiencing a growing number of examples where the use of black boxes leads to decisions that are harmful, unfair, or simply wrong. Here, we show that very often we can simplify complex models without compromising their performance; however, with the benefit of much-needed transparency.

We propose a framework that uses elastic black boxes as supervisor models to create simpler, less opaque, yet still accurate and interpretable glass box models. The new models were created using newly engineered features extracted with the help of a supervisor model.

We supply the analysis using a large-scale benchmark on several tabular data sets from the OpenML database. There are three main results: 1) we show that extracting information from complex models may improve the performance of simpler models, 2) we question a common myth that complex predictive models outperform simpler predictive models, 3) we present a real-life application of the proposed method.

The proposed method is available as an R package rSAFE, https://github.com/ModelOriented/rSAFE.

Link to package or code repository.
https://github.com/ModelOriented/rSAFE


ID: 162 / ep-02: 48
Elevator Pitch
Topics: Bayesian models
Keywords: Bayesian analysis

State of the Market - Infinite State Hidden Markov Models

Dean Markwick

BestX

The stock market is either in a bull or a bear market at any given time. In a bull market, on average prices increase. In a bear market, prices decrease on average. In this talk I will build a non-parametric Bayesian model that can classify the stock market into these different states.

This model is a practical application of my dirichletprocess R package and will serve as an introduction to both the package and non-parametric Bayesian models. I use free stock data and take you through the full quantitative modelling process. I will show how to: prepare the data, build the model and analyze the model output. This model is able to highlight the dot-com crash of the 2000s, the credit crisis of 2008 and the more recent COVID turmoil in the market. As it is a Bayesian model I am also able to highlight the uncertainty around these market states without having to do any extra work. Overall, this talk will provide a practical example and introduction into how R can be used in quantitative finance.



ID: 164 / ep-02: 49
Elevator Pitch
Topics: Bioinformatics / Biomedical or health informatics
Keywords: algorithms

networkABC: Network Reverse Engineering with Approximate Bayesian Computation

Myriam Maumy, Frederic Bertrand

European Technology University - Troyes Technology University

We developed an inference tool based on approximate Bayesian computation to decipher network data and assess the strength of the inferred links between network's actors.

It is a new multi-level approximate Bayesian computation (ABC) approach. At the first level, the method captures the global properties of the network, such as scale-freeness and clustering coefficients, whereas the second level is targeted to capture local properties, including the probability of each couple of genes being linked.

Up to now, Approximate Bayesian Computation (ABC) algorithms have been scarcely used in that setting and, due to the computational overhead, their application was limited to a small number of genes. On the contrary, our algorithm was made to cope with that issue and has a low computational cost. It can be used, for instance, for elucidating gene regulatory network, which is an important step towards understanding the normal cell physiology and complex pathological phenotype.

Reverse-engineering consists of using gene expressions over time or over different experimental conditions to discover the structure of the gene network in a targeted cellular process.



ID: 187 / ep-02: 50
Elevator Pitch
Topics: Economics / Finance / Insurance
Keywords: Tidymodels, Tidyverse, actuarial science, actuarial claim cost analysis

Navigating Insurance Claim Data Through Tidymodels Universe

Jun Haur Lok, Tin Seong Kam

Singapore Management University, Singapore

The increasing ability to store and analyze the data due to the advancement in technology has provided actuaries opportunities in optimizing capital held by insurance companies. Often, the ability to optimize the capital would lower the cost of capital for companies. This could translate into an increase in profit from the lower cost incurred or an increase in competitiveness through lowering the premiums companies charge for their insurance plans. In this analysis, tidyverse and tidymodels packages are used to demonstrate how the modern data science R packages could assist the actuaries in predicting the ultimate claim cost once the claims are reported. The conformity with tidy data concepts by these R packages has flattened the learning curve to use different machine learning techniques to complement the conventional actuarial analysis. This has effectively allowed actuaries in building various machine learning models in a more tidy and efficient manner. The packages also enable users to harass on the power of data science to mine the “gold” in unstructured data, such as claim descriptions, item descriptions, and so on. Nevertheless, these would enable the companies to hold less reserve through a more accurate claim estimation while not compromising the solvency of the companies, allowing the capital to be re-deployed for other purposes.



ID: 146 / ep-02: 51
Elevator Pitch
Topics: Biostatistics and Epidemiology
Keywords: regression, mixed-effects model, grouped data, correlated outcomes, transfromation model

tramME: Mixed-Effects Transformation Models Using Template Model Builder

Balint Tamasi, Torsten Hothorn

Epidemiology, Biostatistics and Prevention Institute (EBPI), University of Zurich, Switzerland

Statistical models that allow for departures from strong

distributional assumptions on the outcome and accommodate

correlated data structures are essential in many applied

regression settings. Our technical note presents the R package

tramME that implements the mixed-effects extension of the linear

transformation models. The model is appealing because it directly

parameterizes the (conditional) distribution function and

estimates the necessary transformation of the outcome in a

data-driven way. As a result, transformation models represent a

general and flexible approach to regression modeling of discrete

and continuous outcomes. The package tramME builds on existing

implementations of transformation models (the mlt and tram

packages) as well as the Laplace approximation and automatic

differentiation (using the TMB package) to perform fast and

efficient likelihood-based estimation and inference in

mixed-effects transformation models. The resulting framework can

be readily applied to a wide range of regression problems with

grouped data structures. Two examples are presented, which

demonstrate how the model can be used for modeling correlated

outcomes without strict distributional assumptions: 1) A

mixed-effects continuous outcome logistic regression for

longitudinal data with a bounded response. 2) A flexible

parametric proportional hazards model for time-to-event data from

a multi-center trial.

Keywords:

correlated outcomes, mixed-effects models, R package development,

regression, transformation models



ID: 119 / ep-02: 52
Elevator Pitch
Topics: Statistical models
Keywords: big data

The one-step estimation procedure in R

Alexandre Brouste1, Christophe Dutang2

1Le Mans Université; 2Université Paris-Dauphine

In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial $\sqrt(n)$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed.

 
2:30pm - 2:45pmBreak
Virtual location: The Lounge #lobby
2:45pm - 3:45pmKeynote: Enseñando a enseñar sin perder a nadie en el camino - Teaching how to teach without leaving anyone behind
Virtual location: The Lounge #key_metadocencia
Session Chair: Juan Pablo Ruiz Nicolini
Zoom Host: Nasrin Fathollahzadeh Attar
Replacement Zoom Host: Tuli Amutenya
Session Sponsor: ixpantia Session Slide
 
ID: 357 / [Single Presentation of ID 357]: 1
Keynote Talk
Topics: Teaching R/R in Teaching

Paola Corrales, Elio Campitelli, Ivan Poggio

Metadocencia

Metadocencia nace en marzo de 2020 cuando la pandemia nos obligó a cambiar la manera en que enseñamos y aprendemos. En ese momento nos encontramos casi sin tiempo ni recursos pero con muchas ganas de ayudar y compartir nuestra experiencia con otras y otros docentes.

Comenzamos dando un taller para compartir métodos educativos basados en evidencia y que se pudieran aplicar de manera sencilla. También brindamos recursos abiertos para fomentar prácticas de enseñanza eficaces e invitamos a las personas a compartir su experiencia y formar una comunidad. Un año después abrimos 3 nuevos talleres y llegamos a más de 1500 personas en 30 países.

En esta charla contaremos sobre algunos de los valores fundamentales que nos definen: encontrarnos con nuestros estudiantes en su lugar y atendiendo a su contexto en Latinoamérica. Esto significa no hacer suposiciones sobre su conocimiento de tecnología o sobre el acceso y la disponibilidad de Internet, diferencias culturales, barreras y necesidades específicas. Queremos compartir lo que aprendimos enseñando en comunidad.

Metadocencia was born in March 2020 when the pandemic forced us to change the way we teach and learn. At the time we found ourselves with little time or resources but eager to help by sharing our experience with other teachers.

We began by running a workshop with evidence-based educational methods that could be applied in a simple way. We also provided open resources to encourage effective teaching practices and invited people to share their experience and form a community. A year later, we opened 3 new workshops and reached more than 1500 people in 30 countries.

 
3:45pm - 4:00pmBreak
Virtual location: The Lounge #lobby
4:00pm - 5:00pmIncubator: The role of the R community in the RSE movement
Location: The Lounge #incubator_rse
Session Chair: Matt Bannert
Session Chair: Heather Turner
Zoom Host: Faith Musili
Replacement Zoom Host: Rachel Heyard
 
ID: 352 / [Single Presentation of ID 352]: 1
Incubator
Topics: Community and Outreach

Heidi Seibold, Heather Turner, Matt Bannert

Johner Institut, Germany

The term "Research Software Engineer"(RSE) was proposed by a group of software developers working in academia at a workshop in Oxford, UK, 2012. It was the beginning of a grass-roots movement to establish Research Software Engineering as a profession for people that combined expertise in programming with an intricate understanding of research. Since then, the movement has grown substantially, leading to recognition, reward and career opportunities for RSEs and the creation of national RSE associations in Australia/New Zealand, Belgium, Germany, the Netherlands, the Nordic region, the UK and the USA.

This incubator will provide an opportunity to discuss the role of the R community in the RSE movement. What can we share with this wider community? How can we help the movement grow? What could the R community gain from this movement? We will identify a range of actions from quick wins to more ambitious projects that could be pursued after useR! 2021.

 
4:00pm - 5:00pmPanel: R User or R Developer? This is the question!
Location: The Lounge #panel_user_developer
Session Chair: Francesca Vitalini
Zoom Host: Maryam Alizadeh
Replacement Zoom Host: Tuli Amutenya
 
ID: 214 / [Single Presentation of ID 214]: 1
Panel
Topics: Other
Keywords: DevOps

Francesca Vitalini, Riccardo Porreca, Stéphanie Gehring, Peter Schmid

Mirai Solutions GmbH,

Since its first official release back in 1995, R has outgrown its statistician-tool origin, spreading out to different fields. A key factor in R’s popularity is without a doubt its approachability for people without a software engineering background.

As a result, R is often considered more as a scripting / prototyping / data analytics tool than a proper software development language. Thanks to its low barrier and accessibility, however, people with a wide variety of (non-technical) backgrounds can quickly become active and effective users. This in turn can set the ground for exploring and building up programming and development skills, transitioning towards what is normally associated with software engineering profiles.

This raises some questions: what does it mean to be an R user? Is there such a thing as an R developer, and (how) does it differ from being an R user? In this time where IT skills are required across virtually every domain, can an R user afford not to be a software engineer as well? What is (and does it even exist) the R equivalent of the Python stack developer? What type of background and expertise should an R user have to fit what companies are looking for? And what about academia? What is the current trend?

We will discuss these questions in a panel featuring the point of view of:

- experts from both industry and academia,

- data scientists who have made the transition from R users to software developers,

- R Core team,

and of course the Community perspective.

 
5:00pm - 6:00pmmixR!
Music, networking channel and raffles. To end the day in a relaxing way
11:15pm - 11:59pmTutorials - Track 2
 
11:15pm - 2:15am
ID: 321 / 2A-Tut: 1
Tutorial
Topics: Efficient programming

Translating R to Your Language

Michael Chirico, Michael Lawrence

- Language: English

- Duration: 180

- Participants: 30

- Level: Intermediate+

R users are a global bunch. Providing error messages in languages besides English can greatly improve the user experience (and debugging experience) of those using R who may not be English natives.

This tutorial aims to get package developers and other R community members started implementing foreign-language translations of R's messages (errors, warnings, verbose output, etc.) into a language of their choosing.

The standard tools for providing translations can be somewhat esoteric; in this tutorial, we'll go over some of the challenges presented by translations, the process for providing and/or updating translations to R itself, and finally introduce a package (`potools`) that will remove some of the frictions potential translators may face.

We especially encourage attendance from speakers of major world languages currently missing from the R translation data base, in particular Hindi, Arabic, Bengali, Urdu, and Bahasa Indonesia.

 
Date: Wednesday, 07/July/2021
12:05amMovie: Coded Bias
The movie: "Coded Bias" will be available to watch during 24 hours, and we'll have a channel open for discussions
7:00am - 11:59pmTutorials - Track 1
 
7:00am - 9:00am
ID: 300 / 1-Tut: 1
Tutorial
Topics: Data visualisation

Data visualization using ggplot2 and its extensions

Haifa Ben Messaoud, Mouna Belaid, Kaouthar Driss, Amir Souissi

- Language: English

- Duration: 120mn

- N° Participants: 100

- Level: Beginner

"This tutorial will cover the introduction to ggplot2 and its main functions. We will cover how to make visualization of one variable, two variables, and three or more variables, how to lay out multiple plots, the use of ggstats for statistical visualizations, how to make interactive graphs using plotly and gganimate, some extensions of ggplot2. Finally, we will show you how to enhance the quality of your graphs by changing the theme or adding a logo and how to export your graph. We will share the code on the github repository of R-Ladies Tunis."



9:00am - 9:15am
ID: 322 / 1-Tut: 2
Breaks

Break

useR! 2021



9:15am - 11:45am
ID: 301 / 1-Tut: 3
Tutorial
Topics: Bayesian models

Additive Bayesian Networks Modeling

Gilles Kratzer, Reinhard Furrer

- Language: English

- Duration: 150 mn

- N° Participants: 60

- Level: Intermediate

Additive Bayesian Networks (ABN) have been developed to disentangle complex relationships of highly correlated datasets as frequently encountered in risk factor analysis studies. ABN is an efficient approach to sort out direct and indirect relationships among variables which is surprisingly common in systemic epidemiology. After the tutorial, you will run the particular steps within an ABN analysis with real-world data. You will be able to contrast this approach with standard regression (linear, logistic, Poisson regression, and multinomial models) used for classical risk factor analysis.

Towards the end, we also cover Bayesian Model Averaging in the context of an ABN, which is useful to assess the validity of the learned model and more advanced inference on the network.



11:45am - 12:00pm
ID: 323 / 1-Tut: 4
Breaks

Break

useR! 2021



12:00pm - 2:00pm
ID: 302 / 1-Tut: 5
Tutorial
Topics: Spatial analysis, Data visualisation

Quick high quality maps with R

Jan-Philipp Kolb

- Language: English

- Duration: 120 mn

- N° Participants: 40

- Level: Beginner

This tutorial covers the basic use of R for creating maps. Useful tools, as well as data sources, are both presented. Concerning tools, the focus is on the packages osmplotr, tmap, and raster.

In the first part of the tutorial, you will learn how to use OpenStreetMap data. Geocoding and creation of bounding boxes will be presented as well as the use of shapefiles to create thematic maps and color-coding in R. After this introduction to the basic concepts and functionalities of mapping with R, you will go through a prototypical data analysis workflow: import, wrangling, exploration, (basic) analysis, reporting. You will have the opportunity to create your own maps during the workshop. A GitHub repo on the course will be shared.



2:00pm - 2:15pm
ID: 324 / 1-Tut: 6
Breaks

Break

useR! 2021



2:15pm - 4:15pm
ID: 305 / 1-Tut: 7
Tutorial
Topics: Big / High dimensional data, Spatial analysis, Efficient programming
Keywords: spam, maximum likelihood estimation, covariance function, BLUP, Gaussian process

Spatial Statistics for huge datasets and best practices

Reinhard Furrer, Roman Flury, Federico Blasi

- Language: English

- Duration: 120 mn

- N° Participants: 50

- Level: Advanced

During the last decade, several advanced approaches have been proposed to address computational issues of larger and larger multivariate space-time datasets. These can essentially be categorized as (i) construct "simpler" models or (e.g., low-rank models, composite likelihood methods, predictive process models) (ii) approximate the models (e.g., with Gaussian Markov random fields, compactly supported covariance function). In this tutorial, we discuss this last point by using sparse covariance matrix approximations. There is seemingly no limit to the sample size with the possibility of working with long vectors jointly with 64bit handling algorithms. However, the

devil is in the details and to avoid encountering negative surprises we provide best practices, strategies, and tricks when modeling huge spatial data.



4:15pm - 4:30pm
ID: 325 / 1-Tut: 8
Breaks

Break

useR! 2021



4:30pm - 7:30pm
ID: 304 / 1-Tut: 9
Tutorial
Topics: Community and Outreach, Algorithms

Contributing to R

Gabriel Becker, Martin Maechler

Clindata Insights, United States of America

- Language: English

- Duration: 180 mn

- N° Participants: 30

- Level: Intermediate to Advanced.

Did you always want to contribute to (base) R but don't know how? Come to our Tutorial!

We will show cases where and how users have contributed actively to (base) R, by submitting bug reports with minimal reproducible examples, how testing, reading source code, and providing patches to the R source code has helped making R better.

Depending on the participants willingness and level of sophistication, we will look into doing things right now, for currently non-resolved issues and bug reports.



7:30pm - 7:45pm
ID: 326 / 1-Tut: 10
Breaks

Break

useR! 2021



7:45pm - 10:15pm
ID: 303 / 1-Tut: 11
Tutorial
Topics: Data visualisation

Graphing multivariate categorical data: The how, what and why of mosaic plots and alluvial diagrams

Joyce Robbins, Ludmila Janda

- Language: English

- Duration: 150

- N° Participants: 24

- Level: Beginner

Multivariate categorical data present unique data visualization challenges. This tutorial provides two options to meet such challenges: mosaic plots and alluvial diagrams. First, we will focus on how to choose the best graph for given data types and communication goals. You will then learn how to get the underlying data in the correct shape to make each graph and then create both graph types using the vcd and ggalluvial packages. We will use engaging datasets and aim to equip you with the skills to make these graphs (and the choice whether to use them) on your own.

 
7:00am - 11:59pmTutorials - Track 2
 
7:00am - 10:00am
ID: 316 / 2-Tut: 1
Tutorial
Topics: Other, Efficient programming
Keywords: testing, vcr, testthat, mocking, fixtures

GET better at testing your R package!

Maëlle Salmon, Scott Chamberlain

Are you a package developer who wants to improve your understanding and practice of unit testing?

You've come to the right place: This tutorial is about Advanced testing of R packages, with HTTP testing as a case study.

Unit tests have numerous advantages like preventing future breakage of your package and helping you define features (test-driven development).

In many introductions to package development you learn how to set up testthat infrastructure, and how to write a few “cute little tests” (https://testthat.r-lib.org/articles/test-fixtures.html#test-fixtures) with only inline assertions.

This might work for a bit but soon you will encounter some practical and theoretical challenges: e.g. where do you put data and helpers for your tests? If your package is wrapping a web API, how do you test it independently from any internet connection? And how do you test the behavior of your package in case of API errors?

In this tutorial we shall use HTTP testing with the vcr package as an opportunity to empower you with more knowledge of testing principles (e.g. cleaning after yourself, testing error behavior) and testthat practicalities (e.g. testthat helper files, testthat custom skippers).

After this tutorial, you will be able to use the handy vcr package for your package wrapping a web API or any other web resource, but you will also have gained skills transferable to your other testing endeavours!

Come and learn from rOpenSci expertise!

Related materials

https://devguide.ropensci.org/building.html#testing

https://books.ropensci.org/http-testing

https://blog.r-hub.io/2019/10/29/mocking/

https://blog.r-hub.io/2020/11/18/testthat-utility-belt/



10:00am - 10:15am
ID: 327 / 2-Tut: 2
Breaks

Break

useR! 2021



10:15am - 2:15pm
ID: 317 / 2-Tut: 3
Tutorial
Topics: Other, R in production

Systematic data validation with the validate package

Mark van der Loo, Edwin de Jonge

Statistics Netherlands, Netherlands, The

- Language: English

- Duration: 240 mn

- N° Participants: 30

- Level: Intermediate

Checking the quality of data is a task that pervades data analyses. It does not matter whether you are working with raw data, cleaned data, or with the results of an analyses. It is always important to convince yourself that the data you are using is fit for its intended purpose. Since it is such a common task, why not automate it? The 'validate' package is designed for exactly this task: it implements a domain specific language for data checking that aims to encompass any check you might wish to perform. In this course you will will learn to define and measure data quality in a precise way with the validate package. We will focus on the main workflow, and show you how you can involve domain experts directly with your work, even if they do not know R. You will learn the main principles of data validation, both from the point of view of organizing a data processing work flow, as well as from a more formal perspective. You will exercise data validation tasks that range from checking input format and types to complex checks that involve data from multiple sources. You will learn how to follow the evolution of data quality as it is processed using the lumberjack package. And you will learn how to flush out redundant or contradictory quality demands using the validatetools package. The course will consist of hands-on work, based on a prepared tutorial that will be published on GitHub. There will be break-out sessions with assignments where you can discuss the materials with other course participants. The presentations will include some Kahoot quizzes to keep things interactive, fun, and focused.



2:15pm - 2:30pm
ID: 330 / 2-Tut: 4
Breaks

Break

useR! 2021



2:30pm - 6:00pm
ID: 318 / 2-Tut: 5
Tutorial
Topics: R in production, Web Applications (Shiny/Dash)

Production-grade Shiny Apps with {golem} - French

Vincent GUYADER, Cervan Girard

- Language: French

- Duration: 120 mn

- N° Participants: 30

- Level: Intermediate

This tutorial is aimed at intermediate or advanced shiny application developers who want to design "clean" applications following best practices. We will present the different steps necessary to obtain an application deployed in production. An active participation of the participants is expected, with screen sharing, microphone (and if possible webcam).



6:00pm - 6:15pm
ID: 331 / 2-Tut: 6
Breaks

Break

useR! 2021



6:15pm - 7:15pm
ID: 319 / 2-Tut: 7
Tutorial
Topics: Data mining / Machine learning / Deep Learning and AI

Penguins in a Box: Interactive Data Science Tutorial with Penguins.

Maria Dermit, Susana Escobar

- Language: English

- Duration: 60 mn

- N° Participants: 100

- Level: Intermediate

Penguins in a Box is a learnr package that covers the topics of R for Data Science book and uses the widely used dataset penguins to explore book's concepts. The package currently contains one tutorial for each chapter of the book and will be introduced during the presentation. In addition, you will join breakout rooms to work on modules on the book's main sections (e.i. Explore, Wrangle, Program, Model and Communicate; 6 sections in total) according to your learning objectives. This tutorial is aimed at both students who want to improve their data science skills in an interactive way and teachers who want to access additional learnr resources similar to Rstudio Primers (https://rstudio.cloud/learn/primers). The tutorial is aimed to be interactive and peer-instruction between attendees is aimed to guide learning at breakout rooms.



7:15pm - 8:00pm
ID: 329 / 2-Tut: 8
Breaks

Break

useR! 2021



8:00pm - 11:00pm
ID: 320 / 2-Tut: 9
Tutorial
Topics: Teaching R/R in Teaching, Other

Professional, Polished, Presentable: Making Great Slides with xaringan

Garrick Aden-Buie, Silvia Canelón

- Language: English

- Duration: 180mn

- N° Participants: 60

- Level: Intermediate

The xaringan package brings professional, impressive, and visually appealing slides to the powerful R Markdown ecosystem. Through our hands-on tutorial, you will learn how to design highly effective slides that support presentations for teaching and reporting alike. Over three hours, you will learn how to create an accessible baseline design that matches your institution or organization’s style guide. Together we’ll explore the basics of CSS—the design language of the internet—and how we can leverage CSS to produce elegant slides for effective communication. Finally, we’ll deploy our slides online where they can be shared and discovered by others long after they support our presentations. The tutorial will demonstrate how to use the skills learned to incorporate principles of accessible design into their presentations. The tutorial will feature live-coding and interactive question and answer periods,

interspersed with small-group break out sessions for guided hands-on experience. The tutorial will be supported by a repository of materials.



11:00pm - 11:15pm
ID: 328 / 2-Tut: 10
Breaks

Break

useR! 2021

 
7:00am - 11:59pmTutorials - Track 3
 
7:00am - 10:00am
ID: 311 / 3-Tut: 1
Tutorial
Topics: Algorithms, Data mining / Machine learning / Deep Learning and AI
Keywords: Interpretable Machine Learning, Explainable Artificial Intelligence, Machine Learning, Fairness, Responsible Machine Learning

Introduction to Responsible Machine Learning

Anna Kozak, Hubert Baniecki, Przemyslaw Biecek, Jakub Wisniewski

- Language: English

- Duration: 180

- N° Participants: 150

- Level: Beginner

What? The workshop focuses on responsible machine learning, including areas such as model fairness, explainability, and validation.

Why? To gain the theory and hands-on experience in developing safe and effective predictive models.

For who? For those with basic knowledge of R, familiar with supervised machine learning and interested in model validation.

What will be used? We will use the DALEX package for explanations, fairmodels for checking bias, and modelStudio for interactive model analysis.

Where? 100% online

When? Wednesday, 7th of July, 7:00 - 10:00 am (UTC)



10:00am - 10:15am
ID: 332 / 3-Tut: 2
Breaks

Break

useR! 2021



10:15am - 2:15pm
ID: 312 / 3-Tut: 3
Tutorial
Topics: Spatial analysis, Data visualisation

Entry level R maps from African data - French - English

Andy South, Anelda van der Walt, Ahmadou Dicko, Shelmith Kariuki, Laurie Baker

- Language: French - English

- Duration: 240

- N° Participants: 60

- Level: Beginner

This tutorial will provide an introduction to mapping and spatial data in R using African data. By the end of the tutorial, you should be able to make a map that is useful to you from data that you have brought yourselves. We will focus on developing confidence in doing the basics really well in preference to straying too far into more advanced analyses. Our tutorials focus on flexible workflows that you can take away. You will also learn how to spot and avoid common pitfalls. The training will be partly based around a set of interactive learnr tutorials that we have

created as part of the afrilearnr package (https://github.com/afrimapr/afrilearnr) and accompanying online demos described in this blog post :

https://afrimapr.github.io/afrimapr.website/blog/2021/interactive-tutorials-for-african-maps/.

The tutorial will be available on shinyapps for those that are unable to install locally. There will be separate English & French language groups with dedicated materials. Each group will start together for the first few sessions and then break into sub-groups of up to 10 learners with one trainer each for improved feedback and discussion. Towards the end of the tutorial we will challenge you to make a map using data that you have brought or found.

Each language group will come back together for a final wrapup session.



2:15pm - 2:30pm
ID: 333 / 3-Tut: 4
Breaks

Break

useR! 2021



2:30pm - 5:30pm
ID: 313 / 3-Tut: 5
Tutorial
Topics: Community and Outreach, Reproducibility, Other

How to build a package with "Rmd First" method

Sébastien Rochette, Emily Riederer

ThinkR,

- Language: English

- Duration: 180

- N° Participants: 30

- Level: Intermediate

"Rmd First" method can reduce mental load when building packages by keeping users in a natural environment, using a tool they know: a RMarkdown document. The step between writing your own R code to analyze some data and refactoring it into a well-documented, ready-to-share R package seems unreachable to many R users. The package structure is sometimes perceived as useful only for building general-purpose tools for data analysis to be shared on official platforms. However, packages can be used for a broader range of purposes, from internal use to open-source sharing. Because packages are designed for robustness and enforce helpful standards for documentation and testing, the package structure provides a useful framework for refactoring analyses and preparing them to go into production. The following approach to write a development or an analysis inside a Rmd, will significantly reduce the work to transform a Rmd into a package : - _Design_ : define the goal of your next steps and the tools needed to reach them - _Prototype_ : use some small examples to prototype your script in Rmd - _Build_ : Build your script as functions and document your work to be able to use them, in the future, on real-life datasets - _Strengthen_ : Create tests to assure stability of your code and follow modifications through time - _Deploy_ : Transform as a well-structured package to deploy and share with your community During this tutorial, we will work through the steps of Rmd Driven Development to persuade attendees that their experience writing R code means that they already know how to build a package. They only need to be in a safe environment to find it out, which will be what we propose. We will take advantage of all existing tools such as {devtools}, {testthat}, {attachment} and {usethis} that ease package development from Rmd to building a package. The recent package [{fusen}](https://thinkr-open.github.io/fusen), which "inflates a package from a simple flat Rmd", will be presented to further reduce the step between well-designed Rmd and package deployment. Attendees will leave this workshop having built their first package with the "Rmd First" method and with the skills and tools to build more packages on their own.



5:30pm - 5:45pm
ID: 334 / 3-Tut: 6
Breaks

Break

useR! 2021



5:45pm - 8:30pm
ID: 314 / 3-Tut: 7
Tutorial
Topics: Bayesian models, Statistical models

Bayesian modeling in R with {rstanarm} - Spanish

Fernando Antonio Zepeda Herrera

- Language: Spanish

- Duration: 165

- N° Participants: 30

- Level: Intermediate

This tutorial would introduce Bayesian modeling in R particularly through {rstanarm}. We would alternate between "lectures" and "practical" examples (with {learnr} tutorials). Starting with a brief introduction of the Bayesian paradigm, we would cover linear and generalized linear regression as well as useful diagnostics and posterior visualization.



8:30pm - 8:45pm
ID: 335 / 3-Tut: 8
Breaks

Break

useR! 2021



8:45pm - 11:45pm
ID: 315 / 3-Tut: 9
Tutorial
Topics: Big / High dimensional data, R in production

Introduction to TileDB for R

Dirk Eddelbuettel, Aaron Wolen

- Language: English

- Duration: 180 mn

- N° Participants: 200

- Level: Intermediate

TileDB is an open source universal data engine that natively supports dense and sparse multidimensional arrays, as well as data frames. Large datasets can be stored on multiple backends ranging from a local filesystem to cloud storage providers such as Amazon S3 (as well Google Cloud Storage and Azure Cloud Storage) and accessed using almost any language, including Python and R. The tutorial introduces the 'tiledb' R package on CRAN, which allows users to efficiently operate on large dense/sparse arrays using familiar R techniques and data structures. It also offers key features of the underlying TileDB Embedded library: parallelised read and write operations, multiple compression formats, time traveling (i.e., the ability to recover data stored at previous timepoints), flexible encryption, and Apache Arrow support. Several simple usage examples will be provided and you will have an opportunity to follow along on your laptops. One or two fuller usage examples from Bioinformatics will serve as a more extended case study.

We will illustrate how TileDB can be used to create a performant data store for results produced by Genome-Wide Association Studies, and demonstrate the BioConductor package, TileDBArray, which is built on top of the DelayedArray framework and has shown excellent performance relevant to existing (hdf5-based) solution. Finally, usage of TileDB with cloud storage providers will be illustrated. This covers both direct reads and writes to, for example, Amazon S3 as well as a brief illustration of the 'pay-as-you-go' Software-as-a-Service offering of

TileDB Cloud with its additional features.

 
7:00am - 11:59pmTutorials - Track 4
 
7:00am - 9:00am
ID: 306 / 4-Tut: 1
Tutorial
Topics: Web Applications (Shiny/Dash)
Keywords: Shiny, Modules, Code reuse, Software engineering, Reactivity

Structure your app: introduction to Shiny modules

Jonas Hagenberg

- Language: English

- Duration: 120

- N° Participants: 25

- Level: Intermediate

You communicate your results interactively with Shiny, maintain a dashboard or provide business logic, but the codebase of your app becomes too complex? Then modules are the right tool for you, they are the Shiny built-in solution to manage this complexity. Shiny modules allow you to break down your code into smaller building blocks that can be combined and reused.

In this tutorial I give an introduction into modules, its advantages over simple R functions and how existing functionality can be transferred to modules.

For an easy start, I cover common pitfalls needed to overcome for a productive use of modules:

- Passing reactive objects to modules

- Returning reactive values from the module to the calling environment

- Nesting modules

- Dynamically generating modules (including UI)

The contents of the tutorial are delivered by short lectures followed by hands-on coding sessions in break-out rooms. For this, you need a basic knowledge of reactive programming/Shiny.



9:00am - 10:30am
ID: 336 / 4-Tut: 2
Breaks

Break

useR! 2021



10:30am - 1:30pm
ID: 307 / 4-Tut: 3
Tutorial
Topics: Data mining / Machine learning / Deep Learning and AI, Interfaces with other programming languages

Getting started with torch (in French)

Sigrid Keydana, Daniel Falbel

- Language: English

- Duration: 180 mn

- N° Participants: 100

- Level: Intermediate

Torch (https://torch.mlverse.org/) is an open source machine learning framework based on PyTorch. Not requiring any Python dependencies, torch for R is at once a powerful computational engine with including GPU acceleration, a neural network library, and an ecosystem providing tools for, among others, image, text, and audio processing. This tutorial will provide a thorough introduction to torch basics: tensors, automatic differentiation, and neural network modules. Thereafter, we delve into two areas of special interest to R users: time series forecasting and numerical optimization. All sections will include time slots for practice.

Training materials will be available in an English version as well. Participants not speaking French, but who would like to join the training anyway, are welcome to ask questions in English in the chat.



1:30pm - 1:45pm
ID: 337 / 4-Tut: 4
Breaks

Break

useR! 2021



1:45pm - 2:45pm
ID: 308 / 4-Tut: 5
Tutorial
Topics: Data mining / Machine learning / Deep Learning and AI

Pinguinos en caja : tutorial interactivo de ciencia de datos con pinguinos - Español.

Maria Dermit, Susana Escobar

- Language: Spanish

- Duration: 60 mn

- N° Participants: 100

- Level: Intermediate

Pingüinos en Caja es un paquete learnr que cubre los temas del libro R para ciencia de datos y utiliza el conocido paquete de datos pinguinos para explorar los conceptos del libro.

El paquete contiene actualmente un tutorial para cada capítulo del libro y se presentará durante el taller. Además, los asistentes trabajarán en salas para grupos pequeños

módulos divididos por las secciones principales del libro (por ejemplo, Explorar, Wrangle, Programar, Modelar y Comunicar; 6 apartados en total) según sus objetivos de aprendizaje.

Las personas de la audiencia de este tutorial son estudiantes que quieran mejorar sus habilidades en ciencia de datos de forma interactiva y docentes que quieran acceder a recursos de aprendizaje adicionales similares a los Primers de Rstudio (https://rstudio.cloud/learn/primers). El tutorial tiene como objetivo ser interactivo y la instrucción entre pares entre los asistentes será dirigida para guiar el aprendizaje en las salas de grupos pequeños.



2:45pm - 3:00pm
ID: 338 / 4-Tut: 6
Breaks

Break

useR! 2021



3:00pm - 6:00pm
ID: 309 / 4-Tut: 7
Tutorial
Topics: Big / High dimensional data, R in production, Web Applications (Shiny/Dash)

Data Pipelines at scale with R and Kubernetes - Spanish

Frans van Dunné

- Language: Spanish

- Duration: 180

- N° Participants: 40

- Level: Advanced

"Many R users are confronted with larger and larger amounts of data that need to be processed. In this tutorial we will show you how to go to the next level by massively parallelizing your R code on a Kubernetes cluster. We will show you how to move your entire data pipeline to Kubernetes where each node in the pipeline consists of a container running R code. These containers can run with multiple cores, and then farmed out to tens or hundreds of these containers running in parallel.

Our experience has shown that this allows for massive speed gains, at relatively low cost when the kubernetes cluster is populated with ephemeral virtual machines (e.g. preemptible VM's on GCP - Spot instances on AWS). You need to have an interest in the more technical aspects of running R code, but only to a degree. We hope to dispel any fear that you might have that setting up a cluster is something that is very difficult. A key tool we will introduce is a tool to create data pipelines on kubernetes called Pachyderm (the open source version). The tutorial will be a combination of theory, break outs to run things hands on, regrouping to talk about experiences and then taking the next step. We will set up code examples in steps, so that if

one step di not work out, after regrouping the group can take off from the next starting point.



6:00pm - 6:15pm
ID: 339 / 4-Tut: 8
Breaks

Break

useR! 2021



6:15pm - 9:15pm
ID: 310 / 4-Tut: 9
Tutorial
Topics: Spatial analysis, Environmental sciences, Data visualisation

Datos espaciales a lo tidy - Español

Elio Campitelli, Paola Corrales

- Language: Spanish

- Duration: 180

- N° Participants: 40

- Level: Intermediate

En este tutorial vas a aprender a descargar, leer, analizar y visualizar datos espaciales grillados en R usando datos tidy. Va a ser un tutorial participativo con programación en vivo y ejercicios, bajo la idea de que puedas usar los datos para responder tus propias preguntas, escribiendo tu propio código.

Al final del taller vas a haber aprendido como:

- descargar datos meteorológicos y climáticos programáticamente desde R,

- leerlos en un formato tidy,

- computar estadísicas espaciales y temporales,

- graficar los resultados usando ggplot2 y extensiones.

 
Date: Thursday, 08/July/2021
12:30am - 1:30amKeynote: Expanding the Vocabulary of R Graphics
Virtual location: The Lounge #key_murrell
Session Chair: Joyce Robbins
Zoom Host: Olgun Aydin
Replacement Zoom Host: Jyoti Bhogal
 
ID: 345 / [Single Presentation of ID 345]: 1
Keynote Talk
Topics: Data visualisation

Paul Murrell

The University of Auckland, New Zealand

At the heart of the R Graphics system lies a graphics engine. This defines a graphics vocabulary for R - a set of possible graphics operations like drawing a line, colouring in a polygon, or setting a clipping region. Graphics packages like 'ggplot2' allow users to describe a plot in terms of high-level concepts like geoms, scales, and aesthetics, but that high-level description has to be reduced to a set of graphics operations that the graphics engine can understand.

Unfortunately, the R graphics engine has a limited vocabulary. It can only draw simple shapes, it can only fill regions with solid colour, and it can only set rectangular clipping regions. This makes it hard (or impossible) for packages like 'ggplot2' to produce some types of graphical output because the graphics engine does not support several fundamental graphical operations.

This talk will describe recent work on the graphics engine that expands its vocabulary to include gradient fills, pattern fills, clipping paths, and masks.

 
1:30am - 1:45amBreak
Virtual location: The Lounge #lobby
1:45am - 3:15am5A - Teaching R and R in Teaching
Virtual location: The Lounge #talk_teaching
Session Chair: Karthik Raman
Zoom Host: Adrian Maga
Replacement Zoom Host: Jyoti Bhogal
Session Sponsor: Appsilon Session Slides
 
1:45am - 2:05am
Talk-Video
ID: 149 / ses-05-A: 1
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: ecology

Developing a datasets based R package to teach environmental data science

Allison Horst1, Julien Brun2

1UC Santa Barbara; 2NCEAS

There are many openly available environmental datasets out there. However, it is time and energy consuming for teachers to identify, explore and clean complex datasets for use in environmental data science classes. As the success (>60k downloads) of the recent palmerpenguins R package demonstrates, there is strong demand and interest in curated real-world datasets ready to be used “out of the box” for data science teaching purposes. In this project, our goal was to develop a sample dataset and an associated analytical example for every site of the Long Term Ecological Research (LTER) network. This network, founded by the US National Science Foundation, is made of 30 sites where both observational and experimental environmental data sets are collected with a long-term perspective, and thus provide a treasure trove of interesting, real-world environmental data. All of those resources have been combined into one R package. R packages are an ideal vehicle for teaching datasets because R is widely used in environmental research communities and degree programs, and packages can be installed in one command. In addition, the R Markdown ecosystem provides a suite of tools to publish the documentation and examples as a website to expose all the pedagogic content to non-R users as well. We relied on the package structure to develop a reproducible workflow to ingest and document the LTER data. We also wanted to share the code necessary to access the full dataset to enable further investigation of more complex datasets. In this presentation, we will explain our process to design this R package and provide a set of analysis examples for environmental data science teaching purposes.



2:05am - 2:25am
Talk-Video
ID: 248 / ses-05-A: 2
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: community, outreach, rmarkdown

Using R as a Community Workbench for The Carpentries Lesson Infrastructure

Zhian N. Kamvar, François Michonneau

The Carpentries, United States of America

The Carpentries is a global community of volunteers that collaboratively develops and delivers lessons to build capacity in data and coding skills (in R and multiple other languages) to researchers worldwide. For the past five years, our collaboratively-developed lesson template (https://github.com/carpentries/styles/) has been the basis for our growing collection of peer-reviewed lesson content. This template was fully self-contained with all the tools and styles needed to create a full lesson website. While the lessons themselves were designed to be easy to author, there were two significant barriers in our toolchain for contributors: software installation and style updating. As our lesson repertoire and community has continued to grow, this template model has not scaled well, resulting in barriers to entry and wasted volunteer time. In 2020 we began the process to redesign our template from the ground up using a combination of R’s literate programming ecosystem and GitHub Workflows, resulting in three R packages called {sandpaper}, {pegboard}, and {varnish} for handling, validating, and styling lessons. The new approach separates the content from the tools and style, allowing for seamless updates so the maintainers can focus on authoring their lessons and not on the tools needed to build them. To accommodate the wide array of diverse skill sets in our community, we wanted to ensure the tools could be used by anyone without any prior knowledge of R. We will detail how we involved our community in iterated development of the new template with user stories, passive community feedback, community member interviews, and user experience testing. In the end, we will show how the wide array of tools available in the R ecosystem makes it easy for us to rebuild our lesson infrastructure in a way that significantly reduces the barrier for entry for our community volunteers.



2:25am - 2:45am
Talk-Video
ID: 134 / ses-05-A: 3
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: Bayesian analysis

Teaching and Learning Bayesian Statistics with {bayesrules}

Mine Dogucu1, Alicia A. Johnson2, Miles Ott3

1University of California, Irvine; 2Macalester College; 3Smith College

Bayesian statistics is becoming more popular in data science. Data scientists are often not trained in Bayesian statistics and if they are, it is usually part of their graduate training. During this talk, we will introduce an introductory course in Bayesian statistics for learners at the undergraduate level and comparably trained practitioners. We will share tools for teaching (and learning) the first course in Bayesian statistics, specifically the {bayesrules} package that accompanies the open-access Bayes Rules! An Introduction to Bayesian Modeling with R book. We will provide an outline of the curriculum and examples for novice learners and their instructors.

Link to package or code repository.
https://github.com/mdogucu/bayesrules


2:45am - 3:05am
Talk-Live
ID: 246 / ses-05-A: 4
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: textbook, open-source, non-profit, bookdown, continuous integration

Building and maintaining OpenIntro using the R ecosystem

Mine Cetinkaya-Rundel

Duke University, RStudio, United States of America

OpenIntro's (openintro.org) mission is to make educational products that are free and transparent and that lower barriers to education. The products include textbooks (in print and online), supporting resources for instructors as well as for students. From day one, OpenIntro materials have been built using tools within the R ecosystem. In this talk we will discuss how the OpenIntro project has shaped and grown over the years, our process for developing and publishing open-source textbooks at the high school and college level, and our computing resources such as interactive R tutorials and R packages as well as labs in various languages. We will highlight recent workflows we have developed and lessons learned for converting books from LaTeX to bookdown and give an overview of our project organization and tooling for authoring, collaboration, and maintenance, much of which is built with R, R Markdown, Git, and GitHub. Finally, we will discuss opportunities for getting involved for educators and students contributing to the development of open-source educational resources under the OpenIntro umbrella and beyond.

 
1:45am - 3:15am5B - Mathematical/Statistical Methods
Virtual location: The Lounge #talk_math_stats
Session Chair: Marcela Alfaro Cordoba
Zoom Host: Nick Spyrison
Replacement Zoom Host: Olgun Aydin
 
1:45am - 2:05am
Talk-Video
ID: 254 / ses-05-B: 1
Regular Talk
Topics: Statistical models
Keywords: Model misspecIfication, tidyverse, assumptions, variance estimation

maars: Tidy Inference under misspecified statistical models in R

Riccardo Fogliato, Shamindra Shrotriya, Arun Kumar Kuchibhotla

Carnegie Mellon University, United States of America

Linear regression using ordinary least squares (OLS) is a critical part of every statistician's toolkit. In R, this is elegantly implemented via lm() and its related functions. However, the statistical inference output from this suite of functions is based on the assumption that the model is well specified. This assumption is often unrealistic and at best satisfied approximately. In the statistics and econometric literature, this has long been recognized and a large body of work provides inference for OLS under more practical assumptions (e.g., only assuming independence of the observations). In this talk, we will introduce our package “maars” (models as approximations) that aims at bringing research on inference in misspecified models to R via a comprehensive workflow. Our "maars" package differs from other packages that also implement variance estimation, such as “sandwich”, in three key ways. First, all functions in “maars” follow a consistent grammar and return output in tidy format (Wickham, 2014), with minimal deviation from the typical lm() workflow. Second, “maars'' contains several tools for inference including empirical, wild, residual bootstrap, and subsampling. Third, “maars” is developed with pedagogy in mind. For this, most of its functions explicitly return the assumptions under which the output is valid. This key innovation makes “maars” useful in teaching inference under misspecification and also a powerful tool for applied researchers. We hope our default feature of explicitly presenting assumptions will become a de facto standard for most statistical modeling in R.



2:05am - 2:25am
Talk-Live
ID: 285 / ses-05-B: 2
Regular Talk
Topics: Operational research and optimization
Keywords: Evolutionary Strategies, Mixed Integer Problems, Multifidelity Optimization, Black Box Optimization, Multi-Objective Optimization

Mixed Integer Evolutionary Strategies with "miesmuschel"

Martin Binder

LMU Munich, Germany

Evolutionary Strategies (ES) are optimization algorithms inspired by biological evolution that do not make use of gradient information, and are therefore well-suited for "black-box optimization" where this information is not available. Mixed-Integer ES (MIES) are an extension that allow optimization of mixed continuous, integer, and categorical search spaces by defining different mutation and recombination operations on different subspaces. We present our new package "miesmuschel" (pronounced MEES-mooshl), a modular toolbox for MIES optimization. It provides "Operator" objects for mutation, recombination, and parent/survival selection that can be configured and combined in various ways to match the optimization problem at hand. Configuration parameters of operators can even be self-adaptive and evolve together with the solutions of the optimization problem. Miesmuschel can be used for both single- and multi-objective optimization, simply by using different selection operations. The multi-fidelity optimization capabilities of miesmuschel can be used for expensive objectives where early generations or new samples are preliminarily evaluated with less effort.

A standard optimization loop (parent selection, recombination, mutation, survival selection) is given and can be used out-of-the-box, but the supplied methods can also be combined as building blocks to form more specialized algorithms.

Miesmuschel makes use of the "paradox" and "bbotk" packages and integrates with the "mlr3" ecosystem.

Link to package or code repository.
https://github.com/mlr-org/miesmuschel


2:25am - 2:45am
Talk-Video
ID: 110 / ses-05-B: 3
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: anomaly detection

Here is the anomalow-down!

Sevvandi Kandanaarachchi1, Rob J Hyndman2

1RMIT University; 2Monash University

Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.

What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates.

In this talk we introduce two R packages – dobin and lookout – that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions.

On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory.

We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.

 
3:15am - 3:30amBreak
Virtual location: The Lounge #lobby
3:30am - 4:30amIncubator: Strategies to build a strong AsiaR Community
Virtual location: The Lounge #incubator_asiar
Session Chair: Janani Ravi
Session Chair: Adithi R. Upadhya
Zoom Host: Jyoti Bhogal
Replacement Zoom Host: Olgun Aydin
 
ID: 342 / [Single Presentation of ID 342]: 1
Incubator
Topics: Community and Outreach
Keywords: Community building, R in Asia, Education and outreach, Diversity

Adithi R. Upadhya1, Dr. Janani Ravi2

1ILK Labs, India; 2Michigan State University, United States of America

R has been a very inclusive community and collective learning has always helped, with many users of R in Asian countries we can as well have a strongly knit community for R’s users. Inspired by the MENA (Middle East and North Africa) R, AfricaR, and LatinR user groups, we propose a similar panel discussion to connect and strengthen the R community in Asia. We aim to target participants who are active R users and/or learners who have not been engaged with any R community and we want to invite panelists who are successful R developers/educators/community leaders in various Asian countries. Our goal is to build a diverse and vibrant R community within Asia. We wish to connect Asian useRs to each other, identify Asian R speakers/participants and facilitate regular webinars, workshops. We want to address the lower participation of Asian, specially Asian Underrepresented Minorities in local meetups and international conferences like useR! 2021, discuss and learn about best practices for nucleating and sustaining an engaged community. We also would like to understand how people from various backgrounds and organisations engage the community for assistance. We also wish to build a strong enough community to host a AsiaR conference in the upcoming years.

 
4:30am - 5:30amaRt Gallery
Virtual location: The Lounge #announcements
Session Chair: Sara Mortara
Session Chair: Marcela Alfaro Cordoba
Meet some aRtists!
5:30am - 7:00am6A - Data visualisation
Virtual location: The Lounge #talk_dataviz
Session Chair: Praveena Mathews
Zoom Host: Olgun Aydin
Replacement Zoom Host: Adrian Maga
Session Sponsor: R Studio Session Slides
 
5:30am - 5:50am
Talk-Video
ID: 172 / ses-06-A: 1
Regular Talk
Topics: Data visualisation
Keywords: ggplot2

Easy R Markdown reporting with chronicle

Philippe Heymans Smith

NA

The chronicle package aims to ease the process of making R Markdown reports for R practitioners. With chronicle, the user is only required to provide the data and structure of the report, and chronicle will write the corresponding R Markdown file on behalf of the user. This means that the user can take the role of a director of the report, focusing on its content and structure, while delegating all the intricacies of visual consistency and interactiveness to the package.

chronicle currently supports 16 of the most popular R Markdown output formats, and lets the user add each element of a report in an additive paradigm inspired by ggplot.

Link to package or code repository.
https://github.com/pheymanss/chronicle


5:50am - 6:10am
Talk-Video
ID: 156 / ses-06-A: 2
Regular Talk
Topics: Data visualisation
Keywords: exploratory data analysis

virgo: a layered interactive grammar of graphics in R

Stuart Lee1, Earo Wang2

1Monash University; 2University of Auckland

The virgo package enables interactive graphics for exploratory data analysis (EDA). Like ggplot2, our package takes a grammar based approach, that is, variables are mapped to visual encodings and plots are built layer by layer with marks. However, unlike ggplot2, the virgo package incorporates interactivity directly into its design by extending the Vega-Lite Javascript library and the vegawidget R package.

Users can easily initialize "selection" objects to specify client side events like brushing or clicking. Once a "selection" object is specified it can used in two different ways. First, a "selection" can be broadcast to modify an encoding channel - for example, points being colored after a selection event has happened. Second, the data in a visual layer can react to a "selection" - for example, computing a mean on the fly given the occurence of a selection event. Through composing multiple selection objects we can achieve rich interactivity.

In this talk, we will discuss the motivations behind the virgo package and grammar. We will demonstrate how virgo seamlessly integrates into existing EDA workflows through a case study. The virgo package is available online at https://vegawidget.github.io/virgo.

Link to package or code repository.
https://github.com/vegawidget/virgo


6:10am - 6:30am
Talk-Live
ID: 155 / ses-06-A: 3
Regular Talk
Topics: Data visualisation
Keywords: dynamic graphics

New displays for the visualization of multivariate data in the tourr package

Ursula Laa

University of Natural Resources and Life Sciences, Vienna

Tour methods allow the visualization of multi-dimensional structures as animated sequences of interpolated projections. The viewer can extrapolate from the observed low-dimensional shapes, to build intuition about the high-dimensional distribution. These methods are available in the tourr package (Wickham et al., 2011), including a range of display functions. The package is on CRAN, see https://CRAN.R-project.org/package=tourr. The traditional displays are however limited in the case of large data: in scenarios with many observations, overplotting will often hide features, while a large number of variables typically leads to piling of the observations near the center of a projection.

In this talk I will introduce new tourr displays that can address these issues. The slice tour (Laa et al., 2020) shows sections of the data, alleviating overplotting issues and potentially revealing concave structures not visible in projections; the sage display (Laa et al., under review, arXiv:2009.10979 ) redistributes the projected data points to reverse piling effects. After introducing the new displays I will briefly describe the implementation in R and show examples that illustrate the advantages of the new approaches.

Link to package or code repository.
https://github.com/ggobi/tourr


6:30am - 6:50am
Talk-Live
ID: 269 / ses-06-A: 4
Regular Talk
Topics: Data visualisation
Keywords: AR, VR, 3D

plotVR - walk through your data

Philipp Thomann

D ONE, Switzerland

Are you bored by 3D-plots that only give you a simple rotatable 2d-projection? plotVR is an open source package that provides a simple way for data scientists to plot data, pick up a phone, get a real 3d impression - either by VR or by AR - and use the computer's keyboard to walk through the scatter plot:

https://www.github.com/thomann/plotVR

After installing the package and plotting your dataframe, scan the QR code on your phone (iOS or Android) and start walking. Either with recent phones directly in the web browser or using an iOS-app (Android App in prepared).

Once you are immersed in your Cardboard how do you navigate through the scatter? plotVR lets you use the computer's keyboard to walk as you would in any first person game.

You want to share your impression? Just use the generated USD (iOS) or gltf (Android) files!

The technologies beneath this project are: a web server that handles the communication between the DataScience-session and the phone, WebSockets to quickly proxy the keyboard events, QR-codes facilitate the simple pairing of both, and an HTML-Page on the computer to grab the keyboard events. And the translation of these keyboard events into 3D terms is a nice exercise in three.js, OpenGL, and SceneKit for HTML, Android, and iOS resp. For in-browser AR experience the package generates USD and GLTF formats.

Ready to see your data as you have never seen before? Join the talk!

Link to package or code repository.
https://github.com/thomann/plotVR
 
5:30am - 7:00am6B - R in Production 1
Virtual location: The Lounge #talk_r_production_1
Session Chair: Emi Tanaka
Zoom Host: Jyoti Bhogal
Replacement Zoom Host: Nick Spyrison
Session Sponsored by: cynkra Session Slides
 
5:30am - 5:50am
sponsored-live
ID: 350 / ses-06-B: 1
Sponsored Talk
Topics: Big / High dimensional data
Keywords: memory, big data

Big Memory for R

Jingchao Sun1, Chris Kang3, Austin Gutierrez2

1MemVerge; 2The Translational Genomics Research Institute (TGen); 3Analytical Biosciences

As we are stepping into the big data era, R and programs written in R are facing various new challenges. First, the data to be processed is growing exponentially and thus results in large memory usage when running R programs. Memory is becoming one of the bottlenecks for large data processing. Second, large data processing dramatically increases the processing time and the risk of program crashes. Scientists or developers might lose hours or days due to a program crash with no chance to save the data. Third, iterative analysis for large data is a pain point due to the long data loading time from disk. Fourth, a large amount of legacy R code does not support multi-threading which was recently introduced to R. This leads to the sequential processing of R code and wastes the CPU's multi-core capability.

To tackle these challenges, MemVerge developed Memory Machine software which supports Intel Optane Persistent Memory. With the help of Memory Machine, R programs can use up to 8 TB of memory on a single server with a 30-50% cost saving. R users can take snapshots of their workload at any time within 1 second to get data persistence without writing data to disk, and restore the workload within 1 second without loading data from disk. Moreover, with the help of instant restore, R users can easily try different parameters multiple times for their workloads. Memory Machine can also restore the workload into different namespaces to enable parallel processing and greatly reduce program execution time.

This talk will provide an overview of Big Memory Computing consisting of Intel Optane Persistent Memory and memory virtualization software working together. R users from Analytical Biosciences and The Translational Genomics Research Institute (TGen) will provide overviews of their implementations of Big Memory.



5:50am - 6:10am
Talk-Live
ID: 282 / ses-06-B: 2
Regular Talk
Topics: R in production
Keywords: DevOps, Agile, Production, Docker

Bridging the Unproductive Valley: Building Data Products Strictly Without Magic

Maximilian Held, Najko Jahn

State- and University Library Goettingen, Germany

Between GUI-based reports and scripted data science lies an unproductive valley that combines the worst of both worlds:

poor scaleability *and* high overhead.

To avoid getting stuck there, small and medium-sized teams must 1) build strategic data products (not one-off scripts), 2) adopt software development best practices (not hacks) and 3) concentrate on business value (not infrastructure).

1) Strategic data products focus on the ETL pipelines, common visualisations and other modules that are central to the mission. These unix-style building blocks can then be recombined into various reports.

2) These modules are designed "as-if-for-CRAN" and written as type/length-stable, unit-tested and exported functions.

3) If something is not related to our mission, we rely on industry standards (Docker) and CaaS/DBaaS (Azure, GCP).

{muggle}'s opininated DevOps provides some technical scaffolding to help with this transition.

It standardises the compute environment in development, testing and deployment on a multi-stage `Dockerfile` with `ONBUILD` triggers for lightweight target images and leverages public cloud services (RSPM, GitHub Actions, GitHub Packages).

In contrast to some existing approaches, {muggle} never infers developer intent and has a minimal git footprint.

Success also requires a cultural shift. Development may still be agile, but it must not build prototype code. Fancy plots and reports are good, but reproducibility is more important.

We believe this is a necessary change to ensure value generation, and thereby, to ensure the future of democratic, and open-source data science.

Link to package or code repository.
https://subugoe.github.io/muggle


6:10am - 6:30am
Talk-Video
ID: 237 / ses-06-B: 3
Regular Talk
Topics: R in production
Keywords: API

Data science serverless-style with R and OpenFaaS

Peter Solymos

Analythium Solutions Inc.

R is well suited for data science due to its diverse tooling and its ability to leverage and integrate with other languages and solutions. In production, R is often just a piece of a much larger puzzle providing API endpoints via e.g. plumber, RestRServe, or a similar web framework. Managing many API endpoints can lead to problems due to shifting dependency requirements or more recent additions breaking older code. The common solution is to use Docker containers to provide isolation to these components. However, managing containers at scale is not trivial, and managing serverless infrastructure is often outsourced to public cloud providers. Providers differ in their approaches, leading to independent integrations of R and repeated efforts. The OpenFaaS project was born to mitigate these problems and to avoid vendor lock-in. OpenFaaS is an open-source framework to deploy functions and microservices anywhere (local cluster, public cloud, edge devices) and at any scale (including 0), with an emphasis on Kubernetes. It provides auto-scaling, metrics, API gateway, and is language-agnostic. In this talk, I introduce R templates for OpenFaaS. The templates support different Docker R base images (Debian, Ubuntu, Alpine Linux) and 6 different frameworks, including plumber. I explain the development life cycle with OpenFaaS using an example cloud function for time series forecasting on daily updated epidemiological data. I end with a review of production use cases where R can truly shine in the multilingual serverless landscape.

Link to package or code repository.
https://github.com/analythium/openfaas-rstats-templates
 
7:00am - 7:15amBreak
Virtual location: The Lounge #lobby
7:15am - 8:15amKeynote: Research software engineers and academia
Virtual location: The Lounge #key_seibold
Session Chair: Dorothea Hug Peter
Zoom Host: Adrian Maga
Replacement Zoom Host: Nick Spyrison
 
ID: 351 / [Single Presentation of ID 351]: 1
Keynote Talk
Topics: R in production

Heidi Seibold

Johner Institut, Germany

Academia is a strange place. On the one hand it is a hotbed of innovations, on the other hand it is a frustratingly lethargic system. The movement of Research Software Engineers (RSEs) shows this really well as nearly all research relies on research software, yet we are still lacking adequate acknowledgment let alone career paths for RSEs. In this talk I want to discuss the status quo and future of software in research, the role of the R community, and also what it has to do with my personal path.

 
8:15am - 9:15amRechaRge 3
Virtual location: The Lounge #announcements
Session Chair: Marcela Alfaro Cordoba
Yoga for the Spine + Stretching
9:15am - 10:45am7A - Ecology and Environmental Sciences
Virtual location: The Lounge #talk_ecology_environment
Session Chair: Ulfah Mardhiah
Zoom Host: Adrian Maga
Replacement Zoom Host: Dorothea Hug Peter
 
9:15am - 9:35am
Talk-Live
ID: 147 / ses-07-A: 1
Regular Talk
Topics: Environmental sciences
Keywords: big data

startR: A tool for large multi-dimensional data processing

An-Chi Ho, Núria Pérez-Zanón, Nicolau Manubens, Francesco Benincasa, Pierre-Antoine Bretonnière

Barcelona Supercomputing Center (BSC-CNS)

Nowadays, the growing data volume and variety in various scientific domains have made data analysis challenging. Simple operations like extracting data from storage and performing statistical analysis on them have to be rethought. startR is an R package developed at the Earth Science Department in Barcelona Supercomputing Center (BSC-CNS) that allows to retrieve, arrange, and process large multi-dimensional datasets automatically with a concise workflow.

startR provides a framework under which the datasets to be processed can be perceived as a single multi-dimensional array. The array is first declared, then a user-defined function can be applied to the relevant dimensions in an apply-like fashion, building up a declarative workflow that can be executed in various computing platforms. During execution, startR implements the MapReduce paradigm, chunking the data and processing them either locally or remotely on high-performance computing systems, leveraging multi-node and multi-core parallelism where possible. Besides the data, metadata are also well-preserved and expanded with the operation information, ensuring the reproducibility of the analysis.

Several functionalities in startR, like spatial interpolation and time manipulation, are tailored for atmospheric sciences research such as climate, weather, and air quality. It is compatible with other R tools developed in BSC-CNS, forming a strong toolset for climate research. However, it is potentially competent in other research fields. Even though netCDF is the only data format supported in the current release, adaptors for other file formats can be plugged in, enabling the tool to be exploited in different scientific domains where large multi-dimensional data is involved.

Link to package or code repository.
https://earth.bsc.es/gitlab/es/startR


9:35am - 9:55am
Talk-Live
ID: 193 / ses-07-A: 2
Regular Talk
Topics: Environmental sciences
Keywords: hydrology, river hydrograph, hydrograph separation, climate change, spatial analysis

grwat: a new R package for automated separation and analysis of river hydrograph

Timofey Samsonov1, Ekaterina Rets2, Maria Kireeva1

1Faculty of Geography, Lomonosov Moscow State University, Russian Federation; 2Institute of Water Problems, Russian Academy of Science, Russian Federation

`grwat` is a new R package aimed at analysis of river hydrograph — a time series of river discharge values. The overall shape of hydrograph is specific for each river and is heavily influenced by climatic conditions within a river basin. Since the climate is changing, the shape of a typical hydrograph for each river is also transformed. The main goal of grwat package is to provide automated tools to extract the genetic components of river discharge (e.g. how much disharge is due to thaws, floods etc.) as well as graphical and statistical tools to reveal interannual and long-term changes of these components. The core procedure which allows extraction of genetic components is separation. The implementation of separation in `grwat` is two-stage. First, it follows the generaly acclaimed approach to separate the discharge into quick flow and baseflow. Second, it involves the temperature and precipitation time series to separate the quick flow into seasonal (snowmelting), thaw and flood-induced discharge using the originally developed algorithm. The separation is programmed in pure STL C++17 and then interfaced into `grwat` via Rcpp. Separated hydrograph is represented as a data frame where for each observation the input total discharge is distributed between several columns, each representing a genetic component. Such data frame can be further analyzed with `grwat` resulting in more than 30 interannual and long-term statistically tested variables characterizing the aggregated values, dates and durations of specific events and periods. Examples are seasonal flood runoff, annual groundwater discharge, number of thaw days, and the date of seasonal flood beginning. Finally, `grwat` contains convenient functions to quickly visualize one or more variables using ggplot2 graphics, and to generate high-quality R Markdown-based HTML reports which combine graphics and results of statistical tests for all computed variables. Development is funded by Russian Science Foundation (Project 19-77-10032)

Link to package or code repository.
https://tsamsonov.github.io/grwat/


9:55am - 10:15am
Talk-Video
ID: 190 / ses-07-A: 3
Regular Talk
Topics: Ecology
Keywords: agent-based modelling, animal, R6, simulation, OOP

Using R6 object-oriented programming to build agent-based models

Liam Daniel Bailey, Alexandre Courtiol

IZW Berlin, Germany

Agent or individual-based modelling is an invaluable tool in the biological sciences, used to understand complex topics such as conservation management, invasive species, and animal population dynamics. However, while R is one of the most common programming languages used in the biological sciences it is often considered 'unsuitable' for agent-based modelling tasks, with other tools such as NetLogo, Java, and C++ utilized instead. Here, we introduce how the package R6 can be used to build agent-based models and simulate complex population and evolutionary dynamics in R. R6 offers the possibility to easily define classes with encapsulated methods. It has become the package of choice behind many well-known R packages that use encapsulated object-oriented programming (e.g. shiny, dplyr, testthat). Yet, while simulations have been built in R using other class systems such as S3 and S4, the potential of R6 to perform such tasks remains untapped. We provide a real-world example from our research on the large African carnivore, the spotted hyena. Object-oriented programming using R6 was easy to learn and implement, and working in R allowed us to quickly build, document, and unit test our code by taking advantage of existing tools in R/RStudio with which we were already familiar (e.g. RStudio projects, roxygen2, testthat). Implementing agent-based modelling in R will allow ecologists to easily make use of this powerful tool in their research. Researchers will not be required to learn any new programming languages but can instead implement agent-based models in the same language they already use for data wrangling, statistical analysis, and data visualisation.



10:15am - 10:35am
Talk-Live
ID: 171 / ses-07-A: 4
Regular Talk
Topics: Environmental sciences
Keywords: data processing

Climate Forecast Analysis Tools Framework: from the storage to the HPC to get reproducible climate research results and services

Núria Pérez-Zanón1, An-Chi Ho1, Francesco Benincasa1, Pierre-Antoine Bretonnière1, Louis-Philippe Caron2, Chihchung Chou1, Carlos Delgado-Torres1, Llorenç Lledó1, Nicolau Manubens3, Lluís Palma1

1Barcelona Supercomputing Center (BSC); 2Ouranos; 3NA

Climate forecast researchers need to assess the quality of their forecasts by comparing them against reference observation datasets using state-of-the-art verification metrics. This procedure requires reading in the seasonal forecasts and reference data and restructuring them for later comparison (e.g.: regridding, resampling or reordering). Only then, statistical methods can be applied to assess forecast skill and, finally, tailored visualization tools are employed to explore the results.

At the Earth Sciences department of the Barcelona Supercomputing Center, the expertise in seasonal forecast research has traditionally been compiled in the s2dverification R package since its first release in 2009. The package provides tools implementing all the steps required for the procedure described above, allowing researchers to share their methods while reducing development and maintenance cost. However, as the department broadened its activity to include research on sub-seasonal forecast, decadal prediction and climate projections, as well as development of climate services for various stakeholders, new state-of-the-art tools to manipulate climate data became necessary.

As a result, the department is currently maintaining eight R packages. These packages can be used separately or in their common framework, and include methods for calibration, downscaling and combination in the CSTools package, climate indicators in ClimProjDiags, and CSIndicators -among other climatological methods- in s2dv (s2dverification’s successor). The framework has been designed to be flexible and efficient. The Big Data issue inherent to climate data analysis is addressed by employing the startR and multiApply packages to seamlessly enable chunked multi-core processing, optionally leveraging multi-node parallelism in HPC platforms.

 
9:15am - 10:45am7B: Statistical modeling in R
Virtual location: The Lounge #talk_stats
Session Chair: Sevvandi Kandanaarachchi
Zoom Host: Nick Spyrison
Replacement Zoom Host: Olgun Aydin
 
9:15am - 9:35am
Talk-Video
ID: 207 / ses-07-B: 1
Regular Talk
Topics: Statistical models
Keywords: distributional regression, probabilistic forecasts, regression trees, random forests, graphical model assessment

Probability Distribution Forecasts: Learning with Random Forests and Graphical Assessment

Moritz N. Lang1, Reto Stauffer1,2, Achim Zeileis1

1Department of Statistics, Faculty of Economics and Statistics, Universität Innsbruck; 2Digital Science Center, Universität Innsbruck

Forecasts in terms of entire probability distributions (often called "probabilistic forecasts" for short) - as opposed to predictions of only the mean of these distributions - are of prime importance in many different disciplines from natural sciences to social sciences and beyond. Hence, distributional regression models have been receiving increasing interest over the last decade. Here, we make contributions to two common challenges in distributional regression modeling:

1. Obtaining sufficiently flexible regression models that can capture complex patterns in a data-driven way.

2. Assessing the goodness-of-fit of distributional models both in-sample and out-of-sample using visualizations that bring out potential deficits of these models.

Regarding challenge 1, we present the R package "disttree" (Schlosser et al. 2021), that implements distributional trees and forests (Schlosser et al. 2019). These blend the recursive partitioning strategy of classical regression trees and random forests with distributional modeling. The resulting tree-based models can capture nonlinear effects and interactions and automatically select the relevant covariates that determine differences in the underlying distributional parameters.

For graphically evaluating the goodness-of-fit of the resulting probabilistic forecasts (challenge 2), the R package "topmodels" (Zeileis et al. 2021) is introduced, providing extensible probabilistic forecasting infrastructure and corresponding diagnostic graphics such as Q-Q plots of randomized residuals, PIT (probability integral transform) histograms, reliability diagrams, and rootograms. In addition to distributional trees and forests other models can be plugged into these displays, which can be rendered both in base R graphics and "ggplot2" (Wickham 2016).



9:35am - 9:55am
Talk-Live
ID: 178 / ses-07-B: 2
Regular Talk
Topics: Statistical models

spaMM: an R package to fit generalized, linear, and mixed models allowing for complex covariance structures

François Rousset1, Alexandre Courtiol2

1Univ. Montpellier, CNRS, Institut des Sciences de l'Evolution, Montpellier, France; 2Leibniz Institute for Zoo and Wildlife Research, Berlin

Introduced to make the fit of spatial Mixed Models accessible, the R package spaMM has grown a lot since its first CRAN release eight years ago. The package now offers the possibility to fit a variety of regression models, from simple linear models (LM) to generalised linear mixed-effects models (GLMM), including multivariate-response models, and double hierarchical GLMMs (DHGLM) in which both the mean of a response and the residual variance can be modelled as a function of fixed and random effects. The package provides a diversity of response families beyond the standard ones, such as the (truncated or not) negative binomial, and the Conway-Maxwell-Poisson, as well as non-gaussian random effect families such as the inverse gaussian. Random effects can further be modelled using several autocorrelation functions for the consideration of spatial, temporal and other forms of dependence between observations (e.g. genetic pedigrees). spaMM handles this diversity of models through a simple formula-based interface akin to glm() or lme4::glmer(). Advanced users will nonetheless appreciate the possibility to fine tune many aspects of the fit (e.g. select among several likelihood approximations; set parameters to fixed values). The package also provides tailored methods for many generics, so that for instance anova() can be called to perform likelihood ratio tests by parametric bootstrap and that AIC() computes both the marginal and conditional AIC. The package is finally competitive in terms of computational speed, for both non-spatial, geostatistical, and autoregressive models



9:55am - 10:15am
withdrawn
ID: 362 / ses-07-B: 3
Regular Talk
Topics: Statistical models
Keywords: big data

Changed to Elevator Pitch: The one-step estimation procedure in R

Alexandre Brouste1, Christophe Dutang2

1Le Mans Université; 2Université Paris-Dauphine

In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial $\sqrt(n)$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed.



10:15am - 10:35am
Talk-Video
ID: 206 / ses-07-B: 4
Regular Talk
Topics: Statistical models
Keywords: probabilistic graphical models

The R Package stagedtrees for Structural Learning of Stratified Staged Trees

Federico Carli1, Manuele Leonelli2, Eva Riccomagno1, Gherardo Varando3

1Università degli Studi di Genova, Dipartimento di Matematica, Italy; 2IE University, School of Human Sciences and Technology, Spain; 3Universitat de València, Image Processing Laboratory (IPL), Spain

stagedtrees is an R package which includes several algorithms for learning

the structure of staged trees and chain event graphs from categorical data. In

the past twenty years there has been an explosion of the use of graphical models

to represent the relationship among a vector of random variables and perform

inference taking advantage of the underlying graphical representations.

Bayesian networks are nowadays one of the most used graphical models,

with applications to a wide array of domains and implementation in various

software. However, they can only represent symmetric conditional independence

statements which in practical applications may not be fully justified. Most

often, the greater the number of levels of categorical variables involved, the

more difficult it is for conditional independence to hold for all the variables’

levels. Therefore, models accommodating also asymmetric relations as context-specific,

partial and local independences have been developed. Staged trees are

one such class. Staged tree modeling has proved its worth in many fields, as for

instance cohort studies, causal analysis, case-control studies, Bayesian games

and medical diagnosis.

stagedtrees permits to estimate any type of non-symmetric conditional

independences from data via score-based and clustering-based algorithms.

Various functionalities to provide inferential, visualization, descriptive and summary

statistics tools for such models and about their graph structure are implemented.

These functions help users in handling categorical experimental data and

analyzing the learned models to untangle complex dependence structures.

 
9:15am - 10:45am7C - Teaching, Automation and Reproducibility
Virtual location: The Lounge #talk_teaching_automation_r
Session Chair: Earo Wang
Zoom Host: Jyoti Bhogal
Replacement Zoom Host: Matt Bannert
Session Sponsor: Roche
Session Slide
 
9:15am - 9:35am
Talk-Live
ID: 143 / ses-07-C: 1
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: automation

A semi-automatic grader for R scripts

Vik Gopal, Samuel Seah, Viknesh Jaya Kumar

National University of Singapore

My department teaches a class in R. The aims of this class are to teach visualisation and good programming practices in R. Every week, we would attempt to go over as many script submissions as we could, as closely as we could. We would then summarise the feedback verbally to the students.

Due to the increasing class size, we were unable to rigorously go through every single student script every week due to time constraints. As such, we could not identify the common misconceptions that students had. We could not intervene and correct the most critical ones early one in the class. Finally, we were unable to visualise all the visualisations that students created.

Hence we developed an R package to automatically run all student scripts and extract metrics such as run-time and certain code features. The package would also collate all the graphs so that we can see them at one go. We also set up a server for students to test their code before submission, ensuring that we can run their code smoothly.

We can now ensure that every students’ code is run and analysed consistently and reliably. Instead of scrutinising the code, we look through a summary table of features generated for each script. If something looks strange here, we go back to the script. By uploading this table, with comments, to our LMS, we can provide custom feedback for each student. Finally, having such a summary table of features indicates the areas that students need more practice in - it allows us to tailor future homework problems.



9:35am - 9:55am
Talk-Live
ID: 174 / ses-07-C: 2
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: Training, Automation, Systems Administration, Reproducibility, Workflow

Automating bespoke online teaching with R

Rhian Davies

Jumping Rivers, United Kingdom

At Jumping Rivers we deliver over 100 R, Python and Stan training courses each year, engaging with thousands of new learners. The necessity to move to fully online training in March last year meant we quickly had to completely rethink how to deliver R training interactively online. We internally trialled running our usual in-person training just on Zoom - and it really doesn’t work trust us!

We already used R & R Markdown to create all training materials including slides and notes.

However, our new workflow uses R in every step of the way, from creating a bespoke learning environment, to collating feedback and generating certificates.

Upcoming training sessions are stored in Asana. Using a single call from R, we extract the relevant Asana task details and:

* Provide the client with a single URL that contains all necessary information for the course

* Deploy a bespoke virtual training environment with {analogsea}

* Automate password generation with {shiny}

* Track and upload attendance sheets

* Create bespoke Google Documents for code quizzes

* Generate fill-in-the-blank tutor R scripts

* Provide automatic feedback reports for clients with {rmarkdown}, {shiny} & {rtypeform}

* Deliver personalised certificates in {shiny}

* Tag the training materials and VM to enable a completely reproducible set-up

This improves the learning experience as the “small” things are automated and allows the trainer to concentrate on actual training.



9:55am - 10:15am
Talk-Video
ID: 270 / ses-07-C: 3
Regular Talk
Topics: Reproducibility
Keywords: reproducibility, rmarkdown, knitr, report, communication

Extend the functionality of your R Markdown documents

Christophe Dervieux

RStudio

R Markdown is a powerful tool that has quickly grown since its creation. If it can be rather simple to quickly create and maintain a simple reproducible report, it can be more challenging to do advanced customization and dynamic content creation due to the different tools involved (rmarkdown, knitr, Pandoc, LaTeX, ...) and a lot of possible tweaks. And this is increaded all the more if you consider the already widespread and still growing ecosytem surrounding R Markdown.

Helping users to better find and know how to do specific tasks with R Markdown was the main driver for the book "R Markdown Cookbook" (CRC Press). This talks will be based on the content of this book and will present a selection of advanced recipes to go further with a R Markdown document. These examples combines little-known features of some R packages (rmarkdown, knitr) and other tools (Pandoc) to provide flexibility and to extend greatly the functionnality for producing communication product, programmatically and with reproducibility.

This talks will also include last features at the time of the talk included in R Markdown family of packages (rmarkdown, knitr, bookdown, blogdown, ...)



10:15am - 10:35am
Talk-Video
ID: 189 / ses-07-C: 4
Regular Talk
Topics: Teaching R/R in Teaching
Keywords: psychometrics, reliability, item response theory, Shiny, teaching R

Computational aspects of psychometrics taught with R and Shiny

Patricia Martinkova1,2

1Institute of Computer Science, Czech Academy of Sciences, Prague, Czech Republic; 2Faculty of Education, Charles University, Prague, Czech Republic

Psychometrics deals with the advancement of quantitative measurement practices in psychology, education, health, and many other fields. It covers a number of statistical methods that are useful for the behavioral and social sciences. Among other topics, it includes the estimation of reliability to deal with the omnipresence of measurement error, as well as a more detailed description of item functioning encompassed in item response theory models.

In this talk, I will discuss some computational aspects of psychometrics, and how understanding these aspects may be supported by real and simulated datasets, interactive examples, and hands on methods. I will first focus on reliability estimation and the issue of restricted range, showing that zero may not always be zero. I will then focus on a deeper understanding of the context behind more complex models and their much simpler counterparts. The last example discusses group-specific models and the importance of item-level analysis for situations where differences in overall gains are not apparent but the differences in item gains may be.

I will finally discuss experiences from teaching computational aspects of psychometrics to a diverse group of students from various fields, including statistics, computer science, psychology, education, medicine, and participants from industry. I will discuss the challenges and joys of creating a truly interdisciplinary course.

Link to package or code repository.
https://github.com/patriciamar/ShinyItemAnalysis
 
10:45am - 11:45ammixR!
Music, networking channel and raffles. To end the day in a relaxing way
Date: Friday, 09/July/2021
12:30pm - 1:30pmKeynote: The R-universe project
Virtual location: The Lounge #key_ooms
Session Chair: Heather Turner
Zoom Host: Yuya Matsumura
Replacement Zoom Host: Priyanka Gagneja
 
ID: 355 / [Single Presentation of ID 355]: 1
Keynote Talk
Topics: Other

Jeroen Ooms

UC Berkeley, Netherlands,

R-universe <https://r-universe.dev> is a new platform by rOpenSci under which we experiment with various ideas for improving publication and discovery of research software in R.

R-universe provides users or organizations with a personal CRAN-like repository for publishing packages, rmarkdown articles, and other R content. The system automatically tracks upstream git package repositories, builds binary packages for windows and mac, renders vignettes, and makes data available through dashboards, feeds and APIs.

In the talk, we explain how R users of any level can benefit from creating a personal universe. Example use-cases include a personal portfolio, an incubator for experimental projects, or a staging pool for dev-versions of CRAN packages. We also discuss possibilities of publishing non-software R packages (e.g. research compedia or reproducible articles) and ongoing work of integrating software citations and health metrics in R-universe.

 
1:30pm - 1:45pmBreak
Virtual location: The Lounge #lobby
1:45pm - 3:15pm8A - Statistics and Bioinformatics
Virtual location: The Lounge #talk_stats_bioinformatics
Session Chair: Leonardo Collado Torres
Zoom Host: Priyanka Gagneja
Replacement Zoom Host: Aditi Shah
Session Sponsor: Roche Session Slides
 
1:45pm - 2:05pm
Talk-Video
ID: 113 / ses-08-A: 1
Regular Talk
Topics: Bioinformatics / Biomedical or health informatics
Keywords: computational biology, bioinformatics, molecular evolution, phylogeny, sequence analysis, R Shiny

MolEvolvR: Web-app and R-package for characterizing proteins using molecular evolution and phylogeny

Samuel Z Chen, Lauren M Sosinski, John B Johnston, Janani Ravi

Michigan State University, United States of America

Molecular evolution and phylogeny can provide key insights into pathogenic protein families. Studying how these proteins evolve across bacterial lineages can help identify lineage-specific and pathogen-specific signatures and variants, and consequently, their functions. We have developed a streamlined computational approach for characterizing the molecular evolution and phylogeny of target proteins, widely applicable across proteins and species of interest. Our approach starts with query protein(s) of interest, identifying their homologs, and characterizing each protein by its domain architecture and phyletic spread. We have developed the MolEvolvR webapp, written entirely in R and Shiny, to enable biologists to run our entire workflow on their data by simply uploading a list of their proteins of interest. The webapp accepts inputs in multiple formats: protein/domain sequences, multi-protein operons/homologous proteins, or motif/domain scans. Depending on the input, MolEvolvR returns the complete set of homologs/phylogenetic tree, domain architectures, common partner domains. Users can obtain graphical summaries that include multiple sequence alignments and phylogenetic trees, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns, and relative occurrences across lineages. Thus, the MolEvolvR webapp provides a powerful, easy-to-use interface for a wide range of protein characterization analyses, starting from homology searches and phylogeny to domain architectures. In addition to this analysis, researchers can use the app for data summarization and dynamic visualization. The webapp can be accessed here: http://jravilab.org/molevolvr. Soon, it will be available as an R-package for use by computational biologists.

Link to package or code repository.
http://jravilab.org/molevolvr


2:05pm - 2:25pm
Talk-Live
ID: 201 / ses-08-A: 2
Regular Talk
Topics: Web Applications (Shiny/Dash)
Keywords: API, cloud computing, web application, applications/case studies, plumber

Using R to Empower a Precision Dosing Web Application

Gergely Daroczi

Rx Studio Inc., United States of America

R has a long history in PK/PD modeling, and it has been heavily used both in research and clinical practice as well, but making these R packages available outside of the R community has its (technical, compliance and UX) challenges that even hosted Shiny apps cannot easily solve yet. We are working on and presenting a scalable platform and web application building on the top of R (Docker, containerized Plumber API), hosted in a HIPAA-compliant infrastructure (AWS and GCP services), and made available to end-users via a user-friendly and configurable web interface (Angular and Nebular). This talk will focus on the overall cloud infrastructure, how we integrate R and other services, and details on the troubles with scalability, error handling, user experience etc in a HIPAA compliant, but startup environment.



2:25pm - 2:45pm
Talk-Video
ID: 136 / ses-08-A: 3
Regular Talk
Topics: Data visualisation
Keywords: biology

Visualization of highly-multiplexed imaging data with cytomapper

Nils Eling, Nicolas Damond, Tobias Hoch, Bernd Bodenmiller

University of Zurich

Highly multiplexed imaging (HMI) produces images that contain up to 40 channels. In the field of cell biology, HMI is used to capture differences between individual cells, which are defined as distinct objects on the images. To derive those objects from multi-channel images, different segmentation approaches can be used. Several challenges in terms of data visualisation arise from this type of high dimensional data: 1. more than 3 channels need to be visualised at once, 2. the features of segmented objects need to be visualised together with pixel-level information and 3. tens to hundreds of images need to be visualised in parallel. Here, we have developed cytomapper, an R/Bioconductor package to address these challenges. The main functions of the package allow 1. the visualisation of pixel-level information across multiple channels, 2. the display of object-level information on segmentation masks and 3. the interactive visualization of images based on an integrated shiny application. Finally, we also developed an on-disk representation framework to expand the usability of the package to several hundreds of images.



2:45pm - 3:05pm
sponsored-live
ID: 346 / ses-08-A: 4
Sponsored Talk
Topics: Biostatistics and Epidemiology

Analyzing Clinical Trials Data using R for Exploratory and Regulatory Analyses

Adrian Waddel

Roche

The R language has gained traction in the Pharmaceutical industry to analyze clinical trials data. Especially shiny apps have proven to be useful for exploratory analyses. We will present two frameworks to analyze clinical trials data: one to create interactive shiny apps and another one for static output generation. In our demonstration we will especially focus on modularity and reproducibility.

 
1:45pm - 3:15pm8B - R in Production 2
Virtual location: The Lounge #talk_r_production_2
Session Chair: Zhian N. Kamvar
Session Chair: Matt Bannert
Zoom Host: Pamela Pairo
Replacement Zoom Host: Yuya Matsumura
Session Sponsor: Appsilon
Session Slides
 
1:45pm - 2:05pm
Talk-Live
ID: 223 / ses-08-B: 1
Regular Talk
Topics: R in production
Keywords: deployment

Shiny PoC to Production Application in 8 steps

Marcin Dubel

Appsilon,

"A great advantage of Shiny applications is that a proof of concept can be created quickly and easily. It is a great way for subject matter experts to present their ideas to stakeholders before moving on to production. However, taking the next step to a production application requires help from experienced software developers. The actions should be focused on two areas: to make the application a great experience for users and to make it maintainable for future work. Focusing on these will assure that the app will be scalable, performant, bug-free, extendable, and enjoyable. Close collaboration between engineers and experts paves a wave to many successful projects in data science and is Appsilon’s confirmed path to production-ready solutions.

The very first step should always be to build a comfortable and (importantly) reproducible workflow, thus setting up the development environment and organizing the folder structure [renv + docker]. Once this is done engineers should proceed to limiting the codebase by cleaning the code – i.e., removing redundant comments, extracting the constants and inline styles [ymls + styler]. Now the fun begins: extract the business logic into separate functions, modules and classes [packages/R6 + plumber]. Restrict reactivity to minimum. Check the logic [data.validator + drake]. Add tests [testthat + cypress/shinytest]. Organize your /www and move actions to the browser [shiny + css/js]. Finally, style the app [sass/bslib + shiny.fluent]. And, voila! A world-class Shiny app."



2:05pm - 2:25pm
Talk-Live
ID: 251 / ses-08-B: 2
Regular Talk
Topics: R in production
Keywords: packages, reproducibility, projects, production

Reliably Reproducible Project Packages

Alex Kahn Gold

RStudio, United States of America

We all dread sharing a data science project with a collaborator or returning to a project only to find that it doesn't run because of mismatched package versions. Maintaining and sharing R projects is historically a fragile endeavor, relying mainly on crossed fingers.

There are now simple workflows to create and share an isolated R package environment for any project that makes luck irrelevant to the process.

In this talk, you'll learn to use the {renv} package to easily and quickly create isolated project environments, capture the packages in those environments, and share them with collaborators. Additionally, you'll learn how to take advantage of dated repository URLs from public RStudio Package Manager to make sure that you can add more packages and continue work on your project, no matter how far down the road that is.



2:25pm - 2:45pm
Talk-Live
ID: 224 / ses-08-B: 3
Regular Talk
Topics: R in production
Keywords: deployment, DevOps, infrastructure, integration, R packages

Binary R Packages for Linux: Past, Present and Future

Iñaki Ucar1, Dirk Eddelbuettel2

1Universidad Carlos III de Madrid; 2University of Illinois at Urbana-Champaign

Pre-compiled binary packages provide a very convenient way of efficiently distributing software that has been adopted by most Linux package management systems. However, the heterogeneity of the Linux ecosystem, combined with the growing number of R extensions available, poses a scalability problem. As a result, efforts to bring binary R packages to Linux have been scattered, and lack a proper mechanism to fully integrate them with R’s package manager. This work reviews past and present of binary distribution for Linux, and presents a path forward by showcasing the ‘cran2copr’ project, an RPM-based proof-of-concept implementation of an automated scalable binary distribution system with the capability of building, maintaining and distributing thousands of packages, while providing a portable and extensible bridge to the system package manager. This not only benefits desktop/server users of Linux systems, but also Windows and macOS users that rely on CI/CD systems to test packages and/or deploy code.



2:45pm - 3:05pm
Talk-Live
ID: 219 / ses-08-B: 4
Regular Talk
Topics: Community and Outreach
Keywords: community

R for Non-Programmers: Creating Paradigm Shifts in Reporting for Community-Facing Organizations Using R

Lisa Kulka, Sulagna Patra, Mohammad Haque

CCNY Inc.,

The automation of reporting processes for community-based organizations working with diverse communities has created a paradigm shift in the way they can approach quality improvement, allocate time and resources to data analyses and management, and utilize various kinds of data to support the communities with which they work.

At CCNY, a nonprofit organization that supports the evaluation and analytics work of community-facing organizations, the use of R to generate and enhance reporting schema has made a significant impact for those who have little to no programming experience. The most successful projects leading to increased organizational effectiveness via use of R with non-programmers include CCNY’s support of its local county’s Children’s System of Care. Non-automated data reporting and management created challenges in terms of credential tracking and ensuring children and families were receiving appropriate services. CCNY deployed R to pre-process and automate training data, generating reports that predict which community providers are eligible to render services. This new organizational ability created streamlined reporting processes that led to non-programmers running R and applying this newly-established data framework to automate other data-related tasks, including dramatically increasing community impact by reducing data processing time, making faster decisions informed by real-time data, and leveraging increased data processing capabilities to improve overall organizational capacity.

In this session, we will review specific features of the project’s unique code used to establish streamlined automated reporting for those with limited R proficiency, as well as the project input, the output deliverables, and techniques for engaging non-programmers in the logistics of building R schema so that basic principles of automated reporting can be understood and easily generalized within and outside of community-facing organizations.

Link to package or code repository.
Code will be shared in specific pieces as it contains sensitive information in certain parts, thank you for your understanding!
 
1:45pm - 3:15pm8C - Statistical modeling & Data Analysis 1
Virtual location: The Lounge #talk_stats_data_analysis_1
Session Chair: Liz Hare
Zoom Host: Rachel Heyard
Replacement Zoom Host: Balogun Stephen
 
1:45pm - 2:05pm
Talk-Video
ID: 176 / ses-08-C: 1
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: Fairness, Bias, AI, Machine Learning, Visualization

fairmodels: A Flexible Tool For Bias Detection, Visualization, And Mitigation

Jakub Wiśniewski1, Przemysław Biecek1,2

1Faculty of Mathematics and Information Science, Warsaw University of Technology; 2Faculty of Mathematics, Informatics and Mechanics, University of Warsaw

Machine learning decision systems are getting increasingly omnipresent in our lives. From dating apps to rating loan seekers, algorithms affect both our well-being and future. Typically, however, these systems are not infallible. Moreover, complex predictive models are very eager to learn social biases present in historical data that can lead to increasing discrimination. If we want to create models responsibly then we need tools for in-depth validation of models also from the perspective of potential discrimination.

This article introduces an R package fairmodels that helps to validate fairness and eliminate bias in classification models in an easy and flexible fashion. The fairmodels package offers a model-agnostic approach to bias detection, visualization, and mitigation. The implemented set of functions and fairness metrics enables model fairness validation from different perspectives. The package includes a series of methods for bias mitigation that aim to diminish the discrimination in the model.

The package is designed not only to examine a single model but also to facilitate comparisons between multiple models.



2:05pm - 2:25pm
Talk-Video
ID: 106 / ses-08-C: 2
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: causal analysis, estimation, high-dimensional data, machine learning, statistical inference

DoubleML - Double Machine Learning in R

Philipp Bach1, Victor Chernozhukov2, Malte S. Kurz1, Martin Spindler1

1University of Hamburg, Germany; 2MIT

The R package DoubleML implements the double/debiased machine learning framework of Chernozhukov et al. (2018). DoubleML makes it possible to estimate causal parameters based on machine learning methods. The double machine learning framework consist of three key ingredients: Neyman orthogonality, high-quality machine learning estimation and sample splitting. Estimation of nuisance components can be performed by various state-of-the-art machine learning methods that are available in the mlr3 ecosystem. DoubleML allows users to perform inference in a variety of causal models, including partially linear and interactive regression models and their extensions to instrumental variable estimation. The object-oriented implementation of DoubleML enables a high flexibility for the model specification and makes it easily extendable. This talk serves as an introduction to the double machine learning framework and the R package DoubleML. We demonstrate how users of DoubleML can perform valid inference based on machine learning methods in reproducible code examples with simulated and real data sets.

References:

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins, J. (2018), Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21: C1-C68, URL: https://academic.oup.com/ectj/article/21/1/C1/5056401.

Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L. and Bischl, B. (2019), mlr3: A modern object-oriented machine learing framework in R. Journal of Open Source Software, doi:10.21105/joss.01903, URL: https://mlr3.mlr-org.com/.



2:25pm - 2:45pm
Talk-Video
ID: 173 / ses-08-C: 3
Regular Talk
Topics: Multivariate analysis
Keywords: statistical independence, conditional independence, variable selection, causal analysis

copent: Estimating Copula Entropy and Transfer Entropy in R

Jian MA

NA

Statistical independence and conditional independence are two fundemental concepts in statistics and machine learning. Copula Entropy is a mathematical concept for multivariate statistical independence measuring and testing, and also proved to be closely related to conditional independence (or transfer entropy). It has been applied to solve several fundamental statistical or machine learning problems, including association discovery, structure learning, variable selection, and causal discovery. The method for estimating copula entropy with rank statistic and the kNN method was implemented in the 'copent' package in R. This talk first introduces the theory and estimation of Copula Entropy, and then the implementation details of the package. Three examples will also be presented to demonstrate the usage of the package: one with simulated data and the other two with real-world data for variable selection and causal discovery. The copent package is available on the CRAN and also on GitHub at https://github.com/majianthu/copent/.

Link to package or code repository.
https://github.com/majianthu/copent
 
3:15pm - 4:15pmtRivia
Session Chair: Sara Mortara
Session Chair: Marcela Alfaro Cordoba
4:15pm - 5:45pmKeynote: Communication - elevating data analysis to make a real impact
Virtual location: The Lounge #key_communication
Session Chair: Federico Marini
Zoom Host: Aditi Shah
Replacement Zoom Host: Balogun Stephen
 
ID: 358 / [Single Presentation of ID 358]: 1
Keynote Talk
Topics: Community and Outreach

Catherine Gicheru1, Katharine Hayhoe2

1Africa Women Journalism Project; 2Texas Tech University

Data analysis is key to identify patterns, understand processes, and guide effective policy-making for real world problems, such as climate change, COVID-19 and other diseases, and gender inequity. But sometimes this is not enough. Our two speakers in this session, data journalist Catherine Gicheru and atmospheric scientist Katherine Hayhoe, will share their work and experience communicating key data results to the general public and stakeholders. Catherine will help us understand how to tell stories with data, by liberating access to information to empower citizens and help them generate changes, particularly in issues related to gender, health and developement in Africa. Katherine will share her struggle to bridge the gap between scientists and stakeholders about potential impacts of climate change. The contributions in this joint keynote will spark a discussion on how to enhance the impact of data analysis by means of efficient communication strategies.

 
5:45pm - 6:45pmRechaRge 4
Session Chair: Marcela Alfaro Cordoba
Chair Yoga
6:45pm - 7:45pmIncubator: Expanding the R community in the Middle East and North Africa (MENA)
Virtual location: The Lounge #incubator_mena
Session Chair: Batool Almarzouq
Zoom Host: Pamela Pairo
Replacement Zoom Host: Aditi Shah
 
ID: 196 / [Single Presentation of ID 196]: 1
Incubator
Topics: Community and Outreach
Keywords: Community, Middle East, North Africa, useR group

Batool Almarzouq1,2,3, Mouna Belaid4, Iman Al Hasani7, Haifa Ben Messaoud6, Fahad Almsned8,9, Mohammed El Fodil10, Hussain Alsalman11, Kamila Benadrouche5, Amal TLILI12, Hedia Tnani13

1University of Liverpool, United Kingdom; 2Open Science Community Saudi Arabia; 3King Abdullah International Medical Research Center (KAIMRC), Saudi Arabia; 4Prime Analytics, Tunis; 5Sonatrach, Algeria; 6National Pen, Tunis; 7Sultan Qaboos University (SQU), Oman; 8The National Institute of Neurological Disorders and Stroke (NINDS); 9The National Institutes of Health (NIH); 10Ecole Nationale Supérieure de Statistiques et d'Economie Appliquée (ENSSEA); 11Saudi Aramco, Saudi Arabia; 12DNA Analytics, Tunis; 13Pasteur Institute of Tunis

Data Science (DS) has become an in-demand, highly paid career in MENA, especially in UAE, KSA and Oil-rich Gulf countries but despite this interest in DS in the Middle East and North Africa (MENA), there is very few R meet-up/R user group dedicated to learning, teaching, and sharing information about the R Programming language in MENA. Several chapters for R-Ladies, UseR groups were established in Saudi Arabia, Tunis, Algeria, Sudan, Muscat and Egypt to bring out more R users and empower data scientists from underrepresented communities in MENA. However, the growth of the R community in MENA is limited by the language barrier, which makes the communities that exist around R (e.g. rOpenScie, RLadies, ...) not well-known in MENA. This incubator aims to establish a working group dedicated to expanding the R community in MENA, and building a network between the different groups. It will engage researchers and data scientists in the MENA region through in-person collaboration on open source projects. By the end of the incubator, we'll have established a working group and identified the barriers and how to overcome them to expand the R community in MENA. This working group aims to facilitate collaboration and integration between data scientists in MENA and R communities Worldwide.

Link to package or code repository.
In this incubator, we will invite community builders, founders and organisers of R-Ladies and UseR chapters across the MENA region as well as R-bloggers who are enriching the Arabic content in R.
 
6:45pm - 7:45pmIncubator: Stop reinventing the wheel: R package(s) for conference and abstract management
Virtual location: The Lounge #incubator_conference_abstract
Session Chair: Matt Bannert
Session Chair: Janani Ravi
Zoom Host: Balogun Stephen
Replacement Zoom Host: Yuya Matsumura
 
ID: 343 / [Single Presentation of ID 343]: 1
Incubator
Topics: R in production
Keywords: R package development, conference/abstract management, R community, Reproducibility

Matthias Bannert1, Janani Ravi2

1ETH, Zurich, Switzerland; 2Michigan State University, United States of America

The ability to host an entire conference online went from nice-to-have to absolutely essential for many communities within just a bit more than a year. As a consequence, online conference tools were exposed to a wider audience, faced tougher scrutiny, and hauled in more feedback than ever before.

So far no clear front runner among conference tools has emerged from the process. Strikingly, all current conference tools used by the open-source community have major deficiencies such as accessibility issues, are bulky to navigate for admins, and/or come with a hefty price tag. The idea of this incubator is to discuss whether and how processes such as registration, abstract evaluation, or submission can be mapped into an R package and the open-source ecosystem. We aim to challenge established processes and walk through possible process changes, e.g., reviews. We also see such a system as an opportunity towards equitable practices in picking topics, forming committees, choosing reviewers, providing feedback to submitters, and selecting contributions. We intend to work on a solution that is easy to reproduce and facilitates knowledge transfer for annual conferences with changing organizers.

 
7:45pm - 8:00pmBreak
Virtual location: The Lounge #lobby
8:00pm - 9:30pm9A - R in production 3
Virtual location: The Lounge #talk_r_production_3
Session Chair: Jennifer Bryan
Zoom Host: Aditi Shah
Replacement Zoom Host: Pamela Pairo
Session Sponsor: Roche
Session Slides
 
8:00pm - 8:20pm
Talk-Video
ID: 262 / ses-09-A: 1
Regular Talk
Topics: R in production
Keywords: metrics, monitoring, shiny, plumber

Production Metrics for R with the 'openmetrics' Package

Aaron Jacobs

Crescendo Technology

Production applications are often expected to emit "metrics" so that they can be monitored in real time for problems. However, traditional monitoring vendors expected developers to use their proprietary client libraries, none of which included support for R. This left R users without the option of supporting their organisation's existing vendor, and monitoring support for R applications (such as Shiny apps or Plumber APIs) has remained poor.

Over the last few years, the open-source Prometheus project has become the de facto metrics solution in the Kubernetes community, and its text-based format (now called 'OpenMetrics') has been formalised as a draft IETF standard. This effort is the closest thing in the monitoring community to a widely-accepted standard in its history.

This talk will introduce the {openmetrics} package, a client library for the OpenMetrics standard. It allows R applications to expose metrics to any Prometheus instance, or one of the growing number of open-source and commercial tools that can ingest the OpenMetrics format.

In addition to user-defined, application-specific metrics, the package also bundles general-purpose metrics and can automatically inject them into Shiny apps or Plumber APIs. This makes it easy for R users to add production-ready metrics to their applications in only a few lines of code.

As production use of R grows, expectations about production features -- such as metrics and corresponding alerts -- will grow as well.



8:20pm - 8:40pm
sponsored-live
ID: 349 / ses-09-A: 2
Sponsored Talk
Topics: R in the wild (unusual applications of R)

Automating business processes with R

Frans van Dunné

ixpantia, Costa Rica

At ixpantia we help organizations to become their most innovative and data-driven selves through personalized coaching and knowledge transfer, continuous and transparent code handover and the implementation of an efficient and cooperative data culture. To bring the results of data analysis to business processes and decision making, we usually need to automate their execution. Often this means that we need a daily process to write, for example, predicted values to a database.

The possibilities to add value to organizations through the use of tools that are available to us in R are legion. This value lies not only in advanced analytics, but also in the power of dynamic (automated) reports and pre-calculated values that combine multiple formal and informal data sources. The value of R for automating these tasks is often so high because the domain experts themselves are writing it, and can iterate at high speeds to answer to changes in their business context.

In this talk we will share some of the experiences we have had automating tasks with R. We will also present the pragmatic approach to run and monitor scheduled tasks that we have developed. This approach is based on Rmarkdown, a taskscheduler (such as cron) and a Shiny to monitor task execution and completion.



8:40pm - 9:00pm
sponsored-live
ID: 347 / ses-09-A: 3
Sponsored Talk
Topics: R in production
Keywords: RStudio Workbench, Collaboration, DevOps, Docker

RStudio Managed Workbench -

Patrick Schratz

cynkra GmbH, Zurich, Switzerland

As a certified RStudio partner, we offer to deploy a full-fledged RStudio Workbench installation on your infrastructure (on-premise) or in the cloud.

In this talk we showcase the benefits of running a centralised RStudio Workbench installation and explain the added value of our containerised service compared to the default binary installation of RStudio Workbench.

Link to package or code repository.
https://cynkra.com/rstudio


9:00pm - 9:20pm
sponsored-live
ID: 354 / ses-09-A: 4
Sponsored Talk
Topics: R in production

A little bit about RStudio

Joseph Rickert

RStudio / R Consortium, United States of America

TBA

 
8:00pm - 9:30pm9B - Community and Outreach 2
Virtual location: The Lounge #talk_community_outreach_2
Session Chair: Stefanie Butland
Zoom Host: Yuya Matsumura
Replacement Zoom Host: Joselyn Chávez
 
8:00pm - 8:20pm
Talk-Video-NE
ID: 294 / ses-09-B: 1
Regular Talk
Topics: Community and Outreach
Keywords: Survey, underrepresentation, inclusion, community

Using R in Latin America: the great, the good, the bad, and the ugly

Virginia A. García Alonso1,2, Paola Corrales3, Claudia A. Huaylla4, Andrea Gómez Vargas5, Joselyn Chávez6, Denisse Fierro Arcos7,8

1Universidad de Buenos Aires, Facultad de Ciencias Exactas y Naturales, Departamento de Biodiversidad y Biología Experimental (DBBE), Buenos Aires, Argentina; 2CONICET-Universidad de Buenos Aires, Instituto de Biodiversidad y Biología Experimental y Aplicada (IBBEA), Buenos Aires, Argentina; 3CIMA, FCEN, UBA; 4IRNAD-CONICET-UNRN; 5FSOC, UBA; 6IBt-UNAM; 7University of Tasmania, Hobart, Tasmania; 8ARC Centre of Excellence for Climate Extremes, Hobart, Tasmania

R is used globally for diverse purposes. However, how widely is it used in peripheral countries? What barriers and challenges are faced there? We are a group of R useRs from Latin America, a region facing several barriers: language, infrastructure, distance, different access to resources, information and training.

This inequality often represents not only an additional barrier for people to learn and use R, but also an obstacle to participate in international events which reduces our representation on the global sphere. Before improving the inclusion of Latin Americans in the global R community, we need to comprehensively understand useRs’ experiences. With this aim in mind, we created an online survey designed to assess which challenges Latin American R users face that has been shared within a strong existing network of Latin American useRs from Tijuana to Ushuaia.

Over 900 useRs completed the survey and preliminary results already highlight that several useRs from this region often face challenges associated with not being native English speakers or with lacking access to certain resources. However, it also revealed that many belong to different R communities such as R-Ladies and LatinR among others, serving as useful platforms to connect with others who can help to cope with the shared difficulties they face.

In this talk we will present the analysis and results from this survey and invite others to do the same as this open and reusable initiative seeks to inspire other underserved regions around the world to measure their strengths and challenges to help increase inclusion and diversity of the international R community.



8:20pm - 8:40pm
Talk-Live
ID: 280 / ses-09-B: 2
Regular Talk
Topics: Community and Outreach
Keywords: R community, data-driven exploration, dashboards

r-community.org: a central community infrastructure for R

CHIBUOKEM BEN UBAH, RICK PACK, MEET BHATNAGAR, João Vitor Cavalcante

R Central, Nigeria

Large amounts of user data are generated from several social and technology spaces within the R Community daily, providing information about important trends and insights that could be helpful in planning for the future of a sustainable open-source community.

This presentation is about r-community.org, a website for tracking R related data from popular online spaces like MEETUP, CRAN, GITHUB, TWITTER, STACKOVERFLOW & R-BLOGGERS with the purpose of understanding time-related trends (where applicable). The visualizations from this tool are planned to provide an easy way for newbies/oldies to the R language to discover communities and events that are available to them, while allowing an exploration of past and future R Conferences/meetups for all. The infrastructure relies on open-source R packages and continuous integration technology for regular update of data.

With the results of this tool, users, organizations and sub-communities are able to find extra insights for driving decision-making regarding developments around the R language and its global community. 

Link to package or code repository.
https://github.com/r-community


8:40pm - 9:00pm
Talk-Live
ID: 271 / ses-09-B: 3
Regular Talk
Topics: Community and Outreach
Keywords: Community, R markdown, R packages, GitHub, CI/CD

CovidR and rmdgallery: A streamlined process for collecting community contributions in a gallery website

Riccardo Porreca, Francesca Vitalini

Mirai Solutions GmbH, Switzerland

The active open source community is certainly one of R’s greatest assets. During the organization of the virtual e-Rum2020, in the midst of the pandemic outbreak, the Organizing Committee thought of engaging the R community by gathering contributions around the topic of COVID-19, as part of the pre-conference event CovidR (https://2020.erum.io/covidr-contest). This came with the need for a smooth and integrated way of collecting community work, which was elegantly addressed in the form of a contributions gallery website (https://milano-r.github.io/erum2020-covidr-contest), populated with submissions coming as Pull Requests / Issues in a GitHub repository.

Behind the scenes, this was supported by the development of a novel package rmdgallery (https://riccardoporreca.github.io/rmdgallery), providing an R markdown site generator for a gallery of (embedded) content created in a dynamic way based on metadata in YAML or JSON format. This simple yet flexible tool, paired with GitHub Actions and the community features of GitHub, was key to the CovidR success, in a number of ways. These include: seamless submission process, automated inclusion in the gallery of contributed abstracts and content, collecting "likes'' using Utterances (https://utteranc.es), dynamic badges reflecting the status as the context evolved, live updates during the event with the announced awardees in a matter of a pull-request.

In this talk, I will go through the main features of the package and its application for the CovidR contest, highlighting the power of using open source tools for community and collaborative initiatives.



9:00pm - 9:20pm
Talk-Live
ID: 296 / ses-09-B: 4
Regular Talk
Topics: Community and Outreach
Keywords: media, podcast, rmarkdown

How open-source and R enable state-of-the-art media production

Eric Nantz

Statistican / Podcast Host

The global pandemic has impacted numerous aspects of lives, with many of us learning new techniques for harnessing digital technology for our daily work and communicating with the rest of the world. The landscape of software for producing content in both audio and video has seen tremendous growth in capabilities, creating new opportunities to share knowledge. In what seems like a lifetime ago, I created the R-Podcast in 2011 as a unique way to share my journey in learning more about the R language, providing resources to new and experienced users alike, and to showcase the awesome work of the R community. Even in that early stage, I was able to harness open-source software to produce and distribute the episodes, even with R playing a prominent role in the web presence. More recently, I have launched a new venture called the Shiny Developer Series that leverages the latest media production open-source software such as Open Broadcaster Software (OBS) and OBS Ninja to give me full control to produce my content without any compromises. In this talk, I will share my key learnings from producing media content about R and how myself and the growing community of R media content producers are using open-source to spread their knowledge of R to the entire world.

 
8:00pm - 9:30pm9C - Statistical modeling and data analysis 2
Virtual location: The Lounge #talk_stats_data_analysis_2
Session Chair: Balasubramanian Narasimhan
Zoom Host: Balogun Stephen
Session Sponsor: R Consortium Session Slide
 
8:00pm - 8:20pm
Talk-Video
ID: 145 / ses-09-C: 1
Regular Talk
Topics: Time series
Keywords: Bayesian analysis, Markov chain Monte Carlo, state space models, R package, sequential Monte Carlo

bssm: Bayesian Inference of Non-linear and Non-Gaussian State Space Models in R

Jouni Helske, Matti Vihola

University of Jyväskylä

State space models are a flexible class of latent variable models commonly used in analysing time series data. The R package bssm is designed for Bayesian inference of general state space models with non-Gaussian and/or non-linear observational and state equations. The package provides easy-to-use and efficient functions for fully Bayesian inference with common time series models such as basic structural time series model with exogenous covariates, simple stochastic volatility models, and discretized diffusion models, making it straightforward and efficient to make predictions and other inference in a Bayesian setting. Unlike the existing packages, bssm allows for easy-to-use approximate inference based on Gaussian approximations such as the Laplace approximation and the extended Kalman filter. The inference is based on fully automatic, adaptive Markov chain Monte Carlo (MCMC) on the hyperparameters, with optional parallelizable importance sampling post-correction to eliminate any approximation bias. The bssm package implements also a direct pseudo-marginal MCMC and a delayed acceptance pseudo-marginal MCMC using intermediate approximations. The package supports directly models with linear-Gaussian state dynamics with non-Gaussian observation models and has an Rcpp interface for specifying custom non-linear and diffusion models.



8:20pm - 8:40pm
Talk-Video
ID: 238 / ses-09-C: 2
Regular Talk
Topics: Bayesian models
Keywords: graphical models

bnmonitor: Checking the Robustness and Sensitivity of Bayesian Networks

Rachel Wilkerson1, Manuele Leonelli2, Ramsiya Ramanathan3

1Baylor University; 2IE University, Madrid; 3Università di Bologna, Bologna, Italy

Bayesian networks (BNs) are the most common approach to investigate the relationship between random variables. There are now a variety of R packages with the capability of learning such models from data and performing inference. The new bnmonitor R package is the only package which enables users to perform robustness and sensitivity analysis for BNs, both in the discrete and in the continuous case.

Various prequential monitors are implemented to check how well a BN describes a dataset used to learn the model. By checking the elements of the structure, we can adjust the model, presumably the best in the equivalence class, to ensure a good fit. Checking the forecasts that flow from the model allows users to check elements of the model structure in an online setting.

Furthermore the impact of the learned probabilities is investigated using sensitivity functions which describe the functional relationship between an output of interest and the model's parameters.

The output of these monitors are concisely reported via a tailored plot method taking advantage of ggplot2. We illustrate our methods with an example that explores the relationships between body measurements to predict the percentage of body fat. Our example highlights the importance of checking a BN with the appropriate diagnostics.



8:40pm - 9:00pm
Talk-Live
ID: 293 / ses-09-C: 3
Regular Talk
Topics: Social sciences
Keywords: matching, causal, observational

FLAME: Interpretable Matching for Causal Inference

Vittorio Dominic Orlandi, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky

Duke University, United States of America

Matching methods are a class of techniques for estimating casual effects from observational data. Such methods match similar units together to emulate the randomization achieved by controlled experiments. Crucially, matching methods rely on a distance measure to determine similarity and thereby match units together. In this talk, we present an R package, FLAME, implementing the Fast, Large-scale Almost Matching Exactly (FLAME) and Dynamic Almost Matching Exactly (DAME) algorithms for performing matching on categorical datasets. These algorithms learn a weighted Hamming distance metric via machine learning on a held out dataset and match units directly on covariate values, prioritizing matches on more important covariates. The R package features an efficient bit-vectors implementation, allowing it to scale to datasets with hundreds of thousands of units and dozens of covariates, with a database implementation under development that allows it to operate on datasets too large to fit in memory. FLAME provides easy summarization, analysis, and visualization of treatment effect estimates, and features a wide variety of options for how matching is to be performed, allowing for users to make analysis-specific decisions throughout the matching procedure. We present an overview of the main functionality of the package and then illustrate an application to the 2010 US NCHS Natality Dataset, in which we study the effect of smoking during pregnancy on NICU admissions.

Link to package or code repository.
https://github.com/vittorioorlandi/FLAME


9:00pm - 9:20pm
sponsored-video
ID: 363 / ses-09-C: 4
Sponsored Talk
Topics: Community and Outreach

R Consortium and You: How you can help us connect the dots

Mehar Pratap Singh

ProCogia

This presentation will showcase what R Consortium is and how it improves the R ecosystem. A lot is going on! We have funded over USD 1,300,000 to the R community. Infrastructure Steering Committee (ISC) and the diverse working group (WG) programs have stimulated conversation and alignment on crucial areas such as industry adoption, package health, and educational standards. This presentation will showcase several working groups and projects funded and shepherded by the R Consortium. The discussion will enable the audience to understand the work being done and the exciting opportunities to participate in these initiatives.

 
9:30pm - 11:30pmClosing and Awards
Virtual location: The Lounge #announcements
Session Chair: Andrea Sánchez-Tapia
Session Chair: Marcela Alfaro Cordoba
Zoom Host: Pamela Pairo
We will announce awards for presentations and recognize the work of those who have helped the R community grow. We will also have a short presentation about the next useR!

 
Contact and Legal Notice · Contact Address:
Privacy Statement · Conference: useR! 2021
Conference Software - ConfTool Pro 2.6.144+TC
© 2001–2022 by Dr. H. Weinreich, Hamburg, Germany