Conference Agenda

7B: Statistical modeling in R
Thursday, 08/July/2021:
9:15am - 10:45am

Session Chair: Sevvandi Kandanaarachchi
Zoom Host: Nick Spyrison
Replacement Zoom Host: Olgun Aydin
Virtual location: The Lounge #talk_stats

Session Topics:
Statistical models

9:15am - 9:35am
ID: 207 / ses-07-B: 1
Regular Talk
Topics: Statistical models
Keywords: distributional regression, probabilistic forecasts, regression trees, random forests, graphical model assessment

Probability Distribution Forecasts: Learning with Random Forests and Graphical Assessment

Moritz N. Lang1, Reto Stauffer1,2, Achim Zeileis1

1Department of Statistics, Faculty of Economics and Statistics, Universität Innsbruck; 2Digital Science Center, Universität Innsbruck

Forecasts in terms of entire probability distributions (often called "probabilistic forecasts" for short) - as opposed to predictions of only the mean of these distributions - are of prime importance in many different disciplines from natural sciences to social sciences and beyond. Hence, distributional regression models have been receiving increasing interest over the last decade. Here, we make contributions to two common challenges in distributional regression modeling:

1. Obtaining sufficiently flexible regression models that can capture complex patterns in a data-driven way.

2. Assessing the goodness-of-fit of distributional models both in-sample and out-of-sample using visualizations that bring out potential deficits of these models.

Regarding challenge 1, we present the R package "disttree" (Schlosser et al. 2021), that implements distributional trees and forests (Schlosser et al. 2019). These blend the recursive partitioning strategy of classical regression trees and random forests with distributional modeling. The resulting tree-based models can capture nonlinear effects and interactions and automatically select the relevant covariates that determine differences in the underlying distributional parameters.

For graphically evaluating the goodness-of-fit of the resulting probabilistic forecasts (challenge 2), the R package "topmodels" (Zeileis et al. 2021) is introduced, providing extensible probabilistic forecasting infrastructure and corresponding diagnostic graphics such as Q-Q plots of randomized residuals, PIT (probability integral transform) histograms, reliability diagrams, and rootograms. In addition to distributional trees and forests other models can be plugged into these displays, which can be rendered both in base R graphics and "ggplot2" (Wickham 2016).

9:35am - 9:55am
ID: 178 / ses-07-B: 2
Regular Talk
Topics: Statistical models

spaMM: an R package to fit generalized, linear, and mixed models allowing for complex covariance structures

François Rousset1, Alexandre Courtiol2

1Univ. Montpellier, CNRS, Institut des Sciences de l'Evolution, Montpellier, France; 2Leibniz Institute for Zoo and Wildlife Research, Berlin

Introduced to make the fit of spatial Mixed Models accessible, the R package spaMM has grown a lot since its first CRAN release eight years ago. The package now offers the possibility to fit a variety of regression models, from simple linear models (LM) to generalised linear mixed-effects models (GLMM), including multivariate-response models, and double hierarchical GLMMs (DHGLM) in which both the mean of a response and the residual variance can be modelled as a function of fixed and random effects. The package provides a diversity of response families beyond the standard ones, such as the (truncated or not) negative binomial, and the Conway-Maxwell-Poisson, as well as non-gaussian random effect families such as the inverse gaussian. Random effects can further be modelled using several autocorrelation functions for the consideration of spatial, temporal and other forms of dependence between observations (e.g. genetic pedigrees). spaMM handles this diversity of models through a simple formula-based interface akin to glm() or lme4::glmer(). Advanced users will nonetheless appreciate the possibility to fine tune many aspects of the fit (e.g. select among several likelihood approximations; set parameters to fixed values). The package also provides tailored methods for many generics, so that for instance anova() can be called to perform likelihood ratio tests by parametric bootstrap and that AIC() computes both the marginal and conditional AIC. The package is finally competitive in terms of computational speed, for both non-spatial, geostatistical, and autoregressive models

9:55am - 10:15am
ID: 362 / ses-07-B: 3
Regular Talk
Topics: Statistical models
Keywords: big data

Changed to Elevator Pitch: The one-step estimation procedure in R

Alexandre Brouste1, Christophe Dutang2

1Le Mans Université; 2Université Paris-Dauphine

In finite-dimensional parameter estimation, the Le Cam one-step procedure is based on an initial guess estimator and a Fisher scoring step on the loglikelihood function. For an initial $\sqrt(n)$–consistent guess estimator, the one-step estimation procedure is asymptotically efficient. As soon as the guess estimator is in a closed form, it can also be computed faster than the maximum likelihood estimator. More recently, it has been shown that this procedure can be extended to an initial guess estimator with a slower speed of convergence. Based on this result, we propose in the OneStep package (available on CRAN) a procedure to compute the one-step estimator in any situation faster than the MLE for large datasets. Monte-Carlo simulations are carried out for several examples of statistical experiments generated by i.i.d. observation samples (discrete and continuous probability distributions). Thereby, we exhibit the performance of Le Cam’s one-step estimation procedure in terms of efficiency and computational cost on observation samples of finite size. A real application and the future package developments will also be discussed.

10:15am - 10:35am
ID: 206 / ses-07-B: 4
Regular Talk
Topics: Statistical models
Keywords: probabilistic graphical models

The R Package stagedtrees for Structural Learning of Stratified Staged Trees

Federico Carli1, Manuele Leonelli2, Eva Riccomagno1, Gherardo Varando3

1Università degli Studi di Genova, Dipartimento di Matematica, Italy; 2IE University, School of Human Sciences and Technology, Spain; 3Universitat de València, Image Processing Laboratory (IPL), Spain

stagedtrees is an R package which includes several algorithms for learning

the structure of staged trees and chain event graphs from categorical data. In

the past twenty years there has been an explosion of the use of graphical models

to represent the relationship among a vector of random variables and perform

inference taking advantage of the underlying graphical representations.

Bayesian networks are nowadays one of the most used graphical models,

with applications to a wide array of domains and implementation in various

software. However, they can only represent symmetric conditional independence

statements which in practical applications may not be fully justified. Most

often, the greater the number of levels of categorical variables involved, the

more difficult it is for conditional independence to hold for all the variables’

levels. Therefore, models accommodating also asymmetric relations as context-specific,

partial and local independences have been developed. Staged trees are

one such class. Staged tree modeling has proved its worth in many fields, as for

instance cohort studies, causal analysis, case-control studies, Bayesian games

and medical diagnosis.

stagedtrees permits to estimate any type of non-symmetric conditional

independences from data via score-based and clustering-based algorithms.

Various functionalities to provide inferential, visualization, descriptive and summary

statistics tools for such models and about their graph structure are implemented.

These functions help users in handling categorical experimental data and

analyzing the learned models to untangle complex dependence structures.