Conference Agenda

5B - Mathematical/Statistical Methods
Thursday, 08/July/2021:
1:45am - 3:15am

Session Chair: Marcela Alfaro Cordoba
Zoom Host: Nick Spyrison
Replacement Zoom Host: Olgun Aydin
Virtual location: The Lounge #talk_math_stats

Session Topics:
Data mining / Machine learning / Deep Learning and AI, Statistical models, Operational research and optimization

1:45am - 2:05am
ID: 254 / ses-05-B: 1
Regular Talk
Topics: Statistical models
Keywords: Model misspecIfication, tidyverse, assumptions, variance estimation

maars: Tidy Inference under misspecified statistical models in R

Riccardo Fogliato, Shamindra Shrotriya, Arun Kumar Kuchibhotla

Carnegie Mellon University, United States of America

Linear regression using ordinary least squares (OLS) is a critical part of every statistician's toolkit. In R, this is elegantly implemented via lm() and its related functions. However, the statistical inference output from this suite of functions is based on the assumption that the model is well specified. This assumption is often unrealistic and at best satisfied approximately. In the statistics and econometric literature, this has long been recognized and a large body of work provides inference for OLS under more practical assumptions (e.g., only assuming independence of the observations). In this talk, we will introduce our package “maars” (models as approximations) that aims at bringing research on inference in misspecified models to R via a comprehensive workflow. Our "maars" package differs from other packages that also implement variance estimation, such as “sandwich”, in three key ways. First, all functions in “maars” follow a consistent grammar and return output in tidy format (Wickham, 2014), with minimal deviation from the typical lm() workflow. Second, “maars'' contains several tools for inference including empirical, wild, residual bootstrap, and subsampling. Third, “maars” is developed with pedagogy in mind. For this, most of its functions explicitly return the assumptions under which the output is valid. This key innovation makes “maars” useful in teaching inference under misspecification and also a powerful tool for applied researchers. We hope our default feature of explicitly presenting assumptions will become a de facto standard for most statistical modeling in R.

2:05am - 2:25am
ID: 285 / ses-05-B: 2
Regular Talk
Topics: Operational research and optimization
Keywords: Evolutionary Strategies, Mixed Integer Problems, Multifidelity Optimization, Black Box Optimization, Multi-Objective Optimization

Mixed Integer Evolutionary Strategies with "miesmuschel"

Martin Binder

LMU Munich, Germany

Evolutionary Strategies (ES) are optimization algorithms inspired by biological evolution that do not make use of gradient information, and are therefore well-suited for "black-box optimization" where this information is not available. Mixed-Integer ES (MIES) are an extension that allow optimization of mixed continuous, integer, and categorical search spaces by defining different mutation and recombination operations on different subspaces. We present our new package "miesmuschel" (pronounced MEES-mooshl), a modular toolbox for MIES optimization. It provides "Operator" objects for mutation, recombination, and parent/survival selection that can be configured and combined in various ways to match the optimization problem at hand. Configuration parameters of operators can even be self-adaptive and evolve together with the solutions of the optimization problem. Miesmuschel can be used for both single- and multi-objective optimization, simply by using different selection operations. The multi-fidelity optimization capabilities of miesmuschel can be used for expensive objectives where early generations or new samples are preliminarily evaluated with less effort.

A standard optimization loop (parent selection, recombination, mutation, survival selection) is given and can be used out-of-the-box, but the supplied methods can also be combined as building blocks to form more specialized algorithms.

Miesmuschel makes use of the "paradox" and "bbotk" packages and integrates with the "mlr3" ecosystem.

Link to package or code repository.

2:25am - 2:45am
ID: 110 / ses-05-B: 3
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: anomaly detection

Here is the anomalow-down!

Sevvandi Kandanaarachchi1, Rob J Hyndman2

1RMIT University; 2Monash University

Why should we care about anomalies? They demand our attention because they are telling a different story from the norm. An anomaly might signify a failing heart rate of a patient, a fraudulent credit card activity, or an early indication of a tsunami. As such, it is extremely important to detect anomalies.

What are the challenges in anomaly detection? As with many machine/statistical learning tasks high dimensional data poses a problem. Another challenge is selecting appropriate parameters. Yet another challenge is high false positive rates.

In this talk we introduce two R packages – dobin and lookout – that address different challenges in anomaly detection. Dobin is a dimension reduction technique especially catered to anomaly detection. So, dobin is somewhat similar PCA; but dobin puts anomalies in the forefront. We can use dobin as a pre-processing step and find anomalies using fewer dimensions.

On the other hand, lookout is an anomaly detection method that uses kernel density estimates and extreme value theory. But there is a difference. Generally, anomaly detection methods that use kernel density estimates require a user-defined bandwidth parameter. But does the user know how to specify this elusive bandwidth parameter? Lookout addresses this challenge by constructing an appropriate bandwidth for anomaly detection using topological data analysis, so the user doesn’t need to specify a bandwidth parameter. Furthermore, lookout has a low false positive rate because it uses extreme value theory.

We also introduce the concept of anomaly persistence, which explores the birth and death of anomalies as the bandwidth changes. If a data point is identified as an anomaly for a large range of bandwidth values, then its significance as an anomaly is high.