Conference Agenda

3A - Machine Learning and Data Management
Tuesday, 06/July/2021:
8:45am - 10:15am

Session Chair: Young-suk Lee
Zoom Host: Nasrin Fathollahzadeh Attar
Replacement Zoom Host: Dorothea Hug Peter
Location: The Lounge #talk_ml_dm

Session Topics:
Data mining / Machine learning / Deep Learning and AI

Session Sponsor: MemVerge
Session Slides

8:45am - 9:05am
ID: 188 / ses-03-A: 1
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: Automated Machine Learning, R package, Hyperband

mlr3automl - Automated Machine Learning in R

Alexander Bernd Hanf

Ludwig-Maximilians-Universität Munich

We introduce mlr3automl, an open-source framework for Automated Machine Learning in R. Based on the mlr3 Machine Learning package, mlr3automl builds robust and accurate classification and regression models for tabular data.

mlr3automl provides automatic preprocessing, which guarantees stable performance in the presence of missing data, categorical and high-cardinality features, and large data sets. Preprocessing and model building is solved through a flexible pipeline implemented with mlr3pipelines. This allows mlr3automl to jointly optimize preprocessing, model selection and model hyperparameters using Hyperband.

mlr3automl shows strong performance and stability on a benchmark consisting of 39 challenging classification tasks. mlr3automl successfully completed every task in the benchmark within the strict time budget, which three out of five other state of the art AutoML systems failed to achieve.

Link to package or code repository.

9:05am - 9:25am
ID: 168 / ses-03-A: 2
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: exploratory data analysis

Triplot: model agnostic measures and visualisations for variable importance in predictive models that take into account the hierarchical correlation structure

Katarzyna Pękala, Katarzyna Woźnica, Przemysław Biecek

MI2 Data Lab, Warsaw University of Technology

One of the key elements of the explanatory analysis of a predictive model is to assess the importance of the individual variables. The rapid development of the area of predictive model exploration (also called explainable artificial intelligence or interpretable machine learning) has led to the popularization of methods for local (instance level) and global (dataset level) methods, such as Permutational Variable Importance, Shapley Values (SHAP), Local Interpretable Model Explanations (LIME), Break Down and so on. However, these methods do not use information about the correlation between features which significantly reduce the explainability of the model behaviour.

In this work, we propose new methods to support model analysis by exploiting the information about the correlation between variables. The dataset level aspect importance measure is inspired by the block permutations procedure, while the instance level aspect importance measure is inspired by the LIME method. We show how to analyse groups of variables (aspects) both when they are proposed by the user and when they should be determined automatically based on the hierarchical structure of correlations between variables.

Additionally, we present a new type of model visualisation, triplot, that exploits a hierarchical structure of variable grouping to produce a high information density model visualisation. This visualisation provides a consistent illustration for either local or global model and data exploration.

We also show an example of real-world data with 5k instances and 37 features in which a significant correlation between variables affects the interpretation of the effect of variable importance.

The proposed method is, to our knowledge, the first to allow direct use of the correlation between variables in exploratory model analysis. Triplot package for R is developed under open source GPL-3 licence and is available on GitHub repository at

Link to package or code repository.

9:25am - 9:45am
ID: 252 / ses-03-A: 3
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: networks, embeddings, machine learning, algorithms

Getting sprung in R: Introduction to the rsetse package for embedding feature-rich networks

Jonathan Bourne

UCL, United Kingdom

The Strain Elevation Tension Spring embedding algorithm (SETSe) is a deterministic method for embedding feature-rich networks. The algorithm uses simple Newtonian equations of motion and Hooke's law to embed the network onto a locally euclidean manifold. To create the embedding, SETSe converts node attributes into forces and the edge attributes into springs. SETSe finds an equilibrium position when the forces on the springs balance the forces of the nodes. The algorithm has a time complexity of O(2) and linear memory complexity; this means the algorithm avoids issues faced by other physics based embedding methods and can be used to embed graphs with tens of thousands of nodes and more than a million edges.

Some applications of SETSe are; analysing social networks; understanding the robustness of power grids; geographical analysis; predicting node features; understanding power dynamic between individuals and organisations; analysis of molecular structures.

This presentation will provide both a brief technical discussion of the algorithm and its implementation, as well as several use cases. The use cases describe how to embed a network and then how to interpret that embedding.

There are very few options for graph embeddings using R, and this is something that rsetse seeks to address; the algorithm has been implemented in the package `rsetse` and is available on CRAN.

9:45am - 10:05am
ID: 137 / ses-03-A: 4
Regular Talk
Topics: Data mining / Machine learning / Deep Learning and AI
Keywords: data envelopment analysis

An R package for the implementation of Efficiency Analysis Trees and the estimation of technical efficiency

Miriam Esteve, Victor J. España, Juan Aparicio, Xavier Barber

Miguel Hernández University of Elche

EAT is a new R package that includes functions to estimate production frontiers and technical efficiency measures using non-parametric techniques based on CART regression trees. The package implements the main algorithms associated with a new technique introduced to estimate the efficiency of a set of decision making units in Economics and Engineering through machine learning techniques, called Efficiency Analysis Trees (Esteve et al., 2020). It encompasses the estimation of radial measures, oriented Russell efficiency measures, the directional distance function, the weighted additive model, graphical representations of the production frontier using tree-shaped structures and the classification of input variable importance. In addition, it incorporates a code to carry out an adaptation of the Random Forest Algorithm to estimate technical efficiency. This work describes the methodology and application of the functions.

Link to package or code repository.