stat updates on arXiv.org

A Bayesian neural network predicts the dissolution of compact planetary systems. (arXiv:2101.04117v1 [astro-ph.EP])

Despite over three hundred years of effort, no solutions exist for predicting when a general planetary configuration will become unstable. We introduce a deep learning architecture to push forward this problem for compact systems. While current machine learning algorithms in this area rely on scientist-derived instability metrics, our new technique learns its own metrics from scratch, enabled by a novel internal structure inspired from dynamics theory. Our Bayesian neural network model can accurately predict not only if, but also when a compact planetary system with three or more planets will go unstable. Our model, trained directly from short N-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predicting instability times than analytical estimators, while also reducing the bias of existing machine learning algorithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planet configurations, the model demonstrates robust generalization to both non-resonant and higher multiplicity configurations, in the latter case outperforming models fit to that specific set of integrations. The model computes instability estimates up to five orders of magnitude faster than a numerical integrator, and unlike previous efforts provides confidence intervals on its predictions. Our inference model is publicly available in the SPOCK package, with training code open-sourced.

General Hannan and Quinn Criterion for Common Time Series. (arXiv:2101.04210v1 [math.ST])

This paper aims to study data driven model selection criteria for a large class of time series, which includes ARMA or AR($\infty$) processes, as well as GARCH or ARCH($\infty$), APARCH and many others processes. We tackled the challenging issue of designing adaptive criteria which enjoys the strong consistency property. When the observations are generated from one of the aforementioned models, the new criteria, select the true model almost surely asymptotically. The proposed criteria are based on the minimization of a penalized contrast akin to the Hannan and Quinn's criterion and then involved a term which is known for most classical time series models and for more complex models, this term can be data driven calibrated. Monte-Carlo experiments and an illustrative example on the CAC 40 index are performed to highlight the obtained results.

Flexible Validity Conditions for the Multivariate Mat\'ern Covariance in any Spatial Dimension and for any Number of Components. (arXiv:2101.04235v1 [stat.ME])

Flexible multivariate covariance models for spatial data are on demand. This paper addresses the problem of parametric constraints for positive semidefiniteness of the multivariate Mat{\'e}rn model. Much attention has been given to the bivariate case, while highly multivariate cases have been explored to a limited extent only. The existing conditions often imply severe restrictions on the upper bounds for the collocated correlation coefficients, which makes the multivariate Mat{\'e}rn model appealing for the case of weak spatial cross-dependence only. We provide a collection of validity conditions for the multivariate Mat{\'e}rn covariance model that allows for more flexible parameterizations than those currently available. We also prove that, in several cases, we can attain much higher upper bounds for the collocated correlation coefficients in comparison with our competitors. We conclude with a simple illustration on a trivariate geochemical dataset and show that our enlarged parametric space allows for better fitting performance with respect to our competitors.

On the Convergence of Deep Networks with Sample Quadratic Overparameterization. (arXiv:2101.04243v1 [cs.LG])

The remarkable ability of deep neural networks to perfectly fit training data when optimized by gradient-based algorithms is yet to be fully explained theoretically. Explanations by recent theoretical works rely on the networks to be wider by orders of magnitude than the ones used in practice. In this work, we take a step towards closing the gap between theory and practice. We show that a randomly initialized deep neural network with ReLU activation converges to a global minimum in a logarithmic number of gradient-descent iterations, under a considerably milder condition on its width. Our analysis is based on a novel technique of training a network with fixed activation patterns. We study the unique properties of the technique that allow an improved convergence, and can be transformed at any time to an equivalent ReLU network of a reasonable size. We derive a tight finite-width Neural Tangent Kernel (NTK) equivalence, suggesting that neural networks trained with our technique generalize well at least as good as its NTK, and it can be used to study generalization as well.

Estimating the probability that a given vector is in the convex hull of a random sample. (arXiv:2101.04250v1 [math.PR])

For a $d$-dimensional random vector $X$, let $p_{n, X}$ be the probability that the convex hull of $n$ i.i.d. copies of $X$ contains a given point $x$. We provide several sharp inequalities regarding $p_{n, X}$ and $N_X$, which denotes the smallest $n$ with $p_{n, X} \ge 1/2$. As a main result, we derive a totally general inequality which states $1/2 \le \alpha_X N_X \le 16d$, where $\alpha_X$ (a.k.a. the Tukey depth) is the infimum of the probability that $X$ is contained in a fixed closed halfspace including the point $x$. We also provide some applications of our results, one of which gives a moment-based bound of $N_X$ via the Berry-Esseen type estimate.

Evaluation of Logistic Regression Applied to Respondent-Driven Samples: Simulated and Real Data. (arXiv:2101.04253v1 [stat.ME])

Objective: To investigate the impact of different logistic regression estimators applied to RDS samples obtained by simulation and real data. Methods: Four simulated populations were created combining different connectivity models, levels of clusterization and infection processes. Each subject in the population received two attributes, only one of them related to the infection process. From each population, RDS samples with different sizes were obtained. Similarly, RDS samples were obtained from a real-world dataset. Three logistic regression estimators were applied to assess the association between the attributes and the infection status, and subsequently the observed coverage of each was measured. Results: The type of connectivity had more impact on estimators performance than the clusterization level. In simulated datasets, unweighted logistic regression estimators emerged as the best option, although all estimators showed a fairly good performance. In the real dataset, the performance of weighted estimators presented some instabilities, making them a risky option. Conclusion: An unweighted logistic regression estimator is a reliable option to be applied to RDS samples, with similar performance to random samples and, therefore, should be the preferred option.

Weighted Approach for Estimating Effects in Principal Strata with Missing Data for a Categorical Post-Baseline Variable in Randomized Controlled Trials. (arXiv:2101.04263v1 [stat.ME])

This research was motivated by studying anti-drug antibody (ADA) formation and its potential impact on long-term benefit of a biologic treatment in a randomized controlled trial, in which ADA status was not only unobserved in the control arm but also in a subset of patients from the experimental treatment arm. Recent literature considers the principal stratum estimand strategy to estimate treatment effect in groups of patients defined by an intercurrent status, i.e. in groups defined by a post-randomization variable only observed in one arm and potentially associated with the outcome. However, status information might be missing even for a non-negligible number of patients in the experimental arm. For this setting, a novel weighted principal stratum approach is presented: Data from patients with missing intercurrent event status were re-weighted based on baseline covariates and additional longitudinal information. A theoretical justification of the proposed approach is provided for different types of outcomes, and assumptions allowing for causal conclusions on treatment effect are specified and investigated. Simulations demonstrated that the proposed method yielded valid inference and was robust against certain violations of assumptions. The method was shown to perform well in a clinical study with ADA status as an intercurrent event.

High-Dimensional Low-Rank Tensor Autoregressive Time Series Modeling. (arXiv:2101.04276v1 [stat.ME])

Modern technological advances have enabled an unprecedented amount of structured data with complex temporal dependence, urging the need for new methods to efficiently model and forecast high-dimensional tensor-valued time series. This paper provides the first practical tool to accomplish this task via autoregression (AR). By considering a low-rank Tucker decomposition for the transition tensor, the proposed tensor autoregression can flexibly capture the underlying low-dimensional tensor dynamics, providing both substantial dimension reduction and meaningful dynamic factor interpretation. For this model, we introduce both low-dimensional rank-constrained estimator and high-dimensional regularized estimators, and derive their asymptotic and non-asymptotic properties. In particular, by leveraging the special balanced structure of the AR transition tensor, a novel convex regularization approach, based on the sum of nuclear norms of square matricizations, is proposed to efficiently encourage low-rankness of the coefficient tensor. A truncation method is further introduced to consistently select the Tucker ranks. Simulation experiments and real data analysis demonstrate the advantages of the proposed approach over various competing ones.

Mode Hunting Using Pettiest Components Analysis. (arXiv:2101.04288v1 [stat.ME])

Principal component analysis has been used to reduce dimensionality of datasets for a long time. In this paper, we will demonstrate that in mode detection the components of smallest variance, the pettiest components, are more important. We prove that when the data follows a multivariate normal distribution, by implementing "pettiest component analysis" when the data is normally distributed, we obtain boxes of optimal size in the sense that their size is minimal over all possible boxes with the same number of dimensions and given probability. We illustrate our result with a simulation revealing that pettiest component analysis works better than its competitors.

Regret Analysis of Distributed Gaussian Process Estimation and Coverage. (arXiv:2101.04306v1 [cs.RO])

We study the problem of distributed multi-robot coverage over an unknown, nonuniform sensory field. Modeling the sensory field as a realization of a Gaussian Process and using Bayesian techniques, we devise a policy which aims to balance the tradeoff between learning the sensory function and covering the environment. We propose an adaptive coverage algorithm called Deterministic Sequencing of Learning and Coverage (DSLC) that schedules learning and coverage epochs such that its emphasis gradually shifts from exploration to exploitation while never fully ceasing to learn. Using a novel definition of coverage regret which characterizes overall coverage performance of a multi-robot team over a time horizon $T$, we analyze DSLC to provide an upper bound on expected cumulative coverage regret. Finally, we illustrate the empirical performance of the algorithm through simulations of the coverage task over an unknown distribution of wildfires.

Change-point detection using spectral PCA for multivariate time series. (arXiv:2101.04334v1 [stat.AP])

We propose a two-stage approach Spec PC-CP to identify change points in multivariate time series. In the first stage, we obtain a low-dimensional summary of the high-dimensional time series by Spectral Principal Component Analysis (Spec-PCA). In the second stage, we apply cumulative sum-type test on the Spectral PCA component using a binary segmentation algorithm. Compared with existing approaches, the proposed method is able to capture the lead-lag relationship in time series. Our simulations demonstrate that the Spec PC-CP method performs significantly better than competing methods for detecting change points in high-dimensional time series. The results on epileptic seizure EEG data and stock data also indicate that our new method can efficiently {detect} change points corresponding to the onset of the underlying events.

The Beta-Mixture Shrinkage Prior for Sparse Covariances with Posterior Minimax Rates. (arXiv:2101.04351v1 [math.ST])

Statistical inference for sparse covariance matrices is crucial to reveal dependence structure of large multivariate data sets, but lacks scalable and theoretically supported Bayesian methods. In this paper, we propose beta-mixture shrinkage prior, computationally more efficient than the spike and slab prior, for sparse covariance matrices and establish its minimax optimality in high-dimensional settings. The proposed prior consists of beta-mixture shrinkage and gamma priors for off-diagonal and diagonal entries, respectively. To ensure positive definiteness of the resulting covariance matrix, we further restrict the support of the prior to a subspace of positive definite matrices. We obtain the posterior convergence rate of the induced posterior under the Frobenius norm and establish a minimax lower bound for sparse covariance matrices. The class of sparse covariance matrices for the minimax lower bound considered in this paper is controlled by the number of nonzero off-diagonal elements and has more intuitive appeal than those appeared in the literature. The obtained posterior convergence rate coincides with the minimax lower bound unless the true covariance matrix is extremely sparse. In the simulation study, we show that the proposed method is computationally more efficient than competitors, while achieving comparable performance. Advantages of the shrinkage prior are demonstrated based on two real data sets.

Dynamic Spectrum Access using Stochastic Multi-User Bandits. (arXiv:2101.04388v1 [cs.IT])

A stochastic multi-user multi-armed bandit framework is used to develop algorithms for uncoordinated spectrum access. In contrast to prior work, it is assumed that rewards can be non-zero even under collisions, thus allowing for the number of users to be greater than the number of channels. The proposed algorithm consists of an estimation phase and an allocation phase. It is shown that if every user adopts the algorithm, the system wide regret is order-optimal of order $O(\log T)$ over a time-horizon of duration $T$. The regret guarantees hold for both the cases where the number of users is greater than or less than the number of channels. The algorithm is extended to the dynamic case where the number of users in the system evolves over time, and is shown to lead to sub-linear regret.

New Bias Calibration for Robust Estimation in Small Areas. (arXiv:2101.04390v1 [stat.ME])

Using sample surveys as a cost effective tool to provide estimates for characteristics of interest at population and sub-populations (area/domain) level has a long tradition in "small area estimation". However, the existence of outliers in the sample data can significantly affect the estimation for areas in which they occur, especially where the domain-sample size is small. Based on existing robust estimators for small area estimation we propose two novel approaches for bias calibration. A series of simulations shows that our methods lead to more efficient estimators in comparison with other existing bias-calibration methods. As a real data example we apply our estimators to obtain \textit{Gini} coefficients in labour market areas of the Tuscany region of Italy, where our sources of information are the EU-SILC survey and the Italian census. This analysis shows that the new methods reveal a different picture than existing methods. We extend our ideas to predictions for non-sampled areas.

Statistical analysis of periodic data in neuroscience. (arXiv:2101.04408v1 [stat.ME])

Many experimental paradigms in neuroscience involve driving the nervous system with periodic sensory stimuli. Neural signals recorded with a variety of techniques will then include phase-locked oscillations at the stimulation frequency. The analysis of such data often involves standard univariate statistics such as T-tests, conducted on the Fourier amplitude components (ignoring phase). However, the assumptions of these tests will often be violated because amplitudes are not normally distributed, and furthermore weak signals might be missed if the phase information is discarded. An alternative approach is to conduct multivariate statistical tests using the real and imaginary Fourier components. Here the performance of two multivariate extensions of the T-test are compared: Hotelling's $T^2$ and a variant called $T^2_{circ}$. A novel test of the assumptions of $T^2_{circ}$ is developed, based on the condition index of the data (the square root of the ratio of eigenvalues of a bounding ellipse), and a heuristic for excluding outliers using the Mahalanobis distance is proposed. The $T^2_{circ}$ statistic is then extended to multi-level designs, resulting in a new statistical test termed $ANOVA^2_{circ}$. This has identical assumptions to $T^2_{circ}$, and is shown to be more sensitive than MANOVA when these assumptions are met. The use of these tests is demonstrated for two publicly available empirical data sets, and practical guidance is suggested for choosing which test to run. Implementations of these novel tools are provided as an R package, in the hope that their wider adoption will improve the sensitivity of statistical inferences involving periodic data.

Penalized regression calibration: a method for the prediction of survival outcomes using complex longitudinal and high-dimensional data. (arXiv:2101.04426v1 [stat.ME])

Longitudinal and high-dimensional measurements have become increasingly common in biomedical research. However, methods to predict survival outcomes using covariates that are both longitudinal and high-dimensional are currently missing. In this article we propose penalized regression calibration (PRC), a method that can be employed to predict survival in such situations.

PRC comprises three modelling steps: first, the trajectories described by the longitudinal predictors are flexibly modelled through the specification of multivariate latent process mixed models. Second, subject-specific summaries of the longitudinal trajectories are derived from the fitted mixed effects models. Third, the time to event outcome is predicted using the subject-specific summaries as covariates in a penalized Cox model.

To ensure a proper internal validation of the fitted PRC models, we furthermore develop a cluster bootstrap optimism correction procedure (CBOCP) that allows to correct for the optimistic bias of apparent measures of predictiveness.

After studying the behaviour of PRC via simulations, we conclude by illustrating an application of PRC to data from an observational study that involved patients affected by Duchenne muscular dystrophy (DMD), where the goal is predict time to loss of ambulation using longitudinal blood biomarkers.

Ergodic Exploration using Tensor Train: Applications in Insertion Tasks. (arXiv:2101.04428v1 [cs.RO])

By generating control policies that create natural search behaviors in autonomous systems, ergodic control provides a principled solution to address tasks that require exploration. A large class of ergodic control algorithms relies on spectral analysis, which suffers from the curse of dimensionality, both in storage and computation. This drawback has prohibited the application of ergodic control in robot manipulation since it often requires exploration in state space with more than 2 dimensions. Indeed, the original ergodic control formulation will typically not allow exploratory behaviors to be generated for a complete 6D end-effector pose. In this paper, we propose a solution for ergodic exploration based on the spectral analysis in multidimensional spaces using low-rank tensor approximation techniques. We rely on tensor train decomposition, a recent approach from multilinear algebra for low-rank approximation and efficient computation of multidimensional arrays. The proposed solution is efficient both computationally and storage-wise, hence making it suitable for its online implementation in robotic systems. The approach is applied to a peg-in-hole insertion task using a 7-axis Franka Emika Panda robot, where ergodic exploration allows the task to be achieved without requiring the use of force/torque sensors.

A patient-specific approach for quantitative and automatic analysis of computed tomography images in lung disease: application to COVID-19 patients. (arXiv:2101.04430v1 [physics.med-ph])

Quantitative metrics in lung computed tomography (CT) images have been widely used, often without a clear connection with physiology. This work proposes a patient-independent model for the estimation of well-aerated volume of lungs in CT images (WAVE). A Gaussian fit, with mean (Mu.f) and width (Sigma.f) values, was applied to the lower CT histogram data points of the lung to provide the estimation of the well-aerated lung volume (WAVE.f). Independence from CT reconstruction parameters and respiratory cycle was analysed using healthy lung CT images and 4DCT acquisitions. The Gaussian metrics and first order radiomic features calculated for a third cohort of COVID-19 patients were compared with those relative to healthy lungs. Each lung was further segmented in 24 subregions and a new biomarker derived from Gaussian fit parameter Mu.f was proposed to represent the local density changes. WAVE.f resulted independent from the respiratory motion in 80% of the cases. Differences of 1%, 2% and up to 14% resulted comparing a moderate iterative strength and FBP algorithm, 1 and 3 mm of slice thickness and different reconstruction kernel. Healthy subjects were significantly different from COVID-19 patients for all the metrics calculated. Graphical representation of the local biomarker provides spatial and quantitative information in a single 2D picture. Unlike other metrics based on fixed histogram thresholds, this model is able to consider the inter-and intra-subject variability. In addition, it defines a local biomarker to quantify the severity of the disease, independently of the observer.

Bayesian equation selection on sparse data for discovery of stochastic dynamical systems. (arXiv:2101.04437v1 [stat.ME])

Often the underlying system of differential equations driving a stochastic dynamical system is assumed to be known, with inference conditioned on this assumption. We present a Bayesian framework for discovering this system of differential equations under assumptions that align with real-life scenarios, including the availability of relatively sparse data. Further, we discuss computational strategies that are critical in teasing out the important details about the dynamical system and algorithmic innovations to solve for acute parameter interdependence in the absence of rich data. This gives a complete Bayesian pathway for model identification via a variable selection paradigm and parameter estimation of the corresponding model using only the observed data. We present detailed computations and analysis of the Lorenz-96, Lorenz-63, and the Orstein-Uhlenbeck system using the Bayesian framework we propose.

Implementing Adaptive Quadrature for Bayesian Inference: the aghq Package. (arXiv:2101.04468v1 [stat.CO])

I introduce the aghq package for implementing Bayesian inference using adaptive Gauss-Hermite quadrature. I describe the method and software, and illustrate its use in several challenging low- and high-dimensional examples. Specifically, I show how the aghq package can be used as a basis for implementing more complicated inference methods with a focus on aproximate Bayesian inference for Extended Latent Gaussian Models, with two difficult applications in non-Gaussian geostatistical modelling. I also show how the package can be used to make fully Bayesian inferences in models currently fit using frequentist inference by leveraging code from other packages, with an application to a zero-inflated, overdispersed Poisson regression fit using the glmmTMB package.

Bayesian inference in high-dimensional models. (arXiv:2101.04491v1 [math.ST])

Models with dimension more than the available sample size are now commonly used in various applications. A sensible inference is possible using a lower-dimensional structure. In regression problems with a large number of predictors, the model is often assumed to be sparse, with only a few predictors active. Interdependence between a large number of variables is succinctly described by a graphical model, where variables are represented by nodes on a graph and an edge between two nodes is used to indicate their conditional dependence given other variables. Many procedures for making inferences in the high-dimensional setting, typically using penalty functions to induce sparsity in the solution obtained by minimizing a loss function, were developed. Bayesian methods have been proposed for such problems more recently, where the prior takes care of the sparsity structure. These methods have the natural ability to also automatically quantify the uncertainty of the inference through the posterior distribution. Theoretical studies of Bayesian procedures in high-dimension have been carried out recently. Questions that arise are, whether the posterior distribution contracts near the true value of the parameter at the minimax optimal rate, whether the correct lower-dimensional structure is discovered with high posterior probability, and whether a credible region has adequate frequentist coverage. In this paper, we review these properties of Bayesian and related methods for several high-dimensional models such as many normal means problem, linear regression, generalized linear models, Gaussian and non-Gaussian graphical models. Effective computational approaches are also discussed.

Data augmentation and feature selection for automatic model recommendation in computational physics. (arXiv:2101.04530v1 [stat.ML])

Classification algorithms have recently found applications in computational physics for the selection of numerical methods or models adapted to the environment and the state of the physical system. For such classification tasks, labeled training data come from numerical simulations and generally correspond to physical fields discretized on a mesh. Three challenging difficulties arise: the lack of training data, their high dimensionality, and the non-applicability of common data augmentation techniques to physics data. This article introduces two algorithms to address these issues, one for dimensionality reduction via feature selection, and one for data augmentation. These algorithms are combined with a wide variety of classifiers for their evaluation. When combined with a stacking ensemble made of six multilayer perceptrons and a ridge logistic regression, they enable reaching an accuracy of 90% on our classification problem for nonlinear structural mechanics.

Perturbations of copulas and Mixing properties. (arXiv:2101.04573v1 [math.PR])

This paper explores the impact of perturbations of copulas on the dependence properties of the Markov chains they generate. We consider Markov chains generated by perturbed copulas. Results are provided for the mixing coefficients $\beta_n$, $\psi_n$ and $\phi_n$. Several results are provided on mixing for the considered perturbations. New copula functions are provided in connection with perturbations of variables that induce other types of perturbation of copulas not considered in the literature.

Sharp detection boundaries on testing dense subhypergraph. (arXiv:2101.04584v1 [math.ST])

We study the problem of testing the existence of a dense subhypergraph. The null hypothesis is an Erdos-Renyi uniform random hypergraph and the alternative hypothesis is a uniform random hypergraph that contains a dense subhypergraph. We establish sharp detection boundaries in both scenarios: (1) the edge probabilities are known; (2) the edge probabilities are unknown. In both scenarios, sharp detectable boundaries are characterized by the appropriate model parameters. Asymptotically powerful tests are provided when the model parameters fall in the detectable regions. Our results indicate that the detectable regions for general hypergraph models are dramatically different from their graph counterparts.

Hybrid Random Networks Mixing Preferential Attachment with Uniform Attachment Mechanisms. (arXiv:2101.04611v1 [cs.SI])

Motivated by the complexity of network data, we propose a directed hybrid random network that mixes preferential attachment (PA) rules with uniform attachment (UA) rules. When a new edge is created, with probability $p\in [0,1]$, it follows the PA rule. Otherwise, this new edge is added between two uniformly chosen nodes. Such mixture makes the in- and out-degrees of a fixed node grow at a slower rate, compared to the pure PA case, thus leading to lighter distributional tails. Useful inference methods for the proposed hybrid model are then provided and applied to both synthetic and real datasets. We see that with extra flexibility given by the parameter $p$, the hybrid random network provides a better fit to real-world scenarios, where lighter tails from in- and out-degrees are observed.

Moving sum data segmentation for stochastics processes based on invariance. (arXiv:2101.04651v1 [stat.ME])

The segmentation of data into stationary stretches also known as multiple change point problem is important for many applications in time series analysis as well as signal processing. Based on strong invariance principles, we analyse data segmentation methodology using moving sum (MOSUM) statistics for a class of regime-switching multivariate processes where each switch results in a change in the drift. In particular, this framework includes the data segmentation of multivariate partial sum, integrated diffusion and renewal processes even if the distance between change points is sublinear. We study the asymptotic behaviour of the corresponding change point estimators, show consistency and derive the corresponding localisation rates which are minimax optimal in a variety of situations including an unbounded number of changes in Wiener processes with drift. Furthermore, we derive the limit distribution of the change point estimators for local changes - a result that can in principle be used to derive confidence intervals for the change points.

Benchmarking Simulation-Based Inference. (arXiv:2101.04653v1 [stat.ML])

Recent advances in probabilistic modelling have led to a large number of simulation-based inference algorithms which do not require numerical evaluation of likelihoods. However, a public benchmark with appropriate performance metrics for such 'likelihood-free' algorithms has been lacking. This has made it difficult to compare algorithms and identify their strengths and weaknesses. We set out to fill this gap: We provide a benchmark with inference tasks and suitable performance metrics, with an initial selection of algorithms including recent approaches employing neural networks and classical Approximate Bayesian Computation methods. We found that the choice of performance metric is critical, that even state-of-the-art algorithms have substantial room for improvement, and that sequential estimation improves sample efficiency. Neural network-based approaches generally exhibit better performance, but there is no uniformly best algorithm. We provide practical advice and highlight the potential of the benchmark to diagnose problems and improve algorithms. The results can be explored interactively on a companion website. All code is open source, making it possible to contribute further benchmark tasks and inference algorithms.

A new method for constructing continuous distributions on the unit interval. (arXiv:2101.04661v1 [math.ST])

A novel approach towards construction of absolutely continuous distributions over the unit interval is proposed. Considering two absolutely continuous random variables with positive support, this method conditions on their convolution to generate a new random variable in the unit interval. This approach is demonstrated using some popular choices of the positive random variables such as the exponential, Lindley, gamma. Some existing distributions like the uniform and the beta are formulated with this method. Several new structures of density functions having potential for future application in real life problems are also provided. One of the new distributions having one parameter is considered for parameter estimation and real life modelling application and shown to provide better fit than the popular one parameter Topp-Leone model.

A note on a confidence bound of Kuzborskij and Szepesv\'ari. (arXiv:2101.04671v1 [math.PR])

In an interesting recent work, Kuzborskij and Szepesv\'ari derived a confidence bound for functions of independent random variables, which is based on an inequality that relates concentration to squared perturbations of the chosen function. Kuzborskij and Szepesv\'ari also established the PAC-Bayes-ification of their confidence bound. Two important aspects of their work are that the random variables could be of unbounded range, and not necessarily of an identical distribution. The purpose of this note is to advertise/discuss these interesting results, with streamlined proofs. This expository note is written for persons who, metaphorically speaking, enjoy the "featured movie" but prefer to skip the preview sequence.

SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities. (arXiv:1807.06756v3 [cs.LG] UPDATED)

The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by the many vulnerabilities reported on a daily basis. This calls for machine learning methods for vulnerability detection. Deep learning is attractive for this purpose because it alleviates the requirement to manually define features. Despite the tremendous success of deep learning in other application domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities in C/C++ programs with source code. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been "silently" patched by the vendors when releasing newer versions of the pertinent software products.

High-Temperature Structure Detection in Ferromagnets. (arXiv:1809.08204v4 [math.ST] UPDATED)

This paper studies structure detection problems in high temperature ferromagnetic (positive interaction only) Ising models. The goal is to distinguish whether the underlying graph is empty, i.e., the model consists of independent Rademacher variables, versus the alternative that the underlying graph contains a subgraph of a certain structure. We give matching upper and lower minimax bounds under which testing this problem is possible/impossible respectively. Our results reveal that a key quantity called graph arboricity drives the testability of the problem. On the computational front, under a conjecture of the computational hardness of sparse principal component analysis, we prove that, unless the signal is strong enough, there are no polynomial time tests which are capable of testing this problem. In order to prove this result we exhibit a way to give sharp inequalities for the even moments of sums of i.i.d. Rademacher random variables which may be of independent interest.

Estimating Robot Strengths with Application to Selection of Alliance Members in FIRST Robotics Competitions. (arXiv:1810.05763v4 [stat.AP] UPDATED)

Since the inception of the FIRST Robotics Competition (FRC) and its special playoff system, robotics teams have longed to appropriately quantify the strengths of their designed robots. The FRC includes a playground draft-like phase (alliance selection), arguably the most game-changing part of the competition, in which the top-8 robotics teams in a tournament based on the FRC's ranking system assess potential alliance members for the opportunity of partnering in a playoff stage. In such a three-versus-three competition, several measures and models have been used to characterize actual or relative robot strengths. However, existing models are found to have poor predictive performance due to their imprecise estimates of robot strengths caused by a small ratio of the number of observations to the number of robots. A more general regression model with latent clusters of robot strengths is, thus, proposed to enhance their predictive capacities. Two effective estimation procedures are further developed to simultaneously estimate the number of clusters, clusters of robots, and robot strengths. Meanwhile, some measures are used to assess the predictive ability of competing models, the agreement between published FRC measures of strength and model-based robot strengths of all, playoff, and FRC top-8 robots, and the agreement between FRC top-8 robots and model-based top robots. Moreover, the stability of estimated robot strengths and accuracies is investigated to determine whether the scheduled matches are excessive or insufficient. In the analysis of qualification data from the 2018 FRC Houston and Detroit championships, the predictive ability of our model is also shown to be significantly better than those of existing models. Teams who adopt the new model can now appropriately rank their preferences for playoff alliance partners with greater predictive capability than before.

Adaptive test of independence based on HSIC measures. (arXiv:1902.06441v5 [math.ST] UPDATED)

Dependence measures based on reproducing kernel Hilbert spaces, also known as Hilbert-Schmidt Independence Criterion and denoted HSIC, are widely used to statistically decide whether or not two random vectors are dependent. Recently, non-parametric HSIC-based statistical tests of independence have been performed. However, these tests lead to the question of the choice of the kernels associated to the HSIC. In particular, there is as yet no method to objectively select specific kernels with theoretical guarantees in terms of first and second kind errors. One of the main contributions of this work is to develop a new HSIC-based aggregated procedure which avoids such a kernel choice, and to provide theoretical guarantees for this procedure. To achieve this, we first introduce non-asymptotic single tests based on Gaussian kernels with a given bandwidth, which are of prescribed level $\alpha \in (0,1)$. From a theoretical point of view, we upper-bound their uniform separation rate of testing over Sobolev and Nikol'skii balls. Then, we aggregate several single tests, and obtain similar upper-bounds for the uniform separation rate of the aggregated procedure over the same regularity spaces. Another main contribution is that we provide a lower-bound for the non-asymptotic minimax separation rate of testing over Sobolev balls, and deduce that the aggregated procedure is adaptive in the minimax sense over such regularity spaces. Finally, from a practical point of view, we perform numerical studies in order to assess the efficiency of our aggregated procedure and compare it to existing independence tests in the literature.

Channels, Remote Estimation and Queueing Systems With A Utilization-Dependent Component: A Unifying Survey Of Recent Results. (arXiv:1905.04362v2 [math.OC] UPDATED)

In this article, we survey the main models, techniques, concepts, and results centered on the design and performance evaluation of engineered systems that rely on a utilization-dependent component (UDC) whose operation may depend on its usage history or assigned workload. Specifically, we report on research themes concentrating on the characterization of the capacity of channels and the design with performance guarantees of remote estimation and queueing systems. Causes for the dependency of a UDC on past utilization include the use of replenishable energy sources to power the transmission of information among the sub-components of a networked system, and the assistance of a human operator for servicing a queue. Our analysis unveils the similarity of the UDC models typically adopted in each of the research themes, and it reveals the differences in the objectives and technical approaches employed. We also identify new challenges and future research directions inspired by the cross-pollination among the central concepts, techniques, and problem formulations of the research themes discussed.

Linear Aggregation in Tree-based Estimators. (arXiv:1906.06463v3 [stat.ME] UPDATED)

Regression trees and their ensemble methods are popular methods for non-parametric regression: they combine strong predictive performance with interpretable estimators. In order to improve their utility for locally smooth response surfaces, we study regression trees and random forests with linear aggregation functions. We introduce a new algorithm that finds the best axis-aligned split to fit linear aggregation functions on the corresponding nodes, and we offer a quasilinear time implementation. We apply the algorithm to several simulated and real-world data sets. We showcase its favorable performance in an extensive simulation study, and demonstrate its improved interpretability using a large get-out-the-vote randomized controlled trial. We also provide a software package that implements several tree-based estimators with linear aggregation functions and includes tools for inference.

Robust Design and Analysis of Clinical Trials With Non-proportional Hazards: A Straw Man Guidance from a Cross-pharma Working Group. (arXiv:1908.07112v4 [stat.AP] UPDATED)

Loss of power and clear description of treatment differences are key issues in designing and analyzing a clinical trial where non-proportional hazard is a possibility. A log-rank test may be very inefficient and interpretation of the hazard ratio estimated using Cox regression is potentially problematic. In this case, the current ICH E9 (R1) addendum would suggest designing a trial with a clinically relevant estimand, e.g., expected life gain. This approach considers appropriate analysis methods for supporting the chosen estimand. However, such an approach is case specific and may suffer lack of power for important choices of the underlying alternate hypothesis distribution. On the other hand, there may be a desire to have robust power under different deviations from proportional hazards. Also, we would contend that no single number adequately describes treatment effect under non-proportional hazards scenarios. The cross-pharma working group has proposed a combination test to provide robust power under a variety of alternative hypotheses. These can be specified for primary analysis at the design stage and methods appropriately accounting for combination test correlations are efficient for a variety of scenarios. We have provided design and analysis considerations based on a combination test under different non-proportional hazard types and present a straw man proposal for practitioners. The proposals are illustrated with real life example and simulation.

On the Estimation of Network Complexity: Dimension of Graphons. (arXiv:1909.02900v2 [stat.ML] UPDATED)

Network complexity has been studied for over half a century and has found a wide range of applications. Many methods have been developed to characterize and estimate the complexity of networks. However, there has been little research with statistical guarantees. In this paper, we develop a statistical theory of graph complexity in a general model of random graphs, the so-called graphon model.

Given a graphon, we endow the latent space of the nodes with the neighborhood distance that measures the propensity of two nodes to be connected with similar nodes. Our complexity index is then based on the covering number and the Minkowski dimension of (a purified version of) this metric space. Although the latent space is not identifiable, these indices turn out to be identifiable. This notion of complexity has simple interpretations on popular examples of random graphs: it matches the number of communities in stochastic block models; the dimension of the Euclidean space in random geometric graphs; the regularity of the link function in H\"older graphon models.

From a single observation of the graph, we construct an estimator of the neighborhood-distance and show universal non-asymptotic bounds for its risk, matching minimax lower bounds. Based on this estimated distance, we compute the corresponding covering number and Minkowski dimension and we provide optimal non-asymptotic error bounds for these two plug-in estimators.

Minimax Nonparametric Two-sample Test under Smoothing. (arXiv:1911.02171v4 [stat.ME] UPDATED)

We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and show that the test statistic is asymptotically chi-square distributed under the null hypothesis. Furthermore, we derive a sharp minimax testing rate based on the Bernstein width for nonparametric two-sample tests and show that our proposed test statistics is minimax optimal. In addition, a data-adaptive tuning criterion is developed to choose the penalty parameter. Simulations and real applications demonstrate that the proposed test outperforms the conventional approaches under various scenarios.

Advanced analysis of temporal data using Fisher-Shannon information: theoretical development and application in geosciences. (arXiv:1912.02452v2 [stat.ME] UPDATED)

Complex non-linear time series are ubiquitous in geosciences. Quantifying complexity and non-stationarity of these data is a challenging task, and advanced complexity-based exploratory tool are required for understanding and visualizing such data. This paper discusses the Fisher-Shannon method, from which one can obtain a complexity measure and detect non-stationarity, as an efficient data exploration tool. The state-of-the-art studies related to the Fisher-Shannon measures are collected, and new analytical formulas for positive unimodal skewed distributions are proposed. Case studies on both synthetic and real data illustrate the usefulness of the Fisher-Shannon method, which can find application in different domains including time series discrimination and generation of times series features for clustering, modeling and forecasting. The paper is accompanied with Python and R libraries for the non-parametric estimation of the proposed measures.

Predicting Attributes of Nodes Using Network Structure. (arXiv:1912.12264v3 [cs.LG] UPDATED)

In many graphs such as social networks, nodes have associated attributes representing their behavior. Predicting node attributes in such graphs is an important problem with applications in many domains like recommendation systems, privacy preservation, and targeted advertisement. Attributes values can be predicted by analyzing patterns and correlations among attributes and employing classification/regression algorithms. However, these approaches do not utilize readily available network topology information. In this regard, interconnections between different attributes of nodes can be exploited to improve the prediction accuracy. In this paper, we propose an approach to represent a node by a feature map with respect to an attribute $a_i$ (which is used as input for machine learning algorithms) using all attributes of neighbors to predict attributes values for $a_i$. We perform extensive experimentation on ten real-world datasets and show that the proposed feature map significantly improves the prediction accuracy as compared to baseline approaches on these datasets.

Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification. (arXiv:2001.06448v5 [cs.LG] UPDATED)

The Information Bottleneck (IB) objective uses information theory to formulate a task-performance versus robustness trade-off. It has been successfully applied in the standard discriminative classification setting. We pose the question whether the IB can also be used to train generative likelihood models such as normalizing flows. Since normalizing flows use invertible network architectures (INNs), they are information-preserving by construction. This seems contradictory to the idea of a bottleneck. In this work, firstly, we develop the theory and methodology of IB-INNs, a class of conditional normalizing flows where INNs are trained using the IB objective: Introducing a small amount of {\em controlled} information loss allows for an asymptotically exact formulation of the IB, while keeping the INN's generative capabilities intact. Secondly, we investigate the properties of these models experimentally, specifically used as generative classifiers. This model class offers advantages such as improved uncertainty quantification and out-of-distribution detection, but traditional generative classifier solutions suffer considerably in classification accuracy. We find the trade-off parameter in the IB controls a mix of generative capabilities and accuracy close to standard classifiers. Empirically, our uncertainty estimates in this mixed regime compare favourably to conventional generative and discriminative classifiers.

AI-GAN: Attack-Inspired Generation of Adversarial Examples. (arXiv:2002.02196v2 [cs.LG] UPDATED)

Deep neural networks (DNNs) are vulnerable to adversarial examples, which are crafted by adding imperceptible perturbations to inputs. Recently different attacks and strategies have been proposed, but how to generate adversarial examples perceptually realistic and more efficiently remains unsolved. This paper proposes a novel framework called Attack-Inspired GAN (AI-GAN), where a generator, a discriminator, and an attacker are trained jointly. Once trained, it can generate adversarial perturbations efficiently given input images and target classes. Through extensive experiments on several popular datasets \eg MNIST and CIFAR-10, AI-GAN achieves high attack success rates and reduces generation time significantly in various settings. Moreover, for the first time, AI-GAN successfully scales to complicated datasets \eg CIFAR-100 with around $90\%$ success rates among all classes.

Estimating Optimal Treatment Rules with an Instrumental Variable: A Partial Identification Learning Approach. (arXiv:2002.02579v5 [stat.ME] UPDATED)

Individualized treatment rules (ITRs) are considered a promising recipe to deliver better policy interventions. One key ingredient in optimal ITR estimation problems is to estimate the average treatment effect conditional on a subject's covariate information, which is often challenging in observational studies due to the universal concern of unmeasured confounding. Instrumental variables (IVs) are widely-used tools to infer the treatment effect when there is unmeasured confounding between the treatment and outcome. In this work, we propose a general framework of approaching the optimal ITR estimation problem when a valid IV is allowed to only partially identify the treatment effect. We introduce a novel notion of optimality called "IV-optimality". A treatment rule is said to be IV-optimal if it minimizes the maximum risk with respect to the putative IV and the set of IV identification assumptions. We derive a bound on the risk of an IV-optimal rule that illuminates when an IV-optimal rule has favorable generalization performance. We propose a classification-based statistical learning method that estimates such an IV-optimal rule, design computationally-efficient algorithms, and prove theoretical guarantees. We contrast our proposed method to the popular outcome weighted learning (OWL) approach via extensive simulations, and apply our method to study which mothers would benefit from traveling to deliver their premature babies at hospitals with high level neonatal intensive care units.

Modelling volatile time series with v-transforms and copulas. (arXiv:2002.10135v5 [q-fin.RM] UPDATED)

An approach to the modelling of volatile time series using a class of uniformity-preserving transforms for uniform random variables is proposed. V-transforms describe the relationship between quantiles of the stationary distribution of the time series and quantiles of the distribution of a predictable volatility proxy variable. They can be represented as copulas and permit the formulation and estimation of models that combine arbitrary marginal distributions with copula processes for the dynamics of the volatility proxy. The idea is illustrated using a Gaussian ARMA copula process and the resulting model is shown to replicate many of the stylized facts of financial return series and to facilitate the calculation of marginal and conditional characteristics of the model including quantile measures of risk. Estimation is carried out by adapting the exact maximum likelihood approach to the estimation of ARMA processes and the model is shown to be competitive with standard GARCH in an empirical application to Bitcoin return data.

Deep Survival Machines: Fully Parametric Survival Regression and Representation Learning for Censored Data with Competing Risks. (arXiv:2003.01176v2 [cs.LG] UPDATED)

We describe a new approach to estimating relative risks in time-to-event prediction problems with censored data in a fully parametric manner. Our approach does not require making strong assumptions of constant proportional hazard of the underlying survival distribution, as required by the Cox-proportional hazard model. By jointly learning deep nonlinear representations of the input covariates, we demonstrate the benefits of our approach when used to estimate survival risks through extensive experimentation on multiple real world datasets with different levels of censoring. We further demonstrate advantages of our model in the competing risks scenario. To the best of our knowledge, this is the first work involving fully parametric estimation of survival times with competing risks in the presence of censoring.

TorchIO: a Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. (arXiv:2003.04696v3 [eess.IV] UPDATED)

Processing of medical images such as MRI or CT presents unique challenges compared to RGB images typically used in computer vision. These include a lack of labels for large datasets, high computational costs, and metadata to describe the physical properties of voxels. Data augmentation is used to artificially increase the size of the training datasets. Training with image patches decreases the need for computational power. Spatial metadata needs to be carefully taken into account in order to ensure a correct alignment of volumes.

We present TorchIO, an open-source Python library to enable efficient loading, preprocessing, augmentation and patch-based sampling of medical images for deep learning. TorchIO follows the style of PyTorch and integrates standard medical image processing libraries to efficiently process images during training of neural networks. TorchIO transforms can be composed, reproduced, traced and extended. We provide multiple generic preprocessing and augmentation operations as well as simulation of MRI-specific artifacts.

Source code, comprehensive tutorials and extensive documentation for TorchIO can be found at https://github.com/fepegar/torchio. The package can be installed from the Python Package Index running 'pip install torchio'. It includes a command-line interface which allows users to apply transforms to image files without using Python. Additionally, we provide a graphical interface within a TorchIO extension in 3D Slicer to visualize the effects of transforms.

TorchIO was developed to help researchers standardize medical image processing pipelines and allow them to focus on the deep learning experiments. It encourages open science, as it supports reproducibility and is version controlled so that the software can be cited precisely. Due to its modularity, the library is compatible with other frameworks for deep learning with medical images.

Towards Automatic Bayesian Optimization: A first step involving acquisition functions. (arXiv:2003.09643v2 [cs.AI] UPDATED)

Bayesian Optimization is the state of the art technique for the optimization of black boxes, i.e., functions where we do not have access to their analytical expression nor its gradients, they are expensive to evaluate and its evaluation is noisy. The most popular application of bayesian optimization is the automatic hyperparameter tuning of machine learning algorithms, where we obtain the best configuration of machine learning algorithms by optimizing the estimation of the generalization error of these algorithms. Despite being applied with success, bayesian optimization methodologies also have hyperparameters that need to be configured such as the probabilistic surrogate model or the acquisition function used. A bad decision over the configuration of these hyperparameters implies obtaining bad quality results. Typically, these hyperparameters are tuned by making assumptions of the objective function that we want to evaluate but there are scenarios where we do not have any prior information about the objective function. In this paper, we propose a first attempt over automatic bayesian optimization by exploring several heuristics that automatically tune the acquisition function of bayesian optimization. We illustrate the effectiveness of these heurisitcs in a set of benchmark problems and a hyperparameter tuning problem of a machine learning algorithm.

Likelihood landscape and maximum likelihood estimation for the discrete orbit recovery model. (arXiv:2004.00041v3 [math.ST] UPDATED)

We study the non-convex optimization landscape for maximum likelihood estimation in the discrete orbit recovery model with Gaussian noise. This model is motivated by applications in molecular microscopy and image processing, where each measurement of an unknown object is subject to an independent random rotation from a rotational group. Equivalently, it is a Gaussian mixture model where the mixture centers belong to a group orbit.

We show that fundamental properties of the likelihood landscape depend on the signal-to-noise ratio and the group structure. At low noise, this landscape is "benign" for any discrete group, possessing no spurious local optima and only strict saddle points. At high noise, this landscape may develop spurious local optima, depending on the specific group. We discuss several positive and negative examples, and provide a general condition that ensures a globally benign landscape. For cyclic permutations of coordinates on $\mathbb{R}^d$ (multi-reference alignment), there may be spurious local optima when $d \geq 6$, and we establish a correspondence between these local optima and those of a surrogate function of the phase variables in the Fourier domain.

We show that the Fisher information matrix transitions from resembling that of a single Gaussian in low noise to having a graded eigenvalue structure in high noise, which is determined by the graded algebra of invariant polynomials under the group action. In a local neighborhood of the true object, the likelihood landscape is strongly convex in a reparametrized system of variables given by a transcendence basis of this polynomial algebra. We discuss implications for optimization algorithms, including slow convergence of expectation-maximization, and possible advantages of momentum-based acceleration and variable reparametrization for first- and second-order descent methods.

Bayesian ODE Solvers: The Maximum A Posteriori Estimate. (arXiv:2004.00623v2 [math.NA] UPDATED)

It has recently been established that the numerical solution of ordinary differential equations can be posed as a nonlinear Bayesian inference problem, which can be approximately solved via Gaussian filtering and smoothing, whenever a Gauss--Markov prior is used. In this paper the class of $\nu$ times differentiable linear time invariant Gauss--Markov priors is considered. A taxonomy of Gaussian estimators is established, with the maximum a posteriori estimate at the top of the hierarchy, which can be computed with the iterated extended Kalman smoother. The remaining three classes are termed explicit, semi-implicit, and implicit, which are in similarity with the classical notions corresponding to conditions on the vector field, under which the filter update produces a local maximum a posteriori estimate. The maximum a posteriori estimate corresponds to an optimal interpolant in the reproducing Hilbert space associated with the prior, which in the present case is equivalent to a Sobolev space of smoothness $\nu+1$. Consequently, using methods from scattered data approximation and nonlinear analysis in Sobolev spaces, it is shown that the maximum a posteriori estimate converges to the true solution at a polynomial rate in the fill-distance (maximum step size) subject to mild conditions on the vector field. The methodology developed provides a novel and more natural approach to study the convergence of these estimators than classical methods of convergence analysis. The methods and theoretical results are demonstrated in numerical examples.

Time Series to Images: Monitoring the Condition of Industrial Assets with Deep Learning Image Processing Algorithms. (arXiv:2005.07031v3 [cs.LG] UPDATED)

The ability to detect anomalies in time series is considered highly valuable in numerous application domains. The sequential nature of time series objects is responsible for an additional feature complexity, ultimately requiring specialized approaches in order to solve the task. Essential characteristics of time series, situated outside the time domain, are often difficult to capture with state-of-the-art anomaly detection methods when no transformations have been applied to the time series. Inspired by the success of deep learning methods in computer vision, several studies have proposed transforming time series into image-like representations, used as inputs for deep learning models, and have led to very promising results in classification tasks.

In this paper, we first review the signal to image encoding approaches found in the literature. Second, we propose modifications to some of their original formulations to make them more robust to the variability in large datasets. Third, we compare them on the basis of a common unsupervised task to demonstrate how the choice of the encoding can impact the results when used in the same deep learning architecture. We thus provide a comparison between six encoding algorithms with and without the proposed modifications. The selected encoding methods are Gramian Angular Field, Markov Transition Field, recurrence plot, grey scale encoding, spectrogram, and scalogram. We also compare the results achieved with the raw signal used as input for another deep learning model. We demonstrate that some encodings have a competitive advantage and might be worth considering within a deep learning framework.

The comparison is performed on a dataset collected and released by Airbus SAS, containing highly complex vibration measurements from real helicopter flight tests. The different encodings provide competitive results for anomaly detection.

Comparing BERT against traditional machine learning text classification. (arXiv:2005.13012v2 [cs.CL] UPDATED)

The BERT model has arisen as a popular state-of-the-art machine learning model in the recent years that is able to cope with multiple NLP tasks such as supervised text classification without human supervision. Its flexibility to cope with any type of corpus delivering great results has make this approach very popular not only in academia but also in the industry. Although, there are lots of different approaches that have been used throughout the years with success. In this work, we first present BERT and include a little review on classical NLP approaches. Then, we empirically test with a suite of experiments dealing different scenarios the behaviour of BERT against the traditional TF-IDF vocabulary fed to machine learning algorithms. Our purpose of this work is to add empirical evidence to support or refuse the use of BERT as a default on NLP tasks. Experiments show the superiority of BERT and its independence of features of the NLP problem such as the language of the text adding empirical evidence to use BERT as a default technique to be used in NLP problems.

A Generalised Linear Model Framework for $\beta$-Variational Autoencoders based on Exponential Dispersion Families. (arXiv:2006.06267v2 [cs.LG] UPDATED)

Although variational autoencoders (VAE) are successfully used to obtain meaningful low-dimensional representations for high-dimensional data, the characterization of critical points of the loss function for general observation models is not fully understood. We introduce a theoretical framework that is based on a connection between $\beta$-VAE and generalized linear models (GLM). The equality between the activation function of a $\beta$-VAE and the inverse of the link function of a GLM enables us to provide a systematic generalization of the loss analysis for $\beta$-VAE based on the assumption that the observation model distribution belongs to an exponential dispersion family (EDF). As a result, we can initialize $\beta$-VAE nets by maximum likelihood estimates (MLE) that enhance the training performance on both synthetic and real world data sets. As a further consequence, we analytically describe the auto-pruning property inherent in the $\beta$-VAE objective and reason for posterior collapse.

The Influence of Shape Constraints on the Thresholding Bandit Problem. (arXiv:2006.10006v2 [cs.LG] UPDATED)

We investigate the stochastic Thresholding Bandit problem (TBP) under several shape constraints. On top of (i) the vanilla, unstructured TBP, we consider the case where (ii) the sequence of arm's means $(\mu_k)_k$ is monotonically increasing MTBP, (iii) the case where $(\mu_k)_k$ is unimodal UTBP and (iv) the case where $(\mu_k)_k$ is concave CTBP. In the TBP problem the aim is to output, at the end of the sequential game, the set of arms whose means are above a given threshold. The regret is the highest gap between a misclassified arm and the threshold. In the fixed budget setting, we provide problem independent minimax rates for the expected regret in all settings, as well as associated algorithms. We prove that the minimax rates for the regret are (i) $\sqrt{\log(K)K/T}$ for TBP, (ii) $\sqrt{\log(K)/T}$ for MTBP, (iii) $\sqrt{K/T}$ for UTBP and (iv) $\sqrt{\log\log K/T}$ for CTBP, where $K$ is the number of arms and $T$ is the budget. These rates demonstrate that the dependence on $K$ of the minimax regret varies significantly depending on the shape constraint. This highlights the fact that the shape constraints modify fundamentally the nature of the TBP.

Spatio-temporal evolution of global surface temperature distributions. (arXiv:2006.12386v3 [stat.AP] UPDATED)

Climate is known for being characterised by strong non-linearity and chaotic behaviour. Nevertheless, few studies in climate science adopt statistical methods specifically designed for non-stationary or non-linear systems. Here we show how the use of statistical methods from Information Theory can describe the non-stationary behaviour of climate fields, unveiling spatial and temporal patterns that may otherwise be difficult to recognize. We study the maximum temperature at two meters above ground using the NCEP CDAS1 daily reanalysis data, with a spatial resolution of 2.5 by 2.5 degree and covering the time period from 1 January 1948 to 30 November 2018. The spatial and temporal evolution of the temperature time series are retrieved using the Fisher Information Measure, which quantifies the information in a signal, and the Shannon Entropy Power, which is a measure of its uncertainty -- or unpredictability. The results describe the temporal behaviour of the analysed variable. Our findings suggest that tropical and temperate zones are now characterized by higher levels of entropy. Finally, Fisher-Shannon Complexity is introduced and applied to study the evolution of the daily maximum surface temperature distributions.

Feature Expansive Reward Learning: Rethinking Human Input. (arXiv:2006.13208v2 [cs.RO] UPDATED)

When a person is not satisfied with how a robot performs a task, they can intervene to correct it. Reward learning methods enable the robot to adapt its reward function online based on such human input, but they rely on handcrafted features. When the correction cannot be explained by these features, recent work in deep Inverse Reinforcement Learning (IRL) suggests that the robot could ask for task demonstrations and recover a reward defined over the raw state space. Our insight is that rather than implicitly learning about the missing feature(s) from demonstrations, the robot should instead ask for data that explicitly teaches it about what it is missing. We introduce a new type of human input in which the person guides the robot from states where the feature being taught is highly expressed to states where it is not. We propose an algorithm for learning the feature from the raw state space and integrating it into the reward function. By focusing the human input on the missing feature, our method decreases sample complexity and improves generalization of the learned reward over the above deep IRL baseline. We show this in experiments with a physical 7DOF robot manipulator, as well as in a user study conducted in a simulated environment.

A Nearest Neighbor Characterization of Lebesgue Points in Metric Measure Spaces. (arXiv:2007.03937v4 [cs.LG] UPDATED)

The property of almost every point being a Lebesgue point has proven to be crucial for the consistency of several classification algorithms based on nearest neighbors. We characterize Lebesgue points in terms of a 1-Nearest Neighbor regression algorithm for pointwise estimation, fleshing out the role played by tie-breaking rules in the corresponding convergence problem. We then give an application of our results, proving the convergence of the risk of a large class of 1-Nearest Neighbor classification algorithms in general metric spaces where almost every point is a Lebesgue point.

Learning to Reweight with Deep Interactions. (arXiv:2007.04649v2 [cs.LG] UPDATED)

Recently, the concept of teaching has been introduced into machine learning, in which a teacher model is used to guide the training of a student model (which will be used in real tasks) through data selection, loss function design, etc. Learning to reweight, which is a specific kind of teaching that reweights training data using a teacher model, receives much attention due to its simplicity and effectiveness. In existing learning to reweight works, the teacher model only utilizes shallow/surface information such as training iteration number and loss/accuracy of the student model from training/validation sets, but ignores the internal states of the student model, which limits the potential of learning to reweight. In this work, we propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model, and the teacher model returns adaptive weights of training samples to enhance the training of the student model. The teacher model is jointly trained with the student model using meta gradients propagated from a validation set. Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.

Reverse Annealing for Nonnegative/Binary Matrix Factorization. (arXiv:2007.05565v2 [cs.LG] UPDATED)

It was recently shown that quantum annealing can be used as an effective, fast subroutine in certain types of matrix factorization algorithms. The quantum annealing algorithm performed best for quick, approximate answers, but performance rapidly plateaued. In this paper, we utilize reverse annealing instead of forward annealing in the quantum annealing subroutine for nonnegative/binary matrix factorization problems. After an initial global search with forward annealing, reverse annealing performs a series of local searches that refine existing solutions. The combination of forward and reverse annealing significantly improves performance compared to forward annealing alone for all but the shortest run times.

GeoStat Representations of Time Series for Fast Classification. (arXiv:2007.06682v3 [cs.LG] UPDATED)

Recent advances in time series classification have largely focused on methods that either employ deep learning or utilize other machine learning models for feature extraction. Though successful, their power often comes at the requirement of computational complexity. In this paper, we introduce GeoStat representations for time series. GeoStat representations are based off of a generalization of recent methods for trajectory classification, and summarize the information of a time series in terms of comprehensive statistics of (possibly windowed) distributions of easy to compute differential geometric quantities, requiring no dynamic time warping. The features used are intuitive and require minimal parameter tuning. We perform an exhaustive evaluation of GeoStat on a number of real datasets, showing that simple KNN and SVM classifiers trained on these representations exhibit surprising performance relative to modern single model methods requiring significant computational power, achieving state of the art results in many cases. In particular, we show that this methodology achieves good performance on a challenging dataset involving the classification of fishing vessels, where our methods achieve good performance relative to the state of the art despite only having access to approximately two percent of the dataset used in training and evaluating this state of the art.

A simple correction for COVID-19 sampling bias. (arXiv:2007.07426v3 [stat.ME] UPDATED)

COVID-19 testing has become a standard approach for estimating prevalence which then assist in public health decision making to contain and mitigate the spread of the disease. The sampling designs used are often biased in that they do not reflect the true underlying populations. For instance, individuals with strong symptoms are more likely to be tested than those with no symptoms. This results in biased estimates of prevalence (too high). Typical post-sampling corrections are not always possible. Here we present a simple bias correction methodology derived and adapted from a correction for publication bias in meta analysis studies. The methodology is general enough to allow a wide variety of customization making it more useful in practice. Implementation is easily done using already collected information. Via a simulation and two real datasets, we show that the bias corrections can provide dramatic reductions in estimation error.

Learning discrete distributions: user vs item-level privacy. (arXiv:2007.13660v3 [cs.LG] UPDATED)

Much of the literature on differential privacy focuses on item-level privacy, where loosely speaking, the goal is to provide privacy per item or training example. However, recently many practical applications such as federated learning require preserving privacy for all items of a single user, which is much harder to achieve. Therefore understanding the theoretical limit of user-level privacy becomes crucial.

We study the fundamental problem of learning discrete distributions over $k$ symbols with user-level differential privacy. If each user has $m$ samples, we show that straightforward applications of Laplace or Gaussian mechanisms require the number of users to be $\mathcal{O}(k/(m\alpha^2) + k/\epsilon\alpha)$ to achieve an $\ell_1$ distance of $\alpha$ between the true and estimated distributions, with the privacy-induced penalty $k/\epsilon\alpha$ independent of the number of samples per user $m$. Moreover, we show that any mechanism that only operates on the final aggregate counts should require a user complexity of the same order. We then propose a mechanism such that the number of users scales as $\tilde{\mathcal{O}}(k/(m\alpha^2) + k/\sqrt{m}\epsilon\alpha)$ and hence the privacy penalty is $\tilde{\Theta}(\sqrt{m})$ times smaller compared to the standard mechanisms in certain settings of interest. We further show that the proposed mechanism is nearly-optimal under certain regimes.

We also propose general techniques for obtaining lower bounds on restricted differentially private estimators and a lower bound on the total variation between binomial distributions, both of which might be of independent interest.

Communication Efficient Distributed Learning with Censored, Quantized, and Generalized Group ADMM. (arXiv:2009.06459v2 [cs.LG] UPDATED)

In this paper, we propose a communication-efficiently decentralized machine learning framework that solves a consensus optimization problem defined over a network of inter-connected workers. The proposed algorithm, Censored and Quantized Generalized GADMM (CQ-GGADMM), leverages the worker grouping and decentralized learning ideas of Group Alternating Direction Method of Multipliers (GADMM), and pushes the frontier in communication efficiency by extending its applicability to generalized network topologies, while incorporating link censoring for negligible updates after quantization. We theoretically prove that CQ-GGADMM achieves the linear convergence rate when the local objective functions are strongly convex under some mild assumptions. Numerical simulations corroborate that CQ-GGADMM exhibits higher communication efficiency in terms of the number of communication rounds and transmit energy consumption without compromising the accuracy and convergence speed, compared to the censored decentralized ADMM, and the worker grouping method of GADMM.

Neuro-symbolic Neurodegenerative Disease Modeling as Probabilistic Programmed Deep Kernels. (arXiv:2009.07738v3 [cs.LG] UPDATED)

We present a probabilistic programmed deep kernel learning approach to personalized, predictive modeling of neurodegenerative diseases. Our analysis considers a spectrum of neural and symbolic machine learning approaches, which we assess for predictive performance and important medical AI properties such as interpretability, uncertainty reasoning, data-efficiency, and leveraging domain knowledge. Our Bayesian approach combines the flexibility of Gaussian processes with the structural power of neural networks to model biomarker progressions, without needing clinical labels for training. We run evaluations on the problem of Alzheimer's disease prediction, yielding results that surpass deep learning in both accuracy and timeliness of predicting neurodegeneration, and with the practical advantages of Bayesian nonparametrics and probabilistic programming.

A Framework of Learning Through Empirical Gain Maximization. (arXiv:2009.14250v2 [cs.LG] UPDATED)

We develop in this paper a framework of empirical gain maximization (EGM) to address the robust regression problem where heavy-tailed noise or outliers may present in the response variable. The idea of EGM is to approximate the density function of the noise distribution instead of approximating the truth function directly as usual. Unlike the classical maximum likelihood estimation that encourages equal importance of all observations and could be problematic in the presence of abnormal observations, EGM schemes can be interpreted from a minimum distance estimation viewpoint and allow the ignorance of those observations. Furthermore, it is shown that several well-known robust nonconvex regression paradigms, such as Tukey regression and truncated least square regression, can be reformulated into this new framework. We then develop a learning theory for EGM, by means of which a unified analysis can be conducted for these well-established but not fully-understood regression approaches. Resulting from the new framework, a novel interpretation of existing bounded nonconvex loss functions can be concluded. Within this new framework, the two seemingly irrelevant terminologies, the well-known Tukey's biweight loss for robust regression and the triweight kernel for nonparametric smoothing, are closely related. More precisely, it is shown that the Tukey's biweight loss can be derived from the triweight kernel. Similarly, other frequently employed bounded nonconvex loss functions in machine learning such as the truncated square loss, the Geman-McClure loss, and the exponential squared loss can also be reformulated from certain smoothing kernels in statistics. In addition, the new framework enables us to devise new bounded nonconvex loss functions for robust learning.

Policy Learning Using Weak Supervision. (arXiv:2010.01748v2 [cs.LG] UPDATED)

Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert's demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the weak supervisions to perform policy learning efficiently. To handle this problem, we treat the "weak supervisions" as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a "correlated agreement" with the peer agent's policy (instead of simple agreements). Our way of leveraging peer agent's information offers us a family of solutions that learn effectively from weak supervisions with theoretical guarantees. Extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations and standard policy co-training (RL + BC) show that the proposed approach leads to substantial improvements, especially when the complexity or the noise of the learning environments grows.

Machine Learning Force Fields. (arXiv:2010.07067v2 [physics.chem-ph] UPDATED)

In recent years, the use of Machine Learning (ML) in computational chemistry has enabled numerous advances previously out of reach due to the computational complexity of traditional electronic-structure methods. One of the most promising applications is the construction of ML-based force fields (FFs), with the aim to narrow the gap between the accuracy of ab initio methods and the efficiency of classical FFs. The key idea is to learn the statistical relation between chemical structure and potential energy without relying on a preconceived notion of fixed chemical bonds or knowledge about the relevant interactions. Such universal ML approximations are in principle only limited by the quality and quantity of the reference data used to train them. This review gives an overview of applications of ML-FFs and the chemical insights that can be obtained from them. The core concepts underlying ML-FFs are described in detail and a step-by-step guide for constructing and testing them from scratch is given. The text concludes with a discussion of the challenges that remain to be overcome by the next generation of ML-FFs.

On the asymptotic rate of convergence of Stochastic Newton algorithms and their Weighted Averaged versions. (arXiv:2011.09706v2 [math.ST] UPDATED)

The majority of machine learning methods can be regarded as the minimization of an unavailable risk function. To optimize the latter, given samples provided in a streaming fashion, we define a general stochastic Newton algorithm and its weighted average version. In several use cases, both implementations will be shown not to require the inversion of a Hessian estimate at each iteration, but a direct update of the estimate of the inverse Hessian instead will be favored. This generalizes a trick introduced in [2] for the specific case of logistic regression, by directly updating the estimate of the inverse Hessian. Under mild assumptions such as local strong convexity at the optimum, we establish almost sure convergences and rates of convergence of the algorithms, as well as central limit theorems for the constructed parameter estimates. The unified framework considered in this paper covers the case of linear, logistic or softmax regressions to name a few. Numerical experiments on simulated data give the empirical evidence of the pertinence of the proposed methods, which outperform popular competitors particularly in case of bad initializa-tions.

A scoping review of causal methods enabling predictions under hypothetical interventions. (arXiv:2011.09815v2 [stat.ME] UPDATED)

Background and Aims: The methods with which prediction models are usually developed mean that neither the parameters nor the predictions should be interpreted causally. However, when prediction models are used to support decision making, there is often a need for predicting outcomes under hypothetical interventions. We aimed to identify published methods for developing and validating prediction models that enable risk estimation of outcomes under hypothetical interventions, utilizing causal inference: their main methodological approaches, underlying assumptions, targeted estimands, and potential pitfalls and challenges with using the method, and unresolved methodological challenges.

Methods: We systematically reviewed literature published by December 2019, considering papers in the health domain that used causal considerations to enable prediction models to be used for predictions under hypothetical interventions.

Results: We identified 4919 papers through database searches and a further 115 papers through manual searches, of which 13 were selected for inclusion, from both the statistical and the machine learning literature. Most of the identified methods for causal inference from observational data were based on marginal structural models and g-estimation.

Conclusions: There exist two broad methodological approaches for allowing prediction under hypothetical intervention into clinical prediction models: 1) enriching prediction models derived from observational studies with estimated causal effects from clinical trials and meta-analyses; and 2) estimating prediction models and causal effects directly from observational data. These methods require extending to dynamic treatment regimes, and consideration of multiple interventions to operationalise a clinical decision support system. Techniques for validating 'causal prediction models' are still in their infancy.

Towards Generalized Implementation of Wasserstein Distance in GANs. (arXiv:2012.03420v2 [cs.LG] UPDATED)

Wasserstein GANs (WGANs), built upon the Kantorovich-Rubinstein (KR) duality of Wasserstein distance, is one of the most theoretically sound GAN models. However, in practice it does not always outperform other variants of GANs. This is mostly due to the imperfect implementation of the Lipschitz condition required by the KR duality. Extensive work has been done in the community with different implementations of the Lipschitz constraint, which, however, is still hard to satisfy the restriction perfectly in practice. In this paper, we argue that the strong Lipschitz constraint might be unnecessary for optimization. Instead, we take a step back and try to relax the Lipschitz constraint. Theoretically, we first demonstrate a more general dual form of the Wasserstein distance called the Sobolev duality, which relaxes the Lipschitz constraint but still maintains the favorable gradient property of the Wasserstein distance. Moreover, we show that the KR duality is actually a special case of the Sobolev duality. Based on the relaxed duality, we further propose a generalized WGAN training scheme named Sobolev Wasserstein GAN (SWGAN), and empirically demonstrate the improvement of SWGAN over existing methods with extensive experiments.

Molecule Optimization via Fragment-based Generative Models. (arXiv:2012.04231v2 [cs.LG] UPDATED)

In drug discovery, molecule optimization is an important step in order to modify drug candidates into better ones in terms of desired drug properties. With the recent advance of Artificial Intelligence, this traditionally in vitro process has been increasingly facilitated by in silico approaches. We present an innovative in silico approach to computationally optimizing molecules and formulate the problem as to generate optimized molecular graphs via deep generative models. Our generative models follow the key idea of fragment-based drug design, and optimize molecules by modifying their small fragments. Our models learn how to identify the to-be-optimized fragments and how to modify such fragments by learning from the difference of molecules that have good and bad properties. In optimizing a new molecule, our models apply the learned signals to decode optimized fragments at the predicted location of the fragments. We also construct multiple such models into a pipeline such that each of the models in the pipeline is able to optimize one fragment, and thus the entire pipeline is able to modify multiple fragments of molecule if needed. We compare our models with other state-of-the-art methods on benchmark datasets and demonstrate that our methods significantly outperform others with more than 80% property improvement under moderate molecular similarity constraints, and more than 10% property improvement under high molecular similarity constraints.

The Study of Urban Residential's Public Space Activeness using Space-centric Approach. (arXiv:2101.03725v2 [stat.AP] UPDATED)

With the advancement of the Internet of Things (IoT) and communication platform, large scale sensor deployment can be easily implemented in an urban city to collect various information. To date, there are only a handful of research studies about understanding the usage of urban public spaces. Leveraging IoT, various sensors have been deployed in an urban residential area to monitor and study public space utilization patterns. In this paper, we propose a data processing system to generate space-centric insights about the utilization of an urban residential region of multiple points of interest (PoIs) that consists of 190,000m$^2$ real estate. We identify the activeness of each PoI based on the spectral clustering, and then study their corresponding static features, which are composed of transportation, commercial facilities, population density, along with other characteristics. Through the heuristic features inferring, the residential density and commercial facilities are the most significant factors affecting public place utilization.