- This event has passed.

# National Institute for Applied Statistics Research Australia Seminar Series

## February 6, 2019 @ 1:30 pm - 2:30 pm

# Speakers

**Olivier Thas**, I-Biostat, Hasselt Unversity, Belgium

Department of Data Analysis and Mathematical Modelling, Ghent University, Belgium

NIASRA, University of Wollongong, Australia

**Leyla Kodalci**, Hasselt University

# Title

A semiparametric model for compositional data with applications to RNASeq and microbiome studies

# Abstract

Compositional observations are mutivariate and they are characterised by a sum-constraint, i.e. the sum of the vector elements equals a constant. For example, in geochemical studies the chemical composition of a soil sample is represented by a vector with the weights of the individual chemical compounds, but the sum of these weights must equal the weight of the soil sample. Hence, only ratios of the weights are informative. This compositional structure is also present in massive parallel sequencing experiments (e.g. RNASeq or microbiome): read counts of targets (e.g. genes or taxa) sum up to the library size, which is often not informative for the research question (e.g. detection of differentially expressed genes or differentially abundant taxa). Many data analysis methods developed for compositional data make use of log ratios of the components of the observation vector. However, in sequencing data many data entries are zero, which causes problems when ratios and logarithms need to be computed. A typical ad hoc solution exists in adding an arbitrary constant to the observations before computing the log ratios.

In this talk we focus on a two-sample problem, i.e. comparing two groups of samples (assessing differentially expressed genes or differentially abundant taxa). We have developed a semiparametric method in the spirit of the probabilistic index models (Thas et al., 2012). In particular, we consider a semiparamtetric model for the probability that the outcome of component i is smaller than the outcome of component j. The estimation of this probability only requires information about the ordering of the vector elements corresponding to components i and j, and hence zero observations cause no problems. Testing for differential abundance then reduces to testing that the probabilistic indexes are the same in the two treatment groups. We have constructed the semiparametric efficient estimator of the effect size parameter in the model, and a hypothesis test based on this estimator. In sequencing studies the observation vectors are high-dimensional (hundreds to thousands of components) and hence a multiple testing procedure is needed to control the false discovery rate at its nominal level. Both permutation and asymptotic procedures are studied.

The method is evaluated in a simulation study and applied to a microbiome case study.

After the seminar, NIASRA will sponsor coffee at The Yard for the audiences. All welcome!