Test for the difference of two cluster means after any clustering algorithm, for matrix normal model with arbitrary scale matrices.
test.clusters.MC.RdTest for the difference of two cluster means after any clustering algorithm, for matrix normal model with arbitrary scale matrices.
Usage
test.clusters.MC(
X,
U = NULL,
Sigma = NULL,
Y = NULL,
UY = NULL,
precUY = NULL,
clusters,
cl_fun,
NC = NULL,
cl = NULL,
ndraws = 2000
)Arguments
- X
A \(n \times p\) matrix drawn from a \(n \times p\) matrix normal distribution \(\mathcal{MN}(\)
M,U,Sigma\()\).Xmust have \(n\) rows and \(p\) columns.- U
A \(n \times n\) positive-definite matrix describing the dependence structure between the rows in
X. IfNULL, observations are considered independent andUis set to the \(n \times n\) identity matrix.- Sigma
A \(p \times p\) positive-definite matrix describing the dependence structure between the columns in
X. IfNULL,Sigmais over-estimated (in the sense of the Loewner partial order).- Y
If
SigmaisNULL, an i.i.d. copy ofXallowing its estimation.Ymust have the same number of columns asX.- UY
If
SigmaisNULL, a positive-definite matrix describing the dependence structure between the rows inY. IfNULLand its inverse is not provided, set to the identity matrix by default.- precUY
The inverse matrix of
UY, that can be provided to increase computational efficiency. IfUYis notNULLandprecUYisNULL,precUYis obtained by invertingUY.- clusters
A vector of two integers from 1 to
NCindicating the pair of clusters whose means have to be compared.- cl_fun
A function returning assignments to clusters. The function must take as input the data matrix
Xand the number of clustersNC.- NC
The number of clusters to choose, that will be passed as argument to
cl_fun. Must be set toNULLif not required bycl_fun.- cl
The result of clustering
Xusingcl_fun. It can useful to precompute this quantity before choosingclusters.- ndraws
The number of Monte Carlo iterations.
Value
pvalue - The p-value for the difference of cluster means.
stat - The test statistic.
stdrr - The Monte Carlo standard error.
clusters - The partition of the
nobservations retrieved by the clustering algorithm.
References
[1] L. L. Gao, J. Bien, and D. Witten. Selective inference for hierarchical clustering. Journal of the American Statistical Association, 0(0):1–11, 2022.
Examples
n <- 50
p <- 20
M <- Matrix::Matrix(0, nrow = n , ncol = p) # Mean matrix
Sigma <- stats::toeplitz(seq(1, 0.1, length = p)) # Sigma: dependence between features
U <- matrixNormal::I(n) # U: dependence between observations
X <- matrixNormal::rmatnorm(s = 1, M, U, Sigma)
Y <- matrixNormal::rmatnorm(s = 1, M, U, Sigma) # i.i.d. copy of X
# Using HDBSCAN clustering from dbscan package. This algorithm selects
# automatically the number of clusters NC.
# Additional clustering parameters must be set as default values
# when defining cl_fun.
# install.packages('dbscan')
hdbscan.clustering <- function(X, NC = NULL, min.occupancy = 5){
X.clus <- dbscan::hdbscan(X, minPts = min.occupancy)
return(X.clus$cluster + 1)
}
# We start by clustering the data
clusters_X <- hdbscan.clustering(X)
#> Error in loadNamespace(x): there is no package called ‘dbscan’
# We test for the equality of clusters 3 and 1
test.clusters.MC(X, U = U, Sigma = Sigma, clusters = c(3,1),
cl = clusters_X, cl_fun = hdbscan.clustering, NC = NULL, ndraws = 500)
#> Error: object 'clusters_X' not found