Simulating data following a given covariance structure

There are many ways to simulate multivariate gaussian distribution that follow a given covariance matrix. One of the most popular method is based on the Cholesky decomposition. Here I simply use the mvrnorm() function from the MASS package. I listed both codes of two approaches.

The next question would be how to generate a positive definite correlation matrix. A positive definite matrix means it’s symmetric and all the eigenvalues are positive. This is a strong requirement for the correlation matrix. When data is not highly correlated (off-diagonal elements close \(0\)), it will not have big problem. However, when data has high correlation, the positive definite is always violated.

I’ve tried couple ways to generate a large full-rank random correlation matrix with some strong correlations present.

Simple way with fixed pattern

One trivial way to obtain a perfectly positive definite matrix with some strong correlations presents by making all off-diagonal elements equal to \(\pm \rho\). That is, generate a mixture of matrices where (1) all the off-diagonal values equal \(\rho\) or (2) they equal \((−1)^{i+j}\rho\) in the \((i,j)\) position. I wouldn’t recommend this because the matrix also has fixed pattern (like Toeplitz). We would like the distribution of off-diagonal elements in each matrix to be reasonably “spread”, with a uniform distribution.

library(gtools)
m=rep(0.6, 190)
m[even(1:190)]=-0.6  # change even number to negative

rho=diag(20)
rho[which(rho==0)]=m
rho[lower.tri(rho)] = t(rho)[lower.tri(rho)]
#show first 6 subset elements in matrix
head(rho[,1:6],6)

##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]  1.0 -0.6  0.6 -0.6  0.6 -0.6
## [2,] -0.6  1.0 -0.6  0.6 -0.6  0.6
## [3,]  0.6 -0.6  1.0 -0.6  0.6 -0.6
## [4,] -0.6  0.6 -0.6  1.0 -0.6  0.6
## [5,]  0.6 -0.6  0.6 -0.6  1.0 -0.6
## [6,] -0.6  0.6 -0.6  0.6 -0.6  1.0

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=rho)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

QR decomposition using Q

Our idea is generating random orthogonal \(Q\) (e.g. by generating random matrix \(A\) and doing its \(QR\) decomposition, or via Gram-Schmidt process) and random diagonal \(D\) with all positive elements, form \(B=QDQ^T\). Obtained matrix \(B\) can be easily normalized to have all ones on the diagonal: \(C=E^{−1/2}BE^{−1/2}\), where \(E=diag(B)\) is the diagonal matrix with the same diagonal as \(B\). However, this way to generate \(B\) result in correlation matrix \(C\) having off-diagonal elements close \(0\).

set.seed(556)
#A = matrix(rnorm(400),nrow=20,ncol=20);
A = matrix(runif(400),nrow=20,ncol=20);
Q=qr.Q(qr(A)) #Reconstruct the Q from QR decomposition
R=qr.R(qr(A)) #Reconstruct the R from QR decomposition
D=diag(runif(20,0,10))
B=Q %*% D %*% t(Q)
C=diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B)))

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

Nonnegative matrix add a positive matrix

Another simple way to achieve the goal is to randomly generate matrix \(W\), form the covariance matrix \(W^TW\) (which will be nonnegative matrix) and add to it a random diagonal matrix \(D\) with positive elements to make \(B=W^TW+D\) full rank. The resulting covariance matrix \(B\) can be normalized to become a correlation matrix \(C\): \(C=E^{-1/2}BE^{-1/2}\), where \(E=diag(B)\) is the diagonal matrix with the same diagonal as \(B\). The off-diagonal elements would show strong correlation when \(W \sim unif(0,1)\). Otherwise, it’s close to \(0\) when \(W \sim unif(-1,1)\).

When \(W \sim unif(-1,1)\)

#W = matrix(rnorm(2000,0,1),nrow=20,ncol=100);
W = matrix(runif(2000,-0.2,1),nrow=100,ncol=20);
B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

When \(W \sim unif(0,1)\)

W = matrix(runif(2000,0,1),nrow=100,ncol=20);
B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

QR decomposition using R

Generate random positive definite matrix (corvariance) \(B\) in another ways is \[B=A^TA=(QR)^T(QR)=R^TQ^TQR=R^TR\].

where \(R\) is an upper triangular matrix and \(Q\) is an orthogonal matrix. Then normalize covariance matrix \(B\) similarly as above mentioned. Similarly, the off-diagonal elements would show strong correlation when \(R \sim unif(0,1)\). Otherwise, it’s close to \(0\) when \(R \sim unif(-1,1)\).

When \(R \sim Unif(-1,1)\)

R=matrix(runif(400, min=-1, max=1),ncol=20,nrow=20)
#R=matrix(rnorm(400),ncol=20,nrow=20)
R[lower.tri(R)] <- 0
B=t(R)  %*%  R
E <- matrix(0,20,20); diag(E) <- 1/sqrt(diag(B)); head(E[,1:6],6)

##          [,1]     [,2]      [,3]     [,4]     [,5]      [,6]
## [1,] 1.399558 0.000000 0.0000000 0.000000 0.000000 0.0000000
## [2,] 0.000000 1.276806 0.0000000 0.000000 0.000000 0.0000000
## [3,] 0.000000 0.000000 0.7017101 0.000000 0.000000 0.0000000
## [4,] 0.000000 0.000000 0.0000000 1.060301 0.000000 0.0000000
## [5,] 0.000000 0.000000 0.0000000 0.000000 1.198243 0.0000000
## [6,] 0.000000 0.000000 0.0000000 0.000000 0.000000 0.7415409

C=E %*% B %*% E

library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

When \(R \sim Unif(0,1)\)

##          [,1]     [,2]      [,3]     [,4]      [,5]      [,6]
## [1,] 7.005529 0.000000 0.0000000 0.000000 0.0000000 0.0000000
## [2,] 0.000000 1.825062 0.0000000 0.000000 0.0000000 0.0000000
## [3,] 0.000000 0.000000 0.7629245 0.000000 0.0000000 0.0000000
## [4,] 0.000000 0.000000 0.0000000 1.002518 0.0000000 0.0000000
## [5,] 0.000000 0.000000 0.0000000 0.000000 0.7932434 0.0000000
## [6,] 0.000000 0.000000 0.0000000 0.000000 0.0000000 0.6364622

Conjugate by an orthogonal matrix

Let’s conjugate by an orthogonal matrix \(Q\) to get a positive definite matrix.

\[B=Q^TR^TRQ\]

\[B=QRR^TQ^T\]

Mixture of different uniform distribution

Since we want some correlation matrix with not only all high correlated or low correlated elements. We would take the mixture of uniform distributions which represent different type of correlation. In the following example, the distribution would be \(Unif(-1,0)\), \(Unif(10,20)\), \(Unif(0,1)\) and \(Unif(0,0.3)\) for 5 columns each, respectively. The \(Unif(10,20)\) represents high correlation. \(Unif(-1,0)\) and \(Unif(0,1)\) represent positive and negative median correlation. And \(Unif(0,0.3)\) denotes the low correlation.

Finally, we would generate a predictor matrix \(X\) with \(p=20\) variables and \(n=5000\) observations. And we could choose 4 different type of correlation matrix,mixture correlated, highly correlated, median correlated and low correlated.

W = matrix(runif(2000,0,0.3),nrow=100,ncol=20);
W[,1:5]=matrix(runif(500,-1,0),nrow=100,ncol=5);
W[,6:10]=matrix(runif(500,10,20),nrow=100,ncol=5);
W[,11:15]=matrix(runif(500,0,1),nrow=100,ncol=5);

B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

Highly correlated matrix

Mixture of \(Unif(0,1)\) and \(Unif(-1,0)\) for 10 columns each.

W = matrix(runif(2000,0,1),nrow=100,ncol=20);
W[,1:10]=matrix(runif(500,-1,0),nrow=100,ncol=5);
B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

Median correlated matrix

Mixture of \(Unif(0,0.5)\) and \(Unif(-0.5,0)\) for 10 columns each.

W = matrix(runif(2000,0,0.5),nrow=100,ncol=20);
W[,1:10]=matrix(runif(500,-0.5,0),nrow=100,ncol=5);
B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))

Low correlated matrix

Only use \(Unif(-1,1)\) is enough.

W = matrix(runif(2000,-1,1),nrow=100,ncol=20);
B = t(W) %*% W + diag(runif(20,0,10));
C = diag(1/sqrt(diag(B))) %*% B %*% diag(1/sqrt(diag(B))); #normalized covariance

k=20
library(MASS)
set.seed(246)
x <- mvrnorm(n=5000, mu=rep(0,k), Sigma=C)
library(corrplot)
corrplot.mixed(cor(x),lower = "circle", upper = "number", tl.cex = 1, mar=c(0.5,0,0.5,0))