Skip to content

Clustered data #5

@AngelikaGeroldinger

Description

@AngelikaGeroldinger

I was just trying to use simdata to generate data with some of the variables constant within clusters (of fixed size). For my purposes the following function was sufficient (where n_obs is the number of observations, cor_mat is the correlation matrix and clustervar is an integer such that 1: clustervar are the indices of the variables constant within clusters):

cmvtnorm <- function(n_obs=100, cluster_size=4, cor_mat, clustervar=0){
  if (n_obs %% cluster_size != 0) {
    stop("n_obs is not divisble by cluster_size")
  }
  X <- matrix(rnorm(n_obs*ncol(cor_mat)), nrow=n_obs, ncol=ncol(cor_mat))
  if (clustervar>0) {
    X[, 1:clustervar] <- X[rep(1:(n_obs/cluster_size), each=cluster_size), 1:clustervar]
  }
  chol_cor <- chol(cor_mat)
  X <- X %*% chol_cor
  return(X)
}
  

cor_mat <- cor_from_upper(5,
                          rbind(c(1,2,0.5), c(1,3,0.5),
                                c(2,4,0.5), c(3,5,-0.3),
                                c(4,5,0.5) ))

test <- cmvtnorm(n_obs=100000, cluster_size=4, cor_mat=cor_mat, clustervar=3)

cor(test)
cor_mat

apply(test, 2, sd)
apply(test, 2, mean)
head(test, 20)

Maybe this could be a nice extension?

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions