Get started

A simple example with two variables

Let’s import gen_data, get_tps and latentcor from package latentcor.

from latentcor import gen_data, get_tps, latentcor

First, we will generate a pair of variables with different types using a sample size n=100 which will serve as example data. Here first variable will be ternary, and second variable will be continuous.

simdata = gen_data(n = 100, tps = ["ter", "con"])
print(simdata['X'][ :6, :])
[[ 2.          0.58681373]
 [ 0.          1.56689355]
 [ 1.          1.03330599]
 [ 1.          1.8223853 ]
 [ 1.          0.17617261]
 [ 1.         -0.3987981 ]]
simdata['plotX']

The output of gen_data is a list with 2 elements:

  • simdata['X']: a matrix (\(100\times 2\)), the first column is the ternary variable; the second column is the continuous variable.

  • simdata['plotX']: None (showplot = False, can be changed to display the plot of generated data in gen_data input).

Then we use get_tps to guess data types automatically.

data_types = get_tps(simdata['X'])
print(data_types)
['ter' 'con']

Then we can estimate the latent correlation matrix based on these 2 variables using latentcor function.

estimate = latentcor(simdata['X'], tps = data_types)
print(estimate['R'])
          0         1
0  1.000000  0.550859
1  0.550859  1.000000
print(estimate['Rpointwise'])
         0        1
0  1.00000  0.55141
1  0.55141  1.00000
print(estimate['plot'])
None
print(estimate['K'])
          0         1
0  1.000000  0.306667
1  0.306667  1.000000
print(estimate['zratios'])
[[0.3 nan]
 [0.8 nan]]

The output of estimate is a list with several elements:

  • estimate['R']: estimated final latent correlation matrix, this matrix is guaranteed to be strictly positive definite (through statsmodels.stats.correlation_tools.corr_nearest projection and parameter nu, see Mathematical framework for estimation) if use.nearPD = True.

  • estimate['Rpointwise']: matrix of pointwise estimated correlations. Due to pointwise estimation, it is not guaranteed to be positive semi-definite

  • estimate['plot']: None by default as showplot = False in latentcor. Otherwise displays a heatmap of latent correlation matrix.

  • estimate['K']: Kendall \(\tau (\tau_{a})\) correlation matrix for these \(2\) variables.

  • estimate['zratios']: a list has the same length as the number of variables. Here the first element is a (\(2\times1\)) vector indicating the cumulative proportions for zeros and ones in the ternary variable (e.g. first element in vector is the proportion of zeros, second element in vector is the proportion of zeros and ones.) The second element of the list is numpy.nan for continuous variable.