Mathematical Framework for latentcor
Main Framework
Latent Gaussian Copula Model for Mixed Data
latentcor
utilizes the powerful semi-parametric latent Gaussian copula models to estimate
latent correlations between mixed data types (continuous/binary/ternary/truncated or zero-inflated).
Below we review the definitions for each type.
from latentcor import gen_data, latentcor
Definition of continuous model
A random \(X\in\cal{R}^{p}\) satisfies the Gaussian copula (or nonparanormal) model if there exist monotonically increasing \(f=(f_{j})_{j=1}^{p}\) with \(Z_{j}=f_{j}(X_{j})\) satisfying \(Z\sim N_{p}(0, \Sigma)\), \(\sigma_{jj}=1\); we denote \(X\sim NPN(0, \Sigma, f)\) [FLNZ17].
print(gen_data(n = 6, tps = "con")['X'])
[[-0.28200758]
[-1.36307758]
[-0.29729347]
[ 0.65155748]
[ 0.42049419]
[ 0.66931717]]
Definition of binary model
A random \(X\in\cal{R}^{p}\) satisfies the binary latent Gaussian copula model if there exists \(W\sim NPN(0, \Sigma, f)\) such that \(X_{j}=I(W_{j}>c_{j})\), where \(I(\cdot)\) is the indicator function and \(c_{j}\) are constants [FLNZ17].
print(gen_data(n = 6, tps = "bin")['X'])
[[0.]
[1.]
[1.]
[0.]
[0.]
[1.]]
Definition of ternary model
A random \(X\in\cal{R}^{p}\) satisfies the ternary latent Gaussian copula model if there exists \(W\sim NPN(0, \Sigma, f)\) such that \(X_{j}=I(W_{j}>c_{j})+I(W_{j}>c'_{j})\), where \(I(\cdot)\) is the indicator function and \(c_{j}<c'_{j}\) are constants [QBW18].
print(gen_data(n = 6, tps = "ter")['X'])
[[1.]
[2.]
[1.]
[1.]
[0.]
[0.]]
Definition of truncated or zero-inflated model
A random \(X\in\cal{R}^{p}\) satisfies the truncated latent Gaussian copula model if there exists \(W\sim NPN(0, \Sigma, f)\) such that \(X_{j}=I(W_{j}>c_{j})W_{j}\), where \(I(\cdot)\) is the indicator function and \(c_{j}\) are constants [YCG20].
print(gen_data(n = 6, tps = "tru")['X'])
[[1.66626733]
[0. ]
[0. ]
[0.41171889]
[2.15349059]
[0. ]]
Mixed latent Gaussian copula model
The mixed latent Gaussian copula model jointly models \(W=(W_{1}, W_{2}, W_{3}, W_{4})\sim NPN(0, \Sigma, f)\) such that \(X_{1j}=W_{1j}\), \(X_{2j}=I(W_{2j}>c_{2j})\), \(X_{3j}=I(W_{3j}>c_{3j})+I(W_{3j}>c'_{3j})\) and \(X_{4j}=I(W_{4j}>c_{4j})W_{4j}\).
X = gen_data(n = 100, tps = ["con", "bin", "ter", "tru"])['X']
print(X[ :6, : ])
[[-0.12368305 0. 0. 0. ]
[ 0.92606721 0. 0. 0. ]
[ 1.80474359 0. 1. 0. ]
[ 1.41132209 1. 1. 1.4011385 ]
[-1.58790565 0. 0. 0. ]
[ 0.38887018 0. 1. 0.27415979]]
Moment-based estimation of latent correlation matrix based on bridge functions
The estimation of latent correlation matrix \(\Sigma\) is achieved via the bridge function \(F\) which is defined such that \(E(\hat{\tau}_{jk})=F(\sigma_{jk})\), where \(\sigma_{jk}\) is the latent correlation between variables \(j\) and \(k\), and \(\hat{\tau}_{jk}\) is the corresponding sample Kendall’s \(\tau\).
Kendall’s correlation
Given observed \(\mathbf{x}_{j}, \mathbf{x}_{k}\in\cal{R}^{n}\),
where \(n\) is the sample size.
latentcor
calculates pairwise Kendall’s \(\widehat \tau\) as part of the estimation process.
K = latentcor(X, tps = ["con", "bin", "ter", "tru"])['K']
print(K)
0 1 2 3
0 1.000000 0.210505 0.189495 0.228081
1 0.210505 1.000000 0.214141 0.220606
2 0.189495 0.214141 1.000000 0.279394
3 0.228081 0.220606 0.279394 1.000000
Using \(F\) and \(\widehat \tau_{jk}\), a moment-based estimator is \(\hat{\sigma}_{jk}=F^{-1}(\hat{\tau}_{jk})\) with the corresponding \(\hat{\Sigma}\) being consistent for \(\Sigma\) [FLNZ17, QBW18, YCG20].
The explicit form of bridge function \(F\) has been derived for all combinations of continuous(C)/binary(B)/ternary(N)/truncated(T) variable types, and we summarize the corresponding references. Each of this combinations is implemented in latentcor
.
Below we provide an explicit form of \(F\) for each combination.
Theorem (explicit form of bridge function)
Let \(W_{1}\in\cal{R}^{p_{1}}\), \(W_{2}\in\cal{R}^{p_{2}}\), \(W_{3}\in\cal{R}^{p_{3}}\), \(W_{4}\in\cal{R}^{p_{4}}\) be such that \(W=(W_{1}, W_{2}, W_{3}, W_{4})\sim NPN(0, \Sigma, f)\) with \(p=p_{1}+p_{2}+p_{3}+p_{4}\). Let \(X=(X_{1}, X_{2}, X_{3}, X_{4})\in\cal{R}^{p}\) satisfy \(X_{j}=W_{j}\) for j=1,…,p_{1}, \(X_{j}=I(W_{j}>c_{j})\) for \(j=p_{1}+1, ..., p_{1}+p_{2}\), \(X_{j}=I(W_{j}>c_{j})+I(W_{j}>c'_{j})\) for \(j=p_{1}+p_{2}+1, ..., p_{3}\) and \(X_{j}=I(W_{j}>c_{j})W_{j}\) for \(j=p_{1}+p_{2}+p_{3}+1, ..., p\) with \(\Delta_{j}=f(c_{j})\). The rank-based estimator of \(\Sigma\) based on the observed \(n\) realizations of \(X\) is the matrix \(\mathbf{\hat{R}}\) with \(\hat{r}_{jj}=1\), \(\hat{r}_{jk}=\hat{r}_{kj}=F^{-1}(\hat{\tau}_{jk})\) with block structure
where \(\Delta_{j}=\Phi^{-1}(\pi_{0j})\), \(\Delta_{k}=\Phi^{-1}(\pi_{0k})\), \(\Delta_{j}^{1}=\Phi^{-1}(\pi_{0j})\), \(\Delta_{j}^{2}=\Phi^{-1}(\pi_{0j}+\pi_{1j})\), \(\Delta_{k}^{1}=\Phi^{-1}(\pi_{0k})\), \(\Delta_{k}^{2}=\Phi^{-1}(\pi_{0k}+\pi_{1k})\),
and
Estimation methods
Given the form of bridge function \(F\), obtaining a moment-based estimation
\(\widehat \sigma_{jk}\) requires inversion of \(F\). latentcor
implements two methods for calculation of the inversion:
method = "original"
method = "approx"
Both methods calculate inverse bridge function applied to each element of sample Kendall’s
\(\tau\) matrix. Because the calculation is performed point-wise (separately for each pair
of variables), the resulting point-wise estimator of correlation matrix may not be positive
semi-definite. latentcor
performs projection of the pointwise-estimator to the space of
positive semi-definite matrices, and allows for shrinkage towards identity matrix using the parameter
nu
.
Original method (`method = “original”`)
Original estimation approach relies on numerical inversion of \(F\) based on solving
uni-root optimization problem. Given the calculated \(\widehat \tau_{jk}\)
(sample Kendall’s \(\tau\) between variables \(j\) and \(k\)), the estimate of
latent correlation \(\widehat \sigma_{jk}\) is obtained by calling scipy.optimize.fminbound
function to solve the following optimization problem:
The parameter tol
controls the desired accuracy of the minimizer and is passed to
scipy.optimize.fminbound
, with the default precision of \(10^{-8}\).
estimate_original = latentcor(X, tps = ["con", "bin", "ter", "tru"], method = "original", tol = 1e-8)
Algorithm for Original method
Input: \(F(r)=F(r, \mathbf{\Delta})\) - bridge function based on the type of variables \(j\), \(k\)
Step 1. Calculate \(\hat{\tau}_{jk}\) using \((1)\).
print(estimate_original['K'])
0 1 2 3
0 1.000000 0.210505 0.189495 0.228081
1 0.210505 1.000000 0.214141 0.220606
2 0.189495 0.214141 1.000000 0.279394
3 0.228081 0.220606 0.279394 1.000000
Step 2. For binary/truncated variable \(j\), set \(\hat{\mathbf{\Delta}}_{j}=\hat{\Delta}_{j}=\Phi^{-1}(\pi_{0j})\) with \(\pi_{0j}=\sum_{i=1}^{n}\frac{I(x_{ij}=0)}{n}\). For ternary variable \(j\), set \(\hat{\mathbf{\Delta}}_{j}=(\hat{\Delta}_{j}^{1}, \hat{\Delta}_{j}^{2})\) where \(\hat{\Delta}_{j}^{1}=\Phi^{-1}(\pi_{0j})\) and \(\hat{\Delta}_{j}^{2}=\Phi^{-1}(\pi_{0j}+\pi_{1j})\) with \(\pi_{0j}=\sum_{i=1}^{n}\frac{I(x_{ij}=0)}{n}\) and \(\pi_{1j}=\sum_{i=1}^{n}\frac{I(x_{ij}=1)}{n}\).
print(estimate_original['zratios'])
[[nan 0.5 0.3 0.5]
[nan nan 0.8 nan]]
Step 3 Compute \(F^{-1}(\hat{\tau}_{jk})\) as \(\hat{r}_{jk}=argmin\{F(r)-\hat{\tau}_{jk}\}^{2}\) solved via
scipy.optimize.fminbound
function with accuracytol
.
print(estimate_original['Rpointwise'])
0 1 2 3
0 1.000000 0.459150 0.349073 0.409995
1 0.459150 1.000000 0.550386 0.551777
2 0.349073 0.550386 1.000000 0.585960
3 0.409995 0.551777 0.585960 1.000000
Approximation method (`method = “approx”`)
A faster approximation method is based on multi-linear interpolation of pre-computed inverse bridge function on a fixed grid of points [YMullerG21]. This is possible as the inverse bridge function is an analytic function of at most \(5\) parameters:
Kendall’s \(\tau\)
Proportion of zeros in the \(1st\) variable
(Possibly) proportion of zeros and ones in the \(1st\) variable
(Possibly) proportion of zeros in the \(2nd\) variable
(Possibly) proportion of zeros and ones in the \(2nd\) variable
In short, d-dimensional multi-linear interpolation uses a weighted average of \(2^{d}\)
neighbors to approximate the function values at the points within the d-dimensional cube of
the neighbors, and to perform interpolation, latentcor
takes advantage of the Python
package scipy.interpolate.RegularGridInterpolator
. This approximation method has been first
described in [YMullerG21] for continuous/binary/truncated cases. In latentcor
,
we additionally implement ternary case, and optimize the choice of grid as well as interpolation
boundary for faster computations with smaller memory footprint.
estimate_approx = latentcor(X, tps = ["con", "bin", "ter", "tru"], method = "approx")
print(estimate_approx['Rpointwise'])
0 1 2 3
0 1.000000 0.458928 0.348806 0.409425
1 0.458928 1.000000 0.558389 0.551532
2 0.348806 0.558389 1.000000 0.583079
3 0.409425 0.551532 0.583079 1.000000
Algorithm for Approximation method
Input: Let \(\check{g}=h(g)\), pre-computed values \(F^{-1}(h^{-1}(\check{g}))\) on a fixed grid \(\check{g}\in\check{\cal{G}}\) based on the type of variables \(j\) and \(k\). For binary/continuous case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j})\); for binary/binary case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}, \check{\Delta}_{k})\); for truncated/continuous case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j})\); for truncated/truncated case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}, \check{\Delta}_{k})\); for ternary/continuous case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2})\); for ternary/binary case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k})\); for ternary/truncated case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k})\); for ternay/ternary case, \(\check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k}^{1}, \check{\Delta}_{k}^{2})\).
Step 1 and Step 2 same as Original method.
Step 3. If \(|\hat{\tau}_{jk}|\le \mbox{ratio}\times \bar{\tau}_{jk}(\cdot)\), apply interpolation; otherwise apply Original method.
To avoid interpolation in areas with high approximation errors close to the boundary, we use hybrid scheme in Step 3. The parameter ratio
controls the size of the region where the interpolation is performed (ratio = 0
means no interpolation, ratio = 1
means interpolation is always performed). For the derivation of approximate bound for BC, BB, TC, TB, TT cases see @yoon2021fast. The derivation of approximate bound for NC, NB, NN, NT case is in the Appendix.
By default, latentcor
uses ratio = 0.9
as this value was recommended in @yoon2021fast having a good balance of accuracy and computational speed. This value, however, can be modified by the user
print(latentcor(X, tps = ["con", "bin", "ter", "tru"], method = "approx", ratio = 0.99)['R'])
print(latentcor(X, tps = ["con", "bin", "ter", "tru"], method = "approx", ratio = 0.4)['R'])
print(latentcor(X, tps = ["con", "bin", "ter", "tru"], method = "original")['R'])
0 1 2 3
0 1.000000 0.458469 0.348457 0.409015
1 0.458469 1.000000 0.557830 0.550981
2 0.348457 0.557830 1.000000 0.582496
3 0.409015 0.550981 0.582496 1.000000
0 1 2 3
0 1.000000 0.458691 0.348457 0.409015
1 0.458691 1.000000 0.549835 0.551231
2 0.348457 0.549835 1.000000 0.582496
3 0.409015 0.551231 0.582496 1.000000
0 1 2 3
0 1.000000 0.458691 0.348729 0.409567
1 0.458691 1.000000 0.549835 0.551247
2 0.348729 0.549835 1.000000 0.585358
3 0.409567 0.551247 0.585358 1.000000
The lower is the ratio
, the closer is the approximation method to original method
(with ratio = 0
being equivalent to method = "original"
), but also the higher
is the cost of computations.
Rescaled Grid for Interpolation
Since \(|\hat{\tau}|\le \bar{\tau}\), the grid does not need to cover the whole domain \(\tau\in[-1, 1]\). To optimize memory associated with storing the grid, we rescale \(\tau\) as follows:
where \(\bar{\tau}_{jk}\) is as defined above.
In addition, for ternary variable \(j\), it always holds that
and
Thus, the grid should not cover the the area corresponding to
We thus rescale as follows:
Adjustment of pointwise-estimator for positive-definiteness
Since the estimation is performed point-wise, the resulting matrix of estimated latent correlations is not guaranteed to be positive semi-definite. For example, this could be expected when the sample size is small (and so the estimation error for each pairwise correlation is larger).
X = gen_data(n = 6, tps = ["con", "bin", "ter", "tru"])['X']
print(latentcor(X, tps = ["con", "bin", "ter", "tru"])['Rpointwise'])
0 1 2 3
0 1.000000 0.147724 0.985722 0.854822
1 0.147724 1.000000 0.526404 0.172673
2 0.985722 0.526404 1.000000 0.986970
3 0.854822 0.172673 0.986970 1.000000
latentcor
automatically corrects the pointwise estimator to be positive definite by making
two adjustments.
First, if Rpointwise
has smallest eigenvalue less than zero, the latentcor
projects
this matrix to
the nearest positive semi-definite matrix.
The user is notified of this adjustment through the message (supressed in previous code chunk), e.g.
print(latentcor(X, tps = ["con", "bin", "ter", "tru"])['R'])
0 1 2 3
0 1.000000 0.168348 0.919167 0.883261
1 0.168348 1.000000 0.481831 0.192184
2 0.919167 0.481831 1.000000 0.923864
3 0.883261 0.192184 0.923864 1.000000
Second, latentcor
shrinks the adjusted matrix of correlations towards identity matrix using
the parameter \nu
with default value of 0.001 (nu = 0.001
), so that the resulting
latentcor[0]
is strictly positive definite with the minimal eigenvalue being greater or equal
to \nu
. That is
where \widetilde R
is the nearest positive semi-definite matrix to Rpointwise
.
print(latentcor(X, tps = ["con", "bin", "ter", "tru"], nu = 0.001)['R'])
0 1 2 3
0 1.000000 0.168349 0.919170 0.883262
1 0.168349 1.000000 0.481829 0.192184
2 0.919170 0.481829 1.000000 0.923863
3 0.883262 0.192184 0.923863 1.000000
As a result, R
and Rpointwise
could be quite different when sample size n
is small. When n
is large and p
is moderate, the difference is typically driven by
parameter nu
.
X = gen_data(n = 100, tps = ["con", "bin", "ter", "tru"])['X']
out = latentcor(X, tps = ["con", "bin", "ter", "tru"], nu = 0.001)
print(out['Rpointwise'])
print(out['R'])
0 1 2 3
0 1.000000 0.318497 0.450533 0.595619
1 0.318497 1.000000 0.357000 0.226670
2 0.450533 0.357000 1.000000 0.386246
3 0.595619 0.226670 0.386246 1.000000
0 1 2 3
0 1.000000 0.318179 0.450083 0.595023
1 0.318179 1.000000 0.356643 0.226443
2 0.450083 0.356643 1.000000 0.385860
3 0.595023 0.226443 0.385860 1.000000
Appendix
Derivation of bridge function for ternary/truncated case
Without loss of generality, let \(j=1\) and \(k=2\). By the definition of Kendall’s \(\tau\),
Since \(X_{1}\) is ternary,
Since \(X_{2}\) is truncated, \(C_{1}>0\) and
Since \(f\) is monotonically increasing, \(sign(X_{2}-X_{2}')=sign(Z_{2}-Z_{2}')\),
From the definition of \(U\), let \(Z_{j}=f_{j}(U_{j})\) and \(\Delta_{j}=f_{j}(C_{j})\) for \(j=1,2\). Using \(sign(x)=2I(x>0)-1\), we obtain
Since \(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, -Z{1}\}\), \(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, Z{1}'\}\) and \(\{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, -Z{2}'\}\) are standard bivariate normally distributed variables with correlation \(-\frac{1}{\sqrt{2}}$, $r/\sqrt{2}\) and \(-\frac{r}{\sqrt{2}}\), respectively, by the definition of \(\Phi_3(\cdot,\cdot, \cdot;\cdot)\) and \(\Phi_4(\cdot,\cdot, \cdot,\cdot;\cdot)\) we have
Using the facts that
and
So that,
It is easy to get the bridge function for truncated/ternary case by switching \(j\) and \(k\).
Derivation of approximate bound for the ternary/continuous case
Let \(n_{0x}=\sum_{i=1}^{n_x}I(x_{i}=0)\), \(n_{2x}=\sum_{i=1}^{n_x}I(x_{i}=2)\), \(\pi_{0x}=\frac{n_{0x}}{n_{x}}\) and \(\pi_{2x}=\frac{n_{2x}}{n_{x}}\), then
For ternary/binary and ternary/ternary cases, we combine the two individual bounds.
Derivation of approximate bound for the ternary/truncated case
Let \(\mathbf{x}\in\mathcal{R}^{n}\) and \(\mathbf{y}\in\mathcal{R}^{n}\) be the observed \(n\) realizations of ternary and truncated variables, respectively. Let \(n_{0x}=\sum_{i=0}^{n}I(x_{i}=0)\), \(\pi_{0x}=\frac{n_{0x}}{n}\), \(n_{1x}=\sum_{i=0}^{n}I(x_{i}=1)\), \(\pi_{1x}=\frac{n_{1x}}{n}\), \(n_{2x}=\sum_{i=0}^{n}I(x_{i}=2)\), \(\pi_{2x}=\frac{n_{2x}}{n}\), \(n_{0y}=\sum_{i=0}^{n}I(y_{i}=0)\), \(\pi_{0y}=\frac{n_{0y}}{n}\), \(n_{0x0y}=\sum_{i=0}^{n}I(x_{i}=0 \;\& \; y_{i}=0)\), \(n_{1x0y}=\sum_{i=0}^{n}I(x_{i}=1 \;\& \; y_{i}=0)\) and \(n_{2x0y}=\sum_{i=0}^{n}I(x_{i}=2 \;\& \; y_{i}=0)\) then
Since \(n_{0x0y}\leq\min(n_{0x},n_{0y})\), \(n_{1x0y}\leq\min(n_{1x},n_{0y})\) and \(n_{2x0y}\leq\min(n_{2x},n_{0y})\) we obtain
It is easy to get the approximate bound for truncated/ternary case by switching \(\mathbf{x}\) and \(\mathbf{y}\).