nl_causal.ts_models.ts_twas

Two Stage IV Causal Methods based GWAS Dataset

Module Contents

Classes

_2SLS

Two-Stage least squares (2SLS) regression is a statistical technique that is used in the analysis of structural equations:

_2SIR

Two-stage instrumental regression (2SIR): sliced inverse regression + least sqaure

class nl_causal.ts_models.ts_twas._2SLS(normalize=True, sparse_reg=None)

Bases: object

Two-Stage least squares (2SLS) regression is a statistical technique that is used in the analysis of structural equations:

(stage 1) x = z^T theta + omega;                (Stage 2) y = beta x + z^T alpha + epsilon

Note that data is expected to be centered, and y is normarlized as sd(y) = 1.

Parameters:
normalize: bool, default=True

Whether to normalize the resulting theta in Stage 1.nl_causal/ts_models/ts_twas.py

fit_flag: bool, default=False

A flag to indicate if the estimation is done.

sparse_reg: class, default=None

A sparse regression used in the Stage 2. If set to None, we will use OLS in for the Stage 2.

Examples

>>> import numpy as np
>>> from nl_causal import ts_models
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> n, p = 1000, 50
>>> np.random.seed(0)
>>> Z = np.random.randn(n, p)
>>> U, eps, gamma = np.random.randn(n), np.random.randn(n), np.random.randn(n)
>>> theta0 = np.random.randn(p)
>>> theta0 = theta0 / np.sqrt(np.sum(theta0**2))
>>> beta0 = .5
>>> X = np.dot(Z, theta0) + U**2 + eps
>>> y = beta0 * X + U + gamma
>>> center = StandardScaler(with_std=False)
>>> mean_X, mean_y = X.mean(), y.mean()
>>> y_scale = y.std()
>>> y = y / y_scale
>>> Z, X, y = center.fit_transform(Z), X - mean_X, y - mean_y
>>> Z1, Z2, X1, X2, y1, y2 = train_test_split(Z, X, y, test_size=.5, random_state=42)
>>> n1, n2 = len(Z1), len(Z2)
>>> LD_Z1, cov_ZX1 = np.dot(Z1.T, Z1), np.dot(Z1.T, X1)
>>> LD_Z2, cov_ZY2 = np.dot(Z2.T, Z2), np.dot(Z2.T, y2)
>>> LS = ts_models._2SLS(sparse_reg=None)
>>> LS.fit_theta(LD_Z1, cov_ZX1)
>>> LS.theta
array([ 0.07268042, -0.06393631, -0.34863568, -0.04453319, -0.10313442,
    0.09284717,  0.13514061,  0.24597997,  0.08304104,  0.02459302,
    -0.06349473,  0.21502581,  0.21177209, -0.08895919,  0.08082504,
    -0.13174606,  0.04644996,  0.23102817, -0.05635546,  0.08286319,
    0.20736079, -0.07574444, -0.20129764,  0.20311458, -0.15965619,
    -0.02148001,  0.01761156,  0.10617795,  0.02028776, -0.00221961,
    -0.11686226, -0.09116777,  0.08004126,  0.00663467, -0.13549927,
    -0.12674926,  0.09331474, -0.24688913, -0.18701941,  0.02714403,
    0.0854651 ,  0.30291367,  0.08926479,  0.023272  , -0.04798961,
    0.26668035, -0.16051689,  0.01169355, -0.08651508, -0.1342292 ])
>>> LS.fit_beta(LD_Z2, cov_ZY2, n2=n2)
>>> LS.beta
0.47526913304288304
>>> LS.test_effect(n2, LD_Z2, cov_ZY2)
>>> LS.p_value
1.2114293989960872e-59
>>> LS.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, level=.95)
array([0.38869162, 0.56184665])
Attributes:
p_value: float
P-value for hypothesis testing::

H_0: beta = 0; H_a: beta neq 0.

theta: array of shape (n_features, )

Estimated linear coefficients for the linear regression in Stage 1.

beta: float

The marginal causal effect beta in Stage 2.

alpha: array of shape (n_features, )

Estimated linear coefficients for Invalid IVs in Stage 2.

CI: array of shape (2, )

Estimated confidence interval for marginal causal effect (beta).

fit_theta(LD_Z1, cov_ZX1)

Fit the linear model in Stage 1.

Parameters:
LD_Z1: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the first sample: LD_Z1 = np.dot(Z1.T, Z1)

cov_ZX1: {array-like, float} of shape (n_features, )

Cov(Z1, X1); cov_ZX1 = np.dot(Z1.T, X1)

Returns:
self: returns an theta of self.
fit_beta(LD_Z2, cov_ZY2, n2, criterion='ebic')

Fit the linear model in Stage 2 based on GWAS data.

Parameters:
LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

n2: int

The number of sample in the second GWAS dataset.

Returns:
self: returns beta of self.
selection_summary()

A summary for the result of model selection of the sparse regression in Stage 2.

Returns:
df: dataframe

dataframe with columns: candidate_model, criteria, and mse.

est_var_res(n2, LD_Z2, cov_ZY2)

Estimated variance for y regress on Z.

Parameters:
n2: int

The number of sample on the second dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:
The estimated variance y regress on Z.
fit(LD_Z1, cov_ZX1, LD_Z2, cov_ZY2, n2)

Fit the linear model in Stage 2 based on GWAS data.

Parameters:
LD_Z1: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the first sample: LD_Z1 = np.dot(Z1.T, Z1)

cov_ZX1: {array-like, float} of shape (n_features, )

Cov(Z1, X1); cov_ZX1 = np.dot(Z1.T, X1)

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

n2: int

The number of sample in the second GWAS dataset.

Returns:
self: return theta of self.
self: returns beta of self.
CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, level=0.95)

Estimated confidence interval (CI) for the causal effect beta.

Parameters:
n1: int

The number of sample on the first dataset.

n2: int

The number of sample on the second dataset.

Z1: {array-like, float} of shape (n_sample, n_features)

Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.

X1: {array-like, float} of shape (n_sample)

Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

level: float, default=0.95

The confidence level to compute, which must be between 0 and 1 inclusive.

test_effect(n2, LD_Z2, cov_ZY2)

Causal inference for the marginal causal effect beta. P-value for hypothesis testing:

H0: `beta` = 0;             Ha: `beta` neq 0.
Parameters:
n2: int

The number of samples in the second dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:
self: returns p-value of self.
summary(precision=4)
class nl_causal.ts_models.ts_twas._2SIR(n_directions=1, n_slices='auto', data_in_slice=100, cond_mean=KNeighborsRegressor(n_neighbors=10), if_fit_link=True, sparse_reg=None)

Bases: object

Two-stage instrumental regression (2SIR): sliced inverse regression + least sqaure

Two-stage instrumental regression (2SIR) is a statistical technique used in the analysis of structural equations:

(stage 1) phi(x) = z^T theta + omega;               (Stage 2) y = beta phi(x) + z^T alpha + epsilon

Note that data is expected to be centered, and y is normarlized as sd(y) = 1.

Parameters:
n_directions: int, default=False

A number of directions for sliced inverse regression (SIR). Currently, we only focus on 1-dimensional case.

n_slices: int, default=’auto’

A number of slices for SIR, if n_slices=’auto’, then it is determined by int(n_sample/data_in_slice).

data_in_slice: int, default=100

A number of samples in a slice for SIR. If data_in_slice is not None, then n_slices=int(n_sample/data_in_slice).

cond_mean: class, default=KNeighborsRegressor(n_neighbors=10)

A nonparameteric regression model for estimate link function.

if_fit_link: bool, default=True

Whether to calculate the link function phi for this model.

fit_flag: bool, default=False

A flag to indicate if the estimation is done.

sparse_reg: class, default=None

A sparse regression used in the Stage 2. If set to None, we will use OLS in for the Stage 2.

Examples

>>> import numpy as np
>>> from nl_causal import ts_models
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> n, p = 2000, 50
>>> np.random.seed(0)
>>> Z = np.random.randn(n, p)
>>> U, eps, gamma = np.random.randn(n), np.random.randn(n), np.random.randn(n)
>>> theta0 = np.random.randn(p)
>>> theta0 = theta0 / np.sqrt(np.sum(theta0**2))
>>> beta0 = 1.
>>> X = 1. / (np.dot(Z, theta0) + U**2 + eps)
>>> phi = 1. / X
>>> y = beta0 * phi + U + gamma
>>> ## normalize Z, X, y
>>> center = StandardScaler(with_std=False)
>>> mean_X, mean_y = X.mean(), y.mean()
>>> Z, X, y = center.fit_transform(Z), X - mean_X, y - mean_y
>>> y_scale = y.std()
>>> y = y / y_scale
>>> Z1, Z2, X1, X2, y1, y2 = train_test_split(Z, X, y, test_size=.5, random_state=42)
>>> n1, n2 = len(Z1), len(Z2)
>>> LD_Z1, cov_ZX1 = np.dot(Z1.T, Z1), np.dot(Z1.T, X1)
>>> LD_Z2, cov_ZY2 = np.dot(Z2.T, Z2), np.dot(Z2.T, y2)
>>> ## Define 2SLS
>>> nl_twas = ts_models._2SIR(sparse_reg=None)
>>> ## Estimate theta in Stage 1
>>> nl_twas.fit_theta(Z1, X1)
>>> nl_twas.theta
>>> array([-0.31331769, -0.10041349, -0.0068858 ,  0.01292981, -0.06329046,
    -0.16037138, -0.16804225,  0.21417387,  0.03366815,  0.06368144,
    -0.24217008,  0.05969993,  0.10396408,  0.1464862 , -0.08197703,
    0.06766957,  0.0663083 , -0.11894489,  0.01675101,  0.29531242,
    0.1923396 , -0.02299256,  0.14921243, -0.01075526, -0.04526044,
    -0.03111288,  0.05537396, -0.02358006,  0.12615653,  0.03938541,
    -0.09100911,  0.12907855,  0.19518874, -0.23574715, -0.12036349,
    -0.04914323, -0.03463147,  0.01019404,  0.15832153, -0.0180269 ,
    0.05225932,  0.33307795,  0.1104155 , -0.21012056, -0.16505056,
    0.16029017,  0.04530822,  0.24969932,  0.13906269,  0.13336765])
>>> ## Estimate beta in Stage 2
>>> nl_twas.fit_beta(LD_Z2, cov_ZY2, n2)
>>> nl_twas.beta*y_scale
>>> 1.0294791446256897
>>> ## p-value for infer if causal effect is zero
>>> nl_twas.test_effect(n2, LD_Z2, cov_ZY2)
>>> nl_twas.p_value
>>> 5.582982509645985e-48
>>> nl_twas.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, B_sample=1000, level=.95)
>>> nl_twas.CI*y_scale
>>> array([0.82237254, 1.23658575])
Attributes:
p_value: float
P-value for hypothesis testing::

H0: beta = 0; Ha: beta neq 0.

theta: array of shape (n_features, )

Estimated linear coefficients for the linear regression in Stage 1.

beta: float

The marginal causal effect beta in Stage 2.

alpha: array of shape (n_features, )

Estimated linear coefficients for Invalid IVs in Stage 2.

rho: float

A correction ratio for the link estimation.

CI: array of shape (2, )

Estimated confidence interval for marginal causal effect (beta).

fit_theta(Z1, X1)

Estimate theta in Stage 1 by using sliced inverse regression (SIR).

Parameters:
Z1: {array-like, float} of shape (n_sample, n_features)

Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.

X1: {array-like, float} of shape (n_sample)

Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

Returns:
self: returns theta of self.
fit_beta(LD_Z2, cov_ZY2, n2, criterion='ebic')

Fit the linear model in Stage 2 based on GWAS data.

Parameters:
LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

n2: int

The number of samples in the second GWAS dataset, which will be used for model selection in sparse regerssion.

Returns:
self: returns beta of self.
selection_summary()

A summary for the result of model selection of the sparse regression in Stage 2.

Returns:
df: dataframe

dataframe with columns: candidate_model, criteria, and mse.

Estimate nonlinear link (phi) in Stage 1.

Parameters:
Z1: {array-like, float} of shape (n_sample, n_features)

Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.

X1: {array-like, float} of shape (n_sample)

Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

Returns:
self: returns cond_mean of self.

Returns estimated conditional mean of z^T theta conditional on X1

self: returns rho of self.

Return estimated correction ratio of conditional mean.

fit(Z1, X1, LD_Z2, cov_ZY2, n2)

Fit theta, beta, (and phi) in the causal model.

Parameters:
Z1: {array-like, float} of shape (n_sample, n_features)

Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.

X1: {array-like, float} of shape (n_sample)

Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

n2: int

The number of samples in the second GWAS dataset, which will be used for model selection in sparse regerssion.

Returns:
self: returns theta of self.
self: returns beta of self.
self: returns cond_mean of self.

Returns estimated conditional mean of z^T theta conditional on X1

self: returns rho of self.

Return estimated correction ratio of conditional mean.

Values of the link function in Stage 1 on instances X.

Parameters:
X1: {array-like, float} of shape (n_sample)

Samples of X, where n_sample is the number of samples.

Returns:
link: {array-like, float} of shape (n_sample)

Returns values of the link function on instances X.

est_var_res(n2, LD_Z2, cov_ZY2)

Estimated variance for y regress on Z.

Parameters:
n2: int

The number of sample on the second dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:
The estimated variance y regress on Z.
CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, B_sample=1000, boot_over='theta', level=0.95)

Estimated confidence interval (CI) for the causal effect beta

Parameters:
n1: int

The number of sample on the first dataset.

n2: int

The number of sample on the second dataset.

Z1: {array-like, float} of shape (n_sample, n_features)

Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.

X1: {array-like, float} of shape (n_sample)

Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

B_sample: int, default=1000

The number of bootstrap for estimate CI.

level: float, default=0.95

The confidence level to compute, which must be between 0 and 1 inclusive.

test_effect(n2, LD_Z2, cov_ZY2)

Causal inference for the marginal causal effect.

Parameters:
n2: int

The number of samples in the second dataset.

LD_Z2: {array-like, float} of shape (n_features, n_features)

LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)

cov_ZY2: {array-like, float} of shape (n_features, )

Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:
self: returns p-value of self.
summary(precision=4)