`nl_causal.ts_models.ts_twas`

Two Stage IV Causal Methods based GWAS Dataset

Module Contents

Classes

`_2SLS`	Two-Stage least squares (2SLS) regression is a statistical technique that is used in the analysis of structural equations:
`_2SIR`	Two-stage instrumental regression (2SIR): sliced inverse regression + least sqaure

class nl_causal.ts_models.ts_twas._2SLS(normalize=True, sparse_reg=None)

Bases: object

Two-Stage least squares (2SLS) regression is a statistical technique that is used in the analysis of structural equations:

(stage 1) x = z^T theta + omega;                (Stage 2) y = beta x + z^T alpha + epsilon

Note that data is expected to be centered, and y is normarlized as sd(y) = 1.

Parameters:

normalize: bool, default=True: Whether to normalize the resulting theta in Stage 1.nl_causal/ts_models/ts_twas.py
fit_flag: bool, default=False: A flag to indicate if the estimation is done.
sparse_reg: class, default=None: A sparse regression used in the Stage 2. If set to None, we will use OLS in for the Stage 2.

Examples

>>> import numpy as np
>>> from nl_causal import ts_models
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> n, p = 1000, 50
>>> np.random.seed(0)
>>> Z = np.random.randn(n, p)
>>> U, eps, gamma = np.random.randn(n), np.random.randn(n), np.random.randn(n)
>>> theta0 = np.random.randn(p)
>>> theta0 = theta0 / np.sqrt(np.sum(theta0**2))
>>> beta0 = .5
>>> X = np.dot(Z, theta0) + U**2 + eps
>>> y = beta0 * X + U + gamma
>>> center = StandardScaler(with_std=False)
>>> mean_X, mean_y = X.mean(), y.mean()
>>> y_scale = y.std()
>>> y = y / y_scale
>>> Z, X, y = center.fit_transform(Z), X - mean_X, y - mean_y
>>> Z1, Z2, X1, X2, y1, y2 = train_test_split(Z, X, y, test_size=.5, random_state=42)
>>> n1, n2 = len(Z1), len(Z2)
>>> LD_Z1, cov_ZX1 = np.dot(Z1.T, Z1), np.dot(Z1.T, X1)
>>> LD_Z2, cov_ZY2 = np.dot(Z2.T, Z2), np.dot(Z2.T, y2)
>>> LS = ts_models._2SLS(sparse_reg=None)
>>> LS.fit_theta(LD_Z1, cov_ZX1)
>>> LS.theta
array([ 0.07268042, -0.06393631, -0.34863568, -0.04453319, -0.10313442,
    0.09284717,  0.13514061,  0.24597997,  0.08304104,  0.02459302,
    -0.06349473,  0.21502581,  0.21177209, -0.08895919,  0.08082504,
    -0.13174606,  0.04644996,  0.23102817, -0.05635546,  0.08286319,
    0.20736079, -0.07574444, -0.20129764,  0.20311458, -0.15965619,
    -0.02148001,  0.01761156,  0.10617795,  0.02028776, -0.00221961,
    -0.11686226, -0.09116777,  0.08004126,  0.00663467, -0.13549927,
    -0.12674926,  0.09331474, -0.24688913, -0.18701941,  0.02714403,
    0.0854651 ,  0.30291367,  0.08926479,  0.023272  , -0.04798961,
    0.26668035, -0.16051689,  0.01169355, -0.08651508, -0.1342292 ])
>>> LS.fit_beta(LD_Z2, cov_ZY2, n2=n2)
>>> LS.beta
0.47526913304288304
>>> LS.test_effect(n2, LD_Z2, cov_ZY2)
>>> LS.p_value
1.2114293989960872e-59
>>> LS.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, level=.95)
array([0.38869162, 0.56184665])

Attributes:

p_value: float

P-value for hypothesis testing::: H_0: beta = 0; H_a: beta neq 0.

theta: array of shape (n_features, )

Estimated linear coefficients for the linear regression in Stage 1.

beta: float

The marginal causal effect beta in Stage 2.

alpha: array of shape (n_features, )

Estimated linear coefficients for Invalid IVs in Stage 2.

CI: array of shape (2, )

Estimated confidence interval for marginal causal effect (beta).

fit_theta(LD_Z1, cov_ZX1)

Fit the linear model in Stage 1.

Parameters:

LD_Z1: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the first sample: LD_Z1 = np.dot(Z1.T, Z1)
cov_ZX1: {array-like, float} of shape (n_features, ): Cov(Z1, X1); cov_ZX1 = np.dot(Z1.T, X1)

Returns:

self: returns an theta of self.

fit_beta(LD_Z2, cov_ZY2, n2, criterion='ebic')

Fit the linear model in Stage 2 based on GWAS data.

Parameters:

LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
n2: int: The number of sample in the second GWAS dataset.

Returns:

self: returns beta of self.

selection_summary()

A summary for the result of model selection of the sparse regression in Stage 2.

Returns:

df: dataframe: dataframe with columns: candidate_model, criteria, and mse.

est_var_res(n2, LD_Z2, cov_ZY2)

Estimated variance for y regress on Z.

Parameters:

n2: int: The number of sample on the second dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:

The estimated variance y regress on Z.

fit(LD_Z1, cov_ZX1, LD_Z2, cov_ZY2, n2)

Fit the linear model in Stage 2 based on GWAS data.

Parameters:

LD_Z1: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the first sample: LD_Z1 = np.dot(Z1.T, Z1)
cov_ZX1: {array-like, float} of shape (n_features, ): Cov(Z1, X1); cov_ZX1 = np.dot(Z1.T, X1)
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
n2: int: The number of sample in the second GWAS dataset.

Returns:

self: return theta of self.
self: returns beta of self.

CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, level=0.95)

Estimated confidence interval (CI) for the causal effect beta.

Parameters:

n1: int: The number of sample on the first dataset.
n2: int: The number of sample on the second dataset.
Z1: {array-like, float} of shape (n_sample, n_features): Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.
X1: {array-like, float} of shape (n_sample): Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
level: float, default=0.95: The confidence level to compute, which must be between 0 and 1 inclusive.

test_effect(n2, LD_Z2, cov_ZY2)

Causal inference for the marginal causal effect beta. P-value for hypothesis testing:

H0: `beta` = 0;             Ha: `beta` neq 0.

Parameters:

n2: int: The number of samples in the second dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:

self: returns p-value of self.

summary(precision=4)

class nl_causal.ts_models.ts_twas._2SIR(n_directions=1, n_slices='auto', data_in_slice=100, cond_mean=KNeighborsRegressor(n_neighbors=10), if_fit_link=True, sparse_reg=None)

Bases: object

Two-stage instrumental regression (2SIR): sliced inverse regression + least sqaure

Two-stage instrumental regression (2SIR) is a statistical technique used in the analysis of structural equations:

(stage 1) phi(x) = z^T theta + omega;               (Stage 2) y = beta phi(x) + z^T alpha + epsilon

Note that data is expected to be centered, and y is normarlized as sd(y) = 1.

Parameters:

n_directions: int, default=False: A number of directions for sliced inverse regression (SIR). Currently, we only focus on 1-dimensional case.
n_slices: int, default=’auto’: A number of slices for SIR, if n_slices=’auto’, then it is determined by int(n_sample/data_in_slice).
data_in_slice: int, default=100: A number of samples in a slice for SIR. If data_in_slice is not None, then n_slices=int(n_sample/data_in_slice).
cond_mean: class, default=KNeighborsRegressor(n_neighbors=10): A nonparameteric regression model for estimate link function.
if_fit_link: bool, default=True: Whether to calculate the link function phi for this model.
fit_flag: bool, default=False: A flag to indicate if the estimation is done.
sparse_reg: class, default=None: A sparse regression used in the Stage 2. If set to None, we will use OLS in for the Stage 2.

Examples

>>> import numpy as np
>>> from nl_causal import ts_models
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.model_selection import train_test_split
>>> n, p = 2000, 50
>>> np.random.seed(0)
>>> Z = np.random.randn(n, p)
>>> U, eps, gamma = np.random.randn(n), np.random.randn(n), np.random.randn(n)
>>> theta0 = np.random.randn(p)
>>> theta0 = theta0 / np.sqrt(np.sum(theta0**2))
>>> beta0 = 1.
>>> X = 1. / (np.dot(Z, theta0) + U**2 + eps)
>>> phi = 1. / X
>>> y = beta0 * phi + U + gamma
>>> ## normalize Z, X, y
>>> center = StandardScaler(with_std=False)
>>> mean_X, mean_y = X.mean(), y.mean()
>>> Z, X, y = center.fit_transform(Z), X - mean_X, y - mean_y
>>> y_scale = y.std()
>>> y = y / y_scale
>>> Z1, Z2, X1, X2, y1, y2 = train_test_split(Z, X, y, test_size=.5, random_state=42)
>>> n1, n2 = len(Z1), len(Z2)
>>> LD_Z1, cov_ZX1 = np.dot(Z1.T, Z1), np.dot(Z1.T, X1)
>>> LD_Z2, cov_ZY2 = np.dot(Z2.T, Z2), np.dot(Z2.T, y2)
>>> ## Define 2SLS
>>> nl_twas = ts_models._2SIR(sparse_reg=None)
>>> ## Estimate theta in Stage 1
>>> nl_twas.fit_theta(Z1, X1)
>>> nl_twas.theta
>>> array([-0.31331769, -0.10041349, -0.0068858 ,  0.01292981, -0.06329046,
    -0.16037138, -0.16804225,  0.21417387,  0.03366815,  0.06368144,
    -0.24217008,  0.05969993,  0.10396408,  0.1464862 , -0.08197703,
    0.06766957,  0.0663083 , -0.11894489,  0.01675101,  0.29531242,
    0.1923396 , -0.02299256,  0.14921243, -0.01075526, -0.04526044,
    -0.03111288,  0.05537396, -0.02358006,  0.12615653,  0.03938541,
    -0.09100911,  0.12907855,  0.19518874, -0.23574715, -0.12036349,
    -0.04914323, -0.03463147,  0.01019404,  0.15832153, -0.0180269 ,
    0.05225932,  0.33307795,  0.1104155 , -0.21012056, -0.16505056,
    0.16029017,  0.04530822,  0.24969932,  0.13906269,  0.13336765])
>>> ## Estimate beta in Stage 2
>>> nl_twas.fit_beta(LD_Z2, cov_ZY2, n2)
>>> nl_twas.beta*y_scale
>>> 1.0294791446256897
>>> ## p-value for infer if causal effect is zero
>>> nl_twas.test_effect(n2, LD_Z2, cov_ZY2)
>>> nl_twas.p_value
>>> 5.582982509645985e-48
>>> nl_twas.CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, B_sample=1000, level=.95)
>>> nl_twas.CI*y_scale
>>> array([0.82237254, 1.23658575])

Attributes:

p_value: float

P-value for hypothesis testing::: H0: beta = 0; Ha: beta neq 0.

theta: array of shape (n_features, )

Estimated linear coefficients for the linear regression in Stage 1.

beta: float

The marginal causal effect beta in Stage 2.

alpha: array of shape (n_features, )

Estimated linear coefficients for Invalid IVs in Stage 2.

rho: float

A correction ratio for the link estimation.

CI: array of shape (2, )

Estimated confidence interval for marginal causal effect (beta).

fit_theta(Z1, X1)

Estimate theta in Stage 1 by using sliced inverse regression (SIR).

Parameters:

Z1: {array-like, float} of shape (n_sample, n_features): Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.
X1: {array-like, float} of shape (n_sample): Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

Returns:

self: returns theta of self.

fit_beta(LD_Z2, cov_ZY2, n2, criterion='ebic')

Fit the linear model in Stage 2 based on GWAS data.

Parameters:

LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
n2: int: The number of samples in the second GWAS dataset, which will be used for model selection in sparse regerssion.

Returns:

self: returns beta of self.

selection_summary()

A summary for the result of model selection of the sparse regression in Stage 2.

Returns:

df: dataframe: dataframe with columns: candidate_model, criteria, and mse.

fit_link(Z1, X1)

Estimate nonlinear link (phi) in Stage 1.

Parameters:

Z1: {array-like, float} of shape (n_sample, n_features): Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.
X1: {array-like, float} of shape (n_sample): Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.

Returns:

self: returns cond_mean of self.: Returns estimated conditional mean of z^T theta conditional on X1
self: returns rho of self.: Return estimated correction ratio of conditional mean.

fit(Z1, X1, LD_Z2, cov_ZY2, n2)

Fit theta, beta, (and phi) in the causal model.

Parameters:

Z1: {array-like, float} of shape (n_sample, n_features): Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.
X1: {array-like, float} of shape (n_sample): Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
n2: int: The number of samples in the second GWAS dataset, which will be used for model selection in sparse regerssion.

Returns:

self: returns theta of self.
self: returns beta of self.
self: returns cond_mean of self.: Returns estimated conditional mean of z^T theta conditional on X1
self: returns rho of self.: Return estimated correction ratio of conditional mean.

link(X)

Values of the link function in Stage 1 on instances X.

Parameters:

X1: {array-like, float} of shape (n_sample): Samples of X, where n_sample is the number of samples.

Returns:

link: {array-like, float} of shape (n_sample): Returns values of the link function on instances X.

est_var_res(n2, LD_Z2, cov_ZY2)

Estimated variance for y regress on Z.

Parameters:

n2: int: The number of sample on the second dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:

The estimated variance y regress on Z.

CI_beta(n1, n2, Z1, X1, LD_Z2, cov_ZY2, B_sample=1000, boot_over='theta', level=0.95)

Estimated confidence interval (CI) for the causal effect beta

Parameters:

n1: int: The number of sample on the first dataset.
n2: int: The number of sample on the second dataset.
Z1: {array-like, float} of shape (n_sample, n_features): Samples of Z in the first dataset, where n_sample = n1 is the number of samples in the first dataset, and n_feature is the number of features.
X1: {array-like, float} of shape (n_sample): Samples of X in the first dataset, where n_sample = n1 is the number of samples in the first dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)
B_sample: int, default=1000: The number of bootstrap for estimate CI.
level: float, default=0.95: The confidence level to compute, which must be between 0 and 1 inclusive.

test_effect(n2, LD_Z2, cov_ZY2)

Causal inference for the marginal causal effect.

Parameters:

n2: int: The number of samples in the second dataset.
LD_Z2: {array-like, float} of shape (n_features, n_features): LD matrix of Z based on the second sample: LD_Z2 = np.dot(Z2.T, Z2)
cov_ZY2: {array-like, float} of shape (n_features, ): Matrix product of Z2 and Y2; cov_ZX2 = np.dot(Z2.T, Y2)

Returns:

self: returns p-value of self.

summary(precision=4)

nl_causal.ts_models.ts_twas

Module Contents

Classes

`nl_causal.ts_models.ts_twas`