In their 2017 paper,“Bayesian Compression for Deep Learning”, Christos Louizos, Karen Ullrich, and Max Welling outline a method for compressing neural networks, both via weight pruning and bit precision reduction. Others (Scaradapane et al. 2016; Wen, Wu 2016) have applied sparsity priors such as the Bayesian LASSO (exponential prior on the variance parameter) in order to reduce the amount of memory required to store neural networks, but Louizos, Ullrich, and Welling’s approach improves on previous work by 1) placing scale-mixture sparsity priors on entire rows of weight layers rather than on individual weights, and 2) using the estimated variance of the parameters to calculate the bit precision, or precision quantization, required to store the weights without losing too much information (i.e. causing data misfit).
Because neural networks are typically stored using a matrix representation of the weight layers, pruning individual weights independently does relatively little to reduce the memory required to store the network: pruning weights individually leads to a sprinkling of zeros in the matrix representation, but reduction in the dimensions of the matrix can only occur if an entire row or column is filled with zeros. By instead using a hierarchical structure in which entire rows share the sparsity-inducing parameter, Louizos, Ullrich, and Welling are able to “zero out” entire rows — entire neurons — and thus reduce the dimensions of the matrix required to store the weight layer. The dimension reduction also compounds across neural network layers: for fully-connected layers, since row \(i\) of weight layer \(j\) is used to compute the \(i\)’th input in weight layer \(j+1\), if row \(i\) (neuron \(i\)) of layer \(j\) is omitted, then column \(i\) of layer \(j+1\) can also be omitted.
In addition, the authors offer an information theory-based argument for using the posterior weights’ variances to impute the optimal number of bits required to encode the weights, leading to further compressibility. The main idea here is that the larger a weight’s variance, the lower the precision required to store it: if a weight has variance 1, then storing the weight with precision to the 8th decimal versus the 4th decimal does not preserve much information about the weight’s distribution. A Normal(4.0001, 1) distribution is less different from a Normal(4.00010001, 1) distribution than a Normal(4.0001, 0.01) distribution is from a Normal(4.00010001, 0.01).
The motivation for this project is the insight that, for the first weight layer, pruning the weights for entire columns would correspond to removing features or covariates. The focus of the method described above is not on variable selection in this way: while the hierarchical sparsity prior on weight layer rows leads to downstream elimination of columns, the input layer is not downstream from any other layer. Thus the method does not lead to variable selection in the usual statistical sense, though it can be used to examine feature importance on an observation-by-observation basis, because it does not remove columns in the first weight layer.
Our idea is that, by repurposing the bit quantization estimate to provide a threshold for weights to be zero’d out, we can in fact eliminate columns form the input layer in a reasonable fashion. That is, we can obtain principled variable selection with known operating characteristics via the Bayesian sparsity prior specification and variance-based thresholding,
A general construction for the scale mixture of normals entails placing a distribution \(p(z)\) on the scale parameter \(z\): \[ z \sim p(z); \quad w \sim N(w; 0, z^2). \]
Specifying the distribution \(p(z)\) to have both relatively large mass at zero and heavy tails biases the posterior distribution for \(w\) to be sparse. For example, when \(p(z)\) is a mixture of a Dirac delta at zero and a Normal distribution, we obtain the spike-and-slab; another well-known case is the LASSO estimator, obtained by placing a Laplace distribution on \(z\): \(p(z^2) = Exp(\lambda)\).
We follow Louizos, Ullrich, and Welling 2017 in using the improper log-uniform prior \(p(z) \propto |z|^{-1}\) (the authors also describe the implementation of their methods using the Horseshoe prior). The primary reason for this choice is the simplicity of computation: marginalizing over the scale parameter \(z\), the weights \(w\) follow the Normal-Jeffreys (or log-uniform) prior:
\[ p(w) \propto \int \frac{1}{|z|} N(w|0, z^2)dz = \frac{1}{|w|}. \]
In comparison, the spike-and-slab delta spike at zero is computationally expensive at scale: 1) variational Bayes techniques are unavailable as the Bernoulli scale mixture defies gradient descent, and 2) the search space contains \(2^M\) models, where \(M\) is the number of parameters, i.e. the number of weights in the neural network.
Group sparsity is accomplished under the authors’ model by requiring weights belonging to the same group (i.e. neuron or convolutional filter) to share the scale parameter. Let \(W\) be the weight matrix of a fully connected neural network layer with dimension \(A \times B\), where \(A\) is the number of neurons and \(B\) the dimension of the weights for each neuron. Then the joint prior is:
\[ p(W, z) \propto \prod_i^A \frac{1}{|z_i|} \prod_{i,j}^{A,B} N(w_{ij} | 0, z_i^2) \]
The authors use the approximate joint posterior
\[\begin{aligned} q_{\phi} (W,z) &= \prod_i^A N(z_i | \mu_{z_i}, \mu^2_{z_i} \alpha_i) \prod_{i,j}^{A,B} N(w_{ij} | z_i \mu_{ij}, z_i^2 \sigma_{ij}^2), \\ \end{aligned}\]
where \(\alpha_i\) is the dropout rate of group or neuron \(i\). After reparameterizing such that \(\sigma^2_{z_i} = \mu^2_{z_i} \alpha_i\) (to avoide high variance gradients), the evidence lower bound is: \[ \mathcal{L}(\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z}) q_{\phi}(\mathbf{W} \mid \mathbf{z})}[\log p(\mathcal{D} \mid \mathbf{W})]-\mathbb{E}_{q_{\phi}(\mathbf{z})}\left[K L\left(q_{\phi}(\mathbf{W} \mid \mathbf{z}) \| p(\mathbf{W} \mid \mathbf{z})\right)\right]-K L\left(q_{\phi}(\mathbf{z}) \| p(\mathbf{z})\right) \]
where the KL-divergence for the conditional prior \(p(W|z)\) to the approximate posterior \(q_\phi (W|z)\) is:
\[\begin{aligned} KL(q_\phi(W|z) || p(W|z)) &= \frac{1}{2} \sum_{i,j}^{A,B} \left( \log \frac{z_i^2}{z_i^2 \sigma^2_{ij}} + \frac{z_i^2 \sigma^2_{ij}}{z_i^2} + \frac{z_i^2 \mu^2_{ij}}{z_i^2} - 1 \right) \\ &= \frac{1}{2} \sum_{i,j}^{A,B} \left( \log \frac{1}{\sigma^2_{ij}} + \sigma^2_{ij} + \mu^2_{ij} - 1 \right). \end{aligned}\]
Last, the authors obtain the KL-divergence from the normal-Jeffreys scale prior \(p(z)\) to the Gaussian variational posterior \(q_\phi (z)\) to be dependent only on the implied dropout rate \(\alpha_i = \sigma^2_{z_i} / \mu^2_{z_i}\):
\[ -K L\left(q_{\phi}(\mathbf{z}) \| p(\mathbf{z})\right) \approx \sum_{i}^{A}\left(k_{1} \sigma\left(k_{2}+k_{3} \log \alpha_{i}\right)-0.5 m\left(-\log \alpha_{i}\right)-k_{1}\right) \]
where \(\sigma(\cdot), m(\cdot)\) are the sigmoid and softplus functions, and \(k_{1}=0.63576, k_{2}=1.87320, k_{3}=1.48695\).
Groups of parameters are then pruned by specifying a threshold for the variational dropout rate \(\alpha_i\) such that \(\log \alpha_{i}=\left(\log \sigma_{z_i}^{2}-\log \mu_{z_i}^{2}\right)>t\)
The variational posterior marginal variance \[ \mathbb{V}\left(w_{i j}\right)_{N J}=\sigma_{z_{i}}^{2}\left(\sigma_{i j}^{2}+\mu_{i j}^{2}\right)+\sigma_{i j}^{2} \mu_{z_{i}}^{2} \] is then used to compute the optimal number of bits that minimizes the sum of the complexity cost of communicating the model and error cost (the data misfit). The theoretical arguments underpinning bit precision optimization are out of the scope of this writeup, but are delineated in Louizos, Ullrich, Welling 2017 Section 2 and Appendix B.
We examined variable selection performance for a basic linear regression problem by generating synthetic data consisting of 10 true covariates, 20 nuisance variables, with N = 10000. The true covariates and nuisance variables were randomly assigned to arise from normal and binomial distributions (code below), and the outcome \(y\) made to be a linear transformation of the true covariates plus Normal noise.
generate_linear_data <- function(
ave_beta = 2,
sd_beta = 10,
sd_eps = 2,
n_obs = 1E4,
n_covars = 10,
n_nuisance = 20
){
beta_vec <- round(rnorm(n_covars + 1, mean = ave_beta, sd = sd_beta), 2)
dat <- data.frame(matrix(0, nrow = n_obs, ncol = n_covars + n_nuisance))
distn_vec <- rep(NA, n_covars + n_nuisance)
for (j in 1:(n_covars+n_nuisance)) {
distn <- sample(c("normal","binomial"), 1)
if (distn == "normal"){
mu_j <- round(runif(1, 0, 10))
sig_j <- round(runif(1, 1, 5))
dat[, j] <- rnorm(n_obs, mu_j, sig_j)
distn_vec[j] <- paste0(distn, "(", mu_j, ", ", sig_j, ")")
} else if (distn == "binomial"){
p_j <- runif(1, 0.1, 1)
n_j <- floor(runif(1, 1, 10))
dat[,j] <- rbinom(n_obs, size = n_j, prob = p_j)
distn_vec[j] <- paste0(distn, "(", n_j, ", ", p_j, ")")
}
}
# construct y
# dat <- round(dat, 4)
covar_inds <- sort(sample(1:(n_covars + n_nuisance), n_covars))
covars <- as.matrix(dat[, covar_inds])
covar_distns <- distn_vec[covar_inds]
eps <- rnorm(n_obs, mean = 0, sd = sd_eps)
dat$y <- as.vector(cbind(1, covars) %*% beta_vec + eps)
true_covars <- colnames(dat)[covar_inds]
nuisance_vars <- colnames(dat)[setdiff(1:(n_nuisance + n_covars), covar_inds)]
names(beta_vec) <- c("(Intercept)", true_covars)
res <- list(
"dat" = dat,
"eps" = eps,
"sd_eps" = sd_eps,
"distn_vec" = distn_vec,
"n_obs" = n_obs,
"n_covars" = n_covars,
"n_nuisance" = n_nuisance,
"beta_vec" = beta_vec,
"covar_inds" = covar_inds,
"true_covars" = true_covars,
"nuisance_vars" = nuisance_vars
)
return(res)
}set.seed(0)
lindat_obj <- generate_linear_data()
dat <- lindat_obj$dat
true_covars <- lindat_obj$true_covars
nuisance_vars <- lindat_obj$nuisance_vars
covar_inds <- lindat_obj$covar_inds
beta_vec <- lindat_obj$beta_vec
true_covars_inds <- as.integer(as.integer(unlist(lapply(strsplit(true_covars, "X"), function(x) x[2]))) - 1)
nuisance_covars_inds <- as.integer(as.integer(unlist(lapply(strsplit(nuisance_vars, "X"), function(x) x[2]))) - 1)
samp <- dat %>%
slice_sample(n = 500) %>%
gather(key = "covar", value = "value", -y)
#
# p_all <- dat %>%
# slice_sample(n = 500) %>%
# gather(key = "covar", value = "value", -y) %>%
# mutate(true_cov = if_else(
# covar %in% true_covars,
# "covar_true",
# "nuisance"
# )
# ) %>%
# ggplot() +
# geom_density(
# aes(
# x = value,
# color = covar,
# fill = covar
# ),
# alpha = .2
# ) +
# theme(legend.position = "none") +
# facet_wrap(. ~ true_cov, ncol = 1) +
# labs(
# title = "density plot of all variables in dataset"
# )
p_true <- dat[, true_covars_inds] %>%
slice_sample(n = 500) %>%
gather(key = "covar", value = "value") %>%
ggplot() +
geom_density(
aes(
x = value,
color = covar,
fill = covar
),
alpha = .1
) +
theme(legend.position = "none") +
labs(
title = "true covariates only"
)
p_all <- dat %>%
slice_sample(n = 500) %>%
gather(key = "covar", value = "value", -y) %>%
mutate(true_cov = if_else(
covar %in% lindat_obj$true_covars,
"covar_true",
"nuisance"
)
) %>%
ggplot() +
geom_density(
aes(
x = value,
color = covar,
fill = covar,
alpha = I(ifelse(true_cov=="covar_true", .25, .1))
)
) +
theme(legend.position = "none") +
labs(
title = "density plot of all covariates",
subtitle = "(mix of normal and binomial distributions)"
)
grid.arrange(
p_all + xlim(-10, 20) + ylim(0, 2),
p_true + xlim(-10, 20) + ylim(0, 2),
ncol = 1
# layout_matrix = rbind(
# c(1, 1, 2),
# c(1, 1, 3)
# )
)# dat %>%
# slice_sample(n = 500) %>%
# gather(key = "covar", value = "value", -y) %>%
# mutate(true_cov = if_else(
# covar %in% lindat_obj$true_covars,
# "covar_true",
# "nuisance"
# )
# ) %>%
# ggplot() +
# geom_point(
# aes(
# y = y,
# x = value,
# color = covar,
# alpha = I(ifelse(true_cov=="covar_true", .5, .25)),
# size = I(ifelse(true_cov=="covar_true", 1.25, .9))
# )
# ) +
# geom_smooth(
# aes(
# y = y,
# x = value
# ),
# method = 'lm'
# ) +
# theme(legend.position = "none") +
# ylim(c(0, 150)) +
# xlim(-5, 15) +
# labs(
# title = "scatterplot of data with linear regression output",
# )
# lmfit <- lm("y ~ .", data = dat)
# lmfit_coefs <- coef(lmfit)
# ols_varsel_err(lmfit, true_covars, nuisance_vars)
# dat <- data.frame(apply(dat, 2, scale))I trained a neural network consisting of 3 fully connected layers over 50 training epochs, with dimension:
from __future__ import print_function, division
import os
os.chdir(r.fpath)
import sys
sys.path.append(r.KUpath)
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.switch_backend('tkagg')
import torch as torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
# source Bayesian Compression .py files
import BayesianLayers
from compression import compute_compression_rate, compute_reduced_weights
from utils import visualize_pixel_importance, generate_gif, visualise_weights
import compression
import utils
# global arguments
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=50)## _StoreAction(option_strings=['--epochs'], dest='epochs', nargs=None, const=None, default=50, type=<class 'int'>, choices=None, help=None, metavar=None)
parser.add_argument('--batchsize', type=int, default=100)## _StoreAction(option_strings=['--batchsize'], dest='batchsize', nargs=None, const=None, default=100, type=<class 'int'>, choices=None, help=None, metavar=None)
parser.add_argument('--thresholds', type=float, nargs='*', default=[-2.8, -3., -5.])## _StoreAction(option_strings=['--thresholds'], dest='thresholds', nargs='*', const=None, default=[-2.8, -3.0, -5.0], type=<class 'float'>, choices=None, help=None, metavar=None)
FLAGS = parser.parse_args()
FLAGS.cuda = torch.cuda.is_available() # check if we can put the net on the GPU
# load data in from R
X = np.asarray(r.dat)[:, :r.dat.shape[1]-1]
y = np.asarray(r.dat)[:, r.dat.shape[1]-1]
X, y = torch.tensor(X).float(), torch.tensor(y).float().unsqueeze(-1)
N = len(y)
N_train = N * 4 // 5
# Shuffles the indices
idx = np.arange(N)
np.random.shuffle(idx)
# Uses first 80% random indices for train
train_idx = idx[:N_train]
test_idx = idx[N_train:]
# Generates train and test
X_train, y_train = X[train_idx], y[train_idx]
X_test, y_test = X[test_idx], y[test_idx]
# # checking that R and python return the same OLS fit
# from sklearn.linear_model import LinearRegression
# linr = LinearRegression()
# linr.fit(X, y)
# print(linr.intercept_, linr.coef_)
# r.lmfit_coefs#
# coef_check <- cbind(
# "linr.fit (python)" = c(py$linr$intercept_, py$linr$coef_),
# "lm (R)" = lmfit_coefs
# )
# mykable(round(coef_check, 3), cap = "check that data from R and imported to python return same OLS fit")# merge into tuples
def merge(list1, list2):
merged_list = [(list1[i], list2[i]) for i in range(0, len(list1))]
return merged_list
train_dat = merge(X_train, y_train)
train_loader = DataLoader(
train_dat,
batch_size=FLAGS.batchsize,
shuffle=True
)
test_dat = merge(X_test, y_test)
test_loader = DataLoader(
test_dat,
batch_size=FLAGS.batchsize,
shuffle=True
)
mask = 255. * (np.ones((1, 30)))
examples = X_train[0:5,]
images = np.vstack([mask, examples])
# check dataloader
# for batch_idx, (data, y) in enumerate(test_loader):
# # print(data)
# # print(y)
# print(data.size())
# print(y.size())
print("")# build a simple MLP
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# activation
self.relu = nn.ReLU()
# layers
self.fc1 = BayesianLayers.LinearGroupNJ(1 * 30, 20, clip_var=0.04, cuda=FLAGS.cuda) # 28 x 28 <--- 1 * 31
self.fc2 = BayesianLayers.LinearGroupNJ(20, 30, cuda=FLAGS.cuda)
self.fc3 = BayesianLayers.LinearGroupNJ(30, 1, cuda=FLAGS.cuda)
# layers including kl_divergence
self.kl_list = [self.fc1, self.fc2, self.fc3]
def forward(self, x):
x = x.view(-1, 1*30)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)
def get_masks(self,thresholds):
weight_masks = []
mask = None
for i, (layer, threshold) in enumerate(zip(self.kl_list, thresholds)):
# compute dropout mask
if mask is None:
log_alpha = layer.get_log_dropout_rates().cpu().data.numpy()
mask = log_alpha < threshold
else:
mask = np.copy(next_mask)
try:
log_alpha = layers[i + 1].get_log_dropout_rates().cpu().data.numpy()
next_mask = log_alpha < thresholds[i + 1]
except:
# must be the last mask
next_mask = np.ones(10)
weight_mask = np.expand_dims(mask, axis=0) * np.expand_dims(next_mask, axis=1)
weight_masks.append(weight_mask.astype(np.float))
return weight_masks
def kl_divergence(self):
KLD = 0
for layer in self.kl_list:
KLD += layer.kl_divergence()
return KLD
# init model
model = Net()
if FLAGS.cuda:
model.cuda()
# init optimizer
optimizer = optim.Adam(model.parameters())
# we optimize the variational lower bound scaled by the number of data
# points (so we can keep our intuitions about hyper-params such as the learning rate)
# discrimination_loss = nn.functional.cross_entropy
discrimination_loss = nn.MSELoss(reduction='none')
def objective(output, target, kl_divergence):
discrimination_error = discrimination_loss(output, target)
# print("kl.item value, type: {}, {}".format(kl_divergence.item(), type(kl_divergence.item())))
# print("kl_divergence data type: {}".format(type(kl_divergence)))
# print("loss data value, type: {}".format(discrimination_error, type(discrimination_error)))
variational_bound = (discrimination_error + kl_divergence) / N_train
# print("variational_bound calculated")
# print("variational_bound value, type: {}, {}".format(torch.mean(variational_bound), type(torch.mean(variational_bound))))
if FLAGS.cuda:
variational_bound = variational_bound.cuda()
return variational_bound
def train(epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
if FLAGS.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
optimizer.zero_grad()
output = model(data)
# print("output = model(data)")
loss = objective(output, target, model.kl_divergence())
# print("loss = ...")
# loss.backward()
loss.sum().backward()
# print("loss.sum().backward()")
optimizer.step()
# clip the variances after each step
for layer in model.kl_list:
layer.clip_variances()
print('Epoch: {} \tTrain loss: {:.6f} \t'.format(
epoch, torch.mean(loss.data)))
def test():
model.eval()
test_loss = 0
correct = 0
for data, target in test_loader:
if FLAGS.cuda:
data, target = data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
output = model(data)
# test_loss += discrimination_loss(output, target, size_average=False).data
test_loss += discrimination_loss(output, target).data
pred = output.data.max(1, keepdim=True)[1]
correct += pred.eq(target.data.view_as(pred)).cpu().sum()
test_loss /= len(test_loader.dataset)
print('Test loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
torch.mean(test_loss), correct, len(test_loader.dataset),
100. * correct / len(test_loader.dataset)))
#### run / load model ----
if r.params["retrain"]:
# train the model and save some visualisations on the way
print("--start training--")
for epoch in range(1, FLAGS.epochs + 1):
print("--epoch" + str(epoch) + "--")
train(epoch)
test()
# visualizations
weight_mus = [model.fc1.weight_mu, model.fc2.weight_mu]
log_alphas = [model.fc1.get_log_dropout_rates(), model.fc2.get_log_dropout_rates(),
model.fc3.get_log_dropout_rates()]
visualise_weights(weight_mus, log_alphas, epoch=epoch)
log_alpha = model.fc1.get_log_dropout_rates().cpu().data.numpy()
# visualize_pixel_importance(images, log_alpha=log_alpha, epoch=str(epoch))
# compute compression rate and new model accuracy
layers = [model.fc1, model.fc2, model.fc3]
thresholds = FLAGS.thresholds
compute_compression_rate(layers, model.get_masks(thresholds))
weights = compute_reduced_weights(layers, model.get_masks(thresholds))
for layer, weight in zip(layers, weights):
if FLAGS.cuda:
layer.post_weight_mu.data = torch.Tensor(weight).cuda()
else:
layer.post_weight_mu.data = torch.Tensor(weight)
# for layer in layers: layer.deterministic = True
torch.save(model.state_dict(), 'model_weights.pth')
# generate_gif(save='pixel', epochs=FLAGS.epochs)
generate_gif(save='weight0_e', epochs=FLAGS.epochs)
generate_gif(save='weight1_e', epochs=FLAGS.epochs)
else:
model.load_state_dict(torch.load('model_weights.pth'))
## <All keys matched successfully>
layers = [model.fc1, model.fc2, model.fc3]
thresholds = FLAGS.thresholds
weight_mus = [model.fc1.weight_mu, model.fc2.weight_mu]
weight_mus, weight_vars = compression.extract_pruned_params(layers, model.get_masks(thresholds))## <string>:37: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
## Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def extract_pruned_layer(layer, mask):
# extract non-zero columns
layer = layer[:, mask.sum(axis = 0) != 0]
# extract non-zero rows
layer = layer[mask.sum(axis = 1) != 0, :]
return layer
def extract_pruned_layers(layers, masks):
res = []
for layer, mask in zip(layers, masks):
l = extract_pruned_layer(layer, mask)
res.append(l)
return(res)
pruned_weights = extract_pruned_layers(weight_mus, model.get_masks(thresholds))
# pruned_weights[0].shape
mask0 = model.get_masks(thresholds)[0]Unfortunately, my python coding ability is not yet good enough to have run many simiulations in an automated fashion, but here is a representative outcome from one sample dataset. Comparisons against OLS (variable inclusion determined using a naive threshold of p-value > 0.05 as well as using Benjamini-Hochberg) were made; sample output is below.
TP) denote variables that were correctly included;TN) denote variables that were correctly excludedgamma_hat <- colSums(abs(py$weight_mus[[1]])) != 0
gamma_true <- rep(FALSE, length(gamma_hat))
gamma_true[as.numeric(gsub("X", "", true_covars))] <- TRUE
binary_err <- function(pred, true){
TP <- sum(true & pred)
FP <- sum(!true & pred)
TN <- sum(!true & !pred)
FN <- sum(true & !pred)
c("TP" = TP, "FP" = FP, "TN" = TN, "FN" = FN)
}
nn_varsel <- binary_err(pred = gamma_hat, true = gamma_true)
lmfit <- lm("y ~ .", data = dat)
lmfit_coefs <- coef(lmfit)
ols_varsel <- ols_varsel_err(lmfit, true_covars, nuisance_vars)
BH_varsel <- ols_varsel_err(lmfit, true_covars, nuisance_vars, adjust.method = "BH")
varsel_results <- rbind(
"OLS" = ols_varsel,
"BH" = BH_varsel,
"Compression" = nn_varsel
)
mykable(varsel_results, cap = "variable selection results")| TP | FP | TN | FN | |
|---|---|---|---|---|
| OLS | 10 | 3 | 17 | 0 |
| BH | 10 | 1 | 19 | 0 |
| Compression | 10 | 0 | 20 | 0 |
The evolution of the first (input) layer’s \(30 \times 20\) weight matrix (30 covariates, 20 neurons) during training can be seen below:
input layer evolution over training epochs
Unfortunately, this displays the weights from each neuron in a column (as opposed to rows, which is how I have been referring to them) — the y-axis increments correspond to each of the 30 covariates; the x-axis increments correspond to the 20 neurons of the input layer. We can see that training ends up pruning columns (neurons), but there is also a steady push towards 0 across the rows, implying that certain covariates have only a minor effect.
Below (generated in ggplot) is the final version of the input layer’s weight matrix after thresholding based on the estimated variance and bit precision roundoffs (unfortunately, the plot scaling is off here, but just as in the animated .gif above, there are 30 rows and 20 columns corresponding to the 30 covariates and 20 neurons):
vismat(py$weight_mus[[1]]) +
coord_flip() +
labs(
y = "neurons in first layer",
x = "covariates"
) +
labs(
title = "first weight layer",
subtitle = "grey = pruned weights"
)Unfortunately, because I do not know how to code in Python without using R’s reticulate package, and I do not know how to loop through chunks of code containing both Python and R code, I can only re-run the simulation manually right now (so I am unable to display results from numerous simulations). However, the results tended to be similar under the same simulation conditions, with the Bayesion Compression algorithm typically performing slightly better than OLS (naive p-value thresholding or with the Benjamini-Hochberg adjustment).
One potential problem, however, is that re-training the exact same model on the exact same dataset occasionally produced different results; my best guess here is that the Bayesian Compression algorithm may be sensitive to starting values, or I may need to train for more epochs.
In the absence of multiple simulations, this is meant mostly as a proof-of-concept, i.e. that we can use the methods outlined by Louizos, Ullrich, and Welling with slight modification to perform principled variable selection. I am encouraged by these initial results, and hope to explore it further. A few potential issues for further work here:
Last, though only linear data was attempted here, I am also excited about the potential applications for functional data analysis and variable selection, as we could then truly leverage the main strengths of neural networks — their expressivity. That is, the end goal here would be to develop a method yielding principled variable selection with well-studied operating characteristics (from the Bayesian sparsity prior) which would also be robust to misspecification of functional forms and nonlinearities.