In their 2017 paper,“Bayesian Compression for Deep Learning”, Christos Louizos, Karen Ullrich, and Max Welling outline a method for compressing neural networks, both via weight pruning and bit precision reduction. Others (Scaradapane et al. 2016; Wen, Wu 2016) have applied sparsity priors such as the Bayesian LASSO (exponential prior on the variance parameter) in order to reduce the amount of memory required to store neural networks, but Louizos, Ullrich, and Welling’s approach improves on previous work by 1) placing scale-mixture sparsity priors on entire rows of weight layers rather than on individual weights, and 2) using the estimated variance of the parameters to calculate the bit precision, or precision quantization, required to store the weights without losing too much information (i.e. causing data misfit).
Because neural networks are typically stored using a matrix representation of the weight layers, pruning individual weights independently does relatively little to reduce the memory required to store the network: pruning weights individually leads to a sprinkling of zeros in the matrix representation, but reduction in the dimensions of the matrix can only occur if an entire row or column is filled with zeros. By instead using a hierarchical structure in which entire rows share the sparsity-inducing parameter, Louizos, Ullrich, and Welling are able to “zero out” entire rows — entire neurons — and thus reduce the dimensions of the matrix required to store the weight layer. The dimension reduction also compounds across neural network layers: for fully-connected layers, since row \(i\) of weight layer \(j\) is used to compute the \(i\)’th input in weight layer \(j+1\), if row \(i\) (neuron \(i\)) of layer \(j\) is omitted, then column \(i\) of layer \(j+1\) can also be omitted.
In addition, the authors offer an information theory-based argument for using the posterior weights’ variances to impute the optimal number of bits required to encode the weights, leading to further compressibility. The main idea here is that the larger a weight’s variance, the lower the precision required to store it: if a weight has variance 1, then storing the weight with precision to the 8th decimal versus the 4th decimal does not preserve much information about the weight’s distribution. A Normal(4.0001, 1) distribution is less different from a Normal(4.00010001, 1) distribution than a Normal(4.0001, 0.01) distribution is from a Normal(4.00010001, 0.01).
The motivation for this project is the insight that, for the first weight layer, pruning the weights for entire columns would correspond to removing features or covariates. The focus of the method described above is not on variable selection in this way: while the hierarchical sparsity prior on weight layer rows leads to downstream elimination of columns, the input layer is not downstream from any other layer. Thus the method does not lead to variable selection in the usual statistical sense, though it can be used to examine feature importance on an observation-by-observation basis, because it does not remove columns in the first weight layer.
Our idea is that, by repurposing the bit quantization estimate to provide a threshold for weights to be zero’d out, we can in fact eliminate columns form the input layer in a reasonable fashion. That is, we can obtain principled variable selection with known operating characteristics via the Bayesian sparsity prior specification and variance-based thresholding,
A general construction for the scale mixture of normals entails placing a distribution \(p(z)\) on the scale parameter \(z\): \[ z \sim p(z); \quad w \sim N(w; 0, z^2). \]
Specifying the distribution \(p(z)\) to have both relatively large mass at zero and heavy tails biases the posterior distribution for \(w\) to be sparse. For example, when \(p(z)\) is a mixture of a Dirac delta at zero and a Normal distribution, we obtain the spike-and-slab; another well-known case is the LASSO estimator, obtained by placing a Laplace distribution on \(z\): \(p(z^2) = Exp(\lambda)\).
We follow Louizos, Ullrich, and Welling 2017 in using the improper log-uniform prior \(p(z) \propto |z|^{-1}\) (the authors also describe the implementation of their methods using the Horseshoe prior). The primary reason for this choice is the simplicity of computation: marginalizing over the scale parameter \(z\), the weights \(w\) follow the Normal-Jeffreys (or log-uniform) prior:
\[ p(w) \propto \int \frac{1}{|z|} N(w|0, z^2)dz = \frac{1}{|w|}. \]
In comparison, the spike-and-slab delta spike at zero is computationally expensive at scale: 1) variational Bayes techniques are unavailable as the Bernoulli scale mixture defies gradient descent, and 2) the search space contains \(2^M\) models, where \(M\) is the number of parameters, i.e. the number of weights in the neural network.
Group sparsity is accomplished under the authors’ model by requiring weights belonging to the same group (i.e. neuron or convolutional filter) to share the scale parameter. Let \(W\) be the weight matrix of a fully connected neural network layer with dimension \(A \times B\), where \(A\) is the number of neurons and \(B\) the dimension of the weights for each neuron. Then the joint prior is:
\[ p(W, z) \propto \prod_i^A \frac{1}{|z_i|} \prod_{i,j}^{A,B} N(w_{ij} | 0, z_i^2) \]
The authors use the approximate joint posterior
\[\begin{aligned} q_{\phi} (W,z) &= \prod_i^A N(z_i | \mu_{z_i}, \mu^2_{z_i} \alpha_i) \prod_{i,j}^{A,B} N(w_{ij} | z_i \mu_{ij}, z_i^2 \sigma_{ij}^2), \\ \end{aligned}\]
where \(\alpha_i\) is the dropout rate of group or neuron \(i\). After reparameterizing such that \(\sigma^2_{z_i} = \mu^2_{z_i} \alpha_i\) (to avoide high variance gradients), the evidence lower bound is: \[ \mathcal{L}(\phi)=\mathbb{E}_{q_{\phi}(\mathbf{z}) q_{\phi}(\mathbf{W} \mid \mathbf{z})}[\log p(\mathcal{D} \mid \mathbf{W})]-\mathbb{E}_{q_{\phi}(\mathbf{z})}\left[K L\left(q_{\phi}(\mathbf{W} \mid \mathbf{z}) \| p(\mathbf{W} \mid \mathbf{z})\right)\right]-K L\left(q_{\phi}(\mathbf{z}) \| p(\mathbf{z})\right) \]
where the KL-divergence for the conditional prior \(p(W|z)\) to the approximate posterior \(q_\phi (W|z)\) is:
\[\begin{aligned} KL(q_\phi(W|z) || p(W|z)) &= \frac{1}{2} \sum_{i,j}^{A,B} \left( \log \frac{z_i^2}{z_i^2 \sigma^2_{ij}} + \frac{z_i^2 \sigma^2_{ij}}{z_i^2} + \frac{z_i^2 \mu^2_{ij}}{z_i^2} - 1 \right) \\ &= \frac{1}{2} \sum_{i,j}^{A,B} \left( \log \frac{1}{\sigma^2_{ij}} + \sigma^2_{ij} + \mu^2_{ij} - 1 \right). \end{aligned}\]
Last, the authors obtain the KL-divergence from the normal-Jeffreys scale prior \(p(z)\) to the Gaussian variational posterior \(q_\phi (z)\) to be dependent only on the implied dropout rate \(\alpha_i = \sigma^2_{z_i} / \mu^2_{z_i}\):
\[ -K L\left(q_{\phi}(\mathbf{z}) \| p(\mathbf{z})\right) \approx \sum_{i}^{A}\left(k_{1} \sigma\left(k_{2}+k_{3} \log \alpha_{i}\right)-0.5 m\left(-\log \alpha_{i}\right)-k_{1}\right) \]
where \(\sigma(\cdot), m(\cdot)\) are the sigmoid and softplus functions, and \(k_{1}=0.63576, k_{2}=1.87320, k_{3}=1.48695\).
Groups of parameters are then pruned by specifying a threshold for the variational dropout rate \(\alpha_i\) such that \(\log \alpha_{i}=\left(\log \sigma_{z_i}^{2}-\log \mu_{z_i}^{2}\right)>t\)
The variational posterior marginal variance \[ \mathbb{V}\left(w_{i j}\right)_{N J}=\sigma_{z_{i}}^{2}\left(\sigma_{i j}^{2}+\mu_{i j}^{2}\right)+\sigma_{i j}^{2} \mu_{z_{i}}^{2} \] is then used to compute the optimal number of bits that minimizes the sum of the complexity cost of communicating the model and error cost (the data misfit). The theoretical arguments underpinning bit precision optimization are out of the scope of this writeup, but are delineated in Louizos, Ullrich, Welling 2017 Section 2 and Appendix B.
We examined variable selection performance for a basic linear regression problem by generating synthetic data consisting of 10 true covariates, 20 nuisance variables, with N = 10000. The true covariates and nuisance variables were randomly assigned to arise from normal and binomial distributions (code below), and the outcome \(y\) made to be a linear transformation of the true covariates plus Normal noise.
<- function(
generate_linear_data ave_beta = 2,
sd_beta = 10,
sd_eps = 2,
n_obs = 1E4,
n_covars = 10,
n_nuisance = 20
){
<- round(rnorm(n_covars + 1, mean = ave_beta, sd = sd_beta), 2)
beta_vec <- data.frame(matrix(0, nrow = n_obs, ncol = n_covars + n_nuisance))
dat <- rep(NA, n_covars + n_nuisance)
distn_vec
for (j in 1:(n_covars+n_nuisance)) {
<- sample(c("normal","binomial"), 1)
distn if (distn == "normal"){
<- round(runif(1, 0, 10))
mu_j <- round(runif(1, 1, 5))
sig_j <- rnorm(n_obs, mu_j, sig_j)
dat[, j] <- paste0(distn, "(", mu_j, ", ", sig_j, ")")
distn_vec[j]
else if (distn == "binomial"){
} <- runif(1, 0.1, 1)
p_j <- floor(runif(1, 1, 10))
n_j <- rbinom(n_obs, size = n_j, prob = p_j)
dat[,j] <- paste0(distn, "(", n_j, ", ", p_j, ")")
distn_vec[j]
}
}
# construct y
# dat <- round(dat, 4)
<- sort(sample(1:(n_covars + n_nuisance), n_covars))
covar_inds <- as.matrix(dat[, covar_inds])
covars <- distn_vec[covar_inds]
covar_distns <- rnorm(n_obs, mean = 0, sd = sd_eps)
eps $y <- as.vector(cbind(1, covars) %*% beta_vec + eps)
dat<- colnames(dat)[covar_inds]
true_covars <- colnames(dat)[setdiff(1:(n_nuisance + n_covars), covar_inds)]
nuisance_vars names(beta_vec) <- c("(Intercept)", true_covars)
<- list(
res "dat" = dat,
"eps" = eps,
"sd_eps" = sd_eps,
"distn_vec" = distn_vec,
"n_obs" = n_obs,
"n_covars" = n_covars,
"n_nuisance" = n_nuisance,
"beta_vec" = beta_vec,
"covar_inds" = covar_inds,
"true_covars" = true_covars,
"nuisance_vars" = nuisance_vars
)
return(res)
}
set.seed(0)
<- generate_linear_data()
lindat_obj <- lindat_obj$dat
dat <- lindat_obj$true_covars
true_covars <- lindat_obj$nuisance_vars
nuisance_vars <- lindat_obj$covar_inds
covar_inds <- lindat_obj$beta_vec
beta_vec <- as.integer(as.integer(unlist(lapply(strsplit(true_covars, "X"), function(x) x[2]))) - 1)
true_covars_inds <- as.integer(as.integer(unlist(lapply(strsplit(nuisance_vars, "X"), function(x) x[2]))) - 1)
nuisance_covars_inds
<- dat %>%
samp slice_sample(n = 500) %>%
gather(key = "covar", value = "value", -y)
#
# p_all <- dat %>%
# slice_sample(n = 500) %>%
# gather(key = "covar", value = "value", -y) %>%
# mutate(true_cov = if_else(
# covar %in% true_covars,
# "covar_true",
# "nuisance"
# )
# ) %>%
# ggplot() +
# geom_density(
# aes(
# x = value,
# color = covar,
# fill = covar
# ),
# alpha = .2
# ) +
# theme(legend.position = "none") +
# facet_wrap(. ~ true_cov, ncol = 1) +
# labs(
# title = "density plot of all variables in dataset"
# )
<- dat[, true_covars_inds] %>%
p_true slice_sample(n = 500) %>%
gather(key = "covar", value = "value") %>%
ggplot() +
geom_density(
aes(
x = value,
color = covar,
fill = covar
),alpha = .1
+
) theme(legend.position = "none") +
labs(
title = "true covariates only"
)
<- dat %>%
p_all slice_sample(n = 500) %>%
gather(key = "covar", value = "value", -y) %>%
mutate(true_cov = if_else(
%in% lindat_obj$true_covars,
covar "covar_true",
"nuisance"
)%>%
) ggplot() +
geom_density(
aes(
x = value,
color = covar,
fill = covar,
alpha = I(ifelse(true_cov=="covar_true", .25, .1))
)+
) theme(legend.position = "none") +
labs(
title = "density plot of all covariates",
subtitle = "(mix of normal and binomial distributions)"
)
grid.arrange(
+ xlim(-10, 20) + ylim(0, 2),
p_all + xlim(-10, 20) + ylim(0, 2),
p_true ncol = 1
# layout_matrix = rbind(
# c(1, 1, 2),
# c(1, 1, 3)
# )
)
# dat %>%
# slice_sample(n = 500) %>%
# gather(key = "covar", value = "value", -y) %>%
# mutate(true_cov = if_else(
# covar %in% lindat_obj$true_covars,
# "covar_true",
# "nuisance"
# )
# ) %>%
# ggplot() +
# geom_point(
# aes(
# y = y,
# x = value,
# color = covar,
# alpha = I(ifelse(true_cov=="covar_true", .5, .25)),
# size = I(ifelse(true_cov=="covar_true", 1.25, .9))
# )
# ) +
# geom_smooth(
# aes(
# y = y,
# x = value
# ),
# method = 'lm'
# ) +
# theme(legend.position = "none") +
# ylim(c(0, 150)) +
# xlim(-5, 15) +
# labs(
# title = "scatterplot of data with linear regression output",
# )
# lmfit <- lm("y ~ .", data = dat)
# lmfit_coefs <- coef(lmfit)
# ols_varsel_err(lmfit, true_covars, nuisance_vars)
# dat <- data.frame(apply(dat, 2, scale))
I trained a neural network consisting of 3 fully connected layers over 50 training epochs, with dimension:
from __future__ import print_function, division
import os
os.chdir(r.fpath)
import sys
sys.path.append(r.KUpath)
import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
'tkagg')
plt.switch_backend(
import torch as torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import Dataset, DataLoader
# source Bayesian Compression .py files
import BayesianLayers
from compression import compute_compression_rate, compute_reduced_weights
from utils import visualize_pixel_importance, generate_gif, visualise_weights
import compression
import utils
# global arguments
import argparse
= argparse.ArgumentParser()
parser '--epochs', type=int, default=50) parser.add_argument(
## _StoreAction(option_strings=['--epochs'], dest='epochs', nargs=None, const=None, default=50, type=<class 'int'>, choices=None, help=None, metavar=None)
'--batchsize', type=int, default=100) parser.add_argument(
## _StoreAction(option_strings=['--batchsize'], dest='batchsize', nargs=None, const=None, default=100, type=<class 'int'>, choices=None, help=None, metavar=None)
'--thresholds', type=float, nargs='*', default=[-2.8, -3., -5.]) parser.add_argument(
## _StoreAction(option_strings=['--thresholds'], dest='thresholds', nargs='*', const=None, default=[-2.8, -3.0, -5.0], type=<class 'float'>, choices=None, help=None, metavar=None)
= parser.parse_args()
FLAGS = torch.cuda.is_available() # check if we can put the net on the GPU
FLAGS.cuda
# load data in from R
= np.asarray(r.dat)[:, :r.dat.shape[1]-1]
X = np.asarray(r.dat)[:, r.dat.shape[1]-1]
y = torch.tensor(X).float(), torch.tensor(y).float().unsqueeze(-1)
X, y
= len(y)
N = N * 4 // 5
N_train
# Shuffles the indices
= np.arange(N)
idx
np.random.shuffle(idx)
# Uses first 80% random indices for train
= idx[:N_train]
train_idx = idx[N_train:]
test_idx
# Generates train and test
= X[train_idx], y[train_idx]
X_train, y_train = X[test_idx], y[test_idx]
X_test, y_test
# # checking that R and python return the same OLS fit
# from sklearn.linear_model import LinearRegression
# linr = LinearRegression()
# linr.fit(X, y)
# print(linr.intercept_, linr.coef_)
# r.lmfit_coefs
#
# coef_check <- cbind(
# "linr.fit (python)" = c(py$linr$intercept_, py$linr$coef_),
# "lm (R)" = lmfit_coefs
# )
# mykable(round(coef_check, 3), cap = "check that data from R and imported to python return same OLS fit")
# merge into tuples
def merge(list1, list2):
= [(list1[i], list2[i]) for i in range(0, len(list1))]
merged_list return merged_list
= merge(X_train, y_train)
train_dat
= DataLoader(
train_loader
train_dat,=FLAGS.batchsize,
batch_size=True
shuffle
)
= merge(X_test, y_test)
test_dat
= DataLoader(
test_loader
test_dat,=FLAGS.batchsize,
batch_size=True
shuffle
)
= 255. * (np.ones((1, 30)))
mask = X_train[0:5,]
examples = np.vstack([mask, examples])
images
# check dataloader
# for batch_idx, (data, y) in enumerate(test_loader):
# # print(data)
# # print(y)
# print(data.size())
# print(y.size())
print("")
# build a simple MLP
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# activation
self.relu = nn.ReLU()
# layers
self.fc1 = BayesianLayers.LinearGroupNJ(1 * 30, 20, clip_var=0.04, cuda=FLAGS.cuda) # 28 x 28 <--- 1 * 31
self.fc2 = BayesianLayers.LinearGroupNJ(20, 30, cuda=FLAGS.cuda)
self.fc3 = BayesianLayers.LinearGroupNJ(30, 1, cuda=FLAGS.cuda)
# layers including kl_divergence
self.kl_list = [self.fc1, self.fc2, self.fc3]
def forward(self, x):
= x.view(-1, 1*30)
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
x return self.fc3(x)
def get_masks(self,thresholds):
= []
weight_masks = None
mask for i, (layer, threshold) in enumerate(zip(self.kl_list, thresholds)):
# compute dropout mask
if mask is None:
= layer.get_log_dropout_rates().cpu().data.numpy()
log_alpha = log_alpha < threshold
mask else:
= np.copy(next_mask)
mask try:
= layers[i + 1].get_log_dropout_rates().cpu().data.numpy()
log_alpha = log_alpha < thresholds[i + 1]
next_mask except:
# must be the last mask
= np.ones(10)
next_mask
= np.expand_dims(mask, axis=0) * np.expand_dims(next_mask, axis=1)
weight_mask float))
weight_masks.append(weight_mask.astype(np.return weight_masks
def kl_divergence(self):
= 0
KLD for layer in self.kl_list:
+= layer.kl_divergence()
KLD return KLD
# init model
= Net()
model if FLAGS.cuda:
model.cuda()
# init optimizer
= optim.Adam(model.parameters())
optimizer
# we optimize the variational lower bound scaled by the number of data
# points (so we can keep our intuitions about hyper-params such as the learning rate)
# discrimination_loss = nn.functional.cross_entropy
= nn.MSELoss(reduction='none')
discrimination_loss
def objective(output, target, kl_divergence):
= discrimination_loss(output, target)
discrimination_error # print("kl.item value, type: {}, {}".format(kl_divergence.item(), type(kl_divergence.item())))
# print("kl_divergence data type: {}".format(type(kl_divergence)))
# print("loss data value, type: {}".format(discrimination_error, type(discrimination_error)))
= (discrimination_error + kl_divergence) / N_train
variational_bound # print("variational_bound calculated")
# print("variational_bound value, type: {}, {}".format(torch.mean(variational_bound), type(torch.mean(variational_bound))))
if FLAGS.cuda:
= variational_bound.cuda()
variational_bound return variational_bound
def train(epoch):
model.train()for batch_idx, (data, target) in enumerate(train_loader):
if FLAGS.cuda:
= data.cuda(), target.cuda()
data, target = Variable(data), Variable(target)
data, target
optimizer.zero_grad()= model(data)
output # print("output = model(data)")
= objective(output, target, model.kl_divergence())
loss # print("loss = ...")
# loss.backward()
sum().backward()
loss.# print("loss.sum().backward()")
optimizer.step()# clip the variances after each step
for layer in model.kl_list:
layer.clip_variances()print('Epoch: {} \tTrain loss: {:.6f} \t'.format(
epoch, torch.mean(loss.data)))
def test():
eval()
model.= 0
test_loss = 0
correct for data, target in test_loader:
if FLAGS.cuda:
= data.cuda(), target.cuda()
data, target = Variable(data, volatile=True), Variable(target)
data, target = model(data)
output # test_loss += discrimination_loss(output, target, size_average=False).data
+= discrimination_loss(output, target).data
test_loss = output.data.max(1, keepdim=True)[1]
pred += pred.eq(target.data.view_as(pred)).cpu().sum()
correct /= len(test_loader.dataset)
test_loss print('Test loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(
len(test_loader.dataset),
torch.mean(test_loss), correct, 100. * correct / len(test_loader.dataset)))
#### run / load model ----
if r.params["retrain"]:
# train the model and save some visualisations on the way
print("--start training--")
for epoch in range(1, FLAGS.epochs + 1):
print("--epoch" + str(epoch) + "--")
train(epoch)
test()# visualizations
= [model.fc1.weight_mu, model.fc2.weight_mu]
weight_mus = [model.fc1.get_log_dropout_rates(), model.fc2.get_log_dropout_rates(),
log_alphas
model.fc3.get_log_dropout_rates()]=epoch)
visualise_weights(weight_mus, log_alphas, epoch= model.fc1.get_log_dropout_rates().cpu().data.numpy()
log_alpha # visualize_pixel_importance(images, log_alpha=log_alpha, epoch=str(epoch))
# compute compression rate and new model accuracy
= [model.fc1, model.fc2, model.fc3]
layers = FLAGS.thresholds
thresholds
compute_compression_rate(layers, model.get_masks(thresholds))
= compute_reduced_weights(layers, model.get_masks(thresholds))
weights for layer, weight in zip(layers, weights):
if FLAGS.cuda:
= torch.Tensor(weight).cuda()
layer.post_weight_mu.data else:
= torch.Tensor(weight)
layer.post_weight_mu.data # for layer in layers: layer.deterministic = True
'model_weights.pth')
torch.save(model.state_dict(),
# generate_gif(save='pixel', epochs=FLAGS.epochs)
='weight0_e', epochs=FLAGS.epochs)
generate_gif(save='weight1_e', epochs=FLAGS.epochs)
generate_gif(saveelse:
'model_weights.pth'))
model.load_state_dict(torch.load(
## <All keys matched successfully>
= [model.fc1, model.fc2, model.fc3]
layers = FLAGS.thresholds
thresholds = [model.fc1.weight_mu, model.fc2.weight_mu]
weight_mus = compression.extract_pruned_params(layers, model.get_masks(thresholds)) weight_mus, weight_vars
## <string>:37: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
## Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def extract_pruned_layer(layer, mask):
# extract non-zero columns
= layer[:, mask.sum(axis = 0) != 0]
layer # extract non-zero rows
= layer[mask.sum(axis = 1) != 0, :]
layer return layer
def extract_pruned_layers(layers, masks):
= []
res for layer, mask in zip(layers, masks):
= extract_pruned_layer(layer, mask)
l
res.append(l)return(res)
= extract_pruned_layers(weight_mus, model.get_masks(thresholds))
pruned_weights # pruned_weights[0].shape
= model.get_masks(thresholds)[0] mask0
Unfortunately, my python coding ability is not yet good enough to have run many simiulations in an automated fashion, but here is a representative outcome from one sample dataset. Comparisons against OLS (variable inclusion determined using a naive threshold of p-value > 0.05 as well as using Benjamini-Hochberg) were made; sample output is below.
TP
) denote variables that were correctly included;TN
) denote variables that were correctly excluded<- colSums(abs(py$weight_mus[[1]])) != 0
gamma_hat <- rep(FALSE, length(gamma_hat))
gamma_true as.numeric(gsub("X", "", true_covars))] <- TRUE
gamma_true[
<- function(pred, true){
binary_err <- sum(true & pred)
TP <- sum(!true & pred)
FP <- sum(!true & !pred)
TN <- sum(true & !pred)
FN
c("TP" = TP, "FP" = FP, "TN" = TN, "FN" = FN)
}
<- binary_err(pred = gamma_hat, true = gamma_true)
nn_varsel <- lm("y ~ .", data = dat)
lmfit <- coef(lmfit)
lmfit_coefs
<- ols_varsel_err(lmfit, true_covars, nuisance_vars)
ols_varsel <- ols_varsel_err(lmfit, true_covars, nuisance_vars, adjust.method = "BH")
BH_varsel
<- rbind(
varsel_results "OLS" = ols_varsel,
"BH" = BH_varsel,
"Compression" = nn_varsel
)
mykable(varsel_results, cap = "variable selection results")
TP | FP | TN | FN | |
---|---|---|---|---|
OLS | 10 | 3 | 17 | 0 |
BH | 10 | 1 | 19 | 0 |
Compression | 10 | 0 | 20 | 0 |
The evolution of the first (input) layer’s \(30 \times 20\) weight matrix (30 covariates, 20 neurons) during training can be seen below:
input layer evolution over training epochs
Unfortunately, this displays the weights from each neuron in a column (as opposed to rows, which is how I have been referring to them) — the y-axis increments correspond to each of the 30 covariates; the x-axis increments correspond to the 20 neurons of the input layer. We can see that training ends up pruning columns (neurons), but there is also a steady push towards 0 across the rows, implying that certain covariates have only a minor effect.
Below (generated in ggplot
) is the final version of the input layer’s weight matrix after thresholding based on the estimated variance and bit precision roundoffs (unfortunately, the plot scaling is off here, but just as in the animated .gif above, there are 30 rows and 20 columns corresponding to the 30 covariates and 20 neurons):
vismat(py$weight_mus[[1]]) +
coord_flip() +
labs(
y = "neurons in first layer",
x = "covariates"
+
) labs(
title = "first weight layer",
subtitle = "grey = pruned weights"
)
Unfortunately, because I do not know how to code in Python without using R’s reticulate
package, and I do not know how to loop through chunks of code containing both Python and R code, I can only re-run the simulation manually right now (so I am unable to display results from numerous simulations). However, the results tended to be similar under the same simulation conditions, with the Bayesion Compression algorithm typically performing slightly better than OLS (naive p-value thresholding or with the Benjamini-Hochberg adjustment).
One potential problem, however, is that re-training the exact same model on the exact same dataset occasionally produced different results; my best guess here is that the Bayesian Compression algorithm may be sensitive to starting values, or I may need to train for more epochs.
In the absence of multiple simulations, this is meant mostly as a proof-of-concept, i.e. that we can use the methods outlined by Louizos, Ullrich, and Welling with slight modification to perform principled variable selection. I am encouraged by these initial results, and hope to explore it further. A few potential issues for further work here:
Last, though only linear data was attempted here, I am also excited about the potential applications for functional data analysis and variable selection, as we could then truly leverage the main strengths of neural networks — their expressivity. That is, the end goal here would be to develop a method yielding principled variable selection with well-studied operating characteristics (from the Bayesian sparsity prior) which would also be robust to misspecification of functional forms and nonlinearities.