3 Network-based Sparse Bayesian Classification. 4 Multi-task Feature ... 3 Sentiment prediction from user-written product reviews. Methods .... using ...

0 downloads 18 Views 2MB Size

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Jos´e Miguel Hern´andez-Lobato Computer Science Department, Universidad Aut´ onoma de Madrid Joint work with D. Hern´ andez-Lobato, A. Su´ arez and P. Dupont

Cambridge UK, March 4th, 2011

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors

Outline

1

Introduction

2

Sparse Linear Regression

3

Network-based Sparse Bayesian Classification

4

Multi-task Feature Selection and Group Feature Selection

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Introduction

Outline

1

Introduction

2

Sparse Linear Regression

3

Network-based Sparse Bayesian Classification

4

Multi-task Feature Selection and Group Feature Selection

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Introduction

Sparse Linear Models...

...include a few coefficients which are different from zero and many coefficients which are exactly zero. ...are specially useful for addressing problems with a reduced number of training instances and a high-dimensional feature space (large d and small n) . Three approaches for enforcing sparsity: 1 Select a small subset of features in advance. 2 Add a penalty term to the objective function. 3 Use a sparsity enforcing prior in a Bayesian approach. 1

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Introduction

Different Sparsity Enforcing Priors Spike and Slab

Laplace

Degenerate Student's t

Selective shrinkage [Ishwaran and Rao (2005)]: Only the posterior mean of truly zero coefficients is strongly shrunk towards zero. We propose to use sparse linear models with spike and slab priors to address problems that belong to the large d and small n class. 2

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Outline

1

Introduction

2

Sparse Linear Regression

3

Network-based Sparse Bayesian Classification

4

Multi-task Feature Selection and Group Feature Selection

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Linear Regression Model with Spike and Slab Priors (LRMSSP) Likelihood: P(y|w, X) =

n Y

N (yi |wT xi , σ02 ) .

i=1

Spike and slab prior: The slab

The spike

The posterior is intractable: use MCMC [George and McCulloch (1997)]. However, MCMC has often a large cost: on average O(p02 d 3 k), k d. Proposed alternative: expectation propagation (EP) [Minka (2001)]. 3

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Expectation Propagation in the LRMSSP EP approximates the posterior P(w, z|X, y) by a simpler distribution: Q(w, z) =

d Y

N (wi |mi , vi )Bern(zi |pi ) .

i=1

where

(1) EP

(2) EP

Exact

EP selects the parameters of Q by iteratively minimizing: (1) DKL [P(y|w, X)fˆ2 fˆ3 k fˆ1 fˆ2 fˆ3 ] and (2) DKL [fˆ1 P(w|z)fˆ3 k fˆ1 fˆ2 fˆ3 ] . When d > n, the cost of EP is linear in d: O(n2 d) . 4

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Experimental Evaluation Regression problems with large d and small n: 1 Reverse-engineering of transcription control networks. 2 Reconstruction of sparse signals. 3 Sentiment prediction from user-written product reviews. Methods analyzed: SS-EP LRMSSP, EP. SS-MCMC LRMSSP, MCMC [George and McCulloch (1997)]. Laplace Linear model, Laplace prior, EP [Seeger (2008)]. RVM Linear model, degenerate Student’s t prior, type-II maximum likelihood approach [Tipping et al. (2001)]. 5

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Reconstruction of Transcription Networks Given an d × n gene expression matrix X, we assume X = WX + σ0 E, (steady-state model) where diag(W) = 0 and eij ∼ N (0, 1). Inference on W by solving d regression problems, assuming sparsity. Prediction of a link j → i when P(wij 6= 0|X) > γ. Performance measure: AUC-PR and AUC-ROC as γ is changed. n = d = 100, X generated by GeneNetWeaver [Marbach et al. 2009]. σ0 = 1, p0 = 0.02, vs fixed using the rule of Steinke et al. 2007.

Average results for 100 networks:

6

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Posterior Mean for W

Superior selective shrinkage of spike and slab priors!

7

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Reconstruction of Sparse Signals Reconstruct d-dimensional sparse signal w0 from n measurements: y = Xw0 + σ0 e, where ei ∼ N (0, 1). Non-uniform spike signals: 20 spikes sampled from N (0, 1). Uniform spike signals: 20 spikes sampled from {−1, 1}. Rows of X sampled from unit hyper-sphere, d = 512, σ0 = 0.005, vs = 1, p0 = 20/512, n = 75 (non-uniform case) and n = 100 (uniform case) ˆ 2 /|w0 |2 . Reconstruction error as performance measure: |w0 − w| Average results for 100 signals: Non-uniform Spike Signals SS-MCMC Laplace RVM SS-EP Error 0.19 0.82 0.19 0.04 798 0.12 0.07 0.19 Time

Uniform Spike Signals SS-MCMC Laplace RVM SS-EP 1.03 0.84 0.66 0.01 1783 0.17 0.12 0.2

8

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Signal Reconstruction, Uniform Case

EP is less affected by sub-optimal modes! 9

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Sentiment Prediction Predict from the review text, the rating (1-5 stars) given by the user to the product. The features are the unigrams and bigrams in at least 100 reviews [Blitzer et al. 2007]. Hyper-parameters fixed by a 10-fold cross-validation grid search, σ0 = 1. 500 training instances, 1213 (Books) and 824 (Kitchen Appliances) features. Test instances: 5001 (Books) and 4649 (Kitchen Appliances). Squared error as performance measure.

Average results on 20 train/test partitions: Books Dataset SS-MCMC Laplace RVM SS-EP Error 1.81 1.84 2.38 1.81 Time 155,438 9.9 2.1 11.1

Kitchen Appliances Dataset SS-MCMC Laplace RVM SS-EP 1.59 1.64 1.91 1.59 40,662 7.6 0.9 9.5 10

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Posterior Mean for w, Kitchen Dataset

Superior selective shrinkage

11

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Sparse Linear Regression

Conclusions LRMSSP In the LRMSSP, EP can outperform Gibbs sampling at a lower computational cost. The LRMSSP can improve the results of sparse models with Laplace and degenerate Student’s t priors. The spike and slab prior distribution has a superior selective shrinkage capacity. EP seems to be less affected by the multiple sub-optimal modes of the posterior distribution than Gibbs sampling.

12

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Outline

1

Introduction

2

Sparse Linear Regression

3

Network-based Sparse Bayesian Classification

4

Multi-task Feature Selection and Group Feature Selection

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Network of Feature Dependencies In some classification problems with large d and small n there is often prior information about feature dependencies. Very often, we know that two features are likely to be either both relevant or both irrelevant for prediction. This prior information can be encoded in an undirected network or graph G = (V , E ), whose nodes correspond to features and whose edges connect dependent features. A sparse linear classifier that incorporates this prior information may improve its predictive performance.

13

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Sparse Bayesian Classifier Likelihood: n h i Y T T P (y|w, , X) = 1 − Θ y i w xi + (1 − )Θ yi w xi , i=1

Θ is the Heaviside step function and w0 is the bias coefficient. Prior for the noise in the class labels: P() = Beta(|a0 , b0 )

[Hern´andez-Lobato et al. 2008] .

Spike and slab prior: d Y zi + 1 1 − zi P(w|z) = N (wi |0, vi ) + δ(wi ) , 2 2 i=1

where v0 = 100 , vi = 1 for i = 1, . . . , d and zi ∈ {−1, 1} . 14

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Network-based Sparse Bayesian Classifier (NBSBC) The information in the network of features G can be incorporated using a Markov random field as the prior for z:

α determines the level of sparsity (10 α). β determines the strength of the dependencies (β ≥ 0). The posterior P(w, z, |X, y, G , α, β) is intractable. Approximate inference with EP [Hern´andez-Lobato et al. (2010)]. 15

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Expectation Propagation in the NBSBC The EP approximation: Q(w, z, ) = Beta(|a, b)

Qd

i=1

N (wi |mi , vi )Bern(zi |pi ),

where

EP

EP

EP

Exact

Exact

When G is sparse, the cost of EP is linear in d and n: O(nd) . 16

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Experimental Evaluation Classification problems with large d, small n and network G : 1. English phonemes (aa vs. ao). 2. Handwritten digits (7 vs. 9) (background noise). 3. Precipitation amounts (positive vs. zero). 4. Metastasis-free survival time (larger vs. shorter). Methods analyzed: NBSBC The proposed method. SBC The proposed method with no network info (β = 0). NBSVM The network-based support vector machine [Zhu et al. (2009)]. GL Logistic regression with graph lasso penalty [Jacob et al. (2009)]. SVM The standard support vector machine. Hyper-parameter values: α = 0, non-informative prior on the sparsity level. β selected in a grid search. The value that minimizes EQ [] . a0 = 1 and b0 = 9. 17

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

English Phonemes ? ? ? ?

“aa” as in dark vs. “ao” as the first vowel in water. 256 features per data instance, forming a log-periodogram. 1022 instances of class “ao” and 695 instances of class “aa”. 150 training instances.

18

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Results on English Phonemes Error Time

SVM 20.66 10.47

NBSVM 20.24 145.60

GL 20.55 6.49

SBC 20.19 0.58

NBSBC 19.48 8.37

The network generates a smoothing effect of the posterior probabilities for the latent variables z. 19

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Handwritten Digits ? ? ? ? ?

7 vs. 9 in the MNIST dataset. 784 features per data instance (pixel intensities). Background noise. 7293 instances of class 7 and 6958 instances of class 9. 150 training instances.

20

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Results on Handwritten Digits

Error Time

SVM 10.32 35.80

NBSVM 10.23 1992.51

GL 11.18 29.93

SBC 9.18 0.56

NBSBC 8.35 21.32

Smoothing effect of the posterior probabilities. 21

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Precipitation Amounts, Moscow Meteorological Station ? ? ? ?

Instance features: rainfall amounts at other 222 stations. 2217 days were dry, 2326 days with positive precipitation. 150 training instances. Results for Moscow representative of results for other stations.

Error Time

SVM 38.12 9.82

NBSVM 36.69 254.37

GL 32.31 14.74

SBC 35.16 0.31

NBSBC 33.17 8.36 22

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Metastasis Free Survival Time. Breast Cancer.

23

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Stability Analysis of Feature Selection Methods Let Bki be the set with the k most relevant features as estimated by the method in the i-th train/test episode. Index of feature selection stability [Kuncheva 2007]: IFSS (k) =

T −1 X T oijk d − k 2 X 2 , T (T − 1) k(d − k)

for

k = 1, . . . , d ,

i=1 j=i+1

where d is the data dimensionality and oijk is the number of common elements between sets Bki and Bkj . Relevance of the i-th feature: SBC and NBSBC: P(zi = 1|X, y, G , α, β). NBSVM and GL: |wi |. 24

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Stability Results Agreement between feature rankings in train/test episodes:

25

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Network-based Sparse Bayesian Classification

Conclusions NBSBC Taking into account dependencies between features can improve the predictive performance of a sparse linear model. These dependencies can be incorporated into a model with spike and slab priors by using a Markov random field. The feature rankings of NBSBC are robust and stable against small perturbations of the training set. Hyper-parameter β can be selected using a grid search which minimizes EQ [] , (no cross-validation).

26

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

Outline

1

Introduction

2

Sparse Linear Regression

3

Network-based Sparse Bayesian Classification

4

Multi-task Feature Selection and Group Feature Selection

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

Multi-task Feature Selection t ]. K learning tasks with coefficients w1 , . . . , wK ∈ Rd , W = [w1t , . . . , wK The same d features are shared across tasks.

Prior assumption: Only a common subset of features is relevant for prediction in the tasks. Multi-task spike and slab prior: ( d ) K K i Y Y Yh P(W|z) = P(wk |z) = zi N (wik , 0, vs ) + (1 − zi ) δ(wik ) k=1

where P(z) =

k=1

Qd

i=1

i=1

Bern(zi |p0 ).

Approximate inference with EP in a probit classification model: [Hernandez-Lobato et al. 2010]. 27

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

Group Feature Selection Prior assumption: The components of w can be partitioned in different groups. The components of w in the same group are either: All equal to zero. All different from zero. Group spike and slab prior: w is split in G disjoint groups: wt = (w1t , . . . , wGt ). z = (z1 , . . . , zG )t . QG P(w|z) = i=1 zg (i) N (wi , 0, vs ) + (1 − zg (i) )δ(wi ) , where g (i) is the index of the group with the i-th feature. QG P(z) = i=1 Bern(zi |pi ). Approximate inference with EP in a linear classification model. Document under review. 28

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

References I Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: Frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730-773. George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica, 7(2):339-373. Minka, T. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD thesis, MIT. Seeger, M. W. (2008). Bayesian inference and optimal design for the sparse linear model. The Journal of Machine Learning Research, 9:759-813. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1:211-244. Hern´ andez-Lobato, J. M., Hern´ andez-Lobato, D., and Su´ arez, A. (2011). Network-based sparse Bayesian classification. Pattern Recognition, 44(4):886-900. Zhu, Y., Shen, X., and Pan, W. (2009). Network-based support vector machine for classification of microarray samples. BMC Bioinformatics, 10(Suppl 1):S21. Jacob, L., Obozinski, G., and Vert, J. (2009). Group lasso with overlap and graph lasso. In ICML 2009, pages 433-440. 29

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

References II Kuncheva, L. I. (2007). A stability index for feature selection. In Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: artificial intelligence and applications, pages 390-395. ACTA Press. Marbach, D., Schaffter, T., Mattiussi, C., and Floreano, D. (2009). Generating realistic in silico gene networks for performance assessment of reverse engineering methods. Journal of Computational Biology, 16(2):229-239. Steinke, F., Seeger, M., and Tsuda, K. (2007). Experimental design for efficient identification of gene regulatory networks using sparse Bayesian models. BMC Systems Biology, 1(1):51. Blitzer, J., Dredze, M., and Pereira, F. (2007). Biographies, Bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the ACL, pages 440-447. Hern´ andez-Lobato, D. and Hern´ andez-Lobato, J. M. (2008). Bayes machines for binary classification. Pattern Recognition Letters, 29(10):1466-1473. Hern´ andez-Lobato D., Hern´ andez-Lobato J. M., Helleppute T. and Dupont P (2010). Expectation Propagation for Bayesian Multi-task Feature Selection. In European Conference on Machine Learning, Barcelona, Spain, ECML PKDD 2010, Part I, LNAI 6321, pp. 522-537. 30

Expectation Propagation in Sparse Linear Models with Spike and Slab Priors Multi-task Feature Selection and Group Feature Selection

Thank you for your attention!

31