Lecture9_Multi_Kernel_SVM.pdf

Lecture9_Multi_Kernel_SVM.pdf - Revenir à l'accueil

 

Lecture 9: Multi Kernel SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 April 16, 2014Roadmap 1 Tuning the kernel: MKL The multiple kernel problem Sparse kernel machines for regression: SVR SimpleMKL: the multiple kernel solutionStandard Learning with Kernels User Learning Machine kernel k data f http://www.cs.nyu.edu/~mohri/icml2011-tutorial/tutorial-icml2011-2.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 3 / 21Learning Kernel framework User Learning Machine kernel family km data f , k(., .) http://www.cs.nyu.edu/~mohri/icml2011-tutorial/tutorial-icml2011-2.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 3 / 21from SVM SVM: single kernel k f (x) = Xn i=1 αi k (x, xi) + b = http://www.nowozin.net/sebastian/talks/ICCV-2009-LPbeta.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 4 / 21from SVM → to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k1, . . . , km, . . . , kM ◮ learn classier and combination weights ◮ can be cast as a convex optimization problem f (x) = Xn i=1 αi X M m=1 dm km(x, xi) + b X M m=1 dm = 1 and 0 ≤ dm = http://www.nowozin.net/sebastian/talks/ICCV-2009-LPbeta.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 4 / 21from SVM → to Multiple Kernel Learning (MKL) SVM: single kernel k MKL: set of M kernels k1, . . . , km, . . . , kM ◮ learn classier and combination weights ◮ can be cast as a convex optimization problem f (x) = Xn i=1 αi X M m=1 dm km(x, xi) + b X M m=1 dm = 1 and 0 ≤ dm = Xn i=1 αiK(x, xi) + b with K(x, xi) = X M m=1 dm km(x, xi) http://www.nowozin.net/sebastian/talks/ICCV-2009-LPbeta.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 4 / 21Multiple Kernel The model f (x) = Xn i=1 αi X M m=1 dmkm(x, xi) + b, X M m=1 dm = 1 and 0 ≤ dm Given M kernel functions k1, . . . , kM that are potentially well suited for a given problem, find a positive linear combination of these kernels such that the resulting kernel k is “optimal” k(x, x ′ ) = X M m=1 dmkm(x, x ′ ), with dm ≥ 0, X m dm = 1 Learning together The kernel coefficients dm and the SVM parameters αi , b. Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 5 / 21Multiple Kernel: illustration Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 6 / 21Multiple Kernel Strategies Wrapper method (Weston et al., 2000; Chapelle et al., 2002) ◮ solve SVM ◮ gradient descent on dm on criterion: ⋆ margin criterion ⋆ span criterion Kernel Learning & Feature Selection ◮ use Kernels as dictionary Embedded Multi Kernel Learning (MKL) Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 7 / 21Multiple Kernel functional Learning The problem (for given C) min f ∈H,b,ξ,d 1 2 kf k 2 H + C X i ξi with yi f (xi) + b  ≥ 1 + ξi ; ξi ≥ 0 ∀i X M m=1 dm = 1 , dm ≥ 0 ∀m , f = X m fm and k(x, x ′ ) = X M m=1 dmkm(x, x ′ ), with dm ≥ 0 The functional framework H = M M m=1 H′ m hf , giH′ m = 1 dm hf , giHm Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 8 / 21Multiple Kernel functional Learning The problem (for given C) min {fm},b,ξ,d 1 2 X m 1 dm kfmk 2 Hm + C X i ξi with yi X m fm(xi) + b  ≥ 1 + ξi ; ξi ≥ 0 ∀i X m dm = 1 , dm ≥ 0 ∀m , Treated as a bi-level optimization task min d∈IRM    min {fm},b,ξ 1 2 X m 1 dm kfmk 2 Hm + C X i ξi with yi X m fm(xi) + b  ≥ 1 + ξi ; ξi ≥ 0 ∀i s.t. X m dm = 1 , dm ≥ 0 ∀m , Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 9 / 21Multiple Kernel representer theorem and dual The Lagrangian: L = 1 2 X m 1 dm kfmk 2 Hm + C X i ξi − X i αi  yi X m fm(xi) + b  − 1 − ξi  − X i βi ξi Associated KKT stationarity conditions: ∇mL = 0 ⇔ 1 dm fm(•) = Xn i=1 αi yikm(•, xi) m = 1, M Representer theorem f (•) = X m fm(•) = Xn i=1 αi yi X m dmkm(•, xi) | {z } K(•,xi ) We have a standard SVM problem with respect to function f and kernel K. Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 10 / 21Multiple Kernel Algorithm Use a Reduced Gradient Algorithm1 min d∈IRM J(d) s.t. X m dm = 1 , dm ≥ 0 ∀m , SimpleMKL algorithm set dm = 1 M for m = 1, . . . , M while stopping criterion not met do compute J(d) using an QP solver with K = P m dmKm compute ∂J ∂dm , and projected gradient as a descent direction D γ ← compute optimal stepsize d ← d + γD end while −→ Improvement reported using the Hessian 1Rakotomamonjy et al. JMLR 08 Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 11 / 21Computing the reduced gradient At the optimal the primal cost = dual cost 1 2 X m 1 dm kfmk 2 Hm + C X i ξi | {z } primal cost = 1 2 α ⊤Gα − e ⊤α | {z } dual cost with G = P m dmGm where Gm,ij = km(xi , xj) Dual cost is easier for the gradient ∇dm J(d) = 1 2 α ⊤Gmα Reduce (or project) to check the constraints P m dm = 1 → P m Dm = 0 Dm = ∇dm J(d) − ∇d1 J(d) and D1 = − X M m=2 Dm Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 12 / 21Complexity For each iteration: SVM training: O(nnsv + n 3 sv). Inverting Ksv,sv is O(n 3 sv), but might already be available as a by-product of the SVM training. Computing H: O(Mn2 sv) Finding d: O(M3 ). The number of iterations is usually less than 10. −→ When M < nsv, computing d is not more expensive than QP. Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 13 / 21MKL on the 101-caltech dataset http://www.robots.ox.ac.uk/~vgg/software/MKL/ Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 14 / 21Support vector regression (SVR) the t-insensitive loss ( min f ∈H 1 2 kf k 2 H with |f (xi) − yi | ≤ t, i = 1, n The support vector regression introduce slack variables (SVR) ( min f ∈H 1 2 kf k 2 H + C P|ξi | with |f (xi) − yi | ≤ t + ξi 0 ≤ ξi i = 1, n a typical multi parametric quadratic program (mpQP) piecewise linear regularization path α(C,t) = α(C0,t0) + ( 1 C − 1 C0 )u + 1 C0 (t − t0)v 2d Pareto’s front (the tube width and the regularity)Support vector regression illustration 0 1 2 3 4 5 6 7 8 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 Support Vector Machine Regression x y 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 −1.5 −1 −0.5 0 0.5 1 1.5 Support Vector Machine Regression x y C large C small there exists other formulations such as LP SVR...Multiple Kernel Learning for regression The problem (for given C and t) min {fm},b,ξ,d 1 2 X m 1 dm kfmk 2 Hm + C X i ξi s.t. X m fm(xi) + b − yi ≤ t + ξi ∀iξi ≥ 0 ∀i X m dm = 1 , dm ≥ 0 ∀m , regularization formulation min {fm},b,d 1 2 X m 1 dm kfmk 2 Hm + C X i max( X m fm(xi) + b − yi − t, 0) X m dm = 1 , dm ≥ 0 ∀m , Equivalently min fm},b,ξ,d X i max X m fm(xi) + b − yi − t, 0  + 1 2C X m 1 dm kfmk 2 Hm + µ X m |dm| Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 17 / 21Multiple Kernel functional Learning The problem (for given C and t) min {fm},b,ξ,d 1 2 X m 1 dm kfmk 2 Hm + C X i ξi s.t. X m fm(xi) + b − yi ≤ t + ξi ∀iξi ≥ 0 ∀i X m dm = 1 , dm ≥ 0 ∀m , Treated as a bi-level optimization task min d∈IRM    min {fm},b,ξ 1 2 X m 1 dm kfmk 2 Hm + C X i ξi s.t. X m fm(xi) + b − yi ≥ t + ξi ∀i ξi ≥ 0 ∀i s.t. X m dm = 1 , dm ≥ 0 ∀m , Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 18 / 21Multiple Kernel experiments 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1 LinChirp 0 0.2 0.4 0.6 0.8 1 −2 −1 0 1 2 x 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Wave 0 0.2 0.4 0.6 0.8 1 0 0.5 1 x 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Blocks 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 x 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Spikes 0 0.2 0.4 0.6 0.8 1 0 0.5 1 x Single Kernel Kernel Dil Kernel Dil-Trans Data Set Norm. MSE (%) #Kernel Norm. MSE #Kernel Norm. MSE LinChirp 1.46 ± 0.28 7.0 1.00 ± 0.15 21.5 0.92 ± 0.20 Wave 0.98 ± 0.06 5.5 0.73 ± 0.10 20.6 0.79 ± 0.07 Blocks 1.96 ± 0.14 6.0 2.11 ± 0.12 19.4 1.94 ± 0.13 Spike 6.85 ± 0.68 6.1 6.97 ± 0.84 12.8 5.58 ± 0.84 Table: Normalized Mean Square error averaged over 20 runs. Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 19 / 21Conclusion on multiple kernel (MKL) MKL: Kernel tuning, variable selection. . . ◮ extention to classification and one class SVM SVM KM: an efficient Matlab toolbox (available at MLOSS)2 Multiple Kernels for Image Classification: Software and Experiments on Caltech-1013 new trend: Multi kernel, Multi task and ∞ number of kernels 2 http://mloss.org/software/view/33/ 3 http://www.robots.ox.ac.uk/~vgg/software/MKL/Bibliography A. Rakotomamonjy, F. Bach, S. Canu & Y. Grandvalet. SimpleMKL. J. Mach. Learn. Res. 2008, 9:2491–2521. M. Gönen & E. Alpaydin Multiple kernel learning algorithms. J. Mach. Learn. Res. 2008;12:2211-2268. http://www.cs.nyu.edu/~mohri/icml2011-tutorial/tutorial-icml2011-2.pdf http://www.robots.ox.ac.uk/~vgg/software/MKL/ http://www.nowozin.net/sebastian/talks/ICCV-2009-LPbeta.pdf Stéphane Canu (INSA Rouen - LITIS) April 16, 2014 21 / 21 Lecture 7: Tuning hyperparameters using cross validation Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 April 4, 2014Roadmap 1 Tuning hyperparameters Motivation Machine learning without data Assessing the quality of a trained SVM Model selection log of the bandwith log of C 1.5 2 2.5 3 3.5 4 4.5 ï1 0 1 2 3 4 “Evaluation is the key to making real progress in data mining”, [Witten & Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)Motivation: the influence of C on SVM 0 0.5 1 1.5 2 2.5 3 3.5 4 0.22 0.24 0.26 0.28 0.3 error C (log. scale) 0 0 1 −1 C too small 0 0 0 1 1 1 −1 −1 −1 −1 −1 nice C 0 0 0 0 0 1 1 1 1 1 1 −1 −1 −1 −1 −1 C too largeMotivation: Need for model selection (tuning the hyper parameters) Require a good estimation of the performance on future data Choose a relevant performance measureRoadmap 1 Tuning hyperparameters Motivation Machine learning without data Assessing the quality of a trained SVM Model selection log of the bandwith log of C 1.5 2 2.5 3 3.5 4 4.5 ï1 0 1 2 3 4 “Evaluation is the key to making real progress in data mining”, [Witten & Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)Machine learning without data minimizing IP(error)Roadmap 1 Tuning hyperparameters Motivation Machine learning without data Assessing the quality of a trained SVM Model selection log of the bandwith log of C 1.5 2 2.5 3 3.5 4 4.5 ï1 0 1 2 3 4 “Evaluation is the key to making real progress in data mining”, [Witten & Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)Training and test data Split dataset into two groups randomly picked (hold out strategy) Training set: used to train the classifier Test set: used to estimate the error rate of the trained classifier (X,y) total available data (Xa,ya) training data (Xt,yt) test data (Xa, ya, Xt, yt) ← split(X, y, option = 1 3 ) Generally, the larger the training data the better the classifier The larger the test data the more accurate the error estimateAssessing the quality of a trained SVM: minimum error rate Definition (The confusion matrix) A matrix showing the predicted and actual classifications. A confusion matrix is of size L × L, where L is the number of different classes. Observed / predicted Positive Negative positive a b negative c d Error rate = 1 - Accuracy = b + c a + b + c + d = b + c n = 1 − a + d n True positive rate (Recall, Sensitivity) d/(c+d). True negative rate (Specificity) a/(a+b). Precision, False positive rate, False negative rate...Other performances measures N. Japkowicz & M. Shah, "Evaluating Learning Algorithms: A Classification Perspective", Cambridge University Press, 2011The learning equation Learning = training + testing + tuning Table: my experimental error rates State of the art my new method Bayes error problem 1 10% ± 1.25 8.5% ± .5 problem 2 5 % (.25) 4 % (.5) is my new method good for problem 1?The learning equation Learning = training + testing + tuning Table: my experimental error rates State of the art my new method Bayes error problem 1 10% ± 1.25 8.5% ± .5 11 % problem 2 5 % (.25) 4 % (.5) 2 % is my new method good for problem 1?Error bars on Bernouilli trials Error rate = bp B(p) with confidence α: (Normal approximation interval) p = IP(error) in bp ± u1−α/2 s bp (1 − bp) nt with confidence α: (improved approximation) p = IP(error) in 1 1 + 1 K u 2 1−α/2  bp ± u1−α/2 s bp (1 − bp) nt   what if bp = 0? http://en.wikipedia.org/wiki/Binomial_proportion_confidence_intervalTo improve the estimate Random Subsampling (The repeated holdout method) K-Fold Cross-Validation (K = 10 or K = 2 or k = n) Leave-one-out Cross-Validation (k = 1) BootstrapError bars: the gaussian approximation ... and to stabilize: iterate K times - do it say K = 10 times The repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, use a different random splitting Average the error rates on the different iterations mean error rate e = 1 K X K k=1 ek variance σb 2 = 1 K − 1 X K k=1 (ek − e) 2 . e + tα/2,K−1 r σb 2 K t0.025,9 = 2.262Cross validation Definition (Cross-validation) A method for estimating the accuracy of an inducer by dividing the data into K mutually exclusive subsets (the “folds”) of approximately equal size. Exemple of K = 3-Fold Cross-Validation training data test data How many folds are needed (K =?) large: small bias, large variance as well as computational time small: computation time reduced, small variance, large bias A common choice for K-Fold Cross Validation is K=5Leave one out cross validation Theoretical guaranteesThe bootstrapComparing results Two different issues what is the best method for my problem? how good is my learning algorithm?Comparing two algorithms: Mc Nemar’s test build the confusion matrix of the two algorithms Algo 1 / Algo 2 right wrong right number of examples well classified by both e01 number of examples well classified by 1 but not by 2 wrong e10 number of examples missclassified by 1 but not by 2 number of examples missclassified by both H0: if the two algorithms are the same (we expect e10 = e01 = e10+e01 2 ) (|e10 − e01| − 1) 2 e10 + e01 ∼ χ 2 1 Beware: if e10 + e01 < 20 better use the sign test Matlab function: http://www.mathworks.com/matlabcentral/fileexchange/189-discrim/content/discrim/ mcnemar.m J. L. Fleiss (1981) Statistical Methods for Rates and Proportions. Second Edition. Wiley.Roadmap 1 Tuning hyperparameters Motivation Machine learning without data Assessing the quality of a trained SVM Model selection log of the bandwith log of C 1.5 2 2.5 3 3.5 4 4.5 ï1 0 1 2 3 4 “Evaluation is the key to making real progress in data mining”, [Witten & Frank, 2005], p.143 (from N. Japkowicz & M. Shah ICML 2012 tutorial)Model selection strategy Model selection criteria attempt to find a good compromise between The complexity of a model Its prediction accuracy on the training data 1 (Xa, ya, Xt, yt) ← split(X, y, options) 2 (C, b) ← tune(Xa, ya, options) 3 model ← train(Xa, ya, C, b, options) 4 error ← test(Xt, yt, C, b, options) Occam’s Razor: the best theory is the smallest one that describes all the factsModel selection: the tuning function function (C, b) ← tune(Xa, ya, options) 1 (Xℓ, yℓ, Xv, yv) ← split(Xa, ya, options) 2 loop on a grid for C 3 loop on a grid for b 1 model ← train(Xℓ, yℓ, C, b, options) 2 error ← test(Xv, yv, C, b, options) The three sets Training set: a set of examples used for learning: to fit the parameters Validation set: a set of examples used to tune the hyper parameters Test set: independent instances that have played no part in formation of classifierhow to design the grids A grid on b A much simpler trick is to pick, say 1000 pairs (x,x’) at random from your dataset, compute the distance of all such pairs and take the median, the 0.1 and the 0.9 quantile. Now pick b to be the inverse any of these three numbers. http://blog.smola.org/post/940859888/easy-kernel-width-choice A grid on C from Cmin to ∞ to much!The coarse to fine strategy 1 use a large coarse grid on a few data to localize interesting values 2 fine tuning on all data in this zone 1 (Xa, ya, Xt, yt) ← split(X, y) 2 (C, b) ← tune(Xa, ya, coarsegrids,smalltrainingset) 3 finegrids ← fit_grid(C, b) 4 (C, b) ← tune(Xa, ya, finegrids, largetrainingset) 5 model ← train(Xa, ya, C, b, options) 6 error ← test(Xt, yt, C, b, options) The computing time is the key issueEvaluation measures the span boundBibliography http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf http://www.cs.odu.edu/~mukka/cs795sum13dm/Lecturenotes/Day3/Chapter5.pdf http://www.cs.cmu.edu/~epxing/Class/10701-10s/Lecture/lecture8.pdf http://www.mohakshah.com/tutorials/icml2012/Tutorial-ICML2012/Tutorial_at_ICML_ 2012_files/ICML2012-Tutorial.pdf Stéphane Canu (INSA Rouen - LITIS) April 4, 2014 26 / 26 Lecture 6: Minimum encoding ball and Support vector data description (SVDD) Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 May 12, 2014Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD)The minimum enclosing ball problem [Tax and Duin, 2004] Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35The minimum enclosing ball problem [Tax and Duin, 2004] the center Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35The minimum enclosing ball problem [Tax and Duin, 2004] the radius Given n points, {xi , i = 1, n} . ( min R∈IR,c∈IRd R 2 with kxi − ck 2 ≤ R 2 , i = 1, . . . , n What is that in the convex programming hierarchy? LP, QP, QCQP, SOCP and SDP Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 3 / 35The convex programming hierarchy (part of) LP    min x f ⊤x with Ax ≤ d and 0 ≤ x QP ( min x 1 2 x ⊤Gx + f ⊤x with Ax ≤ d QCQP    min x 1 2 x ⊤Gx + f ⊤x with x ⊤Bix + a ⊤ i x ≤ di i = 1, n SOCP    min x f ⊤x with kx − aik ≤ b ⊤ i x + di i = 1, n The convex programming hierarchy? Model generality: LP < QP < QCQP < SOCP < SDP Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 4 / 35MEB as a QP in the primal Theorem (MEB as a QP) The two following problems are equivalent, ( min R∈IR,c∈IRd R 2 with kxi − ck 2 ≤ R 2 , i = 1, . . . , n ( min w,ρ 1 2 kwk 2 − ρ with w⊤xi ≥ ρ + 1 2 kxik 2 with ρ = 1 2 (kck 2 − R 2 ) and w = c. Proof: kxi − ck 2 ≤ R 2 kxik 2 − 2x ⊤ i c + kck 2 ≤ R 2 −2x ⊤ i c ≤ R 2 − kxik 2 − kck 2 2x ⊤ i c ≥ −R 2 + kxik 2 + kck 2 x ⊤ i c ≥ 1 2 (kck 2 − R 2 ) | {z } ρ + 1 2 kxik 2 Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 5 / 35MEB and the one class SVM SVDD: ( min w,ρ 1 2 kwk 2 − ρ with w⊤xi ≥ ρ + 1 2 kxik 2 SVDD and linear OCSVM (Supporting Hyperplane) if ∀i = 1, n, kxik 2 = constant, it is the the linear one class SVM (OC SVM) The linear one class SVM [Schölkopf and Smola, 2002] ( min w,ρ′ 1 2 kwk 2 − ρ ′ with w⊤xi ≥ ρ ′ with ρ ′ = ρ + 1 2 kxik 2 ⇒ OC SVM is a particular case of SVDD Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 6 / 35When ∀i = 1, n, kxik 2 = 1 0 c kxi − ck 2 ≤ R 2 ⇔ w ⊤xi ≥ ρ with ρ = 1 2 (kck 2 − R + 1) SVDD and OCSVM "Belonging to the ball" is also "being above" an hyperplane Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 7 / 35MEB: KKT L(c, R, α) = R 2 + Xn i=1 αi kxi − ck 2 − R 2  KKT conditionns : stationarty ◮ 2c Pn i=1 αi − 2 Pn i=1 αixi = 0 ← The representer theorem ◮ 1 − Pn i=1 αi = 0 primal admiss. kxi − ck 2 ≤ R 2 dual admiss. αi ≥ 0 i = 1, n complementarity αi kxi − ck 2 − R 2  = 0 i = 1, n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 8 / 35MEB: KKT the radius L(c, R, α) = R 2 + Xn i=1 αi kxi − ck 2 − R 2  KKT conditionns : stationarty ◮ 2c Pn i=1 αi − 2 Pn i=1 αixi = 0 ← The representer theorem ◮ 1 − Pn i=1 αi = 0 primal admiss. kxi − ck 2 ≤ R 2 dual admiss. αi ≥ 0 i = 1, n complementarity αi kxi − ck 2 − R 2  = 0 i = 1, n Complementarity tells us: two groups of points the support vectors kxi − ck 2 = R 2 and the insiders αi = 0 Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 8 / 35MEB: Dual The representer theorem: c = Pn i=1 αixi Pn i=1 αi = Xn i=1 αixi L(α) = Xn i=1 αi kxi − Xn j=1 αjxjk 2  Xn i=1 Xn j=1 αiαjx ⊤ i xj = α ⊤Gα and Xn i=1 αi x ⊤ i xi = α ⊤diag(G) with G = XX⊤ the Gram matrix: Gij = x ⊤ i xj ,    min α∈IRn α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi , i = 1 . . . n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 9 / 35SVDD primal vs. dual Primal    min R∈IR,c∈IRd R 2 with kxi − ck 2 ≤ R 2 , i = 1, . . . , n d + 1 unknown n constraints can be recast as a QP perfect when d << n Dual    min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi , i = 1 . . . n n unknown with G the pairwise influence Gram matrix n box constraints easy to solve to be used when d > nSVDD primal vs. dual Primal    min R∈IR,c∈IRd R 2 with kxi − ck 2 ≤ R 2 , i = 1, . . . , n d + 1 unknown n constraints can be recast as a QP perfect when d << n Dual    min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi , i = 1 . . . n n unknown with G the pairwise influence Gram matrix n box constraints easy to solve to be used when d > n But where is R 2 ?Looking for R 2 ( min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1, 0 ≤ αi , i = 1, n The Lagrangian: L(α, µ, β) = α ⊤Gα − α ⊤diag(G) + µ(e ⊤α − 1) − β ⊤α Stationarity cond.: ∇αL(α, µ, β) = 2Gα − diag(G) + µe − β = 0 The bi dual ( min α α ⊤Gα + µ with −2Gα + diag(G) ≤ µe by identification R 2 = µ + α ⊤Gα = µ + kck 2 µ is the Lagrange multiplier associated with the equality constraint Xn i=1 αi = 1 Also, because of the complementarity condition, if xi is a support vector, then βi = 0 implies αi > 0 and R 2 = kxi − ck 2 .Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 12 / 35The minimum enclosing ball problem with errors the slack The same road map: initial formuation reformulation (as a QP) Lagrangian, KKT dual formulation bi dual Initial formulation: for a given C    min R,a,ξ R 2 + C Xn i=1 ξi with kxi − ck 2 ≤ R 2 + ξi , i = 1, . . . , n and ξi ≥ 0, i = 1, . . . , n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 13 / 35The MEB with slack: QP, KKT, dual and R 2 SVDD as a QP:    min w,ρ 1 2 kwk 2 − ρ + C 2 Xn i=1 ξi with w⊤xi ≥ ρ + 1 2 kxik 2 − 1 2 ξi and ξi ≥ 0, i = 1, n again with OC SVM as a particular case. With G = XX ⊤ Dual SVDD:    min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi ≤ C, i = 1, n for a given C ≤ 1. If C is larger than one it is useless (it’s the no slack case) R 2 = µ + c ⊤c with µ denoting the Lagrange multiplier associated with the equality constraint Pn i=1 αi = 1.Variations over SVDD Adaptive SVDD: the weighted error case for given wi , i = 1, n    min c∈IRp,R∈IR,ξ∈IRn R + C Xn i=1 wi ξi with kxi − ck 2 ≤ R+ξi ξi ≥ 0 i = 1, n The dual of this problem is a QP [see for instance Liu et al., 2013] ( min α∈IRn α ⊤XX ⊤α − α ⊤diag(XX ⊤) with Pn i=1 αi = 1 0 ≤ αi ≤ Cwi i = 1, n Density induced SVDD (D-SVDD):    min c∈IRp,R∈IR,ξ∈IRn R + C Xn i=1 ξi with wikxi − ck 2 ≤ R+ξi ξi ≥ 0 i = 1, nPlan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 16 / 35SVDD in a RKHS The feature map: IRp −→ H c −→ f (•) xi −→ k(xi , •) kxi − ckIRp ≤ R 2 −→ kk(xi , •) − f (•)k 2 H ≤ R 2 Kernelized SVDD (in a RKHS) is also a QP    min f ∈H,R∈IR,ξ∈IRn R 2 + C Xn i=1 ξi with kk(xi , •) − f (•)k 2 H ≤ R 2+ξi i = 1, n ξi ≥ 0 i = 1, nSVDD in a RKHS: KKT, Dual and R 2 L = R 2 + C Xn i=1 ξi + Xn i=1 αi kk(xi , .) − f (.)k 2 H − R 2−ξi  − Xn i=1 βi ξi = R 2 + C Xn i=1 ξi + Xn i=1 αi k(xi , xi) − 2f (xi) + kf k 2 H − R 2−ξi  − Xn i=1 βi ξi KKT conditions Stationarity ◮ 2f (.) Pn i=1 αi − 2 Pn i=1 αik(., xi) = 0 ← The representer theorem ◮ 1 − Pn i=1 αi = 0 ◮ C − αi − βi = 0 Primal admissibility: kk(xi , .) − f (.)k 2 ≤ R 2 + ξi , ξi ≥ 0 Dual admissibility: αi ≥ 0 , βi ≥ 0 Complementarity ◮ αi kk(xi , .) − f (.)k 2 − R 2 − ξi  = 0 ◮ βi ξi = 0SVDD in a RKHS: Dual and R 2 L(α) = Xn i=1 αik(xi , xi) − 2 Xn i=1 f (xi) + kf k 2 H with f (.) = Xn j=1 αjk(., xj) = Xn i=1 αik(xi , xi) − Xn i=1 Xn j=1 αiαj k(xi , xj) | {z } Gij Gij = k(xi , xj)    min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi≤ C, i = 1 . . . n As it is in the linear case: R 2 = µ + kf k 2 H with µ denoting the Lagrange multiplier associated with the equality constraint Pn i=1 αi = 1.SVDD train and val in a RKHS Train using the dual form (in: G, C; out: α, µ)    min α α ⊤Gα − α ⊤diag(G) with e ⊤α = 1 and 0 ≤ αi≤ C, i = 1 . . . n Val with the center in the RKHS: f (.) = Pn i=1 αik(., xi) φ(x) = kk(x, .) − f (.)k 2 H − R 2 = kk(x, .)k 2 H − 2hk(x, .), f (.)iH + kf (.)k 2 H − R 2 = k(x, x) − 2f (x) + R 2 − µ − R 2 = −2f (x) + k(x, x) − µ = −2 Xn i=1 αik(x, xi) + k(x, x) − µ φ(x) = 0 is the decision border Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 20 / 35An important theoretical result For a well-calibrated bandwidth, The SVDD estimates the underlying distribution level set [Vert and Vert, 2006] The level sets of a probability density function IP(x) are the set Cp = {x ∈ IRd | IP(x) ≥ p} It is well estimated by the empirical minimum volume set Vp = {x ∈ IRd | kk(x, .) − f (.)k 2 H − R 2 ≥ 0} The frontiers coincidesSVDD: the generalization error For a well-calibrated bandwidth, (x1, . . . , xn) i.i.d. from some fixed but unknown IP(x) Then [Shawe-Taylor and Cristianini, 2004] with probability at least 1 − δ, (∀δ ∈]0, 1[), for any margin m > 0 IP kk(x, .) − f (.)k 2 H ≥ R 2 + m  ≤ 1 mn Xn i=1 ξi + 6R 2 m √ n + 3 r ln(2/δ) 2nEquivalence between SVDD and OCSVM for translation invariant kernels (diagonal constant kernels) Theorem Let H be a RKHS on some domain X endowed with kernel k. If there exists some constant c such that ∀x ∈ X , k(x, x) = c, then the two following problems are equivalent,    min f ,R,ξ R + C Xn i=1 ξi with kk(xi , .) − f (.)k 2 H ≤ R+ξi ξi ≥ 0 i = 1, n    min f ,ρ,ξ 1 2 kf k 2 H − ρ + C Xn i=1 εi with f (xi) ≥ ρ − εi εi ≥ 0 i = 1, n with ρ = 1 2 (c + kf k 2 H − R) and εi = 1 2 ξi .Proof of the Equivalence between SVDD and OCSVM    min f ∈H,R∈IR,ξ∈IRn R + C Xn i=1 ξi with kk(xi , .) − f (.)k 2 H ≤ R+ξi , ξi ≥ 0 i = 1, n since kk(xi , .) − f (.)k 2 H = k(xi , xi) + kf k 2 H − 2f (xi)    min f ∈H,R∈IR,ξ∈IRn R + C Xn i=1 ξi with 2f (xi) ≥ k(xi , xi) + kf k 2 H − R−ξi , ξi ≥ 0 i = 1, n. Introducing ρ = 1 2 (c + kf k 2 H − R) that is R = c + kf k 2 H − 2ρ, and since k(xi , xi) is constant and equals to c the SVDD problem becomes    min f ∈H,ρ∈IR,ξ∈IRn 1 2 kf k 2 H − ρ + C 2 Xn i=1 ξi with f (xi) ≥ ρ− 1 2 ξi , ξi ≥ 0 i = 1, nleading to the classical one class SVM formulation (OCSVM)    min f ∈H,ρ∈IR,ξ∈IRn 1 2 kf k 2 H − ρ + C Xn i=1 εi with f (xi) ≥ ρ − εi , εi ≥ 0 i = 1, n with εi = 1 2 ξi . Note that by putting ν = 1 nC we can get the so called ν formulation of the OCSVM    min f ′∈H,ρ′∈IR,ξ′∈IRn 1 2 kf ′k 2 H − nνρ′ + Xn i=1 ξ ′ i with f ′ (xi) ≥ ρ ′ − ξ ′ i , ξ′ i ≥ 0 i = 1, n with f ′ = Cf , ρ ′ = Cρ, and ξ ′ = Cξ.Duality Note that the dual of the SVDD is ( min α∈IRn α ⊤Gα − α ⊤g with Pn i=1 αi = 1 0 ≤ αi ≤ C i = 1, n where G is the kernel matrix of general term Gi,j = k(xi , xj) and g the diagonal vector such that gi = k(xi , xi) = c. The dual of the OCSVM is the following equivalent QP ( min α∈IRn 1 2 α ⊤Gα with Pn i=1 αi = 1 0 ≤ αi ≤ C i = 1, n Both dual forms provide the same solution α, but not the same Lagrange multipliers. ρ is the Lagrange multiplier of the equality constraint of the dual of the OCSVM and R = c + α ⊤Gα − 2ρ. Using the SVDD dual, it turns out that R = λeq + α ⊤Gα where λeq is the Lagrange multiplier of the equality constraint of the SVDD dual form.Plan 1 Support Vector Data Description (SVDD) SVDD, the smallest enclosing ball problem The minimum enclosing ball problem with errors The minimum enclosing ball problem in a RKHS The two class Support vector data description (SVDD) Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 27 / 35The two class Support vector data description (SVDD) −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4 −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4 .    min c,R,ξ+,ξ− R 2+C X yi =1 ξ + i + X yi =−1 ξ − i  with kxi − ck 2 ≤ R 2+ξ + i , ξ+ i ≥ 0 i such that yi = 1 and kxi − ck 2 ≥ R 2−ξ − i , ξ− i ≥ 0 i such that yi = −1 Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 28 / 35The two class SVDD as a QP    min c,R,ξ+,ξ− R 2+C X yi =1 ξ + i + X yi =−1 ξ − i  with kxi − ck 2 ≤ R 2+ξ + i , ξ+ i ≥ 0 i such that yi = 1 and kxi − ck 2 ≥ R 2−ξ − i , ξ− i ≥ 0 i such that yi = −1  kxik 2 − 2x ⊤ i c + kck 2 ≤ R 2+ξ + i , ξ+ i ≥ 0 i such that yi = 1 kxik 2 − 2x ⊤ i c + kck 2 ≥ R 2−ξ − i , ξ− i ≥ 0 i such that yi = −1 2x ⊤ i c ≥ kck 2 − R 2 + kxik 2−ξ + i , ξ+ i ≥ 0 i such that yi = 1 −2x ⊤ i c ≥ −kck 2 + R 2 − kxik 2−ξ − i , ξ− i ≥ 0 i such that yi = −1 2yix ⊤ i c ≥ yi(kck 2 − R 2 + kxik 2 )−ξi , ξi ≥ 0 i = 1, n change variable: ρ = kck 2 − R 2    min c,ρ,ξ kck 2 − ρ + C Pn i=1 ξi with 2yixi ⊤c ≥ yi(ρ − kxik 2 )−ξi i = 1, n and ξi ≥ 0 i = 1, nThe dual of the two class SVDD Gij = yi yjxix ⊤ j The dual formulation:    min α∈IRn α ⊤Gα − Pn i=1 αi yikxik 2 with Xn i=1 yiαi = 1 0 ≤ αi ≤ C i = 1, n Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 30 / 35The two class SVDD vs. one class SVDD The two class SVDD (left) vs. the one class SVDD (right) Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 31 / 35Small Sphere and Large Margin (SSLM) approach Support vector data description with margin [Wu and Ye, 2009]    min w,R,ξ∈IRn R 2+C X yi =1 ξ + i + X yi =−1 ξ − i  with kxi − ck 2 ≤ R 2 − 1+ξ + i , ξ+ i ≥ 0 i such that yi = 1 and kxi − ck 2 ≥ R 2 + 1−ξ − i , ξ− i ≥ 0 i such that yi = −1 kxi − ck 2 ≥ R 2 + 1−ξ − i and yi = −1 ⇐⇒ yi kxi − ck 2 ≤ yiR 2 − 1+ξ − i L(c, R, ξ, α, β) = R 2+C Xn i=1 ξi + Xn i=1 αi yikxi − ck 2 − yiR 2 + 1−ξi  − Xn i=1 βi ξi −4 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4SVDD with margin – dual formulation L(c, R, ξ, α, β) = R 2+C Xn i=1 ξi + Xn i=1 αi yikxi − ck 2 − yiR 2 + 1−ξi  − Xn i=1 βi ξi Optimality: c = Xn i=1 αi yixi ; Xn i=1 αi yi = 1 ; 0 ≤ αi ≤ C L(α) = Xn i=1 αi yikxi − Xn j=1 αi yjxjk 2  + Xn i=1 αi = − Xn i=1 Xn j=1 αjαi yi yjx ⊤ j xi + Xn i=1 kxik 2 yiαi + Xn i=1 αi Dual SVDD is also a quadratic program problem D    min α∈IRn α ⊤Gα − e ⊤α − f ⊤α with y ⊤α = 1 and 0 ≤ αi ≤ C i = 1, n with G a symmetric matrix n × n such that Gij = yi yjx ⊤ j xi and fi = kxik 2 yiConclusion Applications ◮ outlier detection ◮ change detection ◮ clustering ◮ large number of classes ◮ variable selection, . . . A clear path ◮ reformulation (to a standart problem) ◮ KKT ◮ Dual ◮ Bidual a lot of variations ◮ L 2 SVDD ◮ two classes non symmetric ◮ two classes in the symmetric classes (SVM) ◮ the multi classes issue practical problems with translation invariant kernels .Bibliography Bo Liu, Yanshan Xiao, Longbing Cao, Zhifeng Hao, and Feiqi Deng. Svdd-based outlier detection on uncertain data. Knowledge and information systems, 34(3):597–618, 2013. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002. John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. David MJ Tax and Robert PW Duin. Support vector data description. Machine learning, 54(1):45–66, 2004. Régis Vert and Jean-Philippe Vert. Consistency and convergence rates of one-class svms and related algorithms. The Journal of Machine Learning Research, 7:817–854, 2006. Mingrui Wu and Jieping Ye. A small sphere and large margin approach for novelty detection using training data with outliers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(11):2088–2092, 2009. Stéphane Canu (INSA Rouen - LITIS) May 12, 2014 35 / 35 Lecture 5: SVM as a kernel machine Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 April 26, 2014Plan 1 Kernel machines Non sparse kernel machines Sparse kernel machines: SVM SVM: variations on a theme Sparse kernel machines for regression: SVR −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1Interpolation splines find out f ∈ H such that f (xi) = yi , i = 1, ..., n It is an ill posed problemInterpolation splines: minimum norm interpolation ( min f ∈H 1 2 kf k 2 H such that f (xi) = yi , i = 1, ..., n The lagrangian (αi Lagrange multipliers) L(f , α) = 1 2 kf k 2 − Xn i=1 αi f (xi) − yi Interpolation splines: minimum norm interpolation ( min f ∈H 1 2 kf k 2 H such that f (xi) = yi , i = 1, ..., n The lagrangian (αi Lagrange multipliers) L(f , α) = 1 2 kf k 2 − Xn i=1 αi f (xi) − yi  optimality for f : ∇f L(f , α) = 0 ⇔ f (x) = Xn i=1 αik(xi , x)Interpolation splines: minimum norm interpolation ( min f ∈H 1 2 kf k 2 H such that f (xi) = yi , i = 1, ..., n The lagrangian (αi Lagrange multipliers) L(f , α) = 1 2 kf k 2 − Xn i=1 αi f (xi) − yi  optimality for f : ∇f L(f , α) = 0 ⇔ f (x) = Xn i=1 αik(xi , x) dual formulation (remove f from the lagrangian): Q(α) = − 1 2 Xn i=1 Xn j=1 αiαjk(xi , xj) +Xn i=1 αi yi solution: max α∈IRn Q(α) Kα = yRepresenter theorem Theorem (Representer theorem) Let H be a RKHS with kernel k(s,t). Let ℓ be a function from X to IR (loss function) and Φ a non decreasing function from IR to IR. If there exists a function f ∗minimizing: f ∗ = argmin f ∈H Xn i=1 ℓ yi , f (xi)  + Φ kf k 2 H  then there exists a vector α ∈ IRn such that: f ∗ (x) = Xn i=1 αik(x, xi) it can be generalized to the semi parametric case: + Pm j=1 βjφj(x)Elements of a proof 1 Hs = span{k(., x1), ..., k(., xi), ..., k(., xn)} 2 orthogonal decomposition: H = Hs ⊕ H⊥ ⇒ ∀f ∈ H; f = fs + f⊥ 3 pointwise evaluation decomposition f (xi) = fs (xi) + f⊥(xi) = hfs (.), k(., xi)iH + hf⊥(.), k(., xi)iH | {z } =0 = fs (xi) 4 norm decomposition kf k 2 H = kfsk 2 H + kf⊥k 2 H | {z } ≥0 ≥ kfsk 2 H 5 decompose the global cost Xn i=1 ℓ yi , f (xi)  + Φ kf k 2 H  = Xn i=1 ℓ yi , fs (xi)  + Φ kfsk 2 H + kf⊥k 2 H  ≥ Xn i=1 ℓ yi , fs (xi)  + Φ kfsk 2 H  6 argmin f ∈H = argmin f ∈Hs .Smooting splines introducing the error (the slack) ξ = f (xi) − yi (S)    min f ∈H 1 2 kf k 2 H + 1 2λ Xn i=1 ξ 2 i such that f (xi) = yi + ξi , i = 1, n 3 equivalent definitions (S ′ ) min f ∈H 1 2 Xn i=1 f (xi ) − yi 2 + λ 2 kf k 2 H    min f ∈H 1 2 kf k 2 H such that Xn i=1 f (xi ) − yi 2 ≤ C ′    min f ∈H Xn i=1 f (xi ) − yi 2 such that kf k 2 H ≤ C ′′ using the representer theorem (S ′′) min α∈IRn 1 2 kKα − yk 2 + λ 2 α ⊤Kα solution: (S) ⇔ (S ′ ) ⇔ (S ′′) ⇔ α = (K + λI) −1 y 6= ridge regression: min α∈IRn 1 2 kKα − yk 2 + λ 2 α ⊤α with α = (K ⊤K + λI) −1K ⊤yKernel logistic regression inspiration: the Bayes rule D(x) = sign f (x) + α0  =⇒ log  IP(Y =1|x) IP(Y =−1|x)  = f (x) + α0 probabilities: IP(Y = 1|x) = expf (x)+α0 1 + expf (x)+α0 IP(Y = −1|x) = 1 1 + expf (x)+α0 Rademacher distribution L(xi , yi , f , α0) = IP(Y = 1|xi) yi +1 2 (1 − IP(Y = 1|xi)) 1−yi 2 penalized likelihood J(f , α0) = − Xn i=1 log L(xi , yi , f , α0)  + λ 2 kf k 2 H = Xn i=1 log  1 + exp−yi (f (xi )+α0)  + λ 2 kf k 2 HKernel logistic regression (2) (R)    min f ∈H 1 2 kf k 2 H + 1 λ Xn i=1 log 1 + exp−ξi  with ξi = yi (f (xi) + α0), i = 1, n Representer theorem J(α, α0) = 1I⊤ log  1I + expdiag(y)Kα+α0y  + λ 2 α ⊤Kα gradient vector anf Hessian matrix: ∇αJ(α, α0) = K y − (2p − 1I)  + λKα HαJ(α, α0) = Kdiag p(1I − p)  K + λK solve the problem using Newton iterations α new = α old+ Kdiag p(1I − p)  K + λK −1 K y − (2p − 1I) + λαLet’s summarize pros ◮ Universality ◮ from H to IRn using the representer theorem ◮ no (explicit) curse of dimensionality splines O(n 3 ) (can be reduced to O(n 2 )) logistic regression O(kn3 ) (can be reduced to O(kn2 ) no scalability! sparsity comes to the rescue!Roadmap 1 Kernel machines Non sparse kernel machines Sparse kernel machines: SVM SVM: variations on a theme Sparse kernel machines for regression: SVR −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 −1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Stéphane Canu (INSA Rouen - LITIS) April 26, 2014 11 / 38SVM in a RKHS: the separable case (no noise)    max f ,b m with yi f (xi) + b  ≥ m and kf k 2 H = 1 ⇔ ( min f ,b 1 2 kf k 2 H with yi f (xi) + b  ≥ 1 3 ways to represent function f f (x) | {z } in the RKHS H = X d j=1 wj φj(x) | {z } d features = Xn i=1 αi yi k(x, xi) | {z } n data points ( min w,b 1 2 kwk 2 IRd = 1 2 w⊤w with yi w⊤φ(xi) + b  ≥ 1 ⇔ ( min α,b 1 2 α ⊤Kα with yi α ⊤K(:, i) + b  ≥ 1using relevant features... a data point becomes a function x −→ k(x, •)Representer theorem for SVM ( min f ,b 1 2 kf k 2 H with yi f (xi) + b  ≥ 1 Lagrangian L(f , b, α) = 1 2 kf k 2 H − Xn i=1 αi yi(f (xi) + b) − 1  α ≥ 0 optimility condition: ∇f L(f , b, α) = 0 ⇔ f (x) = Xn i=1 αi yik(xi , x) Eliminate f from L:    kf k 2 H = Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) Xn i=1 αi yif (xi) = Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) Q(b, α) = − 1 2 Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) − Xn i=1 αi yib − 1 Dual formulation for SVM the intermediate function Q(b, α) = − 1 2 Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) − b Xn i=1 αi yi  + Xn i=1 αi max α min b Q(b, α) b can be seen as the Lagrange multiplier of the following (balanced) constaint Pn i=1 αi yi = 0 which is also the optimality KKT condition on b Dual formulation    max α∈IRn − 1 2 Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) +Xn i=1 αi such that Xn i=1 αi yi = 0 and 0 ≤ αi , i = 1, nSVM dual formulation Dual formulation    max α∈IRn − 1 2 Xn i=1 Xn j=1 αiαjyi yjk(xi , xj) +Xn i=1 αi with Xn i=1 αi yi = 0 and 0 ≤ αi , i = 1, n The dual formulation gives a quadratic program (QP) ( min α∈IRn 1 2 α ⊤Gα − I1 ⊤α with α ⊤y = 0 and 0 ≤ α with Gij = yi yjk(xi , xj) with the linear kernel f (x) = Pn i=1 αi yi(x ⊤xi) = Pd j=1 βjxj when d is small wrt. n primal may be interesting.the general case: C-SVM Primal formulation (P)    min f ∈H,b,ξ∈IRn 1 2 kf k 2 + C p Xn i=1 ξ p i such that yi f (xi) + b  ≥ 1 − ξi , ξi ≥ 0, i = 1, n C is the regularization path parameter (to be tuned) p = 1 , L1 SVM ( max α∈IRn − 1 2 α ⊤Gα + α ⊤1I such that α ⊤y = 0 and 0 ≤ αi ≤ C i = 1, n p = 2, L2 SVM ( max α∈IRn − 1 2 α ⊤ G + 1 C I  α + α ⊤1I such that α ⊤y = 0 and 0 ≤ αi i = 1, n the regularization path: is the set of solutions α(C) when C variesData groups: illustration f (x) = Xn i=1 αi k(x, xi ) D(x) = sign f (x) + b  useless data important data suspicious data well classified support α = 0 0 < α < C α = C the regularization path: is the set of solutions α(C) when C variesThe importance of being support f (x) = Xn i=1 αi yik(xi , x) data point α constraint value set xi useless αi = 0 yi f (xi) + b  > 1 I0 xi support 0 < αi < C yi f (xi) + b  = 1 Iα xi suspicious αi = C yi f (xi) + b  < 1 IC Table : When a data point is « support » it lies exactly on the margin. here lies the efficiency of the algorithm (and its complexity)! sparsity: αi = 0The active set method for SVM (1)    min α∈IRn 1 2 α ⊤Gα − α ⊤1I such that α ⊤y = 0 i = 1, n and 0 ≤ αi i = 1, n    Gα − 1I − β + by = 0 α ⊤y = 0 0 ≤ αi i = 1, n 0 ≤ βi i = 1, n αiβi = 0 i = 1, n αa 0 − − + b 1 1 0 β0 ya y0 = 0 0 G α − − 1I β + b y = 0 Ga Gi G0 G ⊤ i (1) Gaαa − 1Ia + bya = 0 (2) Giαa − 1I0 − β0 + by0 = 0 1 solve (1) (find α together with b) 2 if α < 0 move it from Iα to I0 goto 1 3 else solve (2) if β < 0 move it from I0 to Iα goto 1The active set method for SVM (2) Function (α, b, Iα) ←Solve_QP_Active_Set(G, y) % Solve minα 1/2α⊤Gα − 1I⊤α % s.t. 0 ≤ α and y⊤α = 0 (Iα, I0, α) ← initialization while The_optimal_is_not_reached do (α, b) ← solve  Gaαa − 1Ia + bya y⊤ a αa = 0 if ∃i ∈ Iα such that αi < 0 then α ← projection( αa, α) move i from Iα to I0 else if ∃j ∈ I0 such that βj < 0 then use β0 = y0(Kiαa + b1I0) − 1I0 move j from I0 to Iα else The_optimal_is_not_reached ← FALSE end if end while α α old α new Projection step of the active constraints algorithm d = alpha - alphaold; alpha = alpha + t * d; Caching Strategy Save space and computing time by computing only the needed parts of kernel matrix GTwo more ways to derivate SVM Using the hinge loss min f ∈H,b∈IR 1 p Xn i=1 max 0, 1 − yi(f (xi) + b) p + 1 2C kf k 2 Minimizing the distance between the convex hulls    min α ku − vk 2 H with u(x) = X {i|yi =1} αik(xi , x), v(x) = X {i|yi =−1} αik(xi , x) and X {i|yi =1} αi = 1, X {i|yi =−1} αi = 1, 0 ≤ αi i = 1, n f (x) = 2 ku − vk 2 H u(x) − v(x)  and b = kuk 2 H − kvk 2 H ku − vk 2 H the regularization path: is the set of solutions α(C) when C variesRegularization path for SVM min f ∈H Xn i=1 max(1 − yif (xi), 0) + λo 2 kf k 2 H Iα is the set of support vectors s.t. yi f (xi) = 1; ∂f J(f ) = X i∈Iα γi yiK(xi , •) − X i∈I1 yiK(xi , •) + λo f (•) with γi ∈ ∂H(1) =] − 1, 0[Regularization path for SVM min f ∈H Xn i=1 max(1 − yif (xi), 0) + λo 2 kf k 2 H Iα is the set of support vectors s.t. yi f (xi) = 1; ∂f J(f ) = X i∈Iα γi yiK(xi , •) − X i∈I1 yiK(xi , •) + λo f (•) with γi ∈ ∂H(1) =] − 1, 0[ Let λn a value close enough to λo to keep the sets I0, Iα and IC unchanged In particular at point xj ∈ Iα (fo (xj) = fn(xj) = yj) : ∂f J(f )(xj) = 0 P i∈Iα γioyiK(xi , xj) = P i∈I1 yiK(xi P , xj) − λo yj i∈Iα γinyiK(xi , xj) = P i∈I1 yiK(xi , xj) − λn yj G(γn − γo) = (λo − λn)y avec Gij = yiK(xi , xj) γn = γo + (λo − λn)w w = (G) −1 yExample of regularization path γi ∈] − 1, 0[ yiγi ∈] − 1, −1[ λ = 1 C γi = − 1 C αi ; performing together estimation and data selectionHow to choose ℓ and P to get linear regularization path? the path is piecewise linear ⇔ one is piecewise quadratic and the other is piecewise linear the convex case [Rosset & Zhu, 07] min β∈IRd ℓ(β) + λP(β) 1 piecewise linearity: lim ε→0 β(λ + ε) − β(λ) ε = constant 2 optimality ∇ℓ(β(λ)) + λ∇P(β(λ)) = 0 ∇ℓ(β(λ + ε)) + (λ + ε)∇P(β(λ + ε)) = 0 3 Taylor expension lim ε→0 β(λ + ε) − β(λ) ε = ∇2 ℓ(β(λ)) + λ∇2P(β(λ))−1∇P(β(λ)) ∇2 ℓ(β(λ)) = constant and ∇2P(β(λ)) = 0 Lecture 4: kernels and associated functions Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 4, 2014Plan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHSIntroducing non linearities through the feature map SVM Val f (x) = X d j=1 xjwj + b = Xn i=1 αi(x ⊤ i x) + b  t1 t2  ∈ IR2 x1 x2 x3 x4 x5 linear in x ∈ IR5 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37Introducing non linearities through the feature map SVM Val f (x) = X d j=1 xjwj + b = Xn i=1 αi(x ⊤ i x) + b  t1 t2  ∈ IR2 φ(t) = t1 x1 t 2 1 x2 t2 x3 t 2 2 x4 t1t2 x5 linear in x ∈ IR5 quadratic in t ∈ IR2 The feature map φ : IR2 −→ IR5 t 7−→ φ(t) = x x ⊤ i x = φ(ti) ⊤φ(t) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 3 / 37Introducing non linearities through the feature map A. Lorena & A. de Carvalho, Uma Introducão às Support Vector Machines, 2007 Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 4 / 37Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj(x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = X p j=1 wjφj(x) with wj = Xn i=1 αi yiφj(xi) so that f (x) = Xn i=1 αi yi X p j=1 φj(xi)φj(x) | {z } k(xi ,x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37Non linear case: dictionnary vs. kernel in the non linear case: use a dictionary of functions φj(x), j = 1, p with possibly p = ∞ for instance polynomials, wavelets... f (x) = X p j=1 wjφj(x) with wj = Xn i=1 αi yiφj(xi) so that f (x) = Xn i=1 αi yi X p j=1 φj(xi)φj(x) | {z } k(xi ,x) p ≥ n so what since k(xi , x) = Pp j=1 φj(xi)φj(x) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 5 / 37closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1,s1,s2, ...,sd ,s 2 1 ,s 2 2 , ...,s 2 d , ...,sisj , ... in this case Φ(s) ⊤Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s 2 1 t 2 1 + ... + s 2 d t 2 d + ... + sisjtitj + ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1,s1,s2, ...,sd ,s 2 1 ,s 2 2 , ...,s 2 d , ...,sisj , ... in this case Φ(s) ⊤Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s 2 1 t 2 1 + ... + s 2 d t 2 d + ... + sisjtitj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s ⊤t + 1 2 = 1 + 2s ⊤t + s ⊤t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd ,s 2 1 ,s 2 2 , ...,s 2 d , ..., √ 2sisj , ... Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37closed form kernel: the quadratic kernel The quadratic dictionary in IRd : Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1,s1,s2, ...,sd ,s 2 1 ,s 2 2 , ...,s 2 d , ...,sisj , ... in this case Φ(s) ⊤Φ(t) = 1 + s1t1 + s2t2 + ... + sd td + s 2 1 t 2 1 + ... + s 2 d t 2 d + ... + sisjtitj + ... The quadratic kenrel: s, t ∈ IRd , k(s, t) = s ⊤t + 1 2 = 1 + 2s ⊤t + s ⊤t 2 computes the dot product of the reweighted dictionary: Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1, √ 2s1, √ 2s2, ..., √ 2sd ,s 2 1 ,s 2 2 , ...,s 2 d , ..., √ 2sisj , ... p = 1 + d + d(d+1) 2 multiplications vs. d + 1 use kernel to save computration Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 6 / 37kernel: features throught pairwise comparizons x φ(x) e.g. a text e.g. BOW K n examples n examples Φ p features n examples k(xi , xj) = X p j=1 φj(xi)φj(xj) K The matrix of pairwise comparizons (O(n 2 )) Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 7 / 37Kenrel machine kernel as a dictionary f (x) = Xn i=1 αik(x, xi) αi influence of example i depends on yi k(x, xi) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X 7−→ IR s, t −→ k(s, t)Kenrel machine kernel as a dictionary f (x) = Xn i=1 αik(x, xi) αi influence of example i depends on yi k(x, xi) the kernel do NOT depend on yi Definition (Kernel) Let X be a non empty set (the input space). A kernel is a function k from X × X onto IR. k : X × X 7−→ IR s, t −→ k(s, t) semi-parametric version: given the family qj(x), j = 1, p f (x) = Xn i=1 αik(x, xi)+ X p j=1 βjqj(x)Kernel Machine Definition (Kernel machines) A (xi , yi)i=1,n  (x) = ψ Xn i=1 αik(x, xi) +X p j=1 βjqj(x)  α et β: parameters to be estimated. Exemples A(x) = Xn i=1 αi(x − xi) 3 + + β0 + β1x splines A(x) = signX i∈I αi exp− kx−xi k 2 b +β0  SVM IP(y|x) = 1 Z expX i∈I αi1I{y=yi }(x ⊤xi + b) 2  exponential familyPlan 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 10 / 37In the beginning was the kernel... Definition (Kernel) a function of two variable k from X × X to IR Definition (Positive kernel) A kernel k(s,t) on X is said to be positive if it is symetric: k(s,t) = k(t,s) an if for any finite positive interger n: ∀{αi}i=1,n ∈ IR, ∀{xi}i=1,n ∈ X , Xn i=1 Xn j=1 αiαjk(xi , xj) ≥ 0 it is strictly positive if for αi 6= 0 Xn i=1 Xn j=1 αiαjk(xi , xj) > 0Examples of positive kernels the linear kernel: s, t ∈ IRd , k(s, t) = s ⊤t symetric: s ⊤t = t ⊤s positive: Xn i=1 Xn j=1 αiαj k(xi , xj ) = Xn i=1 Xn j=1 αiαj x ⊤ i xj = Xn i=1 αi xi !⊤  Xn j=1 αj xj   = Xn i=1 αi xi 2 the product kernel: k(s, t) = g(s)g(t) for some g : IRd → IR, symetric by construction positive: Xn i=1 Xn j=1 αiαj k(xi , xj ) = Xn i=1 Xn j=1 αiαj g(xi )g(xj ) = Xn i=1 αi g(xi ) !  Xn j=1 αj g(xj )   = Xn i=1 αi g(xi ) !2 k is positive ⇔ (its square root exists) ⇔ k(s, t) = hφs, φti J.P. Vert, 2006Example: finite kernel let φj , j = 1, p be a finite dictionary of functions from X to IR (polynomials, wavelets...) the feature map and linear kernel feature map: Φ : X → IRp s 7→ Φ = φ1(s), ..., φp(s)  Linear kernel in the feature space: k(s, t) = φ1(s), ..., φp(s) ⊤ φ1(t), ..., φp(t)  e.g. the quadratic kernel: s, t ∈ IRd , k(s, t) = s ⊤t + b 2 feature map: Φ : IRd → IRp=1+d+ d(d +1) 2 s 7→ Φ = 1, √ 2s1, ..., √ 2sj , ..., √ 2sd ,s 2 1 , ...,s 2 j , ...,s 2 d , ..., √ 2sisj , ...Positive definite Kernel (PDK) algebra (closure) if k1(s,t) and k2(s,t) are two positive kernels DPK are a convex cone: ∀a1 ∈ IR+ a1k1(s, t) + k2(s, t) product kernel k1(s, t)k2(s, t) proofs by linearity: Xn i=1 Xn j=1 αiαj a1k1(i, j) + k2(i, j)  = a1 Xn i=1 Xn j=1 αiαj k1(i, j) +Xn i=1 Xn j=1 αiαj k2(i, j) assuming ∃ψℓ s.t. k1(s, t) = X ℓ ψℓ(s)ψℓ(t) Xn i=1 Xn j=1 αiαj k1(xi , xj )k2(xi , xj ) = Xn i=1 Xn j=1 αiαj X ℓ ψℓ(xi )ψℓ(xj )k2(xi , xj )  = X ℓ Xn i=1 Xn j=1 αiψℓ(xi )  αjψℓ(xj )  k2(xi , xj ) N. Cristianini and J. Shawe Taylor, kernel methods for pattern analysis, 2004Kernel engineering: building PDK for any polynomial with positive coef. φ from IR to IR φ k(s,t)  if Ψis a function from IRd to IRd k Ψ(s), Ψ(t)  if ϕ from IRd to IR+, is minimum in 0 k(s,t) = ϕ(s + t) − ϕ(s − t) convolution of two positive kernels is a positive kernel K1 ⋆ K2 Example : the Gaussian kernel is a PDK exp(−ks − tk 2 ) = exp(−ksk 2 − ktk 2 + 2s ⊤t) = exp(−ksk 2 ) exp(−ktk 2 ) exp(2s ⊤t) s ⊤t is a PDK and function exp as the limit of positive series expansion, so exp(2s ⊤t) is a PDK exp(−ksk 2 ) exp(−ktk 2 ) is a PDK as a product kernel the product of two PDK is a PDK O. Catoni, master lecture, 2005an attempt at classifying PD kernels stationary kernels, (also called translation invariant): k(s,t) = ks (s − t) ◮ radial (isotropic) gaussian: exp  − r 2 b  , r = ks − tk ◮ with compact support c.s. Matèrn : max 0, 1 − r b κ  r b kBk r b  , κ ≥ (d + 1)/2 ◮ locally stationary kernels: k(s,t) = k1(s + t)ks (s − t) K1 is a non negative function and K2 a radial kernel. non stationary (projective kernels): k(s,t) = kp(s ⊤t) ◮ separable kernels k(s,t) = k1(s)k2(t) with k1 and k2(t) PDK in this case K = k1k ⊤ 2 where k1 = (k1(x1), ..., k1(xn)) MG Genton, Classes of Kernels for Machine Learning: A Statistics Perspective - JMLR, 2002some examples of PD kernels... type name k(s,t) radial gaussian exp  − r 2 b  , r = ks − tk radial laplacian exp(−r/b) radial rationnal 1 − r 2 r 2+b radial loc. gauss. max 0, 1 − r 3b d exp(− r 2 b ) non stat. χ 2 exp(−r/b), r = P k (sk−tk ) 2 sk+tk projective polynomial (s ⊤t) p projective affine (s ⊤t + b) p projective cosine s ⊤t/kskktk projective correlation exp  s⊤t kskktk − b  Most of the kernels depends on a quantity b called the bandwidththe importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s ⊤t + b) p = b p  s ⊤t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp  − ks − tk 2 2σ 2  b = 2σ 2the importance of the Kernel bandwidth for the affine Kernel: Bandwidth = biais k(s, t) = (s ⊤t + b) p = b p  s ⊤t b + 1 p for the gaussian Kernel: Bandwidth = influence zone k(s, t) = 1 Z exp  − ks − tk 2 2σ 2  b = 2σ 2 Illustration 1 d density estimation b = 1 2 b = 2 + data (x1, x2, ..., xn) – Parzen estimate IPb(x) = 1 Z Xn i=1 k(x, xi)kernels for objects and structures kernels on histograms and probability distributions kernel on strings spectral string kernel k(s, t) = P u φu(s)φu(t) using sub sequences similarities by alignements k(s, t) = P π exp(β(s, t, π)) kernels on graphs the pseudo inverse of the (regularized) graph Laplacian L = D − A A is the adjency matrixD the degree matrix diffusion kernels 1 Z(b) expbL subgraph kernel convolution (using random walks) and kernels on HMM, automata, dynamical system... Shawe-Taylor & Cristianini’s Book, 2004 ; JP Vert, 2006Multiple kernel M. Cuturi, Positive Definite Kernels in Machine Learning, 2009Gram matrix Definition (Gram matrix) let k(s,t) be a positive kernel on X and (xi)i=1,n a sequence on X . the Gram matrix is the square K of dimension n and of general term Kij = k(xi , xj). practical trick to check kernel positivity: K is positive ⇔ λi > 0 its eigenvalues are posivies: if Kui = λiui ; i = 1, n u ⊤ i Kui = λiu ⊤ i ui = λi matrix K is the one to be usedExamples of Gram matrices with different bandwidth raw data Gram matrix for b = 2 b = .5 b = 10different point of view about kernels kernel and scalar product k(s, t) = hφ(s), φ(t)iH kernel and distance d(s, t) 2 = k(s, s) + k(t, t) − 2k(s, t) kernel and covariance: a positive matrix is a covariance matrix IP(f) = 1 Z exp − 1 2 (f − f0) ⊤K −1 (f − f0)  if f0 = 0 and f = Kα, IP(α) = 1 Z exp − 1 2 α⊤Kα Kernel and regularity (green’s function) k(s, t) = P ∗Pδs−t for some operator P (e.g. some differential)Let’s summarize positive kernels there is a lot of them can be rather complex 2 classes: radial / projective the bandwith matters (more than the kernel itself) the Gram matrix summarize the pairwise comparizonsRoadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHS Stéphane Canu (INSA Rouen - LITIS) March 4, 2014 25 / 37From kernel to functions H0 =    f mf < ∞; fj ∈ IR;tj ∈ X , f (x) = Xmf j=1 fjk(x,tj)    let define the bilinear form (g(x) = Pmg i=1 gi k(x, si )) : ∀f , g ∈ H0, hf , giH0 = Xmf j=1 Xmg i=1 fj gi k(tj ,si) Evaluation functional: ∀x ∈ X f (x) = hf (•), k(x, •)iH0 from k to H for any positive kernel, a hypothesis set can be constructed H = H0 with its metricRKHS Definition (reproducing kernel Hibert space (RKHS)) a Hilbert space H embeded with the inner product h•, •iH is said to be with reproducing kernel if it exists a positive kernel k such that ∀s ∈ X , k(•,s) ∈ H ∀f ∈ H, f (s) = hf (•), k(s, •)iH Beware: f = f (•) is a function while f (s) is the real value of f at point s positive kernel ⇔ RKHS any function in H is pointwise defined defines the inner product it defines the regularity (smoothness) of the hypothesis set Exercice: let f (•) = Pn i=1 αik(•, xi). Show that kf k 2 H = α ⊤KαOther kernels (what really matters) finite kernels k(s, t) = φ1(s), ..., φp(s) ⊤ φ1(t), ..., φp(t)  Mercer kernels positive on a compact set ⇔ k(s, t) = Pp j=1 λjφj(s)φj(t) positive kernels positive semi-definite conditionnaly positive (for some functions pj) ∀{xi}i=1,n, ∀αi , Xn i αipj(xi) = 0; j = 1, p, Xn i=1 Xn j=1 αiαjk(xi , xj) ≥ 0 symetric non positive k(s, t) = tanh(s ⊤t + α0) non symetric – non positive the key property: ∇Jt (f ) = k(t, .) holds C. Ong et al, ICML , 2004The kernel map observation: x = (x1, . . . , xj , . . . , xd ) ⊤ ◮ f (x) = w⊤x = hw, xi IRd feature map: x −→ Φ(x) = (φ1(x), . . . , φj(x), . . . , φp(x))⊤ ◮ Φ : IRd 7−→ IRp ◮ f (x) = w⊤Φ(x) = hw, Φ(x)iIRp kernel dictionary: x −→ k(x) = (k(x, x1), . . . , k(x, xi), . . . , k(x, xn))⊤ ◮ k : IRd 7−→ IRn ◮ f (x) = Xn i=1 αik(x, xi) = hα, k(x)iIRn kernel map: x −→ k(•, x) p = ∞ ◮ f (x) = hf (•),K(•, x)iHRoadmap 1 Statistical learning and kernels Kernel machines Kernels Kernel and hypothesis set Functional differentiation in RKHSFunctional differentiation in RKHS Let J be a functional J : H → IR f 7→ J(f ) examples: J1(f ) = kf k 2 H, J2(f ) = f (x), J directional derivative in direction g at point f dJ(f , g) = lim ε → 0 J(f + εg) − J(f ) ε Gradient ∇J (f ) ∇J : H → H f 7→ ∇J (f ) if dJ(f , g) = h∇J (f ), giH exercise: find out ∇J1 (f ) et ∇J2 (f )Hint dJ(f , g) = dJ(f + εg) dε ε=0Solution dJ1(f , g) = lim ε → 0 kf +εgk 2−kf k 2 ε = lim ε → 0 kf k 2+ε 2kgk 2+2εhf ,giH−kf k 2 ε = lim ε → 0 εkgk 2 + 2hf , giH = h2f , giH ⇔ ∇J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = hk(x, .), giH ⇔ ∇J2 (f ) = k(x, .)Solution dJ1(f , g) = lim ε → 0 kf +εgk 2−kf k 2 ε = lim ε → 0 kf k 2+ε 2kgk 2+2εhf ,giH−kf k 2 ε = lim ε → 0 εkgk 2 + 2hf , giH = h2f , giH ⇔ ∇J1 (f ) = 2f dJ2(f , g) = lim ε → 0 f (x)+εg(x)−f (x) ε = g(x) = hk(x, .), giH ⇔ ∇J2 (f ) = k(x, .) Minimize f ∈H J(f ) ⇔ ∀g ∈ H, dJ(f , g) = 0 ⇔ ∇J (f ) = 0Subdifferential in a RKHS H Definition (Sub gradient) a subgradient of J : H 7−→ IR at f0 is any function g ∈ H such that ∀f ∈ V(f0), J(f ) ≥ J(f0) + hg,(f − f0)iH Definition (Subdifferential) ∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f . H = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1} H = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0} Theorem (Chain rule for linear Subdifferential) Let T be a linear operator H 7−→ IR and ϕ a function from IR to IR If J(f ) = ϕ(Tf ) Then ∂J(f ) = {T ∗g | g ∈ ∂ϕ(Tf )}, where T ∗ denotes T’s adjoint operatorexample of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f 7−→ Tf = (f (x1), . . . , f (xn))⊤ T ∗ : IRn −→ H α 7−→ T ∗α build the adjoint hTf , αiIRn = hf ,T ∗αiHexample of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f 7−→ Tf = (f (x1), . . . , f (xn))⊤ T ∗ : IRn −→ H α 7−→ T ∗α = Xn i=1 αik(•, xi) build the adjoint hTf , αiIRn = hf ,T ∗αiH hTf , αiIRn = Xn i=1 f (xi)αi = Xn i=1 hf (•), k(•, xi)iHαi = hf (•), Xn i=1 αik(•, xi) | {z } T∗α iHexample of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f 7−→ Tf = (f (x1), . . . , f (xn))⊤ T ∗ : IRn −→ H α 7−→ T ∗α = Xn i=1 αik(•, xi) build the adjoint hTf , αiIRn = hf ,T ∗αiH hTf , αiIRn = Xn i=1 f (xi)αi = Xn i=1 hf (•), k(•, xi)iHαi = hf (•), Xn i=1 αik(•, xi) | {z } T∗α iH TT∗ : IRn −→ IRn α 7−→ TT∗α = Xn j=1 αjk(xj , xi) = Kαexample of subdifferential in H evaluation operator and its adjoint T : H −→ IRn f 7−→ Tf = (f (x1), . . . , f (xn))⊤ T ∗ : IRn −→ H α 7−→ T ∗α = Xn i=1 αik(•, xi) build the adjoint hTf , αiIRn = hf ,T ∗αiH hTf , αiIRn = Xn i=1 f (xi)αi = Xn i=1 hf (•), k(•, xi)iHαi = hf (•), Xn i=1 αik(•, xi) | {z } T∗α iH TT∗ : IRn −→ IRn α 7−→ TT∗α = Xn j=1 αjk(xj , xi) = Kα Example of subdifferentials x given J5(f ) = |f (x)| ∂J5(f0) =  g(•) = αk(•, x) ; −1 < α < 1 x given J6(f ) = max(0, 1 − f (x)) ∂J6(f1) =  g(•) = αk(•, x) ; −1 < α < 0 Optimal conditions Theorem (Fermat optimality criterion) When J(f ) is convex, f ⋆ is a stationary point of problem min f ∈H J(f ) If and only if 0 ∈ ∂J(f ⋆ ) f f ⋆ ⋆ ∂J(f ⋆ ) exercice: find for a given y ∈ IR (from Obozinski) min x∈IR 1 2 (x − y) 2 + λ|x|Let’s summarize positive kernels ⇔ RKHS = H ⇔ regularity kf k 2 H the key property: ∇Jt (f ) = k(t, .) holds not only for positive kernels f (xi) exists (pointwise defined functions) universal consistency in RKHS the Gram matrix summarize the pairwise comparizons Lecture 3: Linear SVM with slack variables Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 23, 2014The non separable case −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 −1 −0.5 0 0.5 1 1.5 2 2.5Road map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others “variations on a theme” The hinge loss 0 0 Slack jThe non separable case: a bi criteria optimization problem Modeling potential errors: introducing slack variables ξi (xi , yi)  no error: yi(w⊤xi + b) ≥ 1 ⇒ ξi = 0 error: ξi = 1 − yi(w⊤xi + b) > 0 0 0 Slack j    min w,b,ξ 1 2 kwk 2 min w,b,ξ C p Xn i=1 ξ p i with yi(w⊤xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n Our hope: almost all ξi = 0Bi criteria optimization and dominance    L(w) = 1 p Xn i=1 ξ p i P(w) = kwk 2 Dominance w1 dominates w2 if L(w1) ≤ L(w2) and P(w1) ≤ P(w2) Pareto front (or Pareto Efficient Frontier) it is the set of all nondominated solutions P(w) = || w ||2 L(w) = 1/p Y n i=1 j p i admisible set Pareto’s front w = 0 Admissible solution Figure: dominated point (red), non dominated point (purple) and Pareto front (blue). Pareto frontier ⇔ Regularization path Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 5 / 293 equivalent formulations to reach Pareto’s front min w∈IRd 1 p Xn i=1 ξ p i + λ kwk 2 it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 293 equivalent formulations to reach Pareto’s front min w∈IRd 1 p Xn i=1 ξ p i + λ kwk 2    min w 1 p Xn i=1 ξ p i with kwk 2 ≤ k it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 293 equivalent formulations to reach Pareto’s front min w∈IRd 1 p Xn i=1 ξ p i + λ kwk 2    min w 1 p Xn i=1 ξ p i with kwk 2 ≤ k    min w kwk 2 with 1 p Xn i=1 ξ p i ≤ k ′ it works for CONVEX criteria! Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 6 / 29The non separable case Modeling potential errors: introducing slack variables ξi (xi , yi)  no error: yi(w⊤xi + b) ≥ 1 ⇒ ξi = 0 error: ξi = 1 − yi(w⊤xi + b) > 0 Minimizing also the slack (the error), for a given C > 0    min w,b,ξ 1 2 kwk 2 + C p Xn i=1 ξ p i with yi(w⊤xi + b) ≥ 1 − ξi i = 1, n ξi ≥ 0 i = 1, n Looking for the saddle point of the lagrangian with the Lagrange multipliers αi ≥ 0 and βi ≥ 0 L(w, b, α, β) = 1 2 kwk 2 + C p Xn i=1 ξ p i − Xn i=1 αi yi(w ⊤xi + b) − 1 + ξi  − Xn i=1 βi ξiThe KKT(p = 1) L(w, b, α, β) = 1 2 kwk 2 + C p Xn i=1 ξ p i − Xn i=1 αi yi(w ⊤xi + b) − 1 + ξi  − Xn i=1 βi ξi stationarity w − Xn i=1 αi yixi = 0 and Xn i=1 αi yi = 0 C − αi − βi = 0 i = 1, . . . , n primal admissibility yi(w⊤xi + b) ≥ 1 i = 1, . . . , n ξi ≥ 0 i = 1, . . . , n dual admissibility αi ≥ 0 i = 1, . . . , n βi ≥ 0 i = 1, . . . , n complementarity αi  yi(w⊤xi + b) − 1 + ξi  = 0 i = 1, . . . , n βi ξi = 0 i = 1, . . . , n Let’s eliminate β!KKT (p = 1) stationarity w − Xn i=1 αi yixi = 0 and Xn i=1 αi yi = 0 primal admissibility yi(w⊤xi + b) ≥ 1 i = 1, . . . , n ξi ≥ 0 i = 1, . . . , n; dual admissibility αi ≥ 0 i = 1, . . . , n C − αi ≥ 0 i = 1, . . . , n; complementarity αi  yi(w⊤xi + b) − 1 + ξi  = 0 i = 1, . . . , n (C − αi) ξi = 0 i = 1, . . . , n sets I0 IA IC αi 0 0 < α < C C βi C C − α 0 ξi 0 0 1 − yi(w⊤xi + b) yi(w⊤xi + b) > 1 yi(w⊤xi + b) = 1 yi(w⊤xi + b) < 1 useless usefull (support vec) suspiciousThe importance of being support −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 . data point α constraint value set xi useless αi = 0 yi w⊤xi + b  > 1 I0 xi support 0 < αi < C yi w⊤xi + b  = 1 Iα xi suspicious αi = C yi w⊤xi + b  < 1 IC Table: When a data point is « support » it lies exactly on the margin. here lies the efficiency of the algorithm (and its complexity)! sparsity: αi = 0Optimality conditions (p = 1) L(w, b, α, β) = 1 2 kwk 2 + C Xn i=1 ξi − Xn i=1 αi yi(w ⊤xi + b) − 1 + ξi  − Xn i=1 βi ξi Computing the gradients:    ∇wL(w, b, α) = w − Xn i=1 αi yixi ∂L(w, b, α) ∂b = Xn i=1 αi yi ∇ξiL(w, b, α) = C − αi − βi no change for w and b βi ≥ 0 and C − αi − βi = 0 ⇒ αi ≤ C The dual formulation:    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi ≤ C i = 1, nSVM primal vs. dual Primal    min w,b,ξ∈IRn 1 2 kwk 2 + C Xn i=1 ξi with yi(w⊤xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n d + n + 1 unknown 2n constraints classical QP to be used when n is too large to build G Dual    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi ≤ C i = 1, n n unknown G Gram matrix (pairwise influence matrix) 2n box constraints easy to solve to be used when n is not too largeThe smallest C C small ⇒ all the points are in IC : αi = C −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 5 6 −1 ≤ fj = C Xn i=1 yi(x ⊤ i xj)+b ≤ 1 fM = max(f ) fm = min(f ) Cmax = 2 fM − fmRoad map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others “variations on a theme” The hinge loss 0 0 Slack jL2 SVM: optimality conditions (p = 2) L(w, b, α, β) = 1 2 kwk 2 + C 2 Xn i=1 ξ 2 i − Xn i=1 αi yi(w ⊤xi + b) − 1 + ξi  Computing the gradients:    ∇wL(w, b, α) = w − Xn i=1 αi yixi ∂L(w, b, α) ∂b = Xn i=1 αi yi ∇ξiL(w, b, α) = Cξi − αi no need of the positivity constraint on ξi no change for w and b Cξi − αi = 0 ⇒ C 2 Pn i=1 ξ 2 i − Pn i=1 αi ξi = − 1 2C Pn i=1 α 2 i The dual formulation:    min α∈IRn 1 2 α ⊤(G + 1 C I)α − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, nSVM primal vs. dual Primal    min w,b,ξ∈IRn 1 2 kwk 2 + C 2 Xn i=1 ξ 2 i with yi(w⊤xi + b) ≥ 1 − ξi d + n + 1 unknown n constraints classical QP to be used when n is too large to build G Dual    min α∈IRn 1 2 α ⊤(G + 1 C I)α − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix is regularized n box constraints easy to solve to be used when n is not too largeOne more variant: the ν SVM    max v,a m with min i=1,n |v ⊤xi + a| ≥ m kvk 2 = k    min v,a 1 2 kvk 2 − ν m + Pn i=1 ξi with yi(v ⊤xi + a) ≥ m − ξi ξi ≥ 0, m ≥ 0 The dual formulation:    min α∈IRn 1 2 α ⊤Gα with y ⊤α = 0 and 0 ≤ αi ≤ 1/n i = 1, n m ≤ e ⊤αThe convex hull formulation Minimizing the distance between the convex hulls    min α ku − vk with u = X {i|yi =1} αixi , v = X {i|yi =−1} αixi and X {i|yi =1} αi = 1, X {i|yi =−1} αi = 1, 0 ≤ αi ≤ C i = 1, n w ⊤x = 2 ku − vk u ⊤x − v ⊤x  and b = kuk − kvk ku − vkSVM with non symetric costs Problem in the primal (p = 1)    min w,b,ξ∈IRn 1 2 kwk 2 + C + X {i|yi =1} ξi + C − X {i|yi =−1} ξi with yi w⊤xi + b  ≥ 1 − ξi , ξi ≥ 0, i = 1, n for p = 1 the dual formulation is the following: ( max α∈IRn − 1 2 α ⊤Gα + α ⊤e with α ⊤y = 0 and 0 ≤ αi ≤ C + or C − i = 1, n It generalizes to any cost (useful for unbalanced data)Road map 1 Linear SVM The non separable case The C (L1) SVM The L2 SVM and others “variations on a theme” The hinge loss 0 0 Slack jEliminating the slack but not the possible mistakes    min w,b,ξ∈IRn 1 2 kwk 2 + C Xn i=1 ξi with yi(w⊤xi + b) ≥ 1 − ξi ξi ≥ 0 i = 1, n Introducing the hinge loss ξi = max 1 − yi(w ⊤xi + b), 0  min w,b 1 2 kwk 2 + C Xn i=1 max 0, 1 − yi(w ⊤xi + b)  Back to d + 1 variables, but this is no longer an explicit QPOoops! the notion of sub differential Definition (Sub gradient) a subgradient of J : IRd 7−→ IR at f0 is any vector g ∈ IRd such that ∀f ∈ V(f0), J(f ) ≥ J(f0) + g ⊤(f − f0) Definition (Subdifferential) ∂J(f ), the subdifferential of J at f is the set of all subgradients of J at f . IRd = IR J3(x) = |x| ∂J3(0) = {g ∈ IR | − 1 < g < 1} IRd = IR J4(x) = max(0, 1 − x) ∂J4(1) = {g ∈ IR | − 1 < g < 0}Regularization path for SVM min w Xn i=1 max(1 − yiw ⊤xi , 0) + λo 2 kwk 2 Iα is the set of support vectors s.t. yiw⊤xi = 1; ∂wJ(w) = X i∈Iα αi yixi − X i∈I1 yixi + λo w with αi ∈ ∂H(1) =] − 1, 0[Regularization path for SVM min w Xn i=1 max(1 − yiw ⊤xi , 0) + λo 2 kwk 2 Iα is the set of support vectors s.t. yiw⊤xi = 1; ∂wJ(w) = X i∈Iα αi yixi − X i∈I1 yixi + λo w with αi ∈ ∂H(1) =] − 1, 0[ Let λn a value close enough to λo to keep the sets I0, Iα and IC unchanged In particular at point xj ∈ Iα (w ⊤ o xj = w ⊤ n xj = yj) : ∂wJ(w)(xj) = 0 P i∈Iα αioyix ⊤ i xj = P i∈I1 yix ⊤ i P xj − λo yj i∈Iα αinyix ⊤ i xj = P i∈I1 yix ⊤ i xj − λn yj G(αn − αo) = (λo − λn)y with Gij = yix ⊤ i xj αn = αo + (λo − λn)d d = (G) −1 ySolving SVM in the primal min w,b 1 2 kwk 2 + C Xn i=1 max 0, 1 − yi(w ⊤xi + b)  What for: Yahoo!, Twiter, Amazon, Google (Sibyl), Facebook. . . : Big data Data-intensive machine learning systems "on terascale datasets, with trillions of features,1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines" How: hybrid online+batch approach adaptive gradient updates (stochastic gradient descent) Code available: http://olivier.chapelle.cc/primal/Solving SVM in the primal J(w, b) = 1 2 kwk 2 2 + C 2 Xn i=1 max 1 − yi(w ⊤xi + b), 0 2 = 1 2 kwk 2 2 + C 2 ξ ⊤ξ with Xn ξi = max 1 − yi(w ⊤xi + b), 0  ∇wJ(w, b) = w − C Xn i=1 max 1 − yi(w ⊤xi + b), 0  yixi = w − C (diag(y)X) ⊤ξ HwJ(w, b) = Id + C Xn i∈/I0 xix ⊤ i Optimal step size ρ in the Newton direction: w new = w old − ρ H −1 w ∇wJ(w old , b old)The hinge and other loss Square hinge: (huber/hinge) and Lasso SVM min w,b k w k 1 + C Xni=1 max 1 − yi ( w ⊤ x i + b ) , 0  p Penalized Logistic regression (Maxent) min w,b k w k 22 − C Xni=1 log 1 + exp − 2 yi ( w ⊤ x i + b )  The exponential loss (commonly used in boosting) min w,b k w k 22 + C Xni=1 exp − yi ( w ⊤ x i + b ) The sigmoid loss min w,b k w k 22 − C Xni=1 tanh yi ( w ⊤ x i + b )  − 1 0 1 01 yf(x) classification loss 0/1 loss hinge hinge 2 logistic exponential sigmoidChoosing the data fitting term and the penalty For a given C: controling the tradeoff between loss and penalty min w,b pen(w) + C Xn i=1 Loss yi(w ⊤xi + b)  For a long list of possible penalties: A Antoniadis, I Gijbels, M Nikolova, Penalized likelihood regression for generalized linear models with non-quadratic penalties, 2011. A tentative of classification: convex/non convex differentiable/non differentiable What are we looking for consistency efficiency −→ sparcityConclusion: variables or data point? seeking for a universal learning algorithm ◮ no model for IP(x, y) the linear case: data is separable ◮ the non separable case double objective: minimizing the error together with the regularity of the solution ◮ multi objective optimisation dualiy : variable – example ◮ use the primal when d < n (in the liner case) or when matrix G is hard to compute ◮ otherwise use the dual universality = nonlinearity ◮ kernelsBibliography C. Cortes & V. Vapnik, Support-vector networks, Machine learning, 1995 J. Bi & V. Vapnik, Learning with rigorous SVM, COLT 2003 T. Hastie, S. Rosset, R. Tibshirani, J. Zhu, The entire regularization path for the support vector machine, JMLR, 2004 P. Bartlett, M. Jordan, J. McAuliffe, Convexity, classification, and risk bounds, JASA, 2006. A. Antoniadis, I. Gijbels, M. Nikolova, Penalized likelihood regression for generalized linear models with non-quadratic penalties, 2011. A Agarwal, O Chapelle, M Dudík, J Langford, A reliable effective terascale linear learning system, 2011. informatik.unibas.ch/fileadmin/Lectures/FS2013/CS331/Slides/my_SVM_without_b.pdf http://ttic.uchicago.edu/~gregory/courses/ml2010/lectures/lect12.pdf http://olivier.chapelle.cc/primal/ Stéphane Canu (INSA Rouen - LITIS) March 23, 2014 29 / 29 Lecture 2: Linear SVM in the Dual Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 12, 2014Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.Linear SVM: the problem Linear SVM are the solution of the following problem (called primal) Let {(xi , yi); i = 1 : n} be a set of labelled data with xi ∈ IRd , yi ∈ {1, −1}. A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign w⊤x + b  where w ∈ IRd and b ∈ IR a given thought the solution of the following problem: ( min w,b 1 2 kwk 2 = 1 2w⊤w with yi(w⊤xi + b) ≥ 1 i = 1, n This is a quadratic program (QP): ( min z 1 2 z ⊤Az − d ⊤z with Bz ≤ e z = (w, b)⊤, d = (0, . . . , 0)⊤, A =  I 0 0 0  , B = −[diag(y)X, y] et e = −(1, . . . , 1)⊤Road map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dualA simple example (to begin with) ( min x1,x2 J(x) = (x1 − a) 2 + (x2 − b) 2 with x x ⋆ ∇xJ(x) iso cost lines: J(x) = kA simple example (to begin with) ( min x1,x2 J(x) = (x1 − a) 2 + (x2 − b) 2 with H(x) = α(x1 − c) 2 + β(x2 − d) 2 + γx1x2 − 1 Ω = {x|H(x) = 0} x x ⋆ ∇xJ(x) ∆x ∇xH(x) tangent hyperplane iso cost lines: J(x) = k ∇xH(x) = λ ∇xJ(x)The only one equality constraint case ( min x J(x) J(x + εd) ≈ J(x) + ε∇xJ(x) ⊤d with H(x) = 0 H(x + εd) ≈ H(x) + ε∇xH(x) ⊤d Loss J : d is a descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 J(x + εd) < J(x) ⇒ ∇xJ(x) ⊤d < 0 constraint H : d is a feasible descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 H(x + εd) = 0 ⇒ ∇xH(x) ⊤d = 0 If at x ⋆ , vectors ∇xJ(x ⋆ ) and ∇xH(x ⋆ ) are collinear there is no feasible descent direction d. Therefore, x ⋆ is a local solution of the problem.Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 et H2(x) = 0 . . . Hp(x) = 0Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 λ1 et H2(x) = 0 λ2 . . . Hp(x) = 0 λp each constraint is associated with λi : the Lagrange multiplier.Lagrange multipliers Assume J and functions Hi are continuously differentials (and independent) P =    min x∈IRn J(x) avec H1(x) = 0 λ1 et H2(x) = 0 λ2 . . . Hp(x) = 0 λp each constraint is associated with λi : the Lagrange multiplier. Theorem (First order optimality conditions) for x ⋆ being a local minima of P, it is necessary that: ∇x J(x ⋆ ) +X p i=1 λi∇xHi(x ⋆ ) = 0 and Hi(x ⋆ ) = 0, i = 1, pPlan 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Stéphane Canu (INSA Rouen - LITIS) March 12, 2014 8 / 32The only one inequality constraint case ( min x J(x) J(x + εd) ≈ J(x) + ε∇xJ(x) ⊤d with G(x) ≤ 0 G(x + εd) ≈ G(x) + ε∇xG(x) ⊤d cost J : d is a descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 J(x + εd) < J(x) ⇒ ∇xJ(x) ⊤d < 0 constraint G : d is a feasible descent direction if it exists ε0 ∈ IR such that ∀ε ∈ IR, 0 < ε ≤ ε0 G(x + εd) ≤ 0 ⇒ G(x) < 0 : no limit here on d G(x) = 0 : ∇xG(x) ⊤d ≤ 0 Two possibilities If x ⋆ lies at the limit of the feasible domain (G(x ⋆ ) = 0) and if vectors ∇xJ(x ⋆ ) and ∇xG(x ⋆ ) are collinear and in opposite directions, there is no feasible descent direction d at that point. Therefore, x ⋆ is a local solution of the problem... Or if ∇xJ(x ⋆ ) = 0Two possibilities for optimality ∇xJ(x ⋆ ) = −µ ∇xG(x ⋆ ) and µ > 0; G(x ⋆ ) = 0 or ∇xJ(x ⋆ ) = 0 and µ = 0; G(x ⋆ ) < 0 This alternative is summarized in the so called complementarity condition: µ G(x ⋆ ) = 0 µ = 0 G(x ⋆ ) < 0 G(x ⋆ ) = 0 µ > 0First order optimality condition (1) problem P =    min x∈IRn J(x) with hj(x) = 0 j = 1, . . . , p and gi(x) ≤ 0 i = 1, . . . , q Definition: Karush, Kuhn and Tucker (KKT) conditions stationarity ∇J(x ⋆ ) +X p j=1 λj∇hj(x ⋆ ) +X q i=1 µi∇gi(x ⋆ ) = 0 primal admissibility hj(x ⋆ ) = 0 j = 1, . . . , p gi(x ⋆ ) ≤ 0 i = 1, . . . , q dual admissibility µi ≥ 0 i = 1, . . . , q complementarity µigi(x ⋆ ) = 0 i = 1, . . . , q λj and µi are called the Lagrange multipliers of problem PFirst order optimality condition (2) Theorem (12.1 Nocedal & Wright pp 321) If a vector x ⋆ is a stationary point of problem P Then there existsa Lagrange multipliers such that x ⋆ , {λj}j=1:p, {µi}i=1:q  fulfill KKT conditions a under some conditions e.g. linear independence constraint qualification If the problem is convex, then a stationary point is the solution of the problem A quadratic program (QP) is convex when. . . (QP) ( min z 1 2 z ⊤Az − d ⊤z with Bz ≤ e . . . when matrix A is positive definiteKKT condition - Lagrangian (3) problem P =    min x∈IRn J(x) with hj(x) = 0 j = 1, . . . , p and gi(x) ≤ 0 i = 1, . . . , q Definition: Lagrangian The lagrangian of problem P is the following function: L(x, λ, µ) = J(x) +X p j=1 λjhj(x) +X q i=1 µigi(x) The importance of being a lagrangian the stationarity condition can be written: ∇L(x ⋆ , λ, µ) = 0 the lagrangian saddle point max λ,µ min x L(x, λ, µ) Primal variables: x and dual variables λ, µ (the Lagrange multipliers)Duality – definitions (1) Primal and (Lagrange) dual problems P =    min x∈IRn J(x) with hj(x) = 0 j = 1, p and gi(x) ≤ 0 i = 1, q D = ( max λ∈IRp,µ∈IRq Q(λ, µ) with µj ≥ 0 j = 1, q Dual objective function: Q(λ, µ) = inf x L(x, λ, µ) = inf x J(x) +X p j=1 λjhj(x) +X q i=1 µigi(x) Wolf dual problem W =    max x,λ∈IRp,µ∈IRq L(x, λ, µ) with µj ≥ 0 j = 1, q and ∇J(x ⋆ ) +X p j=1 λj∇hj(x ⋆ ) +X q i=1 µi∇gi(x ⋆ ) = 0Duality – theorems (2) Theorem (12.12, 12.13 and 12.14 Nocedal & Wright pp 346) If f , g and h are convex and continuously differentiablea , then the solution of the dual problem is the same as the solution of the primal a under some conditions e.g. linear independence constraint qualification (λ ⋆ , µ⋆ ) = solution of problem D x ⋆ = arg min x L(x, λ⋆ , µ⋆ ) Q(λ ⋆ , µ⋆ ) = arg min x L(x, λ⋆ , µ⋆ ) = L(x ⋆ , λ⋆ , µ⋆ ) = J(x ⋆ ) + λ ⋆H(x ⋆ ) + µ ⋆G(x ⋆ ) = J(x ⋆ ) and for any feasible point x Q(λ, µ) ≤ J(x) → 0 ≤ J(x) − Q(λ, µ) The duality gap is the difference between the primal and dual cost functionsRoad map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.Linear SVM dual formulation - The lagrangian ( min w,b 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, n Looking for the lagrangian saddle point max α min w,b L(w, b, α) with so called lagrange multipliers αi ≥ 0 L(w, b, α) = 1 2 kwk 2 − Xn i=1 αi yi(w ⊤xi + b) − 1  αi represents the influence of constraint thus the influence of the training example (xi , yi)Stationarity conditions L(w, b, α) = 1 2 kwk 2 − Xn i=1 αi yi(w ⊤xi + b) − 1  Computing the gradients:    ∇wL(w, b, α) = w − Xn i=1 αi yixi ∂L(w, b, α) ∂b = Pn i=1 αi yi we have the following optimality conditions    ∇wL(w, b, α) = 0 ⇒ w = Xn i=1 αi yixi ∂L(w, b, α) ∂b = 0 ⇒ Xn i=1 αi yi = 0KKT conditions for SVM stationarity w − Xn i=1 αi yixi = 0 and Xn i=1 αi yi = 0 primal admissibility yi(w⊤xi + b) ≥ 1 i = 1, . . . , n dual admissibility αi ≥ 0 i = 1, . . . , n complementarity αi  yi(w⊤xi + b) − 1  = 0 i = 1, . . . , n The complementary condition split the data into two sets A be the set of active constraints: usefull points A = {i ∈ [1, n] yi(w ∗⊤xi + b ∗ ) = 1} its complementary A¯ useless points if i ∈ A/ , αi = 0The KKT conditions for SVM The same KKT but using matrix notations and the active set A stationarity w − X ⊤Dyα = 0 α ⊤y = 0 primal admissibility Dy (Xw + b I1) ≥ I1 dual admissibility α ≥ 0 complementarity Dy (XAw + b I1A) = I1A αA¯ = 0 Knowing A, the solution verifies the following linear system:    w −X ⊤ A DyαA = 0 −DyXAw −byA = −eA −y ⊤ AαA = 0 with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :).The KKT conditions as a linear system    w −X ⊤ A DyαA = 0 −DyXAw −byA = −eA −y ⊤ AαA = 0 with Dy = diag(yA), αA = α(A) , yA = y(A) et XA = X(XA; :). = I −X ⊤ A Dy 0 −DyXA 0 −yA 0 −y ⊤ A 0 w αA b 0 −eA 0 we can work on it to separate w from (αA, b)The SVM dual formulation The SVM Wolfe dual    max w,b,α 1 2 kwk 2 − Xn i=1 αi yi(w ⊤xi + b) − 1  with αi ≥ 0 i = 1, . . . , n and w − Xn i=1 αi yixi = 0 and Xn i=1 αi yi = 0 using the fact: w = Xn i=1 αi yixi The SVM Wolfe dual without w and b    max α − 1 2 Xn i=1 Xn j=1 αjαi yi yjx ⊤ j xi + Xn i=1 αi with αi ≥ 0 i = 1, . . . , n and Xn i=1 αi yi = 0Linear SVM dual formulation L(w, b, α) = 1 2 kwk 2 − Xn i=1 αi yi(w ⊤xi + b) − 1  Optimality: w = Xn i=1 αi yixi Xn i=1 αi yi = 0 L(α) = 1 2 Xn i=1 Xn j=1 αjαi yi yjx ⊤ j xi | {z } w⊤w − Pn i=1 αi yi Xn j=1 αjyjx ⊤ j | {z } w⊤ xi − b Xn i=1 αi yi | {z } =0 + Pn i=1 αi = − 1 2 Xn i=1 Xn j=1 αjαi yi yjx ⊤ j xi + Xn i=1 αi Dual linear SVM is also a quadratic program problem D    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n with G a symmetric matrix n × n such that Gij = yi yjx ⊤ j xiSVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > nSVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n f (x) = X d j=1 wjxj + b = Xn i=1 αi yi(x ⊤xi) + bThe bi dual (the dual of the dual)    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n L(α, λ, µ) = 1 2 α ⊤Gα − e ⊤α + λ y ⊤α − µ ⊤α ∇αL(α, λ, µ) = Gα − e + λ y − µ The bidual    max α,λ,µ − 1 2 α ⊤Gα with Gα − e + λ y − µ = 0 and 0 ≤ µ since kwk 2 = 1 2 α ⊤Gα and DXw = Gα ( max w,λ − 1 2 kwk 2 with DXw + λ y ≥ e by identification (possibly up to a sign) b = λ is the Lagrange multiplier of the equality constraintCold case: the least square problem Linear model yi = X d j=1 wjxij + εi , i = 1, n n data and d variables; d < n min w = Xn i=1  Xd j=1 xijwj − yi   2 = kXw − yk 2 Solution: we = (X ⊤X) −1X ⊤y f (x) = x ⊤ (X ⊤X) −1X ⊤y | {z } we What is the influence of each data point (matrix X lines) ? Shawe-Taylor & Cristianini’s Book, 2004data point influence (contribution) for any new data point x f (x) = x ⊤ (X ⊤X)(X ⊤X) −1 (X ⊤X) −1X ⊤y | {z } we = x ⊤ X ⊤ X(X ⊤X) −1 (X ⊤X) −1X ⊤y | {z } αb x⊤ n examples d variables X ⊤ αb we f (x) = X d j=1 wejxjdata point influence (contribution) for any new data point x f (x) = x ⊤ (X ⊤X)(X ⊤X) −1 (X ⊤X) −1X ⊤y | {z } we = x ⊤ X ⊤ X(X ⊤X) −1 (X ⊤X) −1X ⊤y | {z } αb x⊤ n examples d variables X ⊤ αb we x⊤xi f (x) = X d j=1 wejxj = Xn i=1 αbi (x ⊤xi) from variables to examples αb = X(X ⊤X) −1we | {z } n examples et we = X ⊤αb | {z } d variables what if d ≥ n !SVM primal vs. dual Primal    min w∈IRd ,b∈IR 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, n d + 1 unknown n constraints classical QP perfect when d << n Dual    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n n unknown G Gram matrix (pairwise influence matrix) n box constraints easy to solve to be used when d > n f (x) = X d j=1 wjxj + b = Xn i=1 αi yi(x ⊤xi) + bRoad map 1 Linear SVM Optimization in 10 slides Equality constraints Inequality constraints Dual formulation of the linear SVM Solving the dual Figure from L. Bottou & C.J. Lin, Support vector machine solvers, in Large scale kernel machines, 2007.Solving the dual (1) Data point influence αi = 0 this point is useless αi 6= 0 this point is said to be support f (x) = X d j=1 wjxj + b = Xn i=1 αi yi(x ⊤xi) + bSolving the dual (1) Data point influence αi = 0 this point is useless αi 6= 0 this point is said to be support f (x) = X d j=1 wjxj + b = X 3 i=1 αi yi(x ⊤xi) + b Decison border only depends on 3 points (d + 1)Solving the dual (2) Assume we know these 3 data points    min α∈IRn 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 and 0 ≤ αi i = 1, n =⇒ ( min α∈IR3 1 2 α ⊤Gα − e ⊤α with y ⊤α = 0 L(α, b) = 1 2 α ⊤Gα − e ⊤α + b y ⊤α solve the following linear system  Gα + b y = e y ⊤α = 0 U = chol(G); % upper a = U\ (U’\e); c = U\ (U’\y); b = (y’*a)\(y’*c) alpha = U\ (U’\(e - b*y));Conclusion: variables or data point? seeking for a universal learning algorithm ◮ no model for IP(x, y) the linear case: data is separable ◮ the non separable case double objective: minimizing the error together with the regularity of the solution ◮ multi objective optimisation dualiy : variable – example ◮ use the primal when d < n (in the liner case) or when matrix G is hard to compute ◮ otherwise use the dual universality = nonlinearity ◮ kernels SVM and Kernel machine Lecture 1: Linear SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 March 12, 2014Road map 1 Linear SVM Separating hyperplanes The margin Linear SVM: the problem Linear programming SVM 0 0 0 margin "The algorithms for constructing the separating hyperplane considered above will be utilized for developing a battery of programs for pattern recognition." in Learning with kernels, 2002 - from V .Vapnik, 1982Hyperplanes in 2d: intuition It’s a line!Hyperplanes: formal definition Given vector v ∈ IRd and bias a ∈ IR Hyperplane as a function h, h : IRd −→ IR x 7−→ h(x) = v ⊤x + a Hyperplane as a border in IRd (and an implicit function) ∆(v, a) = {x ∈ IRd v ⊤x + a = 0} The border invariance property ∀k ∈ IR, ∆(kv, ka) = ∆(v, a) ∆ = {x ∈ IR2 | v ⊤x + a = 0} the decision border ∆ (x, h(x)) = v ⊤x + a) (x, 0) h(x) d(x, ∆)Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign v ⊤x + a Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign v ⊤x + a  the decision border: v ⊤x + a = 0Separating hyperplanes Find a line to separate (classify) blue from red D(x) = sign v ⊤x + a  the decision border: v ⊤x + a = 0 there are many solutions... The problem is ill posed How to choose a solution?This is not the problem we want to solve {(xi , yi); i = 1 : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error)This is not the problem we want to solve {(xi , yi); i = 1 : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small scalable - algorithmic complexityThis is not the problem we want to solve {(xi , yi); i = 1 : n} a training sample, i.i.d. drawn according to IP(x, y) unknown we want to be able to classify new observations: minimize IP(error) Looking for a universal approach use training data: (a few errors) prove IP(error) remains small scalable - algorithmic complexity with high probability (for the canonical hyperplane): IP(error) < IPb (error) | {z } =0 here + ϕ( 1 margin | {z } =kvk ) Vapnik’s Book, 1982Margin guarantees min i∈[1,n] dist(xi , ∆(v, a)) | {z } margin: m Theorem (Margin Error Bound) Let R be the radius of the smallest ball BR(a) =  x ∈ IRd | kx − ck < R , containing the points (x1, . . . , xn) i.i.d from some unknown distribution IP. Consider a decision function D(x) = sign(v ⊤x) associated with a separating hyperplane v of margin m (no training error). Then, with probability at least 1 − δ for any δ > 0, the generalization error of this hyperplane is bounded by IP(error) ≤ 2 r R2 n m2 + 3 r ln(2/δ) 2n R v’x = 0 m theorem 4.17 p 102 in J Shawe-Taylor, N Cristianini Kernel methods for pattern analysis, Cambridge 2004Statistical machine learning – Computation learning theory (COLT) {xi , yi} {xi , yi} i = 1, n A f = v ⊤x + a x yp = f (x) IPb (error) = 1 n L(f (xi), yi) Loss L Vapnik’s Book, 1982Statistical machine learning – Computation learning theory (COLT) {xi , yi} {xi , yi} i = 1, n A f = v ⊤x + a x yp = f (x) IPb (error) = 1 n L(f (xi), yi) y Loss L IP(error) = IE(L) ∀IP ∈ P P IP Prob  ≤ + ϕ(kvk) ≥ δ Vapnik’s Book, 1982linear discrimination Find a line to classify blue and red D(x) = sign v ⊤x + a  the decision border: v ⊤x + a = 0 there are many solutions... The problem is ill posed How to choose a solution ? ⇒ choose the one with larger marginRoad map 1 Linear SVM Separating hyperplanes The margin Linear SVM: the problem Linear programming SVM 0 0 0 marginMaximize our confidence = maximize the margin the decision border: ∆(v, a) = {x ∈ IRd v ⊤x + a = 0} 0 0 0 margin maximize the margin max v,a min i∈[1,n] dist(xi , ∆(v, a)) | {z } margin: m Maximize the confidence    max v,a m with min i=1,n |v ⊤xi + a| kvk ≥ m the problem is still ill posed if (v, a) is a solution, ∀ 0 < k (kv, ka) is also a solution. . .Margin and distance: details Theorem (The geometrical margin) Let x be a vector in IRd and ∆(v, a) = {s ∈ IRd v ⊤s + a = 0} an hyperplane. The distance between vector x and the hyperplane ∆(v, a)) is dist(xi , ∆(v, a)) = |v⊤x+a| kvk Let sx be the closest point to x in ∆ , sx = arg min s∈∆ kx − sk. Then x = sx + r v kvk ⇔ r v kvk = x − sx So that, taking the scalar product with vector v we have: v ⊤r v kvk = v ⊤(x − sx ) = v ⊤x − v ⊤sx = v ⊤x + a − (v ⊤sx + a) | {z } =0 = v ⊤x + a and therefore r = v⊤x + a kvk leading to: dist(xi , ∆(v, a)) = min s∈∆ kx − sk = r = |v⊤x + a| kvkGeometrical and numerical margin ∆ = {x ∈ IR2 | v ⊤x + a = 0} the decision border ∆ d(x, ∆) = |v ⊤x + a| kvk the geometrical margin d(xb, ∆) (xr, v ⊤xr + a) (xr, 0) mr d(xr, ∆) (xb, v ⊤xb + a) mb (xb, 0) m = |v ⊤x + a| the numerical marginFrom the geometrical to the numerical margin +1 ï1 ï1/|w| 1/|w| {x | wT x = 0} <ï ï marge > x w T x Valeur de la marge dans le cas monodimensionnel Maximize the (geometrical) margin    max v,a m with min i=1,n |v ⊤xi + a| kvk ≥ m if the min is greater, everybody is greater (yi ∈ {−1, 1})    max v,a m with yi(v ⊤xi + a) kvk ≥ m, i = 1, n change variable: w = v mkvk and b = a mkvk =⇒ kwk = 1 m    max w,b m with yi(w⊤xi + b) ≥ 1 ; i = 1, n and m = 1 kwk    min w,b kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, nThe canonical hyperplane ( min w,b kwk 2 with yi(w⊤xi + b) ≥ 1 i = 1, n Definition (The canonical hyperplane) An hyperplane (w, b) in IRd is said to be canonical with respect the set of vectors {xi ∈ IRd , i = 1, n} if min i=1,n |w ⊤xi + b| = 1 so that the distance min i=1,n dist(xi , ∆(w, b)) = |w⊤x + b| kwk = 1 kwk The maximal margin (=minimal norm) canonical hyperplaneRoad map 1 Linear SVM Separating hyperplanes The margin Linear SVM: the problem Linear programming SVM 0 0 0 marginLinear SVM: the problem The maximal margin (=minimal norm) canonical hyperplane 0 0 0 margin Linear SVMs are the solution of the following problem (called primal) Let {(xi , yi); i = 1 : n} be a set of labelled data with x ∈ IRd , yi ∈ {1, −1} A support vector machine (SVM) is a linear classifier associated with the following decision function: D(x) = sign w⊤x + b  where w ∈ IRd and b ∈ IR a given thought the solution of the following problem: ( min w∈IRd , b∈IR 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1 , i = 1, n This is a quadratic program (QP): ( min z 1 2 z ⊤Az − d ⊤z with Bz ≤ eSupport vector machines as a QP The Standart QP formulation ( min w,b 1 2 kwk 2 with yi(w⊤xi + b) ≥ 1, i = 1, n ⇔ ( min z∈IRd+1 1 2 z ⊤Az − d ⊤z with Bz ≤ e z = (w, b) ⊤, d = (0, . . . , 0) ⊤, A =  I 0 0 0  , B = −[diag(y)X, y] and e = −(1, . . . , 1) ⊤ Solve it using a standard QP solver such as (for instance) % QUADPROG Quadratic programming . % X = QUADPROG (H ,f ,A ,b ) attempts to solve the quadratic programming problem : % % min 0.5* x ’* H*x + f ’* x subject to : A*x <= b % x % so that the solution is in the range LB <= X <= UB For more solvers (just to name a few) have a look at: plato.asu.edu/sub/nlores.html#QP-problem www.numerical.rl.ac.uk/people/nimg/qp/qp.htmlRoad map 1 Linear SVM Separating hyperplanes The margin Linear SVM: the problem Linear programming SVM 0 0 0 marginOther SVMs: Equivalence between norms L1 norm variable selection (especially with redundant noisy features) Mangassarian, 1965    max m,v,a m with yi(v ⊤xi + a) ≥ m kvk2 ≥ m 1 √ d kvk1 i = 1, n 1-norm or Linear Programming-SVM (LP SVM) ( min w,b kwk1 = Pp j=1 |wj | with yi(w ⊤xi + b) ≥ 1 ; i = 1, n Generalized SVM (Bradley and Mangasarian, 1998) ( min w,b kwk p p with yi(w ⊤xi + b) ≥ 1 ; i = 1, n p = 2: SVM, p = 1: LPSVM (also with p = ∞), p = 0: L0 SVM, p= 1 and 2: doubly regularized SVM (DrSVM)Linear support vector support (LP SVM) ( min w,b kwk1 = Pp j=1 w + j + w − j with yi(w ⊤xi + b) ≥ 1 ; i = 1, n w = w + − w − with w + ≥ 0 and w − ≥ 0 The Standart LP formulation    min x f ⊤x with Ax ≤ d and 0 ≤ x x = [w +; w −; b] f = [1 . . . 1; 0] d = −[1 . . . 1] ⊤ A = [−yiXi yiXi − yi] % linprog (f ,A ,b , Aeq , beq ,LB , UB ) % attempts to solve the linear programming problem : % min f ’* x subject to : A* x <= b % x % so that the solution is in the range LB <= X <= UBAn example of linear discrimination: SVM and LPSVM true line QP SVM LPSVM Figure: SVM and LP SVMThe linear discrimination problem from Learning with Kernels, B. Schölkopf and A. Smolla, MIT Press, 2002.Conclusion SVM = Separating hyperplane (to begin with the simpler) + Margin, Norm and statistical learning + Quadratic and Linear programming (and associated rewriting issues) + Support vectors (sparsity) SVM preforms the selection of the most relevant data pointsBibliography V. Vapnik, the generalized portrait method p 355 in Estimation of dependences based on empirical data, Springer, 1982 B. Boser, I. Guyon & V. Vapnik, A training algorithm for optimal margin classifiers. COLT, 1992 P. S. Bradley & O. L. Mangasarian. Feature selection via concave minimization and support vector machines. ICML 1998 B. Schölkopf & A. Smolla, Learning with Kernels, MIT Press, 2002 M. Mohri, A. Rostamizadeh & A. Talwalkar, Foundations of Machine Learning, MIT press 2012 http://agbs.kyb.tuebingen.mpg.de/lwk/sections/section72.pdf http://www.cs.nyu.edu/~mohri/mls/lecture_4.pdf http://en.wikipedia.org/wiki/Quadratic_programming Stéphane Canu (INSA Rouen - LITIS) March 12, 2014 25 / 25 Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox Stephane Canu To cite this version: Stephane Canu. Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox. Ecole d’ing´enieur. Introduction to Support Vector Machines (SVM), ´ Sao Paulo, 2014, pp.33. HAL Id: cel-01003007 https://cel.archives-ouvertes.fr/cel-01003007 Submitted on 8 Jun 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.Lecture 8: Multi Class SVM Stéphane Canu stephane.canu@litislab.eu Sao Paulo 2014 April 10, 2014Roadmap 1 Multi Class SVM 3 different strategies for multi class SVM Multi Class SVM by decomposition Multi class SVM Coupling convex hulls 1.5 1.5 1.5 1.5 1.5 1.5 2.5 2.5 2.5 −0.5 0 0.5 1 1.5 0 0.5 1 1.5 23 different strategies for multi class SVM 1 Decomposition approaches ◮ one vs all: winner takes all ◮ one vs one: ⋆ max-wins voting ⋆ pairwise coupling: use probability ◮ c SVDD 2 global approach (size c × n), ◮ formal (different variations)    min f ∈H,α0,ξ∈IRn 1 2 Xc ℓ=1 kfℓk 2 H + C p Xn i=1 Xc ℓ=1,ℓ6=yi ξ p iℓ with fyi (xi) + byi ≥ fℓ(xi) + bℓ + 2 − ξiℓ and ξiℓ ≥ 0 for i = 1, ..., n; ℓ = 1, ..., c; ℓ 6= yi non consistent estimator but practically useful ◮ structured outputs 3 A coupling formulation using the convex hulls3 different strategies for multi class SVM 1 Decomposition approaches ◮ one vs all: winner takes all ◮ one vs one: ⋆ max-wins voting ⋆ pairwise coupling: use probability – best results ◮ c SVDD 2 global approach (size c × n), ◮ formal (different variations)    min f ∈H,α0,ξ∈IRn 1 2 Xc ℓ=1 kfℓk 2 H + C p Xn i=1 Xc ℓ=1,ℓ6=yi ξ p iℓ with fyi (xi) + byi ≥ fℓ(xi) + bℓ + 2 − ξiℓ and ξiℓ ≥ 0 for i = 1, ..., n; ℓ = 1, ..., c; ℓ 6= yi non consistent estimator but practically useful ◮ structured outputs 3 A coupling formulation using the convex hullsMulticlass SVM: complexity issues n training data n = 60, 000 for MNIST c class c = 10 for MNIST approach problem size number of sub problems discrimination rejection 1 vs. all n c ++ - 1 vs. 1 2n c c(c−1) 2 ++ - c SVDD n c c - ++ all together n × c 1 ++ - coupling CH n 1 + +Roadmap 1 Multi Class SVM 3 different strategies for multi class SVM Multi Class SVM by decomposition Multi class SVM Coupling convex hulls 1.5 1.5 1.5 1.5 1.5 1.5 2.5 2.5 2.5 −0.5 0 0.5 1 1.5 0 0.5 1 1.5 2Multi Class SVM by decomposition One-Against-All Methods → winner-takes-all strategy One-vs-One: pairwise methods → max-wins voting → directed acyclic graph (DAG) → error-correcting codes → post process probabilities Hierarchical binary tree for multi-class SVM http://courses.media.mit.edu/2006fall/ mas622j/Projects/aisen-project/SVM and probabilities (Platt, 1999) The decision function of the SVM is: sign f (x) + b  log IP(Y = 1|x) IP(Y = −1|x) should have (almost) the same sign as f (x) + b log IP(Y = 1|x) IP(Y = −1|x) = a1(f (x) + b) + a2 IP(Y = 1|x) = 1 − 1 1 + expa1(f (x)+b)+a2 a1 et a2 estimated using maximum likelihood on new data max a1,a2 L with L = Yn i=1 IP(Y = 1|xi) yi + (1 − IP(Y = 1|xi))(1−yi ) and log L = Pn i=1 yi log(IP(Y = 1|xi)) + (1 − yi)log(1 − IP(Y = 1|xi)) = Pn i=1 yi log IP(Y =1|xi ) 1−IP(Y =1|xi )  + log(1 − IP(Y = 1|xi)) = Pn i=1 yi a1(f (xi) + b) + a2  − log(1 + expa1(f (xi )+b)+a2 ) = Pn i=1 yi a ⊤zi  − log(1 + expa ⊤zi) Newton iterations: a new ← a old − H −1∇logLSVM and probabilities (Platt, 1999) max a∈IR2 log L = Xn i=1 yi a ⊤zi  − log(1 + expa ⊤zi) Newton iterations a new ← a old − H −1∇logL ∇logL = Xn i=1 yi zi − expa ⊤z 1 + expa⊤z zi = Xn i=1 yi − IP(Y = 1|xi)  zi = Z ⊤(y − p) H = − Xn i=1 zi z ⊤ i IP(Y = 1|xi) 1 − IP(Y = 1|xi)  = −Z ⊤WZ Newton iterations a new ← a old + (Z ⊤WZ) −1Z ⊤(y − p)SVM and probabilities: practical issues y −→ t =    1 − ε+ = n+ + 1 n+ + 2 if yi = 1 ε− = 1 n− + 2 if yi = −1 1 in: X, y, f /out: p 2 t ← 3 Z ← 4 loop until convergence 1 p ← 1 − 1 1+expa⊤z 2 W ← diag p(1 − p)  3 a new ← a old + (Z ⊤WZ) −1Z ⊤(t − p)SVM and probabilities: pairwise coupling From pairwise probabilities IP(cℓ , cj) to class probabilities pℓ = IP(cℓ |x) min p Xc ℓ=1 X ℓ−1 j=1 IP(cℓ , cj) 2 (pℓ − pj) 2  Q e e ⊤ 0  p µ  =  0 1  with Qℓj =  IP(cℓ , cj) 2 P ℓ 6= j i IP(cℓ , ci) 2 ℓ = j The global procedure : 1 (Xa, ya, Xt, yt) ← split(X, y) 2 (Xℓ, yℓ, Xp, yp) ← split(Xa, ya) 3 loop for all pairs (ci , cj) of classes 1 modeli,j ← train_SVM(Xℓ, yℓ,(ci , cj)) 2 IP(ci , cj) ← estimate_proba(Xp, yp, model) % Platt estimate 4 p ← post_process(Xt, yt,IP) % Pairwise Coupling Wu, Lin & Weng, 2004, Duan & Keerti, 05SVM and probabilities Some facts SVM is universally consistent (converges towards the Bayes risk) SVM asymptotically implements the bayes rule but theoretically: no consistency towards conditional probabilities (due to the nature of sparsity) to estimate conditional probabilities on an interval (typically[ 1 2 − η, 1 2 + η]) to sparseness in this interval (all data points have to be support vectors) Bartlett & Tewari, JMLR, 07SVM and probabilities (2/2) An alternative approach g(x) − ε −(x) ≤ IP(Y = 1|x) ≤ g(x) + ε +(x) with g(x) = 1 1+4−f (x)−α0 non parametric functions ε − and ε + have to verify: g(x) + ε +(x) = exp−a1(1−f (x)−α0)++a2 1 − g(x) − ε −(x) = exp−a1(1+f (x)+α0)++a2 with a1 = log 2 and a2 = 0 Grandvalet et al., 07Roadmap 1 Multi Class SVM 3 different strategies for multi class SVM Multi Class SVM by decomposition Multi class SVM Coupling convex hulls 1.5 1.5 1.5 1.5 1.5 1.5 2.5 2.5 2.5 −0.5 0 0.5 1 1.5 0 0.5 1 1.5 2Multi class SVM: the decision function One hyperplane by class fℓ(x) = w ⊤ ℓ x + bℓ ℓ = 1, c Winner takes all decision function D(x) = Argmax ℓ=1,c w ⊤ 1 x + b1, w ⊤ 2 x + b2, . . . , w ⊤ ℓ x + bℓ , . . . , w ⊤ c x + bc  We can revisit the 2 classes case in this setting c × (d + 1) unknown variables (wℓ, bℓ); ℓ = 1, cMulti class SVM: the optimization problem The margin in the multidimensional case m = min ℓ6=yi v ⊤ yi xi − ayi − v ⊤ ℓ xi + aℓ  = v ⊤ yi xi + ayi − max ℓ6=yi v ⊤ ℓ xi + aℓ  The maximal margin multiclass SVM    max vℓ,aℓ m with v ⊤ yi xi + ayi − v ⊤ ℓ xi − aℓ ≥ m for i = 1, n; ℓ = 1, c; ℓ 6= yi and 1 2 Xc ℓ=1 kvℓk 2 = 1 The multiclass SVM    min wℓ,bℓ 1 2 Xc ℓ=1 kwℓk 2 with x ⊤ i (wyi − wℓ) + byi − bℓ ≥ 1 for i = 1, n; ℓ = 1, c; ℓ 6= yiMulti class SVM: KKT and dual form: The 3 classes case    min wℓ,bℓ 1 2 X 3 ℓ=1 kwℓk 2 with w⊤ yi xi + byi ≥ w⊤ ℓ xi + bℓ + 1 for i = 1, n; ℓ = 1, 3; ℓ 6= yi    min wℓ,bℓ 1 2 kw1k 2 + 1 2 kw2k 2 + 1 2 kw3k 2 with w⊤ 1 xi + b1 ≥ w⊤ 2 xi + b2 + 1 for i such that yi = 1 w⊤ 1 xi + b1 ≥ w⊤ 3 xi + b3 + 1 for i such that yi = 1 w⊤ 2 xi + b2 ≥ w⊤ 1 xi + b1 + 1 for i such that yi = 2 w⊤ 2 xi + b2 ≥ w⊤ 3 xi + b3 + 1 for i such that yi = 2 w⊤ 3 xi + b3 ≥ w⊤ 1 xi + b1 + 1 for i such that yi = 3 w⊤ 3 xi + b3 ≥ w⊤ 2 xi + b2 + 1 for i such that yi = 3 L = 1 2 (kw1k 2 + kw2k 2 + kw3k 2 ) −α ⊤ 12(X1(w1 − w2) + b1 − b2 − 1) −α ⊤ 13(X1(w1 − w3) + b1 − b3 − 1) −α ⊤ 21(X2(w2 − w1) + b2 − b1 − 1) −α ⊤ 23(X2(w2 − w3) + b2 − b3 − 1) −α ⊤ 31(X3(w3 − w1) + b3 − b1 − 1) −α ⊤ 32(X3(w3 − w2) + b3 − b2 − 1)Multi class SVM: KKT and dual form: The 3 classes case L = 1 2 kwk 2 − α ⊤(XMw + Ab − 1) with w =   w1 w2 w3   ∈ IR3d M = M ⊗ I =   I −I 0 I 0 −I −I I 0 0 I −I −I 0 I 0 −I I   a 6d × 3d matrix where I the identity matrix and X =   X1 0 0 0 0 0 0 X1 0 0 0 0 0 0 X2 0 0 0 0 0 0 X2 0 0 0 0 0 0 X3 0 0 0 0 0 0 X3   a 2n × 6d matrix with input data X =   X1 X2 X3   n × dMulti class SVM: KKT and dual form: The 3 classes case KKT Stationality conditions = ∇wL = w − M⊤X ⊤α ∇bL = A ⊤α The dual min α∈IR2n 1 2 α ⊤Gα − e ⊤α with Ab = 0 and 0 ≤ α With G = XMM⊤X ⊤ = X (M ⊗ I)(M ⊗ I) ⊤X ⊤ = X (MM⊤ ⊗ I)X ⊤ = (MM⊤ ⊗ I). × XX ⊤ = (MM⊤ ⊗ I). × 1I K 1I⊤ and M =   1 −1 0 1 0 −1 −1 1 0 0 1 −1 −1 0 1 0 −1 1  Multi class SVM and slack variables (2 variants) A slack for all (Vapnik & Blanz, Weston & Watkins 1998)    min wℓ,bℓ,ξ∈IRcn 1 2 Xc ℓ=1 kwℓk 2 + C Xn i=1 Xc ℓ=1,ℓ6=yi ξiℓ with w⊤ yi xi + byi − w⊤ ℓ xi − bℓ ≥ 1 − ξiℓ and ξiℓ ≥ 0 for i = 1, n; ℓ = 1, c; ℓ 6= yi The dual min α∈IR2n 1 2 α ⊤Gα − e ⊤α with Ab = 0 and 0 ≤ α ≤ C Max error, a slack per training data (Cramer and Singer, 2001)    min wℓ,bℓ,ξ∈IRn 1 2 Xc ℓ=1 kwℓk 2 + C Xn i=1 ξi with (wyi − wℓ) ⊤xi ≥ 1 − ξi for i = 1, n; ℓ = 1, c; ℓ 6= yi X i=1 and ξi ≥ 0 for i = 1, nMulti class SVM and Kernels    min f ∈H,α0,ξ∈IRcn 1 2 Xc ℓ=1 kfℓk 2 H + C Xn i=1 Xc ℓ=1,ℓ6=yi ξiℓ with fyi (xi) + byi − fℓ(xi) − bℓ ≥ 1 − ξiℓ Xn i=1 and ξiℓ ≥ 0 for i = 1, n; ℓ = 1, c; ℓ 6= yi The dual min α∈IR2n 1 2 α ⊤Gα − e ⊤α with Ab = 0 and 0 ≤ α≤ C where G is the multi class kernel matrixOther Multi class SVM Lee, Lin & Wahba, 2004    min f ∈H λ 2 Xc ℓ=1 kfℓk 2 H + 1 n Xn i=1 Xc ℓ=1,ℓ6=yi (fℓ(xi) + 1 c − 1 )+ with Xc ℓ=1 fℓ(x) = 0 ∀x Structured outputs = Cramer and Singer, 2001 MSVMpack : A Multi-Class Support Vector Machine Package Fabien Lauer & Yann GuermeurRoadmap 1 Multi Class SVM 3 different strategies for multi class SVM Multi Class SVM by decomposition Multi class SVM Coupling convex hulls 1.5 1.5 1.5 1.5 1.5 1.5 2.5 2.5 2.5 −0.5 0 0.5 1 1.5 0 0.5 1 1.5 2One more way to derivate SVM Minimizing the distance between the convex hulls    min α ku − vk 2 with u(x) = X {i|yi =1} αi(x ⊤ i x), v(x) = X {i|yi =−1} αi(x ⊤ i x) and X {i|yi =1} αi = 1, X {i|yi =−1} αi = 1, 0 ≤ αi i = 1, nThe multi class case    min α Xc ℓ=1 Xc ℓ ′=1 kuℓ − uℓ ′k 2 with uℓ(x) = X {i|yi =ℓ} αi,ℓ(x ⊤ i x), ℓ = 1, c and X {i|yi =ℓ} αi,ℓ = 1, 0 ≤ αi,ℓ i = 1, n; ℓ = 1, cBibliography Estimating probabilities ◮ Platt, J. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In Advances in large margin classifiers. MIT Press. ◮ T. Lin, C.-J. Lin, R.C. Weng, A note on Platt’s probabilistic outputs for support vector machines, Mach. Learn. 68 (2007) 267–276 ◮ http://www.cs.cornell.edu/courses/cs678/2007sp/platt.pdf Multiclass SVM ◮ K.-B. Duan & S. Keerthi (2005). "Which Is the Best Multiclass SVM Method? An Empirical Study". ◮ T.-F. Wu, C.-J. Lin, R.C. Weng, Probability estimates for multi-class classification by pairwise coupling, JMLR. 5 (2004) 975–1005. ◮ K. Crammer & Y. Singer (2001). "On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines". JMLR 2: 265–292. ◮ Lee, Y.; Lin, Y.; and Wahba, G. (2001). "Multicategory Support Vector Machines". Computing Science and Statistics 33. ◮ http://www.loria.fr/~guermeur/NN2008_M_SVM_YG.pdf ◮ http://jmlr.org/papers/volume12/lauer11a/lauer11a.pdf Stéphane Canu (INSA Rouen - LITIS) April 10, 2014 25 / 25 https://hal.inria.fr/cel-01082588v2/document Tutoriel Android - TP de prise en main Dima Rodriguez To cite this version: Dima Rodriguez. Tutoriel Android - TP de prise en main. Ecole d’ing´enieur. France. 2014, ´ pp.51. HAL Id: cel-01082588 https://hal.archives-ouvertes.fr/cel-01082588v2 Submitted on 26 Nov 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destin´ee au d´epˆot et `a la diffusion de documents scientifiques de niveau recherche, publi´es ou non, ´emanant des ´etablissements d’enseignement et de recherche fran¸cais ou ´etrangers, des laboratoires publics ou priv´es.Tutoriel Android TM TP de prise en main Dima RodriguezPolytech’ Paris Sud Tutoriel AndroidTM Dima Rodriguez Novembre 2014 TP de prise en mainTable des matières Préambule 4 1 Installation de l’IDE 5 2 Configuration de l’IDE 6 Installation des paquets supplémentaires et des mises à jours . . . . . . 6 Configuration d’un émulateur . . . . . . . . . . . . . . . . . . . . . . . 6 3 Notre première application Android 10 Création d’un projet et d’une application “Hello World” . . . . . . . . . 10 Exécution de l’application . . . . . . . . . . . . . . . . . . . . . . . . . 11 Se repérer dans le projet . . . . . . . . . . . . . . . . . . . . . . . . . 14 Modification de l’interface utilisateur . . . . . . . . . . . . . . . . . . . 16 Répondre aux évènements . . . . . . . . . . . . . . . . . . . . . . . . . 21 Créer et lancer une autre activité . . . . . . . . . . . . . . . . . . . . . 22 Créer des animations . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Créer un View personnalisé pour gérer un jeu . . . . . . . . . . . . . . 32 Temporisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Rajouter un bouton sur la barre d’action . . . . . . . . . . . . . . . . . 40 Lancement d’une autre application . . . . . . . . . . . . . . . . . . . . 41 Changement de langue . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Annexes 46 Explication du code généré par défaut pour la classe Principale . . . . . 46 Cycle de vie d’une activité . . . . . . . . . . . . . . . . . . . . . . . . 49 2Table des figures 2.1 SDK Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Android Virtual Device Manager . . . . . . . . . . . . . . . . . . 8 2.3 Création d’un appareil virtuel . . . . . . . . . . . . . . . . . . . . 9 3.1 Création d’un projet . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Créer une activité . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Nouvelle activité . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Exécution de l’application . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Aperçu de l’interface Eclipse . . . . . . . . . . . . . . . . . . . 14 3.6 Hiérarchie de LinearLayout . . . . . . . . . . . . . . . . . . . . . 17 3.7 Premier test de l’application modifiée . . . . . . . . . . . . . . . . 20 3.8 Champ de saisie et bouton . . . . . . . . . . . . . . . . . . . . . 21 3.9 Création d’une nouvelle activité . . . . . . . . . . . . . . . . . . . 23 3.10 Nouveau xml pour définir une animation . . . . . . . . . . . . . . 28 3.11 Animation en LinearLayout . . . . . . . . . . . . . . . . . . . . . 30 3.12 Animation en RelativeLayout . . . . . . . . . . . . . . . . . . . . 31 3.13 Création de la classe MonViewPerso . . . . . . . . . . . . . . . . 33 3.14 Ajout d’un bouton pour lancer le jeu . . . . . . . . . . . . . . . . 36 3.15 Activité avec vue personnalisée . . . . . . . . . . . . . . . . . . . 39 3.16 Barre d’action . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.17 Cycle de vie d’une activité . . . . . . . . . . . . . . . . . . . . . 50 3Préambule Le système d’exploitation Android est actuellement l’OS le plus utilisé dans le monde faisant tourner des smartphones, tablettes, montres connectées, liseuses électroniques, télévisions interactives, et bien d’autres. C’est un système, open source qui utilise le noyau Linux. Il a été créée par Android, Inc. qui fut rachetée par Google en 2005. Le développement d’applications pour Android s’effectue en Java en utilisant des bibliothèques spécifiques. Le but de ce tutoriel est de vous familiariser avec l’esprit de développement Android et ses bibliothèques. Nous introduirons les concepts de bases de création d’application en mettant en œuvre quelques fonctionnalités simples. Ce tutoriel n’est en aucun cas exhaustive, le potentiel des applications Android est beaucoup plus ample, les exemples cités dans ce document ne devront pas brider votre imagination ni votre curiosité. Sur le site officiel pour les développeurs Android vous trouverez la documentation des classes, des tutoriels ainsi que les lignes directrices pour préparer une distribution Google Play. Un lexique à la fin de ce document définit quelques mot du vocabulaire Android utilisé dans ce tutoriel. 41 Installation de l’IDE Dans cette section nous allons décrire la procédure d’installation d’un environnement de développement Android. Attention : Il faut exécuter les étapes dans l’ordre cité ci-dessous. a. Téléchargez le JDK7 (Java Development Kit) que vous pouvez trouver sur le site d’Oracle 1 . b. Désinstallez des éventuelles versions antérieures du JDK c. Installez le nouveau JDK d. Téléchargez le paquet ADT (Android Developer Tools). Il contient le SDK (Software Development Kit) Android et une version d’Eclipse avec ADT intégré. e. Pour installer l’IDE, il faut juste placer le dossier téléchargé dans le répertoire où vous avez l’habitude d’installer vos programmes (ou directement sur votre partition principale) et le dé-zipper. Vous pouvez également lui changer de nom si vous souhaitez, mais veillez à ne pas mettre des espaces ou des accents quand vous le renommez. f. Dans le dossier dé-zippé vous trouverez un exécutable Eclipse que vous pouvez désormais lancer pour commencer la configuration de votre environnement. i Au moment de l’écriture de ce document, Eclipse est le seul IDE (Integrated Development Environment) officiellement supporté. Un nouvel environnement, Android Studio, est en cours de développement mais est encore en version bêta pas très stable. Si vous souhaitez utiliser une version d’Eclipse que vous avez déjà sur votre machine il faudrait prendre le SDK et un plugin ADT et configurer Eclipse pour son utilisation. 1. Ce tutoriel a été réalisé avec JDK7u60 52 Configuration de l’IDE Installation des paquets supplémentaires et des mises à jours a. Lancez Eclipse b. On commencera par s’assurer que l’environnement installé est à jour. Dans le menu Help sélectionnez Check for Updates et installez les mises à jour le cas échéant. c. Pour vérifier la version du SDK installé, allez dans le menu Window > Android SDK Manager et lancez le gestionnaire du SDK. Dans le gestionnaire (fig.2.1) vous verrez la version du SDK installé (avec les mises jour disponibles) et aussi la version de l’API (Application Programming Interface) installée et la version du OS pour laquelle elle vous permettra de développer. Installez les paquets proposés par défaut. i Si vous voulez développer pour des versions Android plus anciennes il faut installer les versions API correspondantes. Configuration d’un émulateur Un émulateur permet de reproduire le comportement d’un appareil réel d’une façon virtuelle. L’utilisation d’un émulateur nous évite d’avoir à charger à chaque fois l’application dans un appareil pour la tester. On pourra ainsi lancer l’application dans l’IDE et elle s’exécutera sur un appareil virtuel appelé Android Virtual Device AVD qui émule le comportement d’un téléphone, une tablette ou autre. Eclipse ne propose pas d’émulateur par défaut, avant de commencer à créer notre application il faut en configurer un. 6Tutoriel Android 7 Figure 2.1 – SDK Manager Dans cet exemple, il existe une mise à jour disponible pour le SDK. L’API installée est la version 20 qui permet un développement pour Android 4.4, mais il existe une API plus récente pour Android 5.0. Polytech’ Paris Sud Dima RodriguezTutoriel Android 8 Figure 2.2 – Android Virtual Device Manager Allez dans le menu Window > Android Virtual Device Manager, une fois le gestionnaire ouvert cliquez sur le bouton Create (fig. 2.2). Une fenêtre de configuration s’affiche (fig. 2.3a). On propose de configurer un émulateur Nexus One avec les paramètres indiqués (fig.2.3b). Notez qu’à la création de l’appareil sa résolution vous est signalée. Dans cet exemple l’appareil a une résolution 480x800 qui correspond à hdpi (high density dots per inch). Ceci est important à noter pour l’intégration d’images dans l’application. i Notez que pour certains émulateurs proposés le processeur n’est pas installé par défaut, pour pouvoir les créer il faut installer un processeur adapté dans le SDK Manager. Polytech’ Paris Sud Dima RodriguezTutoriel Android 9 (a) Fenêtre de création AVD (b) Création d’un appareil Nexus One Figure 2.3 – Création d’un appareil virtuel Polytech’ Paris Sud Dima Rodriguez3 Notre première application Android Création d’un projet et d’une application “Hello World” a. Dans le menu File > New, sélectionnez Android Application Project , et renseignez les informations comme dans la figure 3.1 Application name : c’est le nom qui va apparaitre dans la liste des applications sur l’appareil et dans le Play Store. Project name : c’est le nom utilisé par Eclipse (typiquement le même que celui de l’application). Package name : il est utilisé comme identifiant de l’application, il permet de considérer différentes versions d’une application comme étant une même application. Minimum required SDK : c’est la version Android la plus ancienne sur laquelle l’application peut tourner. Il faut éviter de remonter trop en arrière ça réduirait les fonctionnalités que vous pourriez donner à votre application. Target SDK : c’est la version pour laquelle l’application est développée et testée. Typiquement la dernière version API que vous avez installée. 1 Compile with : c’est la version d’API à utiliser pour la compilation. Typiquement la dernière version du SDK installée. Theme : c’est l’apparence par défaut qu’aura votre application. 1. Ce tutoriel a été réalisé avec la version 4.4.2 10Tutoriel Android 11 b. Cliquez sur Next et laissez les choix par défaut. Vous pouvez éventuellement modifier l’emplacement de votre projet en décochant Create Project in Workspace et parcourir le disque pour sélectionner un autre dossier. c. Cliquez sur Next. La fenêtre suivante vous propose de définir une icône pour votre application. Nous laisserons l’icône proposée par défaut. Vous pourrez ultérieurement créer votre propre icône pour vos applications. Remarquez que l’image doit être proposée avec différentes résolutions pour s’adapter aux différents appareils. d. Cliquez sur Next. Nous arrivons à la création d’une activité (un écran avec une interface graphique). Sélectionnez Blank Activity (fig. 3.2) et cliquez Next. e. Selon la version de l’ADT que vous avez, vous verrez soit la fenêtre de la figure 3.3a ou celle de la figure 3.3b. La dernière version impose l’utilisation de fragments. Chaque activité dispose d’un layout qui définit la façon dont les composants seront disposés sur l’écran. Une activité peut être divisée en portions (ou fragments) chacune ayant son propre layout. La notion de fragment a été introduite pour favoriser la ré-utilisabilité de morceaux d’activité (un fragment peut être définit une fois et réutilisé dans plusieurs activités). Renseignez les champs comme indiqué dans la figure. f. Cliquez sur Finish, le projet est crée. ! Si vous créez un fragment ce sera le fichier fragment_principale.xml que vous devriez modifier dans la suite du tutoriel sinon vous modi- fierez le fichier activite_principale.xml. Exécution de l’application Sur l’émulateur Appuyez sur le bouton d’exécution (fig.3.4 ) et sélectionnez Android Application dans la fenêtre qui s’affiche. L’émulateur se lance, ça peut prendre quelques minutes soyez patients. Rassurez-vous, vous n’aurez pas à le relancer à chaque fois que vous compilez votre projet, laissez-le ouvert et à chaque fois que vous compilez et relancez votre application, elle sera rechargée dans l’émulateur en cours. Polytech’ Paris Sud Dima RodriguezTutoriel Android 12 Figure 3.1 – Création d’un projet Figure 3.2 – Créer une activité Polytech’ Paris Sud Dima RodriguezTutoriel Android 13 (a) Création d’activité sans fragment (b) Création d’activité avec fragment Figure 3.3 – Nouvelle activité Figure 3.4 – Exécution de l’application Polytech’ Paris Sud Dima RodriguezTutoriel Android 14 Explorateur Palette des composants graphiques Navigateur des fichiers ouverts Liste des composants de l'activité Propritétés du composant selectionné Navigation entre vue graphique et xml Output Aperçu de l'activité Debug et Execution Figure 3.5 – Aperçu de l’interface Eclipse Sur un appareil réel Connectez l’appareil par câble USB à l’ordinateur et installez le pilote si nécessaire. Activez l’option de débogage USB sur votre appareil (en général sous Settings > Applications > Development). Lancez l’application depuis Eclipse comme précédemment. Eclipse charge l’application sur votre appareil et la lance. i Une fois que votre application est compilée, un fichier MonAppli.apk est créé dans le dossier bin de votre répertoire de travail. C’est l’exé- cutable de votre application. C’est ce fichier que vous devez déployer pour distribuer votre application. Le contenu de ce fichier peut être inspecté à l’aide de n’importe quel logiciel standard de compression/- décompression de fichiers. Se repérer dans le projet La figure 3.5 montre les principaux éléments de l’interface Eclipse. Tout projet Android doit respecter une hiérarchie bien précise qui permettra au compilateur de retrouver les différents éléments et ressources lors de la génération de l’application. Cette hiérarchie favorise la modularité des applications Android. A la création du projet, Eclipse crée automatiquement des dossiers pour contenir Polytech’ Paris Sud Dima RodriguezTutoriel Android 15 les fichiers de code Java, les fichiers XML, et les fichiers multimédias. L’explorateur de projet vous permettra de naviguer dans ces dossiers. Les dossiers que nous utiliserons le plus sont src et res. Le premier contient le code Java qui définit le comportement de l’application et le second comporte des sous dossiers où sont stockés les ressources qui définissent l’interface de l’application (l’apparence). i La séparation entre fonctionnalité et apparence est un point essentiel de la philosophie Android. Le code de la classe principale de l’application (Principale.java) est situé dans le sous dossier polytech.android.monappli de src. Vous trouverez en annexe une brève explication du code qui y est généré par défaut. C’est dans le dossier src que seront enregistrées toutes les classes que nous allons créer dans ce projet. Par ailleurs, tout ce qui touche à l’interface utilisateur sera intégré dans les sous dossiers de res, dont voici une brève description : layout regroupe les fichiers XML qui définissent la disposition des composants sur l’écran. Il contient déjà, dès la création du projet, le layout de l’activité principale que nous avons créée. drawable-**** contient tout élément qui peut être dessiné sur l’écran : images (en PNG de préférence), formes, animations, transitions, icône, etc.. Cinq dossiers drawable permettent aux développeurs de proposer des éléments graphiques pour tout genre d’appareil Android en fonction de sa résolution. En populant correctement ces dossiers on peut ainsi créer des applications avec une interface qui s’adapte à chaque résolution d’écran avec un seul fichier .apk. ldpi low-resolution dots per inch. Pour des images destinées à des écrans de basse résolution (~120dpi) mdpi pour des écrans de moyenne resolution (~160dpi) hdpi pour des écrans de haute résolution (~240dpi) xhdpi pour des écrans ayant une extra haute résolution (~320dpi) xxhdpi pour des écrans ayant une extra extra haute résolution (~480dpi). menu contient les fichiers XML définissant les menus Polytech’ Paris Sud Dima RodriguezTutoriel Android 16 values contient les fichiers XML qui définissent des valeurs constantes (des chaines de caractères, des dimensions, des couleurs, des styles etc.) Dans le dossier gen vous verrez du code java généré automatiquement par Eclipse. Nous nous intéresserons particulièrement au fichier R.java dans le package polytech.android.monappli. Ce fichier définit une classe R dans laquelle sont définis les identifiants des ressources de l’application. A chaque fois que vous rajouterez une ressource à votre application un identifiant sera généré automatiquement dans cette classe vous permettant par la suite de pouvoir le référencer pour l’utiliser dans votre code 2 . Vous trouverez également sur la racine du projet un fichier nommé AndroidManifest.xml. Ce fichier est obligatoire dans tout projet Android, et doit toujours avoir ce même nom. Ce fichier permet au système de reconnaitre l’application. Modification de l’interface utilisateur Pour l’instant notre application ne fait qu’afficher un message sur l’écran, dans cette section nous allons modifier l’interface pour y mettre un champ de saisie et un bouton. Une interface utilisateur est en général constituée de ce qu’on appelle des ViewGroups qui contiennent des objets de type View ainsi que d’autres ViewGroups. Un View est un composant, tel un bouton ou un champ de texte, et les ViewGroups sont des conteneurs qui définissent une disposition des composants (Views) qui y sont placés. ViewGroup définit la classe de base des différents layouts. Comprendre le layout La disposition de notre interface est définie dans le fichier fragment_principale.xml situé dans le dossier layout de res. (ou bien le fichier activite_principale.xml si vous n’avez pas définit de fragment à la création de votre projet). Ouvrez ce fichier. 2. A l’intérieur de classe R sont définies plusieurs classes, dites nichées, telles que string, drawable, layout, menu, id, etc. Une classe nichée est membre de la classe qui la contient. On a recours à ce genre de classe en général lorsqu’on veut définir une classe qui n’est utilisée qu’à l’intérieur d’une autre classe. Si on la déclare privée elle ne sera visible qu’a l’intérieur de la classe qui l’a définie. Par ailleurs cette dernière peut également accéder aux attributs privés de la classe nichée. C’est une façon d’améliorer la lisibilité du code en regroupant les fonctionnalités qui vont ensemble. Dans notre cas toutes les classes nichées dans R sont publiques, donc accessibles depuis l’extérieur, mais comme elles sont membres de la classe R, pour y accéder, il faut passer par R. On utilisera des notations telles que R.string puisque ces classes sont statiques. Polytech’ Paris Sud Dima RodriguezTutoriel Android 17 La première balise que vous retrouverez est qui définit le type du conteneur qui compose l’interface, il impose la façon dont les composants seront disposés. Plusieurs types de conteneurs existent, les plus communs sont RelativeLayout, LinearLayout, TableLayout, GridView, ListView. L’utilisation d’un RelativeLayout, par exemple, implique que les composants seront placés selon des positions relatives les uns par rapport aux autres. Un LinearLayout implique une disposition linéaire verticale ou horizontale, un GridView permet la disposition des éléments selon une grille qui peut défiler, etc. A l’intérieur de la balise vous verrez un ensemble d’attributs définis selon le format plateforme:caractéristique=”valeur” Par exemple le premier attribut xmlns:android précise où sont définis les balises Android utilisées dans ce fichier. La balise , fille de la balise , définit un composant texte qui sera placé sur le layout. En effet, c’est sur ce composant là qu’on écrit le “Hello World” qu’affiche notre application. Cette chaine de caractère est définie par l’attribut android:text. La notation "@string/hello_world" fait référence à une chaine de caractère qui s’appelle hello_world et qui est définie dans le fichier strings.xml (dans le dossier values). Modifier le type de layout Nous allons maintenant modifier le type du layout pour le transformer en LinearLayout. La figure 3.6 trace la dérivation de la classe LinearLayout. Nous rajouterons ensuite nos composants sur ce layout dans une disposition linéaire. Figure 3.6 – Hiérarchie de LinearLayout Les layouts sont des ViewGroup qui sont eux mêmes des View [1] Dans le fichier fragment_principale.xml Polytech’ Paris Sud Dima RodriguezTutoriel Android 18 . supprimez l’élément . remplacez l’élément par . rajoutez l’attribut android:orientation et mettre sa valeur à “horizontal” Le code dans le fichier devient ainsi Rajouter d’un champ de saisie . Rajoutez un élément dans le tel que Nous avons ainsi placé un champ de saisie avec les attributs suivants : android :id permet de donner un identifiant unique à ce View qu’on utilisera pour référencer cet objet à l’intérieur de notre code. Le symbol @ est nécessaire pour faire référence à un objet ressource à partir d’un fichier XML. id est le type de ressource et chp_saisie est le nom qu’on donne à notre ressource. Le symbole + est utilisé pour définir un ID pour la première fois. Il indique aux outils du SDK qu’il faudrait générer un ID dans le fichier R.java pour référencer cet objet. Un attribut public static final chp_saisie sera défini dans la classe id.Le symbole + ne doit être utilisé qu’une seule fois au moment où on déclare la ressource pour la première fois. Par la suite si on veut faire référence à cet élément, à partir d’un XML, il suffira d’écrire @id/chp_saisie. Polytech’ Paris Sud Dima RodriguezTutoriel Android 19 android :layout_width permet de spécifier la largeur de élément. “wrap_content” signifie que le View doit être aussi large que nécessaire pour s’adapter à la taille de son contenu. Si en revanche on précise “match_parent” comme on l’avait fait pour le LinearLayout, dans ce cas le EditText occuperait toute la largeur de l’écran puisque sa largeur sera celle de son parent c-à-d le LinearLayout android :layout_height idem que pour le layout_width mais pour la hauteur android :hint précise le texte par défaut à afficher dans le champ de saisie quand il est vide. Nous aurions pu préciser directement la chaine de caractère ici codée en dur, mais on préfère utiliser plutôt une ressource qu’on définira dans strings.xml . Noter que l’utilisation de + ici n’est pas nécessaire parce qu’on fait référence à une ressource concrète (qu’on définira dans le fichier xml) et non pas à un identifiant que le SDK doit créer dans la classe R. i Privilégiez toujours l’utilisation des ressources strings plutôt que des chaines de caractères codées en dur. Cela permet de regrouper tout le texte de votre interface dans un seul endroit pour simplifier la recherche et la mise à jour du texte, de plus ceci est indispensable pour que votre application puisse être multilingue. l’IDE vous affichera un avertissement en cas de non respect de cette recommandation. Après la modification du code que nous venons de faire, quand vous sauvegarderez le fichier, un message d’erreur vous indiquera que l’identifiant str_chp_saisie n’est pas connu. Nous allons donc le définir. . Ouvrez le fichier strings.xml qui se trouve dans res>values . Rajoutez une nouvelle string nommée str_chp_saisie et dont la valeur est “Entrer un texte” . Vous pouvez éventuellement supprimer la ligne qui définit “hello_world” Votre fichier strings.xml ressemblera donc à ceci MonAppli Entrer un texte Polytech’ Paris Sud Dima RodriguezTutoriel Android 20 Figure 3.7 – Premier test de l’application modifiée Settings . Une fois que vos modifications sont sauvegardées vous remarquerez la création de deux attributs dans le fichier R.java. • Un attribut constant nommé chp_saisie dans la classe id. C’est un numéro unique qui identifie l’élément EditText que nous venons de rajouter. Cet identifiant nous permettra de manipuler l’élément à partir du code. • Un attribut constant nommé str_chp_saisie dans la classe string. Il fait référence à la chaine de caractère et nous permettra de l’utiliser dans le code. Lancez l’application, l’émulateur affichera un écran tel que dans la figure 3.7. Tapez un texte et remarquez comment la taille du champ de saisie s’adapte à la longueur du texte. Polytech’ Paris Sud Dima RodriguezTutoriel Android 21 (a) Disposition par défaut (b) Le EditText a un poids de 1 Figure 3.8 – Champ de saisie et bouton Rajouter un bouton . Dans le fichier strings.xml rajoutez une chaine de caractère qui s’appelle "btn_envoyer" et qui vaut Envoi. . Dans le fichier du layout rajoutez un élément