Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos - Google Revenir à l'accueil

Accéder au pdf :ACCEDER AU PDF

Voir également : [TXT]
 Unsupervised-Testing..> 06-Mar-2015 18:52  2.9M  
[TXT]
 Backoff-Inspired-Fea..> 06-Mar-2015 18:51  2.9M  
[TXT]
 Accurate-and-Compact..> 06-Mar-2015 18:51  2.9M  
[TXT]
 ReFr-An-Open-Source-..> 06-Mar-2015 18:50  2.9M  
[TXT]
 On-Demand-Language-M..> 06-Mar-2015 18:50  2.9M  
[TXT]
 Building-High-level-..> 06-Mar-2015 18:49  3.0M  
[TXT]
 On-the-Predictabilit..> 06-Mar-2015 18:49  3.0M  
[TXT]
 Tera-scale-deep-lear..> 06-Mar-2015 18:48  3.0M  
[TXT]
 Building-high-level-..> 06-Mar-2015 18:48  3.0M  
[TXT]
 Bootstrapping-Depend..> 06-Mar-2015 18:47  3.1M  
[TXT]
 Towards-A-Unified-Mo..> 06-Mar-2015 18:47  3.1M  
[TXT]
 Scalable-Dynamic-Non..> 06-Mar-2015 18:46  3.1M  
[TXT]
 March-2013-WORKING-G..> 06-Mar-2015 18:45  3.2M  
[TXT]
 Collaboration-in-the..> 06-Mar-2015 18:45  3.3M  
[TXT]
 Minimizing-off-targe..> 06-Mar-2015 18:44  3.3M  
[TXT]
 Online-Microsurveys-..> 06-Mar-2015 18:43  3.3M  
[TXT]
 Better-Bounds-for-On..> 06-Mar-2015 18:43  3.3M  
[TXT]
 Linear-Space-Computa..> 06-Mar-2015 18:42  3.3M  
[TXT]
 Performance-Trade-of..> 06-Mar-2015 18:42  3.3M  
[TXT]
 Extracting-Patterns-..> 06-Mar-2015 18:41  3.4M  
[TXT]
 Auditory-Sparse-Codi..> 06-Mar-2015 18:40  3.4M  
[TXT]
 Estimation-Optimizat..> 06-Mar-2015 18:40  2.5M  
[TXT]
 Automatically-Discov..> 06-Mar-2015 18:39  1.3M  
[TXT]
 Le-referencement-sur..> 06-Mar-2015 09:19  2.8M  
[TXT]
 Google-Analytics-Liv..> 06-Mar-2015 09:19  2.8M  
[TXT]
 Google-Analytics-pri..> 06-Mar-2015 09:19  2.8M  
[TXT]
 Reussir-son-referenc..> 06-Mar-2015 09:18  2.8M  
[TXT]
 Google-AdWords-Analy..> 06-Mar-2015 09:18  2.8M
Automatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos Eric Nichols Department of Computer Science Indiana University Bloomington, Indiana, USA Email: epnichols@gmail.com Charles DuHadway, Hrishikesh Aradhye, and Richard F. Lyon Google, Inc. Mountain View, California, USA Email: {duhadway,hrishi,dicklyon}@google.com Abstract—Online video presents a great opportunity for upand-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of “home singing” videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality. Keywords-talent discovery; singing; intonation; music; melody; video; YouTube I. INTRODUCTION AND PRIOR WORK Video sharing sites such as YouTube provide people everywhere a platform to showcase their talents. Occasionally, this leads to incredible successes. Perhaps the best known example is Justin Bieber, who is believed to have been discovered on YouTube and whose videos have since received over 2 billion views. However, many talented performers are never discovered. Part of the problem is the sheer volume of videos: sixty hours of video are uploaded to YouTube every minute (nearly ten years of content every day) [23]. This builds a “rich get richer” bias where only those with a large established viewer base continue to get most of the new visitors. Moreover, even “singing at home” videos have a large variation not only in choice of song but also in sophistication of audio capture equipment and the extent of postproduction. An algorithm that can analyze all of YouTube’s daily uploads to automatically identify talented amateur singers and musicians will go a long way towards removing these biases. We present in this paper a system that uses acoustic analysis and machine learning to (a) detect “singing at home” videos, and (b) quantify the quality of musical performances therein. To the best of our knowledge, no prior work exists for this specific problem, especially given an unconstrained dataset such as videos on YouTube. While performace quality will Figure 1. “Singing at home” videos. always have a large subjective component, one relatively objective measure of quality is intonation—that is, how intune is a music performance? In the case of unaccompanied audio, the method in [14] uses features derived both from intonation and vibrato analysis to automatically evaluate singing quality from audio. These sorts of features have also been investigated by music educators attempting to quantify intonation quality given certain constraints. The InTune system [1], for example, processes an instrumentalist’s recording to generate a graph of deviations from desired pitches, based on alignment with a known score followed by analysis of the strongest FFT bin near each expected pitch. Other systems for intonation visualization are reviewed in [1]; these differ in whether or not the score is required and in the types of instruments recognized. Practical value of such systems on large scale data such as YouTube is limited because (a) the original recording and/or score may not be known, and (b) most published approaches for intonation estimation assume a fixed reference pitch such as A=440 Hz. Previous work in estimating the reference pitch has generally been based on FFT or filterbank analysis [8], [9], [10]. To ensure scalability to a corpus of millions of videos, we propose a computationally efficient means forestimating both the reference pitch and overall intonation. We then use it to construct an intonation-based feature for musical performance quality. Another related subproblem relevant to performance quality is the analysis of melody in audio. There are many approaches to automatically extracting the melody line from a polyphonic audio signal (see the review in [15]), ranging from simple autocorrelation methods [3], [5] to FFT analysis and more complex systems [16], [18], [22]. Melody extraction has been a featured task in the MIREX competition in recent years; the best result so far for singing is the 78% accuracy obtained by [16] on a standard test set with synthetic (as opposed to natural) accompaniment. This system combined FFT analysis with heuristics which favor extracted melodies with typically-musical contours. We present a new melody-based feature for musical performance quality. In addition to these new features, the proposed approach uses a large set of previously published acoustic features including MFCC, SAI[12], intervalgram[21], volume, and spectrogram. When identifying candidate videos we also use video features including HOG [17], CONGAS[19] and HueSaturation color histograms [11]. II. APPROACH A. Identifying Candidate Videos We first identify “singing at home” videos. These videos are correlated with features such as ambient indoor lighting, head-and-shoulders view of a person singing in front of a fixed camera, few instruments, and a single dominant voice. A full description of this stage is beyond this paper’s scope. We use the approach in [2] to train a classifier to identify these videos. In brief, we collected a large set of videos that were organically included in YouTube playlists related to amateur performances. We then used this as weakly labeled ground-truth against a large set of randomly picked negative samples to train a “singing at home” classifier. We use a combination of audio and visual features including HOG, CONGAS[19], MFCC, SAI[12], intervalgram[21], volume, and spectrograms. Our subsequent analyses for feature extraction and singing quality estimation are based on the high precision range of this classifier. Figure 1 shows a sample of videos identified by this approach. B. Feature Extraction We developed two sets of features, each comprised of 10 floating point numbers. These include an intonation feature set, intonation, and a melody line feature set, melody. 1) Intonation-based Features: Intonation Histogram: Considering that for an arbitrary YouTube video we know neither the tuning reference nor the desired pitches, we implemented a two step algorithm to estimate the in-tuneness of an audio recording. The first step computes a tuning reference (see Figure 2). To this end, we first detect STFT amplitude peaks Figure 2. Pitch histogram for an in-tune recording. in the audio (monophonic 22.05 kHz, frame size 4096 samples=186 ms, 5.38 Hz bin size). From these peaks we construct an amplitude-weighted histogram, and set the tuning reference to the maximum bin. The second step makes a histogram of distances from nearest chromatic pitches using the previously-computed histogram. Note that this computation is very simple and efficient, compared with filterbank approaches as in [14], and as it allows for multiple peaks, it works with polyphonic audio recordings. In this process we first use the tuning reference to induce a grid of “correct” pitch frequencies based on an equal-tempered chromatic scale. Subsequently, we make an amplitudeweighted histogram of differences from correct frequencies. Histogram heights are normalized to sum to 1. We used 7 bins to cover each 100 cent range (1 semitone), which worked out nicely because the middle bin collected pitches within ±7.1 cents of the correct pitch. The range ±7 cents was found to sound “in-tune” in experiments [7]. When possible we match audio to known reference tracks using the method in [21] and use this matching to identify and remove frames that are primarily non-pitch, such as talking or rapping, when computing the tuning reference. Feature Representation: Now we can generate a summary vector consisting of the 7 heights of the histogram itself followed by three low-order weighted moments-aboutzero. These statistics (standard deviation, skew, and kurtosis) describe the data’s deviation from the reference tuning grid. See Table I. This set of 10 values, which we refer to collectively as intonation, gives a summary of the intonation of a recording, by describing how consistent the peaks of each frame are with the tuning reference derived from the set of all these peaks. Figure 3(b) shows the histogram for an out-of-tune recording. For the high-quality recording in Figure 2(a), the(a) In-tune (good) audio (b) Out-of-tune (bad) audio Figure 3. Distance to tuning reference. bar1 bar2 bar3 bar4 bar5 bar6 bar7 stddev skew kurtosis In-tune .00 .05 .14 .70 .08 .03 .00 .006 .15 2.34 Out-of-tune .14 .20 .13 .30 .10 .07 .06 .015 -.42 -.87 Table I INTONATION FEATURE VECTORS FOR FIGURES 3(A) AND 3(B). central bar of the histogram is relatively high, indicating that most peaks were for in-tune frequencies. The histogram is also relatively symmetrical and has lower values for more out-of-tune frequencies. The high kurtosis and low skew and standard deviation of the data reflect this. The low-quality recording, on the other hand, does have a central peak, but it is much shorter relative to the other bars, and in general its distribution’s moments do not correspond well with a normal distribution. Note that while we expect that asymmetrical, peaked distribution in this histogram is an indicator of “good singing”, we do not build in this expectation to our prediction system explicitly; rather, these histogram features will be provided as input to a machine learning algorithm. Good performances across different genres of music might result in differing shapes of the histogram; the system should learn which shapes to expect based on the training data. For example, consider the case of music where extensive pitch-correction has been applied by a system such as AutoTune. We processed several such tracks using this system, resulting in histograms with a very tall central bar and very short other bars; almost all notes fell within 7 cents of the computed reference grid. If listeners rated these recordings highly, this shape might lead to predictions of high quality by our system; if listeners disliked this sound, it might have the inverse effect. Similarly, consider vocal vibrato. If the extent of vibrato (the amplitude of modulation of the frequency) is much more than 50 cents in each direction from the mean frequency of a note, then this approach will result in a more flat histogram which might obscure the intonation quality we are trying to capture. Operatic singing often has vibrato with an extent of a whole semitone, giving a very flat distribution; early music performance, on the other hand, is characterized by very little vibrato. Popular music comprises the bulk of the music studied here. Although we did not analyze the average vibrato extent in this collection, an informal look at histograms produced with this approach suggests that performances that sound in-tune in our data tend to have histograms with a central peak. For musical styles with large vibrato extent, such as opera, we would need to refine our technique to explicitly model the vibrato in order to recover the mean fundamental frequency of each note, as in [14]. For styles with a moderate amount of vibrato, frequency energy is placed symmetrically about the central histogram bar, and in-tune singing yields the expected peaked distribution (for example, if a perfeclty sinusoidal vibrato ranges from 50 cents above to 50 cents below the mean frequency, then approximately 65% of each note’s duration will be spent within the middle three bars of the histogram; reducing the vibrato extent to 20 cents above and below causes all frequencies of an in-tune note fall within the middle three bars.) 2) Melody-based Features: Melody Line: As we are interested in the quality of the vocal line in particular, a primary goal in analyzing thesinging quality is to isolate the vocal signal. One method for doing so is to extract the melody line, and to assume that most of the time, the primary melody will be the singing part we are interested in. This is a reasonable assumption for many of the videos we encounter where people have recorded themselves singing, especially when someone is singing over a background karaoke track. Our problem would be made easier if we had access to a symbolic score (e.g., the sheet music) for the piece being sung, as in [1]. However, we have no information available other than the recording itself. Thus we use two ideas to extract a good candidate for a melody line: the Stabilized Auditory Image (SAI) [20] and the Viterbi algorithm. Algorithm: We compute the SAI for each frame of audio, where we have set the frame rate to 50 frames per second. At 22,050 Hz, this results in a frame size of 441 samples. The SAI is a matrix with lag times on one axis and frequency on the other; we convert the lag dimension into a pitch-class representation for each frame using the method employed in [21] but without wrapping pitch to chroma. This is a vector giving strengths in each frequency bin. Our frequency bins span 8 octaves, and we tried various numbers of bins per octave such as 12, 24, or 36. In our experiments, 12 bins per octave gave the best results. This 96-element vector of bin strengths for each frame looks much like a spectrogram, although unlike a spectrogram, we cannot recover the original audio signal with an inverse transform. However, the bins with high strengths should correspond to perceptually salient frequencies, and we assume that for most frames the singer’s voice will be one of the most salient frequencies. We next extract a melody using a best-path approach. We represent the successive SAI summary vectors as layers in a trellis graph, where nodes correspond to frequency bins for each frame and each adjacent pair of layers is fully connected. We then use the Viterbi algorithm to find the best path using the following transition score function: St[i, j] = SAIt[j] − α  pm + pl + |i − j| T  (1) where pm =  1 if i 6= j 0 otherwise pl =  1 if transition is ≥ 1 octave 0 otherwise and T is the frame length in seconds. We used α = 0.15 in our experiments. Figure 4(a) shows the SAI summary frames and the best path computed for a professional singer. Figure 4(b) shows the best path for the recording of an badly-rated amateur singer. We observed that the paths look qualitatively different in the two cases, although the difference is hard to describe precisely. In the professional singer case, the path looks more smooth and is characterized by longer horizontal bars (corresponding to single sustained notes) and less vertical jumps of large distance. Note that this is just an example suggestive of some potentially useful features to be extracted below; the training set and learning algorithm will make use of these features only if they turn out to be useful in prediction. Feature Representation: Remembering that our aim was to study not the quality of the underlying melody of the song, but instead the quality of the performance, we realized we could use the shape of the extracted melody as an indicator of the strength and quality of singing. This idea may seem counterintuitive, but we are studying characteristics of the extracted melody—rather than correlation between the performance and a desired melody—simply because we do not have access to the sheet music and “correct” notes of the melody. Obviously, this depends a great deal on the quality of the melody-extraction algorithm, but because we are training a classifier based on extraction results, we expect that even with an imperfect extraction algorithm, useful trends should emerge that can help distinguish between low- and high-quality performances. Differences between songs also obviousy affects global melody contour, but we maintain that for any given song a better singer should produce a melody line that is more easily extracted and which locally conforms better to expected shapes. To study the shape and quality of the extracted melody, first we define a “note” to be a contiguous horizontal segment of the note path, so that each note has a single frequency bin. Then we compute 10 different statistics at the note level to form the melody feature vector: 1) Mean and standard deviation of note length (µlen, σlen) 2) Difference between the standard deviation and mean 3) Mean and standard deviation of note frequency bin number (µbin, σbin) 4) Mean and standard deviation of note strength (sum of bin strengths divided by note length) (µstr, σstr) 5) Mean and standard deviation of vertical leap distance between adjacent notes (in bins) (µleap, σleap) 6) Total Viterbi best path score divided by total number of frames The intuition behind this choice of statistics follows. In comparing Figures 4(a) and 4(b), we see that the path is more fragmented for the lower-quality performance: there are more, shorter notes than there should be. Thus, note length is an obvious statistic to compute. If we assume that note length is governed by a Poisson process, we would expect an exponential distribution on note lengths, and the mean and standard deviation would be about the same. However, we conjecture that a Poisson process is not the best model for lengths of notes in musical compositions. If the best-path chosen by the Viterbi algorithm is more in-line(a) A better quality recording (b) A lower quality recording Figure 4. Best-path melody extraction. The best path is shown as a blue line superimposed on the plot. Higher-amplitude frequency bins are shown in red. Upper and lower frequency bins were cropped for clarity. µlen σlen σlen − µlen µbin σbin µstr σstr µleap σleap path score good 71.77 83.71 11.93 37.12 4.00 0.094 0.028 3.44 2.52 32.14 medium 43.64 41.61 -2.02 38.71 2.87 0.105 0.012 3.46 2.18 30.49 bad 45.46 46.08 0.62 38.16 3.84 0.101 0.032 3.84 2.60 32.64 Table II MELODY FEATURE VECTORS FOR FIGURES 4(A) AND 4(B). with the correct melody, we would expect a non-exponential distribution. Thus, the difference between standard deviation and mean of note length is computed as a useful signal about the distribution type. Note strength is also computed because we suspect that notes with larger amplitude values are more likely to correspond to instances of strong, clear singing. Note frequency bins are analyzed because vocal performances usually lie in a certain frequency range; deviations from the range would be a signal that something went wrong in the melody detection process and hence that the performance might not be so good. Leap distance between adjacent notes is a useful statistic because musical melody paths will follow certain patterns, whereas problems in the path could show up if the distribution of leaps is not distributed as expected. Finally, the average path score per frame from the Viterbi algorithm is recorded, although it may prove to be a a useless statistic because it is notoriously hard to interpret path scores from different data files—more analysis is necessary to determine which of these features are most useful. Table II gives examples of these statistics for the paths in Figures 4(a) and 4(b) as well as for one other medium-quality melody. C. Performance Quality Estimation Given a pool of candidate videos our next step is to estimate the performance quality of each video. For sets on the order of a hundred videos human ratings could be used directly for ranking. However, to consider thousands or more videos we require an automated solution. We train kernelized passive-aggressive (PA) [6] rankers to estimate the quality of each candidate video set. We tried several kernels including linear, intersection, and polynomial and found that the intersection kernel worked the best overall Unless noted otherwise we used this kernel in all our experiments. The training data for these rankers is given as pairs of video feature sets where one video has been observed to be higher quality than the other. Given a new video the ranker generates a single quality score estimate. III. EXPERIMENTAL RESULTS A. ”Singing At Home” Video Dataset We have described two features for describing properties of a melody, where each feature is a vector of 10 floating point numbers. To test their utility, the features are used to predict human ratings on a set of pairs of music videos. This corpus is composed of over 5,000 pairs of videos, where for each pair, human judges have selected which video of the pair is better. Carterette et al. [4] showed that preference judgements of this type can be more effective than absolute judgements. Each pair is evaluated by at least 3 different judges. In this experiment, we only consider the subset of video pairs where the winner was selected unanimously. Our training dataset is made of this subset, which comprises 1,573 unique videos. B. Singing Quality Ranker Training For each video, we computed intonation and melody feature vectors described above, as well as a large feature vector which is composed of other audio analysis features including MFCC, SAI[12], intervalgram[21], volume, andFeature Accuracy (%) # dimensions Accuracy gain / # dimensions Rank intonation 51.9 10 0.1900 2 melody 61.2 10 1.1200 1 large 67.5 14,352 0.0012 9 all 67.8 14,372 0.0012 8 large-MFCC 61.4 2,000 0.0057 5 large-SAI-boxes 66.7 7,168 0.0023 6 large-SAI-intervalgram 58.6 4,096 0.0021 7 large-spectrum 62.7 1,024 0.0124 4 large-volume 59.9 64 0.1547 3 Table III PREDICTION ACCURACY BY FEATURE SET. spectrograms. These features are used to train a ranker which outputs a floating-point score for each input example. In order to test the ranker, we simply generate the ranking score for each example in each pair, and choose the higher-scoring example as the winner. To test the ranker, we compare this winner to that chosen by unanimous human consent. Thus, although we use a floating-point ranker as an intermediate step, the final ranker output is a simple binary choice and baseline performance is 50%. C. Prediction Results Training consisted of 10-fold cross-validation. The percentages given below are the mean accuracies over the 10 cross-validation folds, where accuracy is computed as the number of correct predictions of the winner in a pair divided by the total number of pairs. Overall, large yields the best accuracy, 67.5%, melody follows with 61.2%, and intonation achieves just 51.9% accuracy. The results for our two new feature vectors, as well as for large, are given in Table III. Because large has so many dimensions, it is unsurprising that it performs better than our 10-dimensional features. To better understand the utility of each feature, we broke large down into subsets also listed in Table III, calculated the % gain above baseline for each feature subset, computed the average % gain per feature dimension, and ranked the features accordingly. The intonation and melody features offer the most accuracy per dimension. Our metric of % gain per dimension is important because we are concerned with computational resources in analyzing large collections of videos. For the subsets of the large vector which required thousands of dimensions, it was interesting to see how useful each subset was compared with the amount of computation being done (assuming that the number of dimensions is a rough correlate to computation time). For example, it seems clear that melody is more useful than large-SAI-intervalgram as it has better accuracy with less dimensions, but also melody is probably more useful when computational time is limited than is large-MFCC, as they have similar accuracy but a much different accuracy gain per dimension. D. Effect of Production Quality We did one further experiment to determine if the above rankers were simply learning to distinguish videos with better production quality. To test this possibility we trained another ranker on pairs of videos with similar production quality. This dataset contained 999 pairs with ground truth established through the majority voting of 5 human operators. As before we trained and tested rankers using 10- fold cross validation. The average accuracy of the resulting rankers, using the large feature set, was 61.8%. This suggests that the rankers are indeed capturing more than simple production quality. IV. DISCUSSION The results in Table III show that the melody feature set performed quite well, with the best accuracy gain per dimension and also a good raw accuracy. The intonation feature set achieved second place according to the accuracy gain metric, but raw accuracy was not much better than baseline. However, kernel choice may have had a large impact: the large feature set performs better with the intersection kernel, while intonation alone does better (54.1%) with a polynomial kernel. Integrating the different types of features using multi-kernel methods might help. Note that while we developed these features for vocal analysis, they could be applied to other music sources—the feature sets analyze the strongest or perceptually salient frequency components of a signal, which might be any instrument in a recording. In our case where we have “singing at home” videos, these analyzed components are often the sung melody that we are interested in, but even if not, the intonation and melodyshape of other components of the recording are still likely indicators of overall video quality. The output of our system is a set of high quality video performances, but this system is not (yet) capable of identifying the very small set of performers with extraordinary talent and potential. This is not surprising given that pitch and consistently strong singing are only two of many factors that determine a musician’s popularity. Our system has two properties that make it well-suited for use as a filtering step for a competition driven by human ratings. First, it can evaluatevery large sets of candidate videos which would overwhelm a crowd-based ranking system with limited users. Second, it can eliminate obviously low quality videos which would otherwise reduce the entertainment in such competition. V. FUTURE WORK Our ongoing work includes several improvements to these features. For instance, we have used the simple bin index in the FFT to estimate frequencies. Although it would increase computation time, we could use the instantaneous phase (with derivative approximated by a one-frame difference) to more precisely estimate the frequency of a component present in a particular bin [13]. With this modification, step 1 of our algorithm would no longer use a histogram; instead, the tuning reference minimizing the total error in step 2 would be computed instead. Our present implementation avoided this fine-tuning by using a quite-large frame size (at the expense of time-resolution) so that our maximum error (half the bin size) is 2.7 Hz, or approximately 10 cents for a pitch near 440 Hz. The proposed intonation feature extraction algorithm can be easily modified to run on small segments (e.g., 10 seconds) of audio at once instead of over the whole song. This has the advantage of allowing the algorithm to throw out extremely out-of-tune frames which are probably due to speech or other non-pitched events. Finally, we also are working on substantially improving the process of vocal line extraction from a polyphonic signal. Once this is achieved, there are many details which could augment our current feature sets to provide a deeper analysis of singing quality; such features may include vibrato analysis of the melody line, strength of vocal signal, dynamics (expression), and duration/strength of long notes. REFERENCES [1] K. Ae and C. Raphael: “InTune: A System to Support an Instrumentalist’s Visualization of Intonation”, Computer Music Journal, Vol. 34, No. 3, Fall 2010. [2] H. Aradhye, G. Toderici, and J. Yagnik: “Video2Text: Learning to Annotate Video Content”, ICDM Workshop on Internet Multimedia Mining, 2009. [3] P. Boersma: “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound”, Proceedings of the Institute of Phonetic Sciences, University of Amsterdam, pp. 97–110, 1993. [4] B. Carterette, P. Bennett, D. Chickering and S. Dumais: “Here or There Preference Judgments for Relevance”, Advances in Information Retrieval, Volume 4956/2008, pp. 16–27. [5] A. de Cheveigne: “YIN, a fundamental frequency estimator for speech and music”, J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002. [6] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer: “Online passive-aggressive algorithms”, Journal of Machine Learning Research (JMLR), Vol. 7, 2006. [7] D. Deutsch: The Psychology of Music, p. 205, 1982. [8] S. Dixon: “A Dynamic Modelling Approach to Music Recognition”, ICMC, 1996. [9] E. Gomez: “Comparative Analysis of Music Recordings from Western and NonWestern Traditions by Automatic Tonal Feature Extraction”, Empirical Musicology Review, Vol. 3, No. 3, pp. 140–156, March 2008. [10] A. Lerch: “On the requirement of automatic tuning frequency estimation”, ISMIR, 2006. [11] T. Leung and J. Malik: “Representing and recognizing the visual appearance of materials using three-dimensional textons”, IJCV, 2001. [12] R. Lyon, M. Rehn, S. Bengio, T. Walters, and G. Chechik: “Sound Retrieval and Ranking Using Sparse Auditory Representations”, Neural Computation, Vol. 22 (2010), pp. 2390– 2416. [13] D. McMahon and R. Barrett: “Generalization of the method for the estimation of the frequencies of tones in noise from the phases of discrete fourier transforms”, Signal Processing, Vol. 12, No. 4, pp. 371–383, 1987. [14] T. Nakano, M. Goto, and Y. Hiraga: “An Automatic Singing Skill Evaluation Method for Unknown Melodies Using Pitch Interval Accuracy and Vibrato Features,” ICSLP pp. 1706– 1709, 2006. [15] G. Poliner: “Melody Transcription From Music Audio: Approaches and Evaluation”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 4, May 2007. [16] J. Salamon and E. Gmez: “Melody Extraction from Polyphonic Music: MIREX 2011”, Music Information Retrieval Evaluation eXchange (MIREX), extended abstract, 2011. [17] J. Shotton, M. Johnson, and R. Cipolla: “Semantic texton forests for image categorization and segmentation”. CVPR, 2008. [18] L. Tan and A. Alwan: “Noise-robust F0 estimation using SNR-weighted summary correlograms from multi-band comb filters”, ICASSP, pp. 4464–4467, 2011. [19] E. Tola, V. Lepetit, and P. Fua: “A fast local descriptor for dense matching”, CVPR, 2008 [20] T. Walters: Auditory-Based Processing of Communication Sounds, Ph.D. thesis, University of Cambridge, 2011. [21] T. Walters, D. Ross and R. Lyon: “The Intervalgram: An Audio Feature for Large-scale Melody Recognition”, accepted for CMMR, 2012. [22] L. Yi and D. Wang: “Detecting pitch of singing voice in polyphonic audio”, ICASSP, 2005. [23] YouTube. “Statistics” http://www.youtube.com/t/press statistics. April 11, 2012. Estimation, Optimization, and Parallelism when Data is Sparse or Highly Varying John C. Duchi Michael I. Jordan H. Brendan McMahan November 10, 2013 Abstract We study stochastic optimization problems when the data is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties—in terms of increased sample complexity that sparse data necessitates—and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F(·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ R d . For a probability distribution P on Ξ, we consider the following optimization problem: minimize x∈X f(x) := E[F(x; ξ)] = Z Ξ F(x; ξ)dP(ξ). (1) By data sparsity, we mean that the sampled data ξ is sparse: samples ξ are assumed to lie in R d , and if we define the support supp(x) of a vector x to the set of indices of its non-zero components, we assume that supp ∇F(x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F(x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications, though in spite of its prevalence, study of such problems has been somewhat limited. As a motivating example, consider a text classification problem: data ξ ∈ R d represents words appearing in a document, and we wish to minimize a logistic loss F(x; ξ) = log(1 + exp(hξ, xi)) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, 1Formally, if we define πξ as the coordinate projection that zeros all indices j of its argument where ξj = 0, then F(πξ(x); ξ) = F(x; ξ) for all x, ξ. This is implied by first order conditions for convexity [6, Chapter VI.2] 1very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have significant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel optimization algorithms [12, 13, 14], but the authors do not study the statistical effects of data sparsity. In recent work, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms designed to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in the sense that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difficult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the AdaGrad algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (AsyncDA) and asynchronous AdaGrad (AsyncAdaGrad), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s Hogwild! [12] with a new analysis, we prove our algorithms can achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classification and web-advertising tasks to illustrate the benefits of the new algorithms. Notation For a convex function x 7→ f(x), we let ∂f(x) denote its subgradient set at x (if f has two arguments, we say ∂xf(x, y) is the subgradient w.r.t. x). For a positive semi-definite matrix A, we let k·kA be the (semi)norm defined by kvk 2 A := hv, Avi, where h·, ·i is the standard inner product. We let 1 {·} be the indicator function, which is 1 when its argument is true and 0 otherwise. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let xb denote any estimator for a minimizer of the 2objective (1). We define the optimality gap ǫN for the estimator xb based on N samples ξ 1 , . . . , ξN from the distribution P as ǫN (x, F, b X , P) := f(xb) − inf x∈X f(x) = EP [F(xb; ξ)] − inf x∈X EP [F(x; ξ)] . This quantity is a random variable, since xb is a random variable (it is a function of ξ 1 , . . . , ξN ). To define the minimax error, we thus take expectations of the quantity ǫN , though we require a bit more than simply E[ǫN ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F specified by a collection F of convex losses F : X × ξ → R. We can then define the minimax error for the family of losses F and distributions P as ǫ ∗ N (X ,P, F) := inf xb sup P ∈P sup F ∈F EP [ǫN (x, F, b X , P)], (3) where the infimum is taken over all possible estimators (optimization schemes) xb. 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist distributions supported on Ξ = {−1, 0, 1} d , and we let pj := P(ξj 6= 0) be the marginal probability of appearance of feature j (j ∈ {1, . . . , d}). For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂xF(x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R] d . Then ǫ ∗ N (X ,P, F) ≥ 1 8 R X d j=1 Mj min  pj , √pj √ N log 3 . We provide the proof of Proposition 1 in Appendix A.1, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P(ξj 6= 0) = p0j −α. We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P(ξj 6= 0) = p0j −α on coordinate appearance probabilities. Then (1) If d > (p0N) 1/α, ǫ ∗ N (X ,P, F) ≥ MR 8  2 2 − α r p0 N  (p0N) 2−α 2α − 1  + p0 1 − α  d 1−α − (p0N) 1−α α   . (2) If d ≤ (p0N) 1/α, ǫ ∗ N (X ,P, F) ≥ MR 8 r p0 N  1 1 − α/2 d 1− α 2 − 1 1 − α/2  . 3For simplicity assume that the features are not too extraordinarly sparse, say, that α ∈ [0, 2], and that number of samples is large enough that d ≤ (p0N) 1/α. Then we find ourselves in regime (2) of Corollary 2, so that the lower bound on optimization error is of order MRr p0 N d 1− α 2 when α < 2, MRr p0 N log d when α → 2, and MRr p0 N when α > 2. (4) These results beg the question of tightness: are they improvable? As we see presently, they are not. 2.2 Algorithms for attaining the minimax rate The lower bounds specified by Proposition 1 and the subsequent specializations are sharp, meaning that they are unimprovable by more than constant factors. To show this, we review a few stochastic gradient algorithms. We first recall stochastic gradient descent, after which we review the dual averaging methods and an extension of both. We begin with stochastic gradient descent (SGD): for this algorithm, we repeatedly sample ξ ∼ P, compute g ∈ ∂xF(x; ξ), then perform the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Then standard analyses of stochastic gradient descent (e.g. [10]) show that after N samples ξ, in our setting the SGD estimator xb(N) satisfies E[f(xb(N))] − inf x∈X f(x) ≤ O(1) R2M qPd j=1 pj √ N , (5) where R2 denotes the ℓ2-radius of X . Dual averaging, due to Nesterov [11] and referred to as “follow the regularized leader” in the machine learning literature (see, e.g., the survey article by Hazan [5]) is somewhat more complex. In dual averaging, one again samples g ∈ ∂xF(x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes x ← argmin x∈X  hz, xi + 1 η ψ(x)  , where ψ(x) is a strongly convex function defined over X (often one takes ψ(x) = 1 2 kxk 2 2 ). The dual averaging algorithm, as we shall see, is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The AdaGrad algorithm [4, 9] is a slightly more complicated extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, AdaGrad performs the following: it computes g ∈ ∂xF(x; ξ), then updates Sj ← Sj + g 2 j for j ∈ [d]. Depending on whether the dual averaging or stochastic gradient descent (SGD) variant is being used, AdaGrad performs one of two updates. In the dual averaging case, it maintains the dual vector z, which is updated by z ← z + g; in the SGD case, the parameter x is maintained. The updates for the two cases are then x ← argmin x′∈X  g, x′ + 1 2η D x ′ − x, S 1 2 (x ′ − x) E 4for stochastic gradient descent and x ← argmin x′∈X  z, x′ + 1 2η D x ′ , S 1 2 x ′ E for dual averaging, where η is a stepsize. Then appropriate choice of η shows that after N samples ξ, the averaged parameter xb(N) AdaGrad returns satisfies E[f(xb(N))] − inf x∈X f(x) ≤ O(1)R∞M √ N X d j=1 √pj , (6) where R∞ denotes the ℓ∞-radius of X (e.g. [4, Section 1.3 and Theorem 5], where one takes η ≈ R∞). By inspection, the AdaGrad rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation shows that the multiplier for the SGD guarantee (5) becomes R∞ √ d max{d (1−α)/2 , 1}, while AdaGrad attains rate at worst R∞ max{d 1−α/2 , log d} (by evaluation of P j √pj ). Thus for α > 1, the AdaGrad rate is no worse, and for α ≥ 2, is more than √ d/ log d better than SGD—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield benefits in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of AdaGradtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we first revisit Niu et al.’s Hogwild!. The Hogwild! algorithm of Niu et al. [12] is an asynchronous (parallelized) stochastic gradient algorithm that proceeds as follows. To apply Hogwild!, we must assume the domain X in problem (1) is a product space, meaning that it decomposes as X = X1 × · · · × Xd, where Xj ⊂ R. Fix a stepsize η > 0. Then a pool of processors, each running independently, performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂xF(x; ξ) 3. For each j s.t. gj 6= 0, update xj ← ΠXj (xj − ηgj ) Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of Hogwild! is that in step 2, the parameter x at which g is calculated may be somewhat inconsistent—it may have received partial gradient updates from many processors—though for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show a linear speedup in optimization time as the number of independent processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumption that there is at most one non-zero entry in any gradient g. 53.1 Asynchronous dual averaging One of the weaknesses of Hogwild! is that, as written it appears to only be applicable to problems for which the domain X is a product space, and the known analysis assumes that kgk0 = 1 for all gradients g. In effort to alleviate these difficulties, we now develop and present our asynchronous dual averaging algorithm, AsyncDA. In AsyncDA, instead of asynchronously updating a centralized parameter vector x, we maintain a centralized dual vector z. A pool of processors performs asynchronous additive updates to z, where each processor repeatedly performs the following updates: 1. Read z and compute x := argminx∈X n hz, xi + 1 η ψ(x) o // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂xF(x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj 6= 0, update zj ← zj + gj Because the actual computation of the vector x in asynchronous dual averaging (AsyncDA) is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. As noted by Niu et al. [12], this operation can often be performed atomically on modern processors. In our analysis of AsyncDA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the AsyncDA algorithm, that is, whichever is the tth x actually computed by any of the processors. We note that this quantity exists and is recoverable from the algorithm, and that it is possible to track the running sum Pt τ=1 x(τ ). Additionally, we require two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P(ξj 6= 0) ≤ pj . We also require an assumption about the continuity (Lipschitzian) properties of the loss functions being minimized; the assumption amounts to a second moment constraint on the sub-gradients of the instantaneous F along with a rough measure of the sparsity of the gradients. Assumption B. There exist constants M and (Mj ) d j=1 such that the following bounds hold for all x ∈ X : E[k∂xF(x; ξ)k 2 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xjF(x; ξ)|] ≤ pjMj . With these definitions, we have the following theorem, which captures the convergence behavior of AsyncDA under the assumption that X is a Cartesian product, meaning that X = X1×· · ·×Xd, where Xj ⊂ R, and that ψ(x) = 1 2 kxk 2 2 . Note the algorithm itself can still be efficiently parallelized for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then E X T t=1 F(x(t); ξ t ) − F(x ∗ ; ξ t )  ≤ 1 2η kx ∗ k 2 2 + η 2 TM2 + ηTmX d j=1 p 2 jM2 j . 6We provide the proof of Theorem 3 in Appendix B. As stated, the theorem is somewhat unwieldy, so we provide a corollary and a few remarks to explain and simplify the result. Under a more stringent condition that |∂xjF(x; ξ)| ≤ Mj , Assumption A implies E[k∂xF(x; ξ)k 2 2 ] ≤ Pd j=1 pjM2 j . Thus, without loss of generality for the remainder of this section we take M2 = Pd j=1 pjM2 j , which serves as an upper bound on the Lipschitz continuity constant of the objective function f. We then obtain the following corollary. Corollary 4. Define xb(T) = 1 T PT t=1 x(t), and set η = kx ∗k2 /M √ T. Then E[f(xb(T)) − f(x ∗ )] ≤ M kx ∗k √ 2 T + m kx ∗k2 2M √ T X d j=1 p 2 jM2 j Corollary 4 is almost immediate. To see the result, note that since ξ t is independent of x(t), we have E[F(x(t); ξ t ) | x(t)] = f(x(t)); applying Jensen’s inequality to f(xb(T)) and performing an algebraic manipulation give the corollary. If the data is suitably “sparse,” meaning that pj ≤ 1/m (which may also occur if the data is of relatively high variance in Assumption B) the bound in Corollary 4 simplifies to E[f(xb(T)) − f(x ∗ )] ≤ 3 2 M kx ∗k √ 2 T = 3 2 qPd j=1 pjM2 j kx ∗k2 √ T (7) which is the convergence rate of stochastic gradient descent (and dual averaging) even in nonasynchronous situations (5). In non-sparse cases, setting η ∝ kx ∗k2 / √ mM2T in Theorem 3 recovers the bound E[f(xb(T)) − f(x ∗ )] ≤ O(1)√ m · M kx ∗k √ 2 T . The convergence guarantee (7) shows that after T timesteps, we have error scaling 1/ √ T; however, if we have k processors, then updates can occur roughly k times as quickly, as all updates are asynchronous. Thus in time scaling as n/k, we can evaluate n gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending AdaGrad to asynchronous settings, developing AsyncAdaGrad (asynchronous AdaGrad). As in the AsyncDA algorithm, AsyncAdaGrad maintains a shared dual vector z among the processors, which is the sum of gradients observed; AsyncAdaGrad also maintains the matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1. Read S and z and set G = S 1 2 . Compute x := argminx∈X n hz, xi + 1 2η hx, Gxi o // Implicitly increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F(x; ξ) 3. For j ∈ [d] such that gj 6= 0, update Sj ← Sj + g 2 j and zj ← zj + gj 7As in the description of AsyncDA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze AsyncAdaGrad, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. Assumption C. There exist constants (Mj ) d j=1 such that for any x ∈ X and ξ ∈ Ξ, we have E[(∂xjF(x; ξ))2 | ξj 6= 0] ≤ M2 j . Indeed, taking M2 = P j pjM2 j shows that Assumption C implies Assumption B with specific constants. We then have the following convergence result, whose proof we provide defer to Appendix C. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that δ 2 ≥ M2 j m for all j and that X ⊂ [−R∞, R∞] d . Then X T t=1 E F(x(t); ξ t ) − F(x ∗ ; ξ t ) ≤ X d j=1 min  1 η R 2 ∞E " δ 2 + X T t=1 gj (t) 2 1 2 # + ηE "X T t=1 gj (t) 2 1 2 # (1 + pjm), MjR∞pjT  . We can also relax the condition on the initial constant diagonal term δ slightly, which gives a qualitatively similar result (see Appendix C.3). Corollary 6. Under the conditions of Theorem 5, assume that δ 2 ≥ M2 j min{m, 6 max{log T, mpj}} for all j. Then X T t=1 E F(x(t); ξ t ) − F(x ∗ ; ξ t ) ≤ X d j=1 min  1 η R 2 ∞E " δ 2 + X T t=1 gj (t) 2 1 2 # + 3 2 ηE X T t=1 gj (t) 2  1 2 (1 + pjm), MjR∞pjT  . It is natural to ask in which situations the bound Theorem 5 and Corollary 6 provides is optimal. We note that, as in the case with Theorem 3, we may take an expectation with respect to ξ t and obtain a convergence rate for f(xb(T)) − f(x ∗ ), where xb(T) = 1 T PT t=1 x(t). By Jensen’s inequality, we have for any δ that E " δ 2 + X T t=1 gj (t) 2 1 2 # ≤  δ 2 + X T t=1 E[gj (t) 2 ] 1 2 ≤ q δ 2 + T pjM2 j . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that E[f(xb(T)) − f(x ∗ )] ≤ O(1)  1 η R 2 ∞ + η X d j=1 Mj min (p log(T)/T + pj √ T , pj ) , (8) 8 Auditory Sparse Coding 1 Auditory Sparse Coding Steven R. Ness University of Victoria Thomas Walters Google Inc. Richard F. Lyon Google Inc. CONTENTS 1.1 Summary .................................................................. 3 1.2 Introduction ............................................................... 4 1.2.1 The stabilized auditory image .................................... 6 1.3 Algorithm ................................................................. 7 1.3.1 Pole–Zero Filter Cascade .......................................... 7 1.3.2 Image Stabilization ................................................ 8 1.3.3 Box Cutting ....................................................... 9 1.3.4 Vector Quantization ............................................... 9 1.3.5 Machine Learning ................................................. 10 1.4 Experiments ............................................................... 11 1.4.1 Sound Ranking .................................................... 11 1.4.2 MIREX 2010 ...................................................... 14 1.5 Conclusions ................................................................ 17 1.1 Summary The concept of sparsity has attracted considerable interest in the field of machine learning in the past few years. Sparse feature vectors contain mostly values of zero and one or a few non-zero values. Although these feature vectors can be classified by traditional machine learning algorithms, such as SVM, there are various recently-developed algorithms that explicitly take advantage of the sparse nature of the data, leading to massive speedups in time, as well as improved performance. Some fields that have benefited from the use of sparse algorithms are finance, bioinformatics, text mining [1], and image classification [4]. Because of their speed, these algorithms perform well on very large collections of data [2]; large collections are becoming increasingly 34 Book title goes here relevant given the huge amounts of data collected and warehoused by Internet businesses. In this chapter, we discuss the application of sparse feature vectors in the field of audio analysis, and specifically their use in conjunction with preprocessing systems that model the human auditory system. We present early results that demonstrate the applicability of the combination of auditory-based processing and sparse coding to content-based audio analysis tasks. We present results from two different experiments: a search task in which ranked lists of sound effects are retrieved from text queries, and a music information retrieval (MIR) task dealing with the classification of music into genres. 1.2 Introduction Traditional approaches to audio analysis problems typically employ a shortwindow fast Fourier transform (FFT) as the first stage of the processing pipeline. In such systems a short, perhaps 25ms, segment of audio is taken from the input signal and windowed in some way, then the FFT of that segment is taken. The window is then shifted a little, by perhaps 10ms, and the process is repeated. This technique yields a two-dimensional spectrogram of the original audio, with the frequency axis of the FFT as one dimension, and time (quantized by the step-size of the window) as the other dimension. While the spectrogram is easy to compute, and a standard engineering tool, it bears little resemblance to the early stages of the processing pipeline in the human auditory system. The mammalian cochlea can be viewed as a bank of tuned filters the output of which is a set of band-pass filtered versions of the input signal that are continuous in time. Because of this property, fine-timing information is preserved in the output of cochlea, whereas in the spectrogram described above, there is no fine-timing information available below the 10ms hop-size of the windowing function. This fine-timing information from the cochlea can be made use of in later stages of processing to yield a three-dimensional representation of audio, the stabilized auditory image (SAI)[11], which is a movie-like representation of sound which has a dimension of ‘time-interval’ in addition to the standard dimensions of time and frequency in the spectrogram. The periodicity of the waveform gives rise to a vertical banding structure in this time interval dimension, which provides information about the sound which is complementary to that available in the frequency dimension. A single example frame of a stabilized auditory image is shown in Figure 1.1. While we believe that such a representation should be useful for audio analysis tasks, it does come at a cost. The data rate of the SAI is many times that of the original input audio, and as such some form of dimensionalityAuditory Sparse Coding 5 reduction is required in order to create features at a suitable data rate for use in a recognition system. One approach to this problem is to move from a the dense representation of the SAI to a sparse representation, in which the overall dimensionality of the features is high, but only a limit number of the dimensions are nonzero at any time. In recent years, machine learning algorithms that utilize the properties of sparsity have begun to attract more attention and have been shown to outperform approaches that use dense feature vectors. One such algorithm is the passive-aggressive model for image retrieval (PAMIR), a machine learning algorithm that learns a ranking function from the input data, that is, it takes an input set of documents and orders them based on their relevance to a query. PAMIR was originally developed as a machine vision method and has demonstrated excellent results in this field. There is also growing evidence that in the human nervous system sensory inputs are coded in a sparse manner; that is, only small numbers of neurons are active at a given time [10]. Therefore, when modeling the human auditory system, it may be advantageous to investigate this property of sparseness in relation to the mappings that are being developed. The nervous systems of animals have evolved over millions of years to be highly efficient in terms of energy consumption and computation. Looking into the way sound signals are handled by the auditory system could give us insights into how to make our algorithms more efficient and better model the human auditory system. One advantage of using sparse vectors is that such coding allows very fast computation of similarity, with a trainable similarity measure [4]. The efficiency results from storing, accessing, and doing arithmetic operations on only the non-zero elements of the vectors. In one study that examined the performance of sparse representations in the field of natural language processing, a 20- to 80-fold speedup over LIBSVM was found [7]. They comment that kernel-based methods, like SVM, scale quadratically with the number of training examples and discuss how sparsity can allow algorithms to scale linearly based on the number of training examples. In this chapter, we use the stabilized auditory image (SAI) as the basis of a sparse feature representation which is then tested in a sound ranking task and a music information retrieval task. In the sound raking task, we generate a two-dimensional SAI for each time slice, and then sparse-code those images as input to PAMIR. We use the ability of PAMIR to learn representations of sparse data in order to learn a model which maps text terms to audio features. This PAMIR model can then be used rank a list of unlabeled sound effects according to their relevance to some text query. We present results that show that in certain tasks our methods can outperform highly tuned FFT based approaches. We also use similar sparse-coded SAI features as input to a music genre classification system. This system uses an SVM classifier on the sparse features, and learns text terms associated with music. The system was entered into the annual music information retrieval evaluation exchange evaluation (MIREX 2010).6 Book title goes here Results from the sound-effects ranking task show that sparse auditorymodel-based features outperform standard MFCC features, reaching precision about 73% for the top-ranked sound, compared to about 60% for standard MFCC and 67% for the best MFCC variant. These experiments involved ranking sounds in response to text queries through a scalable online machine learning approach to ranking. 1.2.1 The stabilized auditory image In our system we have taken inspiration from the human auditory system in order to come up with a rich set of audio features that are intended to more closely model the audio features that we use to listen and process music. Such fine timing relations are discarded by traditional spectral techniques. A motivation for using auditory models is that the auditory system is very effective at identifying many sounds. This capability may be partially attributed to acoustic features that are extracted at the early stages of auditory processing. We feel that there is a need to develop a representation of sounds that captures the full range of auditory features that humans use to discriminate and identify different sounds, so that machines have a chance to do so as well. FIGURE 1.1 An example of a single SAI of a sound file of a spoken vowel sound. The vertical axis is frequency with lower frequencies at the bottom of the figure and higher frequencies on the top. The horizontal axis is the autocorrelation lag. From the positions of the vertical features, one can determine the pitch of the sound. This SAI representation generates a 2D image from each section of waveform from an audio file. We then reduce each image in several steps: first cutting the image into overlapping boxes converted to fixed resolution per box; second, finding row and column sums of these boxes and concatenating those into a vector; and finally vector quantizing the resulting medium-Auditory Sparse Coding 7 dimensionality vector, using a separate codebook for each box position. The VQ codeword index is a representation of a 1-of-N sparse code for each box, and the concatenation of all of those sparse vectors, for all the box positions, makes the sparse code for the SAI image. The resulting sparse code is accumulated across the audio file, and this histogram (count of number of occurrences of each codeword) is then used as input to an SVM [5] classifier[3]. This approach is similar to that of the “bag of words” concept, originally from natural language processing, but used heavily in computer vision applications as “bag of visual words”; here we have a “bag of auditory words”, each “word” being an abstract feature corresponding to a VQ codeword. The bag representation is a list of occurrence counts, usually sparse. 1.3 Algorithm In our experiments, we generate a stream of SAIs using a series of modules that process an incoming audio stream through the various stages of the auditory model. The first module filters the audio using the pole–zero filter cascade (PZFC) [9], then subsequent modules find strobe points in this audio, and generate a stream of SAIs at a rate of 50 per second. The SAIs are then cut into boxes and are transformed into a high dimensional dense feature vector [12] which is vector quantized to give a high dimensional sparse feature vector. This sparse vector is then used as input to a machine learning system which performs either ranking or classification. This whole process is shown in diagrammatic form in Figure 1.2 1.3.1 Pole–Zero Filter Cascade We first process the audio with the pole–zero filter cascade (PZFC) [9], a model inspired by the dynamics of the human cochlea. The PZFC is a cascade of a large number of simple filters with an output tap after each stage. The effect of this filter cascade is to transform an incoming audio signal into a set of band-pass filtered versions of the signal. In our case we used a cascade with 95 stages, leading to 95 output channels. Each output channel is halfwave rectified to simulate the output of the inner hair cells along the length of the cochlea. The PZFC also includes an automatic gain control (AGC) system that mimics the effect of the dynamic compression mechanisms seen in the cochlea. A smoothing network, fed from the output of each channel, dynamically modifies the characteristics of the individual filter stages. The AGC can respond to changes in the output on the timescale of milliseconds, leading to very fast-acting compression. One way of viewing this filter cascade is that its outputs are an approximation of the instantaneous neuronal firing rate as a function of cochlear place, modeling both the frequency filtering and8 Book title goes here FIGURE 1.2 A flowchart describing the flow of data in our system. First, either the pole– zero filter cascade (PZFC) or gammatone filterbank filters the input audio signal. Filtered signals then pass through a half-wave rectification module (HCL), and trigger points in the signal are then calculated by the local-max module. The output of this stage is the SAI, the image in which each signal is shifted to align the trigger time to the zero lag point in the image. The SAI is then cut into boxes with the box-cutting module, and the resulting boxes are then turned into a codebook with the vector-quantization module. the automatic gain control characteristics of the human cochlea [8]. The PZFC parameters used for the sound-effects ranking task are described in [9]. We did not do any further tuning of this system to the problems of genre, mood or song classification; this would be a fruitful area of further research. 1.3.2 Image Stabilization The output of the PZFC filterbank is then subjected to a process of strobe finding where large peaks in the PZFC signal are found. The temporal locations of these peaks are then used to initiate a process of temporal integration whereby the stabilized auditory image is generated. These strobe points “stabilize” the signal in a manner analogous to the trigger mechanism in an oscilloscope. When these strobe points are found, a modified form of autocorrelation, known as strobed temporal integration, which is like a sparse version of autocorrelation where only the strobe points are correlated against the sig-Auditory Sparse Coding 9 FIGURE 1.3 The cochlear model, a filter cascade with half-wave rectifiers at the output taps, and an automatic gain control (AGC) filter network that controls the tuning and gain parameters in response to the sound. nal. Strobed temporal integration has the advantage of being considerably less computationally expensive than full autocorrelation. 1.3.3 Box Cutting We then divide each image into a number of overlapping boxes using the same process described in [9]. We start with rectangles of size 16 lags by 32 frequency channels, and cover the SAI with these rectangles, with overlap. Each of these rectangles is added to the set of rectangles to be used for vector quantization. We then successively double the height of the rectangle up to the largest size that fits in an SAI frame, but always reducing the contents of each box back to 16 by 32 values. Each of these doublings is added to the set of rectangles. We then double the width of each rectangle up to the width of the SAI frame and add these rectangles to the SAI frame. The output of this step is a set of 44 overlapping rectangles. The process of box-cutting is shown in Figure 1.4. In order to reduce the dimensionality of these rectangles, we then take their row and column marginals and join them together into a single vector. 1.3.4 Vector Quantization The resulting dense vectors from all the boxes of a frame are then converted to a sparse representation by vector quantization. We first preprocessed a collection of 1000 music files from 10 genres using a PZFC filterbank followed by strobed temporal integration to yield a set of10 Book title goes here FIGURE 1.4 The boxes, or multi-scale regions, used to analyze the stabilized auditory images are generated in a variety of heights, widths, and positions. SAI frames for each file . We then take this set of SAI and apply the boxcutting technique described above. The followed by the calculation of row and column marginals. These vectors are then used to train dictionaries of 200 entries, representing abstract “auditory words”, for each box position, using a k-means algorithm. This process requires the processing of large amounts of data, just to train the VQ codebooks on a training corpus. The resulting dictionaries for all boxes are then used in the MIREX experiment to convert the dense features from the box cutting step on the test corpus songs into a set of sparse features where each box was represented by a vector of 200 elements with only one element being non-zero. The sparse vectors for each box were then concatenated, and these long spare vectors are histogrammed over the entire audio file to produce a sparse feature vector for each song or sound effect. This operation of constructing a sparse bag of auditory words was done for both the training and testing corpora.Auditory Sparse Coding 11 1.3.5 Machine Learning For this system, we used the support vector machine learning system from libSVM which is included in the Marsyas[13] framework. Standard Marsyas SVM parameters were used in order to classify the sparse bag of auditory words representation of each song. It should be noted that SVM is not the ideal algorithm for doing classification on such a sparse representation, and if time permitted, we would have instead used the PAMIR machine learning algorithm as described in [9]. This algorithm has been shown to outperform SVM on ranking tasks, both in terms of execution speed and quality of results. 1.4 Experiments 1.4.1 Sound Ranking We performed an experiment in which we examined a quantitative ranking task over a diverse set of audio files using tags associated with the audio files. For this experiment, we collected a dataset of 8638 sound effects, which came from multiple places. 3855 of the sound files were from commercially available sound effect libraries, of these 1455 were from the BBC sound effects library. The other 4783 audio files were collected from a variety of sources on the internet, including findsounds.com, partnersinrhyme.com, acoustica.com, ilovewaves.com, simplythebest.net, wav-sounds.com, wav-source.com and wavlist.com. We then manually annotated this dataset of sound effects with a small number of tags for each file. Some of the files were already assigned tags and for these, we combined our tags with this previously existing tag information. In addition, we added higher level tags to each file, for example, files with the tags “cat”, “dog” and “monkey” were also given the tags “mammal” and “animal”. We found that the addition of these higher level tags assist retrieval by inducing structure over the label space. All the terms in our database were stemmed, and we used the Porter stemmer for English, which left a total of 3268 unique tags for an average of 3.2 tags per sound file. In order to estimate the performance of the learned ranker, we used a standard three-fold cross-validation experimental setup. In this scheme, two thirds of the data is used for training and one third is used for testing; this process is then repeated for all three splits of the data and results of the three are averaged. We removed any queries that had fewer than 5 documents in either the training set or the test set, and if the corresponding documents had no other tags, these documents were removed as well. To determine the values of the hyperparameters for PAMIR we performed a second level of cross-validation where we iterated over values for the aggressiveness parameter C and the number of training iterations. We found that in12 Book title goes here general system performance was good for moderate values of C and that lower values of C required a longer training time. For the agressiveness parameter, we selected a value of C=0.1, a value which was also found to be optimal in other research [6]. For the number of iterations, we chose 10M, and found that in our experience, the system was not very sensitive to the exact value of these parameters. We evaluated our learned model by looking at the precision within the top k audio files from the test set as ranked by each query. Precision at top k is a commonly used measure in retrieval tasks such as these and measures the fraction of positive results within the top k results from a query. The stabilized auditory image generation process has a number of parameters which can be adjusted including the parameters of the PZFC filter and the size of rectangles that the SAI is cut into for subsequent vector quantization. We created a default set of parameters and then varied these parameters in our experiments. The default SAI box-cutting was performed with 16 lags and 32 channels, which gave a total of 49 rectangles. These rectangles were then reduced to their marginal values which gives a 48 dimension vector, and a codebook of size 256 was used for each box, giving a total of 49 x 256 = 12544 feature dimensions. Starting from these, we then made systematic variations to a number of different parameters and measured their effect on precision of retrieval. For the box-cutting step, we adjusted various parameters including the smallest sized rectangle, and the maximum number of rectangles used for segmentation. We also varied the codebook sizes that we used in the sparse coding step. In order to evaluate our method, we compared it with results obtained using a very common feature extraction method for audio analysis, MFCCs (mel-frequency cepstral coefficients). In order to compare this type of feature extraction with our own, we turned these MFCC coefficients into a sparse code. These MFCC coefficients were calculated with a Hamming window with initial parameters based on a setting optimized for speech. We then changed various parameters of the MFCC algorithm, including the number of cepstral coefficients (13 for speech), the length of each frame (25ms for speech), and the number of codebooks that were used to sparsify the dense MFCC features for each frame. We obtained the best performance with 40 cepstral coefficients, a window size of 40ms and codebooks of size 5000. We investigated the effect of various parameters of the SAI feature extraction process on test-set precision, these results are displayed graphically in Figure 1.5 where the precision of the top ranked sound file is plotted against the number of features used. As one can see from this graph, performance saturates when the number of features approaches 105 which results from the use of 4000 code words per codebook, with a total of 49 codebooks. This particular set of parameters led to a performance of 73%, significantly better than the best MFCC result which achieved a performance of 67%, which represents a smaller error of 18% (from 33 % to 27 % error). It is also notableAuditory Sparse Coding 13 FIGURE 1.5 Ranking at top-1 retrieved result for all the experimental runs described in this section. A few selected experiment names are plotted next to each point, and different experiments are shown by different icons. The convex hull that connects the best-performing experiments is plotted as a solid line. that SAI can achieve better precision-at-top-k consistently for all values of k, albeit with a smaller improvement in relative precision. In table 1.2 results of three queries along with the top five sound files that were returned by the best SAI-based and MFCC-based systems. From this table, one can see that the two systems perform in different ways, this can be expected when one considers the basic audio features that these two systems extract. For example, for the query “gulp”, the SAI system returns “pouring” and “water-dripping”, all three of these share the similarity of involving the movement of water or liquids. When we calculated performance, it was based on textual tags, which are often noisy and incomplete. Due to the nature of human language and perception, people often use different words to describe sounds that are very similar, for example, a Chopin Mazurka could be described with the words “piano”, “soft”, “classical”, “Romantic”, and “mazurka”. To compound this diffi- culty, a song that had a female vocalist singing could be labelled as “woman”, “women”, “female”, “female vocal”, or “vocal”. This type of multi-label problem is common in the field of content based retrieval. It can be alleviated by a number of techniques, including the stemming of words, but due to the varying nature of human language and perception, will continue to remain an issue. In Figure 1.6 the performance of the SAI and MFCC based systems are14 Book title goes here FIGURE 1.6 A comparison of the average precision of the SAI and MFCC based systems. Each point represents a single query, with the horizontal position being the MFCC average precision and the vertical position being the SAI average precision. More of the points appear above the y=x line, which indicates that the SAI based system achieved a higher mean average precision. compared to each other with respect to their average precision. A few select full tag names are placed on this diagram, for the rest, only a plus is shown. This is required because otherwise the text would overlap to such a great degree that it would be impossible to read. In this diagram we plot the average precision of the SAI based system against that of the MFCC based system, with the SAI precision shown along the vertical axis and the MFCC precision shown along the horizontal axis. If the performance of the two systems was identical, all points would lie on the line y=x. Because more points lie above the line than below the line, the performance of the SAI based system is better than that of the MFCC based system.Auditory Sparse Coding 15 top-k SAI MFCC percent error reduction 1 27 33 18 % 2 39 44 12 % 5 60 62 4 % 10 72 74 3 % 20 81 84 4 % TABLE 1.1 A comparison of the best SAI and MFCC configurations. This table shows the percent error at top-k, where error is defined as 1 - precision. Query SAI file (labels) MFCC file (labels) tarzan Tarzan-2 (tarzan, yell) TARZAN (tarzan, yell) tarzan2 (tarzan, yell) 175orgs (steam, whistle) 203 (tarzan) mosquito-2 (mosquito) wolf (mammal, wolves, wolf, ...) evil-witch-laugh (witch, evil, laugh) morse (morse, code) Man-Screams (horror, scream, man) applause 27-Applause-from-audience 26-Applause-from-audience audience 30-Applause-from-audience phase1 (trek, phaser, star) golf50 (golf) fanfare2 (fanfare, trumpet) firecracker 45-Crowd-Applause (crowd, applause) 53-ApplauseLargeAudienceSFX golf50 gulp tite-flamn (hit, drum, roll) GULPS (gulp, drink) water-dripping (water, drip) drink (gulp, drink) Monster-growling (horror, monster, growl) california-myotis-search (blip) Pouring (pour,soda) jaguar-1 (bigcat, jaguar, mammal, ...) TABLE 1.2 A comparison of the best SAI and MFCC configurations. This table shows the percent error at top-k, where error is defined as (1 - precision).16 Book title goes here Algorithm Classification Accuracy SAI/VQ 0.4987 Marsyas MFCC 0.4430 Best 0.6526 Average 0.455 TABLE 1.3 Classical composer train/test classification task Algorithm Classification Accuracy SAI/VQ 0.4861 Marsyas MFCC 0.5750 Best 0.6417 Average 0.49 TABLE 1.4 Music mood train/test classification task 1.4.2 MIREX 2010 All of these algorithms were then ported to the Marsyas music information retrieval framework from AIM-C, and extensive tests were written as described above. These algorithms were submitted to the MIREX 2010 competition as C++ code, which was then run by the organizers on blind data. As of this date, only results for two of the four train/test tasks have been released. One of these is for the task of classifying classical composers and the other is for classifying the mood of a piece of music. There were 40 groups participating in this evaluation, the most ever for MIREX, which gives some indication about how this classification task is increasingly important in the real world. Below I present the results for the best entry, the average of all entries, our entry, and the other entry for the Marsyas system. It is instructive to compare our result to that of the standard Marsyas system because in large part we would like to compare the SAI audio feature to the standard MFCC features, and since both of these systems use the SVM classifier, we partially negate the influence of the machine learning part of the problem. For the classical composer task the results are shown in table 1.3 and for the mood classification task, results are shown in table 1.4 From these results we can see that in the classical composer task we outperformed the traditional Marsyas system which has been tuned over the course of a number of years to perform well. This gives us the indication that the use of these SAI features has promise. However, we underperform the best algorithm, which means that there is work to be done in terms of testing different machine learning algorithms that would be better suited to this type of data. However, in a more detailed analysis of the results, which is shown in 1.7, it is evident that each of the algorithms has a wide range of perfor-Auditory Sparse Coding 17 mance on different classes. This graph shows that the most well predicted in our SAI/VQ classifier overlap significantly with those from the highest scoring classification engines. FIGURE 1.7 Per class results for classical composer In the mood task, we underperform both Marsyas and the leading algorithm. This is interesting and might speak to the fact that we did not tune the parameters of this algorithm for the task of music classification, but instead used the parameters that worked best for the classification of sound effects. Music mood might be a feature that has spectral aspects that evolve over longer time periods than other features. For this reason, it would be important to search for other parameters in the SAI algorithm that would perform well for other tasks in music information retrieval. For these results, due to time constraints, we only used the SVM classifier on the SAI histograms. This has been shown in [9] to be an inferior classifier for this type of sparse, high-dimensional data than the PAMIR algorithm. In the future, we would like to add the PAMIR algorithm to Marsyas and to try these experiments using this new classifier. It was observed that the MIR community is increasingly becoming focused on advanced machine learning techniques, and it is clear that it will be critical to both try different machine learning algorithms on these audio features as well as to perform wider sweeps of parameters for these classifiers. Both of these will be important in increasing the performance of these novel audio features.18 Book title goes here 1.5 Conclusions The use of physiologically-plausible acoustic models combined with a sparsi- fication approach has shown promising results in both the sound effects ranking and MIREX 2010 experiments. These features are novel and hold great promise in the field of MIR for the classification of music as well as other tasks. Some of the results obtained were better than that of a highly tuned MIR system on blind data. In this task we were able to expose the MIR community to these new audio features. These new audio features have been shown to outperform MFCC features in a sound-effects ranking task, and by evaluating these features with machine learning algorithms more suited for these high dimensional, sparse features, we have great hope that we will obtain even better results in future MIREX evaluations.Bibliography [1] Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear classifiers in the massive data setting. J. Mach. Learn. Res., 9:313–337, 2008. [2] L´eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston. Large-Scale Kernel Machines (Neural Information Processing). The MIT Press, 2007. [3] O. Chappelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. [4] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. J. Mach. Learn. Res., 11:1109–1135, 2010. [5] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM into Weka Environment, 2005. [6] David Grangier and Samy Bengio. A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell., 30(8):1371–1384, 2008. [7] Patrick Haffner. Fast transpose methods for kernel learning on sparse data. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 385–392, New York, NY, USA, 2006. ACM. [8] R. F. Lyon. Automatic gain control in cochlear mechanics. In P Dallos et al., editor, The Mechanics and Biophysics of Hearing, pages 395–420. Springer-Verlag, 1990. [9] Richard F. Lyon, Martin Rehn, Samy Bengio, Thomas C. Walters, and Gal Chechik. Sound retrieval and ranking using auditory sparsecode representations. Neural Computation, 22, 2010. [10] Bruno A. Olshausen and David J. Field. Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4):481–487, 2004. [11] R. D. Patterson. Auditory images: how complex sounds are represented in the auditory system. The Journal of the Acoustical Society of America, 21:183–190, 2000. 1920 Book title goes here [12] Martin Rehn, Richard F. Lyon, Samy Bengio, Thomas C. Walters, and Gal Chechik. Sound ranking using auditory sparse-code representations. ICML 2009 Workshop on Sparse Methods for Music Audio, 2009. [13] G. Tzanetakis. Marsyas-0.2: A case study in implementing music information retrieval systems, chapter 2, pages 31–49. Intelligent Music Information Systems: Tools and Methodologies. Information Science Reference, 2008. Shen, Shepherd, Cui, Liu (eds). Extracting Patterns from Location History Andrew Kirmse Google Inc Mountain View, California akirmse@google.com Tushar Udeshi Google Inc Boulder, Colorado tudeshi@google.com Jim Shuma Google Inc Mountain View, California jshuma@google.com Pablo Bellver Google Inc Mountain View, California pablob@google.com ABSTRACT In this paper, we describe how a user's location history (recorded by tracking the user's mobile device location with his permission) is used to extract the user's location patterns. We describe how we compute the user's commonly visited places (including home and work), and commute patterns. The analysis is displayed on the Google Latitude history dashboard [7] which is only accessible to the user. Categories and Subject Descriptors D.0 [General]: Location based services. General Terms Algorithms. Keywords Location history analysis, commute analysis. 1. INTRODUCTION Location-based services have been gaining in popularity. Most services[4,5] utilize a “check-in” model where a user takes some action on the phone to announce that he has reached a particular place. He can then advertise this to his friends and also to the business owner who might give him some loyalty points. Google Latitude [6] utilizes a more passive model. The mobile device periodically sends his location to a server which shares it with his registered friends. The user also has the option of opting into latitude location history. This allows Google to store the user's location history. This history is analyzed and displayed for the user on a dashboard [7]. A user's location history can be used to provide several useful services. We can cluster the points to determine where he frequents and how much time he spends at each place. We can determine the common routes the user drives on, for instance, his daily commute to work. This analysis can be used to provide useful services to the user. For instance, one can use real-time traffic services to alert the user when there is traffic on the route he is expected to take and suggest an alternate route. We expect many more useful services to arise from location history. It is important to note that a user's location history is stored only if he explicitly opts into this feature. However, once signed in, he can get several useful services without any additional work on his part (like checking in). 2. PREVIOUS WORK Much previous work assumes clean location data sampled at very high frequency. Ashbrook and Starner [2] cluster a user's significant locations from GPS traces by identifying locations where the GPS signal reappears after an absence of 10 minutes or longer. This approach is unable to identify important outdoor places and is also susceptible to spurious GPS signal loss (e.g. in urban canyons or when the recording device is off). In addition they use a Markov model to predict where the user is likely to go next from where he is. Liao, et al [11] attempt to segment a user's day into everyday activities such as “working”, “sleeping” etc. using a hierarchical activity model. Both these papers obtain one GPS reading per second. This is impractical with today's mobile devices due to battery usage. Kang et al [10] use time-based clustering of locations obtained using a “Place Lab Client” to infer the user's important locations. The “Place lab client” infers locations by listening to RF-emissions from known wi-fi access points. This requires less power than GPS. However, their clustering algorithm assumes a continuous trace of one sample per second. Real-world data is not so reliable and often has missing and noisy data as illustrated in Section 3.2. Ananthanarayanan et al [1] describe a technique to infer a user's driving route. They also match users having similar routes to suggest carpool partners. Liao et al [12] use a hierarchical Markov Model to infer a user's transportation patterns including different modes of transportation (e.g. bus, on foot, car etc.). Both these papers use clean regularly-sampled GPS traces as input. 3. LOCATION ANALYSIS 3.1 Input Data For every user, we have a list of timestamped points. Each point has a geolocation (latitude and longitude), an accuracy radius and an input source: 17% of our data points are from GPS and these have an accuracy in the 10 meter range. Points derived from wifi signatures have an accuracy in the 100 meter range and represent 57% of our data. The remaining 26% of our points are derived Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL GIS '11, November 1-4, 2011. Chicago, IL, USA Copyright © 2011 ACM ISBN 978-1-4503-1031-4/11/11...$10.00from cell tower triangulation and these have an accuracy in the 1000 meter range. We have a test database of location history of around a thousand users. We used this database to generate the data in this paper. 3.2 Location Filtering The raw geolocations reported from mobile devices may contain errors beyond the measurable uncertainties inherent in the collection method. Hardware or software bugs in the mobile device may induce spurious readings, or variations in signal strength or terrain may cause a phone to connect to a cell tower that is not the one physically closest to the device. Given a stream of input locations, we apply the following filters to account for these errors: 1. Reject any points that fall outside the boundaries of international time zones over land. While this discards some legitimate points over water (presumably collected via GPS), in practice it removes many more false readings. 2. Reject any points with timestamps before the known public launch of the collection software. 3. Identify cases of “jitter”, where the reported location jumps to a distant point and soon returns. As shown in Figure 1, this is surprisingly common. We look for a sequence of consecutive locations {P1, P2, …, Pn} where the following conditions hold: ◦ P1 and Pn are within a small distance threshold D of each other. ◦ P1 and Pn have timestamps within a few hours of each other. ◦ P1 and Pn have high reported accuracy. ◦ P2, …, Pn-1 have low reported accuracy. ◦ P2, …, Pn-1 are farther than D from P1. In such a case, we conclude that the points P2, …, Pn-1 are due to jitter, and discard them. 4. If a pair of consecutive points implies a non-physical velocity, reject the later one. Any points that are filtered are discarded, and are not used in the remaining algorithms described in this paper. 3.3 Computing Frequently Visited Places In this section, we describe the algorithms we use to compute places frequented by a user from his location history. We first filter out the points for which the user is stationary i.e. moving at a very low velocity. These stationary points need to be clustered to extract interesting locations. 3.3.1 Clustering Stationary Points We use two different algorithms for clustering stationary points. 3.3.1.1 Leader-based Clustering For every point, we determine if it belongs to any of the already generated clusters by computing its distance to the cluster leader's location. If this is below a threshold radius, the point is added to the cluster. Psuedocode is in Figure 2. This algorithm is simple and efficient. It runs in O(NC) where N is the number of points and C is the number of clusters. However, the output clusters are dependent on the order of the input points. For example, consider 3 points P1, P2 and P3 which lie on a straight line with a distance of radius between them as shown in Figure 3. If the input points are ordered {P1, P2, P3}, we would get 2 clusters: {P1} and {P2, P3}. But if they are ordered {P2, P1 ,P3} we would get only 1 cluster containing all 3 points. 3.3.1.2 Mean Shift Clustering Mean shift [3] is an iterative procedure that moves every point to the average of the data points in its neighborhood. The iterations are stopped when all points move less than a threshold. This algorithm is guaranteed to converge. We use a weighted average to compute the move position where the weights are inversely proportional to the accuracy. This causes the points to gravitate towards high accuracy points. Once the iterations converge, the moved locations of the points are chosen as cluster centers. All input points within a threshold radius to a cluster center are added to its cluster. We revert back to leader-based clustering if the iterations do not converge or some of the input points remain unclustered. Psuedocode is shown in Figure 4. This algorithm does not suffer from the input-order dependency of leader-based clustering. For the input point set of Figure 3, it will always return a cluster comprising all 3 points. The algorithm generates a smaller number of better located clusters compared to leader-based clustering. For example consider 4 points on the vertices of a square with a diagonal of 2*radius as shown in Figure 5. Leader-based clustering would generate 4 clusters, 1 per Figure 3. Three equidistant points on a line. P2 P3 P 1 radius radius Let points = input points. Let clusters = [] foreach p in points: foreach c in clusters: if distance(c.leader(), p) < radius: Add p to c break else: Create a new cluster c with p as leader clusters.add(c) Figure 2. Leader-based Clustering Algorithm. Figure 1. A set of reported locations exhibiting “jitter”. One of the authors was actually stationary during the time interval represented by these points.point. Mean-shift clustering would return only 1 cluster whose centroid is at the center of the square. The iterative nature of this algorithm makes it expensive. We therefore limit the maximum number of iterations and revert to leader-based clustering if the algorithm does not converge quickly enough. When we ran Mean-shift clustering on our test database, the algorithm converged in 2.4 iterations on an average. 3% of the input points could not be clustered (i.e. we had to revert to leaderbased for them). However, it did not cause a significant reduction in the number of computed clusters (< 1%). We concluded that the marginal improvement in quality did not justify the increased computational cost. 3.3.1.3 Adaptive Radius Clustering The two clustering algorithms described above return clusters of the input points. One possibility would be to deem clusters larger than a threshold as interesting locaitons. However, this is not ideal since the input points have varying accuracy. For instance, if there are three stationary GPS points within close proximity of each other, we have high confidence that the user visited that place as opposed to three stationary cell tower points. We run the clustering algorithms multiple times, increasing the radius as well as the minimum cluster size after every iteration. When a cluster is generated, we check to see if it overlaps an already computed cluster (generated from a smaller radius). If that is the case, we merge it into the larger cluster. Note that adaptive radius clustering can be used in conjunction with any clustering algorithm. From our test database, we found that adaptively increasing the clustering radius from 20 meters to 500 meters and the minimum cluster size from 2 to 4, increased the number of computed visited places by 81% as compared to clustering with a fixed radius of 500 meters and a minimum cluster size of 4. We also surveyed users and found that the majority of the new visited places generated were correct and useful enough to display on the Latitude history dashboard [7]. 3.3.2 Computing Home and Work Locations We use a simple heuristic to determine the user's home and work locations. A user is likely to be at home at night. We filter out the user's points which occur at night and cluster them. The largest cluster is deemed the user's home location. Similarly, work location is derived by clustering points which occur on weekdays in the middle of the day and clustering them. Note that this heuristic will not work for users with non-standard schedules (e.g. work at night or work in multiple locations). Such users have the option of correcting their home and work location on the Latitude history dashboard [7]. These updated locations will be used for other analyses (e.g. commute analysis described in Section 3.4). 3.3.3 Computing Visited Places We do some additional filtering of the input points before clustering for visited places: 1. We remove points which are within a threshold distance of home and work locations. 2. We remove points which are on the user's commute between home and work. These points are determined using the algorithm described in Section 3.4.1. 3. We remove points near airports since these are reported as flights as described in Section 3.5. Without these filters, we get spurious visited places. Even when a user is stationary at home or work, the location reported can jump around, as described in Section 3.2. Without the first filter, we would get multiple visited places near home and work. If a user regularly stops at a long traffic signal on his commute to work, it has a good chance of being clustered to a visited place. This is why we need the second filter. 3.4 Commute Analysis We can deduce a user's driving commute patterns from his location history. The main challenge here is that points are reported infrequently and we have to derive the path the user has taken in between these points. Also, the accuracy of the points can be very low and so one needs to snap the points to the road he is likely to be on. The commutes are analyzed in three steps: (1) Extract sets of commute points from the input. (2) Fit the commute points to a road path. (3) Cluster paths together spatially and (optionally) temporally to generate the most common commutes taken by the user. 3.4.1 Extracting Commute Points Given a source and destination location (e.g. home and work), we extract the points from the user's location history which likely occurred on the user's driving commute from source to destination. We first filter out all points with low accuracy from the input set. We then find pairs of source-destination points. All points between a pair are candidate points for a single commute. The input points are noisy and therefore we do some sanity checks on the commute candidate points: (1) The commute distance should be reasonable. (2) The commute duration should be reasonable. (3) The commute should be at reasonable driving velocity. Figure 5. Four points on a square. Let points = input points Let clusters = [] foreach p in points: Compute Weight(p) Let cluster_centers = points while all cluster_centers move < shift_threshold: foreach p in cluster_centers: Find all points np which are within a threshold distance of p p = ∑ Weight (npi )×npi ∑ Weight( npi ) while not cluster_centers.empty(): Choose p with highest accuracy from cluster_centers Find all points, say rp, in points which are within radius of p Create a new cluster c with rp clusters.add(c) points = points – rp cluster_centers = cluster_centers – Moved(rp) Cluster remaining points with Leader-based clustering. Figure 4. Mean-Shift Clustering Algorithm 2*radius3.4.2 Fitting a Road Path to Commute Points We use the routing engine used to compute driving directions in Google Maps [8] to fit a path to the commute point. For the rest of this paper, we will refer to this routing engine as “Pathfinder”. This is an iterative algorithm. We first query Pathfinder for the route between source and destination. If all the commute points are within the accuracy threshold distance (used in Section 3.4.1), we terminate and return this path. If not, we add the point which is furthest away from the path as a waypoint and query Pathfinder again. If Pathfinder fails, we assume that the waypoint is not valid (for example it might be in water) and drop it. We continue iterating until all commute points are within the accuracy distance threshold. Psuedocode is shown in Figure 6. Two iterations of this algorithm are shown in Figure 7. The small blue markers are the points from the user's history. The large green markers are the Pathfinder waypoints used to generate the path. After the second iteration all the user points are within the accuracy threshold. 3.4.3 Clustering Commutes Given a bag of commute paths and the time intervals the commutes occurred, we cluster them to determine the most frequent commutes. Two commutes are deemed temporally close if they start and end within some threshold of each other on the same day of the week. Two commutes are deemed spatially close if their Hausdorff distance [9] is within a threshold. We use a variant of leader-based clustering (described in Section 3.3.1.1) to generate commute clusters. The largest commute cluster on a particular day of week is the most common route taken by the user. This can be used for traffic alerts as described in Section 4.1. 4. ACKNOWLEDGMENTS Our thanks to Matthieu Devin, Max Braun, Jesse Rosenstock, Will Robinson, Dale Hawkins, Jim Guggemos, and Baris Gultekin for contributing to this project. 5. REFERENCES [1] Ananthanarayanan, G., Haridasan , M., Mohomed , I., Terry, D., Thekkath , C. A. 2009. StarTrack: A Framework for Enabling Track-Based Applications. Proceedings of the 7th international conference on Mobile systems, applications, and services. [2] Ashbrook, D., Starner, T. 2003. Using GPS to Learn Significant Locations and Predict Movement across Multiple Users. Personal and Ubiquitous Computing, Vol. 7, Issue 5. [3] Cheng, Y. C. 1995. Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 8. [4] Facebook places http://www.facebook.com/places. [5] Foursquare http://www.foursquare.com. [6] Google Latitude http://www.google.com/latitude. [7] Google Latitude History Dashboard. http://www.google.com/latitude/history/dashboard [8] Google Maps http://maps.google.com. [9] Huttenlocher D. P., Klanderman, G. A., and Rucklidge W. J. 1993. Comparing Images using the Hausdorff Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 15, No 9. [10] Kang, J.H., Welbourne, W., Stewart, B. and Borriello, G. 2005. Extracting Places from Traces of Locations. SIGMOBILE Mob. Comput. Commun. Rev. 9, 3. [11] Liao, L., Fox, D., and Kautz, H. 2007. Extracting Places and Activities from GPS traces using Hierarchical Conditional Random Fields. International Journal of Robotics Research. [12] Liao, L., Patterson D. J., Fox, D., and Kautz, H. 2007, Learning and Inferring Transportation Routines. Artificial Intelligence. Vol. 171, Issues 5-6. Figure 7. Two iterations of path fitting algorithm. The small markers are the user's points. The large marker are Pathfinder waypoints used to generate the path. Iteration 1 Iteration 2 Let points = input points Let waypoints = [source, destination] Let current_path = Pathfinder route for waypoints while points.size() > 2: Let p be the point in points farthest away from current_path if distance(p, current_path) < threshold: Record current_path as a commute. break Add p to waypoints in the correct position current_path = Pathfinder route for waypoints if Pathfinder fails: Erase p from waypoints Erase p from points Figure 6. Algorithm to fit a path to commute Abstract When we started implementing a refactoring tool for real-world C programs, we recognized that preprocessing and parsing in straightforward and accurate ways would result in unacceptably slow analysis times and an overly-complicated parsing system. Instead, we traded some accuracy so we could parse, analyze, and change large, real programs while still making the refactoring experience feel interactive and fast. Our tradeoffs fell into three categories: using different levels of accuracy in different parts of the analysis, recognizing that collected wisdom about C programs didn't hold for Objective-C programs, and finding ways to exploit delays in typical interaction with the tool. Categories and Subject Descriptors D.2.6 [Software Engineering]: Programming Environments General Terms Design, Language Keywords: refactoring, case study, scalability, Objective-C 1. Introduction Taking software engineering tools from research to development requires addressing the practical details of software development: huge amounts of source code, the nuances of real languages, and multiple build configurations. Making tools useful for real programmers requires either addressing all these sorts of issues, or accepting various trade-offs in order to ship a reasonable software tool. In our case, we wanted to add refactoring to Apple’s Xcode IDE (integrated development environment.) 1 The refactoring feature would manipulate programs written in Objective-C. Objective-C is an object-oriented extension to C, and Apple’s primary development language [1]. In past research [2], I’d found it acceptable to take multiple minutes to perform a transformation on a small Scheme program. The critical requirements for our commercial tool were quite different: • Support the most common and useful transformations. Renaming declarations, replacing a block of code with a call to a new function, and moving declarations up and down a class hierarchy were mandatory features. • Refactor 200,000 line programs. The feature had to work on real, medium-sized applications. The actual amount of code to parse was much larger than the program’s size. Most Mac OS X compilation units pull in headers for common system libraries, requiring at least another 60-120,000 lines of code that would need to be parsed for every compilation unit. Such large sets of headers are not unique to Mac OS X. C programs using large libraries like the Qt user interface library would encounter similar scalability issues. • Interactive behavior. Xcode’s refactoring feature would be part of the source code editor. Users will expect transformations to complete in seconds rather than minutes, and the whole experience would need to feel interactive [3]. Parsing and analyzing programs of this size in straightforward ways would result in an unacceptable user experience. In one of my first experiences with a similar product, renaming a declaration in a 4,200 line C program (with the previously-mentioned 60,000 lines of headers) took two minutes. • Don't force the user to change the program in order to refactor. The competing product previously mentioned could provide much more acceptable performance if the user specified a pre-compiled header—a single header included by all compilation units. However, converting a large existing project to use a pre-compiled header is not a trivial task, and the additional and hidden setup step discourages new users. • Be aware of use of C's preprocessor. The programs being manipulated would make common use of preprocessor macros and conditionally compiled code. If we did not fully address how the preprocessor affected refactoring, we would at least need to be aware of the potential issues. • Reuse existing parsing infrastructure. We realized there wasn’t sufficient time or resources to write a new parser from scratch. Analysis would need to be done by an existing Objective C parser used for indexing global declarations. Refactoring had to work best for our third-party developers—primarily developers writing GUI applications. It should also work well for developers within Apple, but not for those writing low-level operating system or device driver code. Performance and interactivity were key—we wanted to provide an excellent refactoring experience. In order to meet these performance and interactivity goals, we attacked three areas: using different levels of accuracy in different parts of the tool, recognizing differences between our target programmers and typical C programmers, and finding ways to exploit delays in the user’s interaction with the tool. 2. Different Levels of Accuracy In C, each source file is preprocessed and compiled independently as a “compilation unit”. Each can include different headers, or can include the same headers with different inclusion order or Performance Trade-offs Implementing Refactoring Support for Objective-C Robert Bowdidge* rbowdidge@mac.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 3rd Workshop on Refactoring Tools '09, Oct. 26, 2009, Orlando, FL. Copyright © 2009 ACM 978-1-60558-909-1...$10.00 * This work was performed while the author was at Apple, and discusses the initial implementation of refactoring for Xcode 3.0. The author is currently at Google.initial macro settings. As a result, each compilation unit may interpret the same headers different ways, and may parse different declarations in those same headers. For correct parsing, the compiler needs to compile every source file independently, read in header files anew each time, and fully parse all headers. For small programs, this may not matter, but with Mac OS X, each source file includes between 60-120,000 lines of code from header files. Precompiled headers and other optimizations could speed compile times, but not all developers use precompiled headers, nor could we demand that developers use such schemes in order to use refactoring. Naively parsing all source code was not acceptable; we saw parse times of around five seconds to parse a typical set of headers, so five seconds minimum per file per build configuration would be completely unacceptable. We realized two facts about programs that made us question whether we needed compilation-unit-level accuracy. We realized that although programmers have the opportunity for header files to be interpreted differently in each compilation unit, most programmers intend for the headers to be processed the same in all compilation units. (When header files are not processed uniformly, it can cause subtle, nasty bugs that can take days to track down.) We also realized that system header files are not really part of the project, and not targets for refactoring. We needed to correctly parse system header files merely for their information on types and external function declarations. For most refactoring operations, we didn’t care if the my_integer_t type was 4 bytes long or 8; we just needed to know that the name referred to a type. We also knew that correct refactoring transformations shouldn’t change the write-protected system header files. We thus made two assumptions about headers we parsed. First, we decided to parse each header file at most once, and would assume that the files were interpreted the same in each compilation unit. This meant that we could shorten parsing times for at least five seconds per file to five seconds (for all system header files), plus the additional time to only parse the source files and headers in the project. Second, we gathered less position information for system header files. We knew that changes in system header files were both incorrect (because we couldn’t change the existing code in libraries) and uninteresting (because we couldn’t change all other clients of the header file.) We gathered less exact position information for such files, and would flag errors if a transformation would change code in a system header file. We also realized that the user interface needed information about the source code to identify whether refactoring was possible for a given selection, which transformations were possible, and what the default parameters for the transformation would be. Because we wanted the user interface to make these suggestions immediately without waiting for parsing to complete, we used saved information from the Xcode’s declaration index when helping the user propose a refactoring transformation. We did have some issues where indexer information had inaccuracies (when its less accurate parser misparsed certain constructs), but in general we found the information good enough for our first release. 3. The Typical Programmer Dealing with conditional code and multiple build configurations is another major issue for refactoring and source code analysis of C programs. We realized that many of the assumptions about C code did not hold for Objective-C programs, and changed our expectations of what we would implement. C’s preprocessor supports conditional code—code only compiled if certain macros are set. Although some conventions exist for using conditional directives, the criteria triggering a particular block of code usually can be understood only by evaluating the values of the controlling macros at the point the preprocessor would have interpreted the directive. If source code with conditional code was refactored without considering all potential conditions, syntax errors or changed behavior could be introduced. Others have proposed various solutions for handling conditional code. Garrido and Johnson expanded the conditional code to cover entire declarations, and annotated the ASTs to mark the configurations including each declaration [4]. Vittek suggested parsing only the feasible sets of configuration macros, parsing each condition separately, and merging the resulting parse trees [5]. McCloskey and Brewer proposed a new preprocessor amenable to analysis and change, with tools to migrate existing programs to the new preprocessor [6]. We instead chose to parse for a single build configuration—a single set of macros, compiler flags, and include paths. Parsing a single build configuration appeared reasonable because ObjectiveC programs use the preprocessor less than typical C programs, because occurrences of conditional code were unlikely to be refactored, and because remaining uses of conditional code were insensitive to the refactoring changes. Ernst’s survey of preprocessor use found that UNIX utilities varied in their use of preprocessor directives. He found the percentage of preprocessor directives to total non-comment, nonblank (NCNB) lines ranged between 4% and 22% [7]. By contrast, only 3-8% of lines in typical Objective-C programs were preprocessor directives. (Measurements were made on sources for the Osirix medical visualization application, Adium multiprotocol chat client, and Xcode itself.) Within those Objective-C programs, preprocessor directives and conditional code also occurred much more frequently in the code unlikely to be refactored. Many were either in third-party utility code, or in cross-platform C++ code. The utility code was often public-domain source code intended for multiple operating systems. Such code is unlikely to be refactored for fear of complicating merges of newer versions. For applications designed for multiple operating systems, often a core C++ library would be the basis of all versions, and separate user interface code would be written for each operating system. Because our first release would not refactor or parse C++ code, such core code would be irrelevant to refactoring. For the Objective-C portions of the projects, only 2-4% of all lines were preprocessor directives. The preprocessor directives that do appear in Objective-C code are often irrelevant to refactoring. Of Ernst’s eleven categories of conditional code, many are either unlikely to affect the target audience, or are irrelevant to refactoring in general. Include guards are less frequently used in Objective-C because a separate directive (#import) ensures a file is included only once. Conditional directives that always disabled code (“#if (0)”) can be handled in the same way comments are processed. Operating system -specific conditional code is unlikely in Objective-C code because the language is used only on Mac OS X. However, there are three problematic conditional code directives that appear in Objective-C programs: code for debugging, architecture-specific code, and conditional code for enabling and disabling features in the project. Conditional code for debugging is unlikely to be troublesome. The rename transformation will make an incorrect change if a declaration is referenced in conditional code that is not parsed. If the condition is parsed, then the conditional code is not a concern. The most dangerous case occurs when code that needs to be manipulated exists in two conditionally compiled sections of code never parsed at the same time. Luckily, most conditional code controlled by debugging macros only adds code to the debug case, and does not add code to the non-debug case. As long as we parse the program with debug macros set (which should be the default during development), then we should parse all necessary code. Architecture-specific code is more common at Apple because we support two architectures (x86 and PowerPC), both in 32 and 64 bit versions. Most of the architecture-specific conditional code is found in low level system code and device drivers. The external developers we are targeting with refactoring would be working on application software, and would be unlikely to have architecturespecific code. Project-specific features controlled by conditional compilation directives represent a larger risk. Some of these may actually be in use (such as code shared between an iPhone and Mac application), and others may represent dead code. Code may exist on both sides of a condition. For the first release, we only changed code in the current build configuration, and relied on the user to be aware of and avoid changes in project-specific conditional code. 4. Exploiting Interaction Delays A final area for optimization was deciding when parsing and refactoring work would begin during actual use. Even with our previous decisions, parsing speed still wasn’t acceptable. Our rough numbers were that we could parse all the system header files in about 5 seconds, and then could parse an additional ten files a second on a typical machine. Caching the results of the header file parsing was an obvious solution, but we weren’t sure we had the time to implement such caching. A straightforward implementation would start parsing after the user specified the transformation to be performed, and only show results when the transformation was complete. We realized we could speed perceived performance by starting parsing early, and showing partial results before the transformation completed. 4.1 Optimistically Starting Parsing It usually takes a few seconds for a programmer to specify a refactoring transformation. Even for the simple rename, the user needs to indicate that he wants to rename a declaration, then needs to type in the new name. For “extract function”, the additional choices for parameter name and order requires additional time. To improve perceived performance, we began parsing the currently active file and header files as soon as the programmer had selected the “refactor” menu item. For refactoring transformations that only affected a single file, this often meant that as soon as the user specified the parameters for refactoring, the parsing had already been completed, and the transformation would be ready immediately. 4.2 Showing Partial Results When performing transformations changing multiple files, we similarly exploited how programmers would interact with the refactoring tool. We knew that most programmers beginning to use refactoring might want to examine the changes being made to double-check that the transformation was correct. If we assumed that most transformations would be successful (because the programmer was unlikely to try a transformation they thought would break their code), then we could begin showing partial results immediately rather than waiting for the entire transformation to be complete and validated to be safe. Most descriptions of refactoring break each transformation into two parts: the pre-conditions (which indicate the requirements that must be met before a transformation may be performed) and the changes to the source code (which are only performed after the change is believed safe. [8]) Because parse times are liable to be longer than a few seconds, the “check, then perform” approach would not have been interactive. The user would have to wait until all source code was parsed and all refactoring complete before examining any results. Similarly, parse trees for all functions would need to be generated before any refactoring work could begin. If the project being manipulated was particularly large, then the parse trees could consume huge amounts of memory. To make refactoring more palatable on large projects, we designed our transformations to work in several phases so that changes could be presented shown after only some of the code had been parsed and portions of the transformation performed. (See Figure 1). We also could dispose of some parse trees as soon as that code has been analyzed. The seven phases for our transformations are: • check user input: precondition checks that could be done with the initial inputs to the transformation only. • check first file: precondition checks to do after the file containing the selection is parsed. Generally, the analysis performed in this phase only performs initial sanity checks requiring parse trees. For the rename transformation, the phase checks that the declaration can be renamed, if the name is a valid C identifier, and if the declaration is not in a system header file. • perform first file: apply any changes that can be determined after the first file is parsed. Few transformations do work in this phase. • check per-file: precondition checks to do after parsing each compilation unit. • perform per-file: changes to apply after parsing each compilation unit. Most transformations do the bulk of their work in the per-file category. The check and perform parts both look at newly found uses of relevant declarations, and make appropriate changes. Each transformation specifies if the memory for parsed representations of function bodies can be freed before beginning the next file. • check final: precondition checks to do after parsing all files. The after-parsing checks tend to involve existence tests or nonexistence tests—whether any situations exist that indicate the transformation is unsafe such as “did we ever see any declaraparse b.c check per-file perform per-file check per-file perform per-file check per-file perform first file perform per-file check first file parse a.c parse c.c perform final check final Process a.c Process b.c Process c.c Figure 1: Order of processing of interleaved refactoring transformation on three source files a.c, b.c, and c.c. Results of the transformation are incrementally updated after each perform- phase is complete.tions with this name already?” Some of these checks could be done incrementally as each file is parsed. • perform final: changes to apply after parsing all files. The perform final phase is typically used for edits that cannot be constructed until all sources have been parsed. For example, when converting references to a structure’s field to call getter or setter functions, the transformation needs to determine where to place the new accessor functions. The accessors need to be placed in a source file (rather than a header), preferably near existing references to the field or the definition of the structure. Typically, the transformation can place the functions as soon as a likely location is found. If no appropriate location for the new code is found in any source file, the perform final phase chooses an arbitrary location. By breaking up each transformation in this way, the user experience of refactoring becomes more interactive. The refactoring user interface can show the list of files which must be parsed for a transformation. As each file is parsed and changes are identified, the user interface indicates completion and notes the number of changes in that file. Selecting the filename shows a side-by-side view of the source before and after the change. As the transformation progresses, more files and edits are displayed. The user can examine proposed changes as soon as each file is processed. While examining the changes, the user can also choose not to include some changes, or can make additional edits to the changed source code. In this way, the user can both measure progress and can be working productively as the transformation progresses. The interleaved transformation approach has the risk of declaring a transformation unsafe after the user has already examined some changes. This turned out not to be a problem in actual use. Programmers weren’t bothered by the delayed negative answer. We also found very few transformations where we could outright refuse to do a transformation. We might warn the result is incorrect, but we found programmers often wanted the chance to apply those incorrect changes and then fix remaining problems with straight edits. 5. Conclusions Overall, our progress on refactoring matched effort described on similar projects. Our first prototype was completed in three months by one person, and our first release required two years and three people. We found the transformations tended to be easy to write. Most of our parsing effort focused on scalability - getting parsing performance and memory use low, and making sure it worked well inside the IDE. We also found that implementing a polished user interface took the majority of the overall effort, with two of the engineers working full time on refactoring workflow and on making the file comparison view as polished as possible. With the trade-offs described here, we met our performance goals. Our goal at the beginning of the project was to permit refactoring on 200,000 line projects, and be able to rename a frequently-referenced declaration within 30 seconds. On a 2.2 GHz Dual Xeon PowerMac with 1 GB of memory, we renamed declarations in a 270,000 line Objective-C project. We found we could rename a class referenced in 382 places through 123 files in 28 seconds. We could rename a class used in 65 files in 15 seconds. Operations involving only a single file took around 8 seconds; this time was irrespective of the source file because parsing the headers dominated. Most transformations only require parsing a small subset of source files in a project. However, one of the transformations searches all code for iterators that can be converted to use a new language feature. Parsing the entire 270,000 line project for this transformation takes around 90 seconds. This is not acceptable for the interactive transformations, but is adequate for an infrequently run transformation that changes all source files. The refactoring feature as described shipped as part of Xcode 3.0 and Mac OS X 10.5. Building software development tools in industry requires making tradeoffs in both requirements and design. Some are driven by the expected needs of users such as the size of programs to be refactored, or response times expected. Some are driven by scalability issues such as whether to save pre-processed header files in the IDE between refactorings, or whether to re-parse headers from scratch each time. Other tradeoffs occur for business, timing, or staffing reasons, affecting whether a feature might even be implemented, or whether a new parser is written from scratch. As described in this paper, our requirements strongly affected what we could and did implement. The particular tradeoffs we made may not appear to be the "right" or "perfect" decision in all cases, but they are representative of the sorts of decisions that must be made during the process of commercial development. Our three themes of trade-offs—identifying where different levels of accuracy were acceptable, recognizing differences between "our typical user" and "a typical user", and exploiting delays in user interaction to improve responsiveness—suggest ways that other tools can meet their own goals. Acknowledgements Thanks to Michael Van De Vanter and Todd Fernandez for their feedback on a previous version of this paper. Dave Payne originally suggested applying the transformations file-by-file. Andrew Pontious and Yuji Akimoto implemented the refactoring user interface, and kept us focused on an interactive experience. Our approach for incrementally showing refactoring results is also described in U.S. Patent Application 20080052684, “ Stepwise source code refactoring”. References [1] Apple, "Apple Developer Documentation: Objective-C Programming Language," Cupertino, CA 2007. [2] R. W. Bowdidge and W. G. Griswold, "Supporting the Restructuring of Data Abstractions through Manipulation of a Program Visualization," ACM Transactions on Software Engineering and Methodology, vol. 7(2), 1998. [3] D. Bäumer, E. Gamma, and A. Kiezun, "Integrating refactoring support into a Java development tool," in OOPSLA 2001 Companion, 2001. [4] A. Garrido and R. Johnson, "Analyzing Multiple Configurations of a C Program," in 21st IEEE International Conference on Software Maintenance (ICSM), 2005. [5] M. Vittek, "Refactoring Browser with Preprocessor," in 7th European Conference on Software Maintenance and Reengineering, Benevento, Italy, 2003. [6] B. McCloskey and E. Brewer, "ASTEC: a new approach to refactoring C," in 13th ACM SIGSOFT international symposium on Foundations of Software Engineering ESEC/FSE-13, 2005. [7] M. D. Ernst, G. J. Badros, and D. Notkin, "An Empirical Analysis of C Preprocessor Use," IEEE Transactions on Software Engineering, vol. 28, pp. 1146-1170, December 2002. [8] W. F. Opdyke, "Refactoring: A Program Restructuring Aid in Designing Object-Oriented Application Frameworks," University of Illinois, Urbana-Champaign, 1991. Linear-Space Computation of the Edit-Distance between a String and a Finite Automaton Cyril Allauzen1 and Mehryar Mohri2,1 1 Google Research 76 Ninth Avenue, New York, NY 10011, US. 2 Courant Institute of Mathematical Sciences 251 Mercer Street, New York, NY 10012, US. Abstract. The problem of computing the edit-distance between a string and a finite automaton arises in a variety of applications in computational biology, text processing, and speech recognition. This paper presents linear-space algorithms for computing the edit-distance between a string and an arbitrary weighted automaton over the tropical semiring, or an unambiguous weighted automaton over an arbitrary semiring. It also gives an efficient linear-space algorithm for finding an optimal alignment of a string and such a weighted automaton. 1 Introduction The problem of computing the edit-distance between a string and a finite automaton arises in a variety of applications in computational biology, text processing, and speech recognition [8, 10, 18, 21, 14]. This may be to compute the edit-distance between a protein sequence and a family of protein sequences compactly represented by a finite automaton [8, 10, 21], or to compute the error rate of a word lattice output by a speech recognition with respect to a reference transcription [14]. A word lattice is a weighted automaton, thus this further motivates the need for computing the edit-distance between a string and a weighted automaton. In all these cases, an optimal alignment is also typically sought. In computational biology, this may be to infer the function and various properties of the original protein sequence from the one it is best aligned with. In speech recognition, this determines the best transcription hypothesis contained in the lattice. This paper presents linear-space algorithms for computing the edit-distance between a string and an arbitrary weighted automaton over the tropical semiring, or an unambiguous weighted automaton over an arbitrary semiring. It also gives an efficient linear-space algorithm for finding an optimal alignment of a string and such a weighted automaton. Our linear-space algorithms are obtained by using the same generic shortest-distance algorithm but by carefully defining different queue disciplines. More precisely, our meta-queue disciplines are derived in the same way from an underling queue discipline defined over states with the same level.The connection between the edit-distance and the shortest distance in a directed graph was made very early on (see [10, 4–6] for a survey of string algorithms). This paper revisits some of these algorithms and shows that they are all special instances of the same generic shortest-distance algorithm using different queue disciplines. We also show that the linear-space algorithms all correspond to using the same meta-queue discipline using different underlying queues. Our approach thus provides a better understanding of these classical algorithms and makes it possible to easily generalize them, in particular to weighted automata. The first algorithm to compute the edit-distance between a string x and a finite automaton A as well as their alignment was due to Wagner [25] (see also [26]). Its time complexity was in O(|x||A| 2 Q) and its space complexity in O(|A| 2 Q|Σ| + |x||A|Q), where Σ denotes the alphabet and |A|Q the number of states of A. Sankoff and Kruskal [23] pointed out that the time and space complexity O(|x||A|) can be achieved when the automaton A is acyclic. Myers and Miller [17] significantly improved on previous results. They showed that when A is acyclic or when it is a Thompson automaton, that is an automaton obtained from a regular expression using Thompson’s construction [24], the edit-distance between x and A can be computed in O(|x||A|) time and O(|x|+|A|) space. They also showed, using a technique due to Hirschberg [11], that the optimal alignment between x and A can be obtained in O(|x| + |A|) space, and in O(|x||A|) time if A is acyclic, and in O(|x||A| log |x|) time when A is a Thompson automaton. The remainder of the paper is organized as follows. Section 2 introduces the definition of semirings, and weighted automata and transducers. In Section 3, we give a formal definition of the edit-distance between a string and a finite automaton, or a weighted automaton. Section 4 presents our linear-space algorithms, including the proof of their space and time complexity and a discussion of an improvement of the time complexity for automata with some favorable graph structure property. 2 Preliminaries This section gives the standard definition and specifies the notation used for weighted transducers and automata which we use in our computation of the edit-distance. Finite-state transducers are finite automata [20] in which each transition is augmented with an output label in addition to the familiar input label [2, 9]. Output labels are concatenated along a path to form an output sequence and similarly input labels define an input sequence. Weighted transducers are finitestate transducers in which each transition carries some weight in addition to the input and output labels [22, 12]. Similarly, weighted automata are finite automata in which each transition carries some weight in addition to the input label. A path from an initial state to a final state is called an accepting path. A weighted transducer or weighted automaton is said to be unambiguous if it admits no two accepting paths with the same input sequence.The weights are elements of a semiring (K, ⊕, ⊗, 0, 1), that is a ring that may lack negation [12]. Some familiar semirings are the tropical semiring (R+ ∪ {∞}, min, +, ∞, 0) and the probability semiring (R+ ∪ {∞}, +, ×, 0, 1), where R+ denotes the set of non-negative real numbers. In the following, we will only consider weighted automata and transducers over the tropical semiring. However, all the results of section 4.2 hold for an unambiguous weighted automaton A over an arbitrary semiring. The following gives a formal definition of weighted transducers. Definition 1. A weighted finite-state transducer T over the tropical semiring (R+ ∪ {∞}, min, +, ∞, 0) is an 8-tuple T = (Σ, ∆, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet of the transducer, ∆ its finite output alphabet, Q is a finite set of states, I ⊆ Q the set of initial states, F ⊆ Q the set of final states, E ⊆ Q × (Σ ∪ {ǫ}) × (∆ ∪ {ǫ}) × (R+ ∪ {∞}) × Q a finite set of transitions, λ : I → R+ ∪ {∞} the initial weight function, and ρ : F → R+ ∪ {∞} the final weight function mapping F to R+ ∪ {∞}. We define the size of T as |T | = |T |Q + |T |E where |T |Q = |Q| is the number of states and |T |E = |E| the number of transitions of T . The weight of a path π in T is obtained by summing the weights of its constituent transitions and is denoted by w[π]. The weight of a pair of input and output strings (x, y) is obtained by taking the minimum of the weights of the paths labeled with (x, y) from an initial state to a final state. For a path π, we denote by p[π] its origin state and by n[π] its destination state. We also denote by P(I, x, y, F) the set of paths from the initial states I to the final states F labeled with input string x and output string y. The weight T (x, y) associated by T to a pair of strings (x, y) is defined by: T (x, y) = min π∈P (I,x,y,F ) λ(p[π]) + w[π] + ρ[n[π]]. (1) Figure 1(a) shows an example of weighted transducer over the tropical semiring. Weighted automata can be defined as weighted transducers A with identical input and output labels, for any transition. Thus, only pairs of the form (x, x) can have a non-zero weight by A, which is why the weight associated by A to (x, x) is abusively denoted by A(x) and identified with the weight associated by A to x. Similarly, in the graph representation of weighted automata, the output (or input) label is omitted. Figure 1(b) shows an example. 3 Edit-distance We first give the definition of the edit-distance between a string and a finite automaton. Let Σ be a finite alphabet, and let Ω be defined by Ω = (Σ ∪ {ǫ}) × (Σ ∪ {ǫ}) − {(ǫ, ǫ)}. An element of Ω can be seen as a symbol edit operation: (a, ǫ) is a deletion, (ǫ, a) an insertion, and (a, b) with a 6= b a substitution. We will denote by h the natural morphism between Ω∗ and Σ∗ × Σ∗ defined by a:b/.1 0 1 a:b/.2 2/1 a:b/.4 3/.8 b:a/.6 b:a/.3 b:a/.5 a/.1 0 1 a/.2 2/1 a/.4 3/.8 b/.6 b/.3 b/.5 (a) (b) Fig. 1. (a) Example of a weighted transducer T. (b) Example of a weighted automaton A. T(aab, bba) = A(aab) = min(.1 +.2 +.6 +.8, .2 +.4 +.5 +.8). A bold circle indicates an initial state and a double-circle a final state. The final weight ρ[q] of a final state q is indicated after the slash symbol representing q. h((a1, b1)· · ·(an, bn)) = (a1 · · · an, b1 · · · bn). An alignment ω between two strings x and y is an element of Ω∗ such that h(ω) = (x, y). Let c : Ω → R+ be a function associating a non-negative cost to each edit operation. The cost of an alignment ω = ω1 · · · ωn is defined as c(ω) = Pn i=1 c(ωi). Definition 2. The edit-distance d(x, y) of two strings x and y is the minimal cost of a sequence of symbols insertions, deletions or substitutions transforming one string into the other: d(x, y) = min h(ω)=(x,y) c(ω). (2) When c is the function defined by c(a, a) = 0 and c(a, ǫ) = c(ǫ, a) = c(a, b) = 1 for all a, b in Σ such that a 6= b, the edit-distance is also known as the Levenshtein distance. The edit-distance d(x, A) between a string x and a finite automaton A can then be defined as d(x, A) = min y∈L(A) d(x, y), (3) where L(A) denotes the regular language accepted by A. The edit-distance d(x, A) between a string x and a weighted automaton A over the tropical semiring is defined as: d(x, A) = min y∈Σ∗ A(y) + d(x, y)  . (4) 4 Algorithms In this section, we present linear-space algorithms both for computing the editdistance d(x, A) between an arbitrary string x and an automaton A, and an optimal alignment between x and A, that is an alignment ω such that c(ω) = d(x, A). We first briefly describe two general algorithms that we will use as subroutines.4.1 General algorithms Composition. The composition of two weighted transducers T1 and T2 over the tropical semiring with matching input and output alphabets Σ, is a weighted transducer denoted by T1 ◦ T2 defined by: (T1 ◦ T2)(x, y) = min z∈Σ∗ T1(x, z) + T2(z, y). (5) T1 ◦ T2 can be computed from T1 and T2 using the composition algorithm for weighted transducers [19, 15]. States in the composition T1 ◦ T2 are identified with pairs of a state of T1 and a state of T2. In the absence of transitions with ǫ inputs or outputs, the transitions of T1 ◦ T2 are obtained as a result of the following matching operation applied to the transitions of T1 and T2: (q1, a, b, w1, q′ 1 ) and (q2, b, c, w2, q′ 2 ) → ((q1, q′ 1 ), a, c, w1 + w2,(q2, q′ 2 )). (6) A state (q1, q2) of T1 ◦T2 is initial (resp. final) iff q1 and q2 are initial (resp. final) and, when it is final, its initial (resp.final) weight is the sum of the initial (resp. final) weights of q1 and q2. In the worst case, all transitions of T1 leaving a state q1 match all those of T2 leaving state q2, thus the space and time complexity of composition is quadratic, that is O(|T1||T2|). Shortest distance. Let A be a weighted automaton over the tropical semiring. The shortest distance from p to q is defined as d[p, q] = min π∈P (p,q) w[π]. (7) It can be computed using the generic single-source shortest-distance algorithm of [13], a generalization of the classical shortest-distance algorithms. This generic shortest-distance algorithm works with an arbitrary queue discipline, that is the order according to which elements are extracted from a queue. We shall make use of this key property in our algorithms. The pseudocode of a simplified version of the generic algorithm for the tropical semiring is given in Figure 2. The complexity of the algorithm depends on the queue discipline selected for S. Its general expression is O(|Q| + C(A) max q∈Q N(q)|E| + (C(I) + C(X))X q∈Q N(q)), (8) where N(q) denotes the number of times state q is extracted from queue S, C(X) the cost of extracting a state from S, C(I) the cost of inserting a state in S, and C(A) the cost of an assignment. With a shortest-first queue discipline implemented using a heap, the algorithm coincides with Dijkstra’s algorithm [7] and its complexity is O((|E| + |Q|) log |Q|). For an acyclic automaton and with the topological order queue discipline, the algorithm coincides with the standard linear-time (O(|Q| + |E|)) shortest-distance algorithm [3].Shortest-Distance(A, s) 1 for each p ∈ Q do 2 d[p] ← ∞ 3 d[s] ← 0 4 S ← {s} 5 while S 6= ∅ do 6 q ← Head(S) 7 Dequeue(S) 8 for each e ∈ E[q] do 9 if (d[s] + w[e] < d[n[e]]) then 10 d[n[e]] ← d[s] + w[e] 11 if (n[e] 6∈ S) then 12 Enqueue(S, n[e]) Fig. 2. Pseudocode of the generic shortest-distance algorithm. 4.2 Edit-distance algorithms The edit cost function c can be naturally represented by a one-state weighted transducer over the tropical semiring Tc = (Σ, Σ, {0}, {0}, {0}, Ec, 1, 1), or T in the absence of ambiguity, with each transition corresponding to an edit operation: Ec = {(0, a, b, c(a, b), 0)|(a, b) ∈ Ω}. Lemma 1. Let A be a weighted automaton over the tropical semiring and let X be the finite automaton representing a string x. Then, the edit-distance between x and A is the shortest-distance from the initial state to a final state in the weighted transducer U = X ◦ T ◦ A. Proof. Each transition e in T corresponds to an edit operation (i[e], o[e]) ∈ Ω, and each path π corresponds to an alignment ω between i[π] and o[π]. The cost of that alignment is, by definition of T , c(ω) = w[π]. Thus, T defines the function: T (u, v) = min ω∈Ω∗ {c(ω): h(ω) = (u, v)} = d(u, v), (9) for any strings u, v in Σ∗ . Since A is an automaton and x is the only string accepted by X, it follows from the definition of composition that U(x, y) = T (x, y) + A(y) = d(x, y) + A(y). The shortest-distance from the initial state to a final state in U is then: min π∈PU (I,F ) w[π] = min y∈Σ∗ min π∈PU (I,x,y,F ) w[π] = min y∈Σ∗ U(x, y) (10) = min y∈Σ∗ d(x, y) + A(y)  = d(x, A), (11) that is the edit-distance between x and A. ⊓⊔0 1 a/0 2 b/0 3/0 a/0 0/0 1 a/0 b/0 (a) (b) 0/0 ε:a/1 ε:b/1 a:ε/1 a:a/0 a:b/1 b:ε/1 b:a/1 b:b/0 0,0 0,1 ε:a/1 1,0 a:ε/1 1,1 a:a/0 ε:b/1 a:b/1 a:ε/1 ε:a/1 2,0 b:ε/1 2,1 b:a/1 ε:b/1 b:b/0 b:ε/1 ε:a/1 3,0/0 a:ε/1 3,1 a:a/0 ε:b/1 a:b/1 a:ε/1 ε:a/1 ε:b/1 (c) (d) Fig. 3. (a) Finite automaton X representing the string x = aba. (b) Finite automaton A. (c) Edit transducer T over the alphabet {a, b} where the cost of any insertion, deletion and substitution is 1. (d) Weighted transducer U = X ◦ T ◦ A. Figure 3 shows an example illustrating Lemma 1. Using the lateral strategy of the 3-way composition algorithm of [1] or an ad hoc algorithm exploiting the structure of T , U = X ◦ T ◦ A can be computed in O(|x||A|) time. The shortestdistance algorithm presented in Section 4.1 can then be used to compute the shortest distance from an initial state of U to a final state and thus the edit distance of x and A. Let us point out that different queue disciplines in the computation of that shortest distance lead to different algorithms and complexities. In the next section, we shall give a queue discipline enabling us to achieve a linear-space complexity. 4.3 Edit-distance computation in linear space Using the shortest-distance algorithm described in Section 4.1 leads to an algorithm with space complexity linear in the size of U, i.e. in O(|x||A|). However, taking advantage of the topology of U, it is possible to design a queue discipline that leads to a linear space complexity O(|x| + |A|). We assume that the finite automaton X representing the string x is topologically sorted. A state q in the composition U = X ◦T ◦A can be identified with a triplet (i, 0, j) where i is a state of X, 0 the unique state of T , and j a state of A. Since T has a unique state, we further simplify the notation by identifying each state q with a pair (i, j). For a state q = (i, j) of U, we will refer to i by the level of q. A key property of the levels is that there is a transition in U from q to q ′iff level(q ′ ) = level(q) or level(q ′ ) = level(q) + 1. Indeed, a transition from (i, j) to (i ′ , j′ ) in U corresponds to taking a transition in X (in that case i ′ = i + 1 since X is topologically sorted) or staying at the same state in X and taking an input-ǫ transition in T (in that case i ′ = i). From any queue discipline ≺ on the states of U, we can derive a new queue discipline ≺l over U defined for all q, q′ in U as follows: q ≺l q ′ iff level(q) < level(q ′ )  or level(q) = level(q ′ ) and q ≺ q ′  . (12) Proposition 1. Let ≺ be a queue discipline that requires at most O(|V |) space to maintain a queue over any set of states V . Then, the edit-distance between x and A can be computed in linear space, O(|x| + |A|), using the queue discipline ≺l. Proof. The benefit of the queue discipline ≺l is that when computing the shortest distance to q = (i, j) in U, only the shortest distances to the states in U of level i and i − 1 need to be stored in memory. The shortest distances to the states of level strictly less than i − 1 can be safely discarded. Thus, the space required to store the shortest distances is in O(|A|Q). Similarly, there is no need to store in memory the full transducer U. Instead, we can keep in memory the last two levels active in the shortest-distance algorithm. This is possible because the computation of the outgoing transitions of a state with level i only requires knowledge about the states with level i and i + 1. Therefore, the space used to store the active part of U is in O(|A|E +|A|Q) = O(|A|). Thus, it follows that the space required to compute the edit-distance of x and A is linear, that is in O(|x| + |A|). ⊓⊔ The time complexity of the algorithm depends on the underlying queue discipline ≺. A natural choice is for ≺ is the shortest-first queue discipline, that is the queue discipline used in Dijkstra’s algorithm. This yields the following corollary. Corollary 1. The edit-distance between a string x and an automaton A can be computed in time O(|x||A| log |A|Q) and space O(|x| + |A|) using the queue discipline ≺l. Proof. A shortest-first queue is maintained for each level and contains at most |A|Q states. The cost for the global queue of an insertion, C(I), or an assignment, C(A), is in O(log |A|Q) since it corresponds to inserting in or updating one of the underlying level queues. Since N(q) = 1, the general expression of the complexity (8) leads to an overall time complexity of O(|x||A| log |A|Q) for the shortestdistance algorithm. ⊓⊔ When the automaton A is acyclic, the time complexity can be further improved by using for ≺ the topological order queue discipline. Corollary 2. If the automaton A is acyclic, the edit-distance between x and A can be computed in time O(|x||A|) and space O(|x| + |A|) using the queue discipline ≺l with the topological order queue discipline for ≺.Proof. Computing the topological order for U would require O(|U|) space. Instead, we use the topological order on A, which can be computed in O(|A|), to define the underlying queue discipline. The order inferred by (12) is then a topological order on U. ⊓⊔ Myers and Miller [17] showed that when A is a Thompson automaton, the time complexity can be reduced to O(|x||A|) even when A is not acyclic. This is possible because of the following observation: in a weighted automaton over the tropical semiring, there exists always a shortest path that is simple, that is with no cycle, since cycle weights cannot decrease path weight. In general, it is not clear how to take advantage of this observation. However, a Thompson automaton has additionally the following structural property: a loop-connectedness of one. The loop-connectedness of A is k if in any depth- first search of A, a simple path goes through at most k back edges. [17] showed that this property, combined with the observation made previously, can be used to improve the time complexity of the algorithm. The results of [17] can be generalized as follows. Corollary 3. If the loop-connectedness of A is k, then the edit-distance between x and A can be computed in O(|x||A|k) time and O(|x| + |A|) space. Proof. We first use a depth-first search of A, identify back edges, and mark them as such. We then compute the topological order for A, ignoring these back edges. Our underlying queue discipline ≺ is defined such that a state q = (i, j) is ordered first based on the number of times it has been enqueued and secondly based on the order of j in the topological order ignoring back edges. This underlying queue can be implemented in O(|A|Q) space with constant time costs for the insertion, extraction and updating operations. The order ≺l derived from ≺ is then not topological for a transition e iff e was obtained by matching a back edge in A and level(p[e]) = level(n[e]). When such a transition e is visited, n[e] is reinserted in the queue. When state q is dequeued for the lth time, the value of d[q] is the weight of the shortest path from the initial state to q that goes through at most l −1 back edges. Thus, the inequality N(q) ≤ k + 1 holds for all q and, since the costs for managing the queue, C(I), C(A), and C(X), are constant, the time complexity of the algorithm is in O(|x||A|k). ⊓⊔ 4.4 Optimal alignment computation in linear space The algorithm presented in the previous section can also be used to compute an optimal alignment by storing a back pointer at each state in U. However, this can increase the space complexity up to O(|x||A|Q). The use of back pointers to compute the best alignment can be avoided by using a technique due to Hirschberg [11], also used by [16, 17]. As pointed out in previous sections, an optimal alignment between x and A corresponds to a shortest path in U = X ◦ T ◦ A. We will say that a state q in U is a midpoint of an optimal alignment between x and A if q belongs to a shortest path in U and level(q) = ⌊|x|/2⌋.Lemma 2. Given a pair (x, A), a midpoint of the optimal alignment between x and A can be computed in O(|x|+|A|) space with a time complexity in O(|x||A|) if A is acyclic and in O(|x||A| log |A|Q) otherwise. Proof. Let us consider U = X ◦ T ◦ A. For a state q in U let d[q] denote the shortest distance from the initial state to q, and by d R[q] the shortest distance from q to a final state. For a given state q = (i, j) in U, d[(i, j)] +d R[(i, j)] is the cost of the shortest path going through (i, j). Thus, for any i, the edit-distance between x and A is d(x, A) = minj (d[(i, j)] + d R[(i, j)]). For a fixed i0, we can compute both d[(i0, j)] and d R[(i0, j)] for all j in O(|x||A| log |A|Q) time (or O(|x||A| time if A is acyclic) and in linear space O(|x| + |A|) using the algorithm from the previous section forward and backward and stopping at level i0 in each case. Running the algorithm backward (exchanging initial and final states and permuting the origin and destination of every transition) can be seen as computing the edit-distance between x R and AR, the mirror images of x and A. Let us now set i0 = ⌊|x|/2⌋ and j0 = argminj (d[(i0, j)] + d R[(i0, j)]). It then follows that (i0, j0) is a midpoint of the optimal alignment. Hence, for a pair (x, A), the running-time complexity of determining the midpoint of the alignment is in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q) otherwise. ⊓⊔ The algorithm proceeds recursively by first determining the midpoint of the optimal alignment. At step 0 of the recursion, we first find the midpoint (i0, j0) between x and A. Let x 1 and x 2 be such that x = x 1x 2 and |x 1 | = i0, and let A1 and A2 be the automaton obtained from A by respectively changing the final state to j0 in A1 and the initial state to j0 in A2 . We can now recursively find the alignment between x 1 and A1 and between x 2 and A2 . Theorem 1. An optimal alignment between a string x and an automaton A can be computed in linear space O(|x| + |A|) and in time O(|x||A|) if A is acyclic, O(|x||A| log |x| log |A|Q) otherwise. Proof. We can assume without loss of generality that the length of x is a power of 2. At step k of the recursion, we need to compute the midpoints for 2k string-automaton pairs (x i k , Ai k )1≤i≤2k . Thus, the complexity of step k is in O( P2 k i=1 |x i k ||Ai k | log |Ai k |Q) = O( |x| 2 k P2 k i=1 |Ai k | log |Ai k |Q) since |x i k | = |x|/2 k for all i. When A is acyclic, the log factor can be avoided and the equality P2 k i=1 |Ai k | = O(|A|) holds, thus the time complexity of step k is in O(|x||A|/2 k ). In the general case, each |Ai k | can be in the order of |A|, thus the complexity of step k is in O(|x||A| log |A|Q). Since there are at most log |x| steps in the recursion, this leads to an overall time complexity in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q log |x|) in general. ⊓⊔ When the loop-connectedness of A is k, the time complexity can be improved to O(k|x||A| log |x|) in the general case.5 Conclusion We presented general algorithms for computing in linear space both the editdistance between a string and a finite automaton and their optimal alignment. Our algorithms are conceptually simple and make use of existing generic algorithms. Our results further provide a better understanding of previous algorithms for more restricted automata by relating them to shortest-distance algorithms and general queue disciplines. References 1. C. Allauzen and M. Mohri. 3-way composition of weighted finite-state transducers. In O. Ibarra and B. Ravikumar, editors, Proceedings of CIAA 2008, volume 5148 of Lecture Notes in Computer Science, pages 262–273. Springer-Verlag Berlin Heidelberg, 2008. 2. J. Berstel. Transductions and Context-Free Languages. Teubner Studienbucher: Stuttgart, 1979. 3. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press: Cambridge, MA, 1992. 4. M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 5. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994. 6. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002. 7. E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959. 8. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probalistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. 9. S. Eilenberg. Automata, Languages and Machines, volume A–B. Academic Press, 1974–1976. 10. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK, 1997. 11. D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, June 1975. 12. W. Kuich and A. Salomaa. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. Springer-Verlag, 1986. 13. M. Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002. 14. M. Mohri. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6):957–982, 2003. 15. M. Mohri, F. C. N. Pereira, and M. Riley. Weighted automata in text and speech processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language, Budapest, Hungary. John Wiley and Sons, Chichester, 1996. 16. E. W. Myers and W. Miller. Optimal alignments in linear space. CABIOS, 4(1):11– 17, 1988. 17. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5–37, 1989.18. G. Navarro and M. Raffinot. Flexible pattern matching. Cambridge University Press, 2002. 19. F. Pereira and M. Riley. Finite State Language Processing, chapter Speech Recognition by Composition of Weighted Finite Automata. The MIT Press, 1997. 20. D. Perrin. Finite automata. In J. V. Leuwen, editor, Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics, pages 1–57. Elsevier, Amsterdam, 1990. 21. P. A. Pevzner. Computational Molecular Biology: an Algorithmic Approach. MIT Press, 2000. 22. A. Salomaa and M. Soittola. Automata-Theoretic Aspects of Formal Power Series. Springer-Verlag, 1978. 23. D. Sankoff and J. B. Kruskal. Time Wraps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. 24. K. Thompson. Regular expression search algorithm. Communications of the ACM, 11(6):365–375, 1968. 25. R. A. Wagner. Order-n correction for regular languages. Communications of the ACM, 17(5):265–268, May 1974. 26. R. A. Wagner and J. I. Seiferas. Correcting counter-automaton-recognizable languages. SIAM Journal on Computing, 7(3):357–375, August 1978. JMLR: Workshop and Conference Proceedings vol 23 (2012) 44.1–44.3 25th Annual Conference on Learning Theory Open Problem: Better Bounds for Online Logistic Regression H. Brendan McMahan MCMAHAN@GOOGLE.COM Google Inc., Seattle, WA Matthew Streeter MSTREETER@GOOGLE.COM Google Inc., Pittsburgh, PA Editor: Shie Mannor, Nathan Srebro, Robert C. Williamson Abstract Known algorithms applied to online logistic regression on a feasible set of L2 diameter D achieve regret bounds like O(e D log T) in one dimension, but we show a bound of O( √ D + log T) is possible in a binary 1-dimensional problem. Thus, we pose the following question: Is it possible to achieve a regret bound for online logistic regression that is O(poly(D) log(T))? Even if this is not possible in general, it would be interesting to have a bound that reduces to our bound in the one-dimensional case. Keywords: online convex optimization, online learning, regret bounds 1. Introduction and Problem Statement Online logistic regression is an important problem, with applications like click-through-rate prediction for web advertising and estimating the probability that an email message is spam. We formalize the problem as follows: on each round t the adversary selects an example (xt , yt) ∈ R n × {−1, 1}, the algorithm chooses model coefficients wt ∈ R n , and then incurs loss `(wt ; xt , yt) = log(1 + exp(−ytwt · xt)), (1) the negative log-likelihood of the example under a logistic model. For simplicity we assume kxtk2 ≤ 1 so that any gradient kO`(wt)k2 ≤ 1. While conceptually any w ∈ R n could be used as model parameters, for regret bounds we consider competing with a feasible set W = {w | kwk2 ≤ D/2}, the L2 ball of diameter D centered at the origin. Existing algorithms for online convex optimization can immediately be applied. First-order algorithms like online gradient descent (Zinkevich, 2003) achieve bounds like O(D √ T). On a bounded feasible set logistic loss (Eq. (1)) is exp-concave, and so we can use second-order algorithms like Follow-The-Approximate-Leader (FTAL), which has a general bound of O(( 1 α + GD)n log T) (Hazan et al., 2007) when the loss functions are α-exp-concave on the feasible set; we have α = e −D/2 for the logistic loss (see Appendix A), which leads to a bound of O((exp(D) + D)n log T) in the general case, or O(exp(D) log T) in the one-dimensional case. The exponential dependence on the diameter of the feasible set can make this bound worse than the O(D √ T) bounds for practical problems where the post-hoc optimal probability can be close to zero or one. We suggest that better bounds may be possible. In the next section, we show that a simple Follow-The-Regularized-Leader (FTRL) algorithm can achieve a much better result, namely c 2012 H.B. McMahan & M. Streeter.MCMAHAN STREETER O( √ D + log T), for one-dimensional problems where the adversary is further constrained1 to pick xt ∈ {−1, 0, +1}. A single mis-prediction can cost about D/2, and so the additive dependence on the diameter of the feasible set is less than the cost of one mistake. The open question is whether such a bound is achievable for problems of arbitrary finite dimension n. Even the general onedimensional case, where xt ∈ [−1, 1], is not obvious. 2. Analysis in One Dimension We analyze an FTRL algorithm. We can ignore any rounds when xt = 0, and then since only the sign of ytxt matters, we assume xt = 1 and the adversary picks yt ∈ {−1, 1}. The cumulative loss function on P positive examples and N negative examples is c(w; N, P) = P log(1 + exp(−w)) + N log(1 + exp(w)). Let Nt denote the number of negative examples seen through the t’th round, with Pt the corresponding number of positive examples. We play FTRL, with wt+1 = arg min w c(w; Nt + λ, Pt + λ), for a constant λ > 0. This is just FTRL with a regularization function r(w) = c(w; λ, λ). Using the FTRL lemma (e.g., McMahan and Streeter (2010, Lemma 1)), we have Regret ≤ r(w ∗ ) +X T t=1 ft(wt) − ft(wt+1) where ft(w) = `(w; xt , yt). It is easy to verify that r(w) ≤ λ(|w| + 2 log 2). It remains to bound ft(wt) − ft(wt+1). Fix a round t. For compactness, we write N = Nt−1 and P = Pt−1. Suppose that yt = −1, so Nt = N + 1 and Pt = P (the case when yt+1 = +1 is analogous). Since ft is convex, by definition ft(w) ≥ ft(wt) + gt(w − wt) where gt = Oft(wt). Taking w = wt+1 and re-arranging, we have ft(wt) − ft(wt+1) ≤ gt(wt − wt+1) ≤ |gt ||wt − wt+1|. It is easy to verify that |gt | ≤ 1, and also that wt = log  P + λ N + λ  . Since yt = −1, wt+1 < wt , and so |wt − wt+1| = log  P + λ N + λ  − log  P + λ N + 1 + λ  = log(N + 1 + λ) − log(N + λ) = log  1 + 1 N + λ  ≤ 1 N + λ . 1. Constraining the adversary in this way is reasonable in many applications. For example, re-scaling each xt so kxtk2 = 1 is a common pre-processing step, and many problems also are naturally featurized by xt,i ∈ {0, 1}, where xt,i = 1 indicates some property i is present on the t’th example. 44.2OPEN PROBLEM: ONLINE LOGISTIC REGRESSION Thus, if we let T − = {t | yt = −1}, we have X t∈T − ft(wt) − ft(wt+1) ≤ X NT N=0 1 N + λ ≤ 1 λ + X NT N=1 1 N ≤ 1 λ + log(NT ) + 1. Applying a similar argument to rounds with positive labels and summing over the rounds with positive and negative labels independently gives Regret ≤ λ(|w ∗ | + 2 log 2) + log(PT ) + log(NT ) + 2 λ + 2. Note log(PT ) + log(NT ) ≤ 2 log T. We wish to compete with w ∗ where |w ∗ | ≤ D/2, so we can choose λ = √ 1 D/2 which gives Regret ≤ O( √ D + log T). References Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69, December 2007. H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003. Appendix A. The Exp-Concavity of the Logistic Loss Theorem 1 The logistic loss function `(wt ; xt , yt) = log(1 + exp(−ytwt · xt)), from Eq. (1), is α-exp-concave with α = exp(−D/2) over set W = {w | kwk2 ≤ D/2} when kxtk2 ≤ 1 and yt ∈ {−1, 1}. Proof Recall that a function ` is α-exp-concave if O 2 exp(−α`(w))  0. When `(w) = g(w·x) for x ∈ R n , we have O 2 exp(−α`(w)) = O 2f 00(z)xx>, where f(z) = exp(−αg(z)). For the logistic loss, we have g(z) = log(1 + exp(z)) (without loss of generality, we consider a negative example), and so f(z) = (1 + exp(z))−α. Then, f 00(z) = αez (1 + e z ) −α−2 (αez − 1). We need the largest α such that f 00(z) ≤ 0, given a fixed z. We can see by inspection that α = 0 is a zero. Since e z (1 + e z ) −α−2 > 0, from the term (αez − 1) we conclude α = e −z is the largest value of α where f 00(z) ≤ 0. Note that z = wt · xt , and so |z| ≤ D/2 since kxtk2 ≤ 1, and so taking the worst case over wt ∈ W and xt with kxtk2 ≤ 1, we have α = exp(−D/2). 44.3 Online Microsurveys for User Experience Research Abstract This case study presents a critical analysis of microsurveys as a method for conducting user experience research. We focus specifically on Google Consumer Surveys (GCS) and analyze a combination of log data and GCSs run by the authors to investigate how they are used, who the respondents are, and the quality of the data. We find that such microsurveys can be a great way to quickly and cheaply gather large amounts of survey data, but that there are pitfalls that user experience researchers should be aware of when using the method. Author Keywords Microsurveys; user experience research; user research methods ACM Classification Keywords H.5.2. User Interfaces: Theory and methods. Introduction To keep up with fast paced design and development teams, user researchers must develop a toolkit of methods to quickly and efficiently address research questions. One such method is the microsurvey, or a short survey of only one to three questions. There are several commercial microsurveys—including Google Consumer Surveys (GCS), SlimSurveys, and Survata— Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). CHI 2014, April 26–May 1, 2014, Toronto, Ontario, Canada. ACM 978-1-4503-2474-8/14/04. http://dx.doi.org/10.1145/2559206.2559975 David Huffaker Google Inc. Mountain View, CA 94043 huffaker@google.com Gueorgi Kossinets Google Inc. Mountain View, CA 94043 gkossinets@google.com Kerwell Liao Google Inc. Mountain View, CA 94043 kerwell@google.com Paul McDonald Google Inc. Mountain View, CA 94043 pmcdonald@google.com Aaron Sedley Google Inc. Mountain View, CA 94043 asedley@google.com Victoria Schwanda Sosik Cornell University 301 College Ave. Ithaca, NY 14850 vsosik@cs.cornell.edu Elie Bursztein Google Inc. Mountain View, CA 94043 elieb@google.com Sunny Consolvo Google Inc. Mountain View, CA 94043 sconsolvo@google.com Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 889Figure 1. An respondent en asked to answ or share the p social media i the publisher’ Figure 2. Pa interface. To responses by to a multiple to the right. example of how a ncounters GCS. The wer a short survey q page they are readin in order to continue ’s content. rt of the GCS result the left are controls y demographics, and choice question are that p data q study micro quest respo collec for us One Since brief o shown quest each r survey of twe ended respo the qu must limits to sho Surve popula demo Doubl Quest publis catego Refere contin way, t betwe acces y are question, ng via reading s s to filter d results e shown promise to provide quickly and at a re , we present a cri survey, Google Co ions about how th ndents are, and o t. We conclude wi sing this method in Example of a M we use GCS in th overview of how it n only one questio ion. If a survey ha respondent is ran y questions. The s elve predefined qu d, single answer, m nses. Certain que uestion or respons be short, with 12 respectively; mu owing 5 response ey designers can r ation, or target re graphics (as infer eClick cookies) or tions are then sho sher’s premium co ories of News, Art ence—and people nue reading the co these microsurvey een the responden s. e people with larg elatively low cost. tical analysis of o onsumer Surveys hey are being used of what quality is t ith some current b n user research. Microsurvey: G his case study, we t works. Each GCS on, two if there is as more than one domly shown only survey designer c uestion formats th multiple answer, a stion formats allo ses. Questions an 5 character and 4 ltiple choice quest options to each re request a represen espondents based rred by IP address r by using a scree own to people tryi ontent—primarily ts & Entertainmen answer the quest ontent (see Figure ys are acting as a nt and the content e amounts of In this case ne type of , addressing d, who their the data they best practices GCS e first provide a S respondent is a screening question, then y one of the can choose one hat include open and rating scale ow for images in d responses 44 character tions are limited espondent. ntative on specific ses and ening question. ng to access a in the nt, and tion in order to e 1); in this surveywall t they want to After data results in t basic analy different d clustering Results: We analyz surveys ru run specifi and others for our pro a methodo GCS by th GCS log da types of q Table 1). T up over 80 the most c the lowest On averag to a GCS q seconds (s quickly—o collecting created, a four days. collection targeted s Who are G In Novemb compare G is collected, surv the GCS interface ysis tools includin demographics and of open-ended te Analysis of GC zed GCS log data a un by the authors ically to gather da s were run to answ oduct teams, how ological perspectiv he Numbers ata shows that th uestions are mult Together, single a 0% of all deployed common question t completion rate ge, respondents sp question, and the see Figure 3). GCS on average, survey data between one nd complete data General populati on the lower end surveys tend to ta GCS Respondents? ber 2012, PEW Re GCS demographic vey designers can e, which provides ng comparison of r d automatic, edita ext responses (see CS and data from sev . Some of the sur ata about GCS as wer user research wever we analyzed ve for this case st e two most frequ tiple-choice questi and multiple answ d GCS questions. type—multiple an (see Table 1). pend 9.7 seconds modal response t Ss also collect dat ys are approved t e and four hours a a collection in abou on surveys finish of that range, wh ke the four days. ? esearch ran a stud s with that of the view the users with results by ble e Figure 2). veral rveys were a method, h questions d them from tudy. ently used ions (see wers make However nswer—has responding time is 4 ta very to start after being ut two to data hereas dy to ir Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 890Demogra Men Women 18–24 25–34 35—44 45—54 55—64 65+ Unknown Ag Survey Question Do you ever use the inter to use a socia networking s like MySpace Facebook, or LinkedIn.co m What is the primary socia networking s you use? Table 2. Infe compared to Table 3. Soci older America survey sample phic PEW 32% 35% 33% 37% 49% 38% 28% 18% ge — n PEW G net al ite e, r m? 42% [age 50+] 4 [age al ite Fac (8 Linked Twitte Google MyS (1 erred GCS demograp PEW demographics. al Network usage a m ns, using PEW and G es. teleph respo compo that t heav y Q Mult Sing Open Ratin Nu m Ratin Ratin Larg Side Imag Open Two GCS 27% 27% 18% 30% 32% 28% 26% 23% 27% GCS 46% e 45+] ebook 85%) dIn (6%) er (4%) e+ (3%) Space 1%) Table compl types Figure GCS su phics . mong GCS hone panels. Their ndents “conform c osition of the ove here is little evide y internet users. [ Question Type tiple answers gle answer n Ended ng meric open ended ng with text ng with image ge image choice e-by-side images ge with menu n ended with image choices with image e 1. Rate of usage a letion rate among re of GCS questions. e 3. Distribution of r urvey questions. r overall findings w closely to the de m rall internet popu ence that GCS is b 4] Usage C 62.04% 21.71% 4.62% 3.81% 1.60% 1.50% 1.30% 0.99% 0.92% 0.82% 0.69% -- mong survey design espondents for the 1 response times in se were that GCS mographic lation,” and biased towards Completion Rate 20.56% 39.37% 27.03% 34.19% 25.30% 34.09% 27.20% 28.49% 29.37% 36.57% 27.79% -- ners and 12 different econds to We ran a s and techno of tablet o phone own (35%, 33 % was lower of demogr and gende networking findings us We also co from Surv Knowledge and techno were simil heavier int being the Overall, w between th of unkno w similar acr representi Responde n We ran a G surveywal are trying options the premium c response w followed b (34%), ma purchasing they then series of GCSs to ology-use questio ownership (PEW = nership (91%, 67 %) or the internet among GCS than raphics, GCS sho w er (see Table 2). W g site usage amon sing GCS were clo ompared GCS res ey Sampling Inte e Networks (KN) w ology adoption. R ar, with SSI respo ternet users and t lowest (see Table while we notice de m he survey sample wns in GCS—techn ross all four samp ng the high and lo nts’ Attitudes To w GCS to explore re ls that stand bet w to access. We as ey would prefer w content. We found was taking a shor by having content aking a small one g a subscription (6 had to specify as dig deeper into d ons. We found tha = 34%, GCS = 28 % %) and use of ce t for banking (61 % n PEW respondent ws lower rates acr With respect to so ng older American ose to PEW (see T pondents to respo rnational (SSI) an with respect to int Results across the ondents tending t technology adopte e 4). mographic differe es—likely due to th nology usage and ples, with PEW and ow extremes, res ward Surveywalls espondents’ attitud ween them and co ked them which o when trying to acc d that the most po rt microsurvey (47 sponsored by an e-time payment (1 6%), and other (3 open ended text) emographic at the rate %), cell ll phones %, 48%) ts. In terms ross age ocial ns, our Table 3). ondents nd ternet use panels to be the ers, and KN nces he number adoption is d KN pectively. des toward ontent they of five cess opular 7%), advertiser 10%), 3%; which ). Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 891Data Quality: Survey Attentiveness As one measure of data quality, we ran a GCS that asked respondents one of several trap questions. For a summary of how respondents performed, see the sidebar to the left. We find that our GCS respondents answered correctly the “Very Often” question less often (73%) than an example of the same trap question being asked on a paper survey (97%) [3]. A trap survey run in Mechanical Turk found only 61% of respondents answering correctly when asked to read an email and answer two questions [2], but this task is arguably harder than the questions we asked. Data Quality: Garbage Open Ended Responses We also analyzed data quality by looking at the rate of garbage responses that we received across 25 GCS questions run for other projects. Examples of these questions include: “which web browser(s) do you use?” and “what does clicking on this image allow you to do?” responses such as “blah”, “who cares”, and “zzzzz” and found that the percentage of garbage responses ranged from 1.8% to 23.4% (Mean = 7.8%). Our analysis revealed that the percentage of “I don’t know” responses tended to correlate with the percentage of garbage responses, suggesting that people were more likely to provide such garbage responses when they were not sure of what the question was asking of them. Conclusion: Best Practices for Microsurveys We find that microsurveys such as Google Consumer Surveys can quickly provide large amounts of data with relatively low setup costs. We also see that the GCS population is fairly representative as compared to other large-scale survey panels. However there are also pitfalls to keep in mind. Our findings from the trap question survey suggests that being concise is important to maximize data quality, which supports GCS’s question length constraints. We also suggest that it is important to appropriately target surveys to a population in order to keep garbage open ended responses to a minimum. If respondents are being asked about something they are unfamiliar with, they are less likely to provide meaningful responses. Finally, multiple answer questions had the lowest completion rate—which is often used as a measure of data quality (e.g. [1])—so we suggest that people think critically about the types of questions they use, and consider using other question types if at all appropriate. With respect to analyzing microsurveys, first it is important to remember that demographics are inferred, and there are many “unknowns”. We also suggest using built-in text clustering tools to categorize open-ended responses, and if desired, following up with multiple choice questions to determine how frequent these categories are. References [1] Dillman, D. A. & Schaefer, D. R. (1998). Development of a standard e-mail methodology: results of an experiment. Public Opinion Quarterly, 62, 3. [2] Downs, J.S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of CHI '10. [3] Hargittai, E. (2005). Survey Measures of WebOriented Digital Literacy. Social Science Computer Review. 23, 3, 371–379. [4] Pew Research (November 2012). A Comparison of Results from Surveys by the Pew Research Center and Google Consumer Surveys. KN GCS SSI For personal purposes, I normally use the Internet (5 = every hour or more, 1 = once per week or less) 3.2 3.5 3.8 Other people often seek my ideas and advice regarding technology (5 = describes me very well, 1 = describes me very poorly) 2.7 3.1 3.2 I am willing to pay more for the latest technology (same as above) 2.3 2.6 3.1 Which of the following best describes when you buy or try out new technology? (5 = Among the first people, 1 = I am usually not interested) 2.5 2.6 3.1 How frequently do you post on social networks? (5 = multiple times a day, 1 = once a month or less) 1.7 2.1 2.4 Trap Questions in GCS  What is the color of a red ball? (90.3% correct)  What is the shape of a red ball? (85.7%)  The purpose of this question is to assess your attentiveness to question wording. For this question please mark the ‘Very Often’ response. (72.5%)  The purpose of this question is to assess your attentiveness to question wording. Ignore the question below, and select “blue” from the answers. What color is a basketball? (57%) Table 4. Technology use and adoption among 3 different survey panels. Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 892 Minimizing off-target signals in RNA fluorescent in situ hybridization Aaron Arvey1,2, Anita Hermann3 , Cheryl C. Hsia3 , Eugene Ie2,4, Yoav Freund2 and William McGinnis3,* 1 Computational and Systems Biology Center, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, 2 Department of Computer Sciences and Engineering, 3 Division of Biological Sciences, University of California, San Diego, La Jolla, CA 92093 and 4 Google Inc., Mountain View, CA 94043, USA Received November 4, 2009; Revised December 11, 2009; Accepted January 17, 2010 ABSTRACT Fluorescent in situ hybridization (FISH) techniques are becoming extremely sensitive, to the point where individual RNA or DNA molecules can be detected with small probes. At this level of sensitivity, the elimination of ‘off-target’ hybridization is of crucial importance, but typical probes used for RNA and DNA FISH contain sequences repeated elsewhere in the genome. We find that very short (e.g. 20 nt) perfect repeated sequences within much longer probes (e.g. 350–1500 nt) can produce significant off-target signals. The extent of noise is surprising given the long length of the probes and the short length of non-specific regions. When we removed the small regions of repeated sequence from either short or long probes, we find that the signal-to-noise ratio is increased by orders of magnitude, putting us in a regime where fluorescent signals can be considered to be a quantitative measure of target transcript numbers. As the majority of genes in complex organisms contain repeated k-mers, we provide genome-wide annotations of k-mer-uniqueness at http://cbio.mskcc .org/aarvey/repeatmap. INTRODUCTION The gene expression profiles of individual cells can be drastically different from that of adjacent cells. This is particularly true in developing or heterogeneous tissues such as embryos (1), proliferative adult epithelia (2) and tumors (3). Visualization of RNA expression patterns in fields of cells is often accomplished with fluorescence in situ hybridization (FISH) using antisense probes. Analysis of cellular patterns of gene expression by FISH has provided insight into prognosis (3) and cell fate (4) of tissues. A challenge for the future is to use FISH in tissues to quantify RNA expression levels on a cell-by-cell basis, which requires high resolution, high sensitivity and high signal-to-noise ratios (1,5–8). A major hurdle in making RNA FISH methods quantitative has been increasing sensitivity and specificity to the point where genuine target RNA signals can be distinguished from background. One way to produce probes of high specificity has been to produce chemicallysynthesized oligonucleotides that are directly labeled with fluorophores, and tiled along regions of RNA sequence (6–8). Although directly-labeled oligo probes are elegant, they have not yet been widely applied, in part due to their expense, and in part due to their relatively low signal strength (6–8). One alternative method for single RNA molecule detection employs long haptenylated riboprobes that are enzymatically synthesized from cDNAs (1,5). Such probes are cheaply and easily produced, and when detected with primary and fluorescently-labeled secondary antibodies, they have higher signal intensities and equivalent resolution when compared to probes that are directly labeled with fluorophores (1,5). However, tiling probes have a natural advantage with respect to specificity: if a single probe ‘tile’ hybridizes to an off-target transcript, it is unlikely to generate sufficient signal to pass an intensity threshold that is characteristic of genuine RNA transcripts, which contain multiple tiled binding sites. In contrast, a single haptenylated probe, even if fragmented to sizes in the range of hundreds of nucleotides, may yield strong off-target signals due to the amplification conferred by primary and secondary antibodies. One traditional approach to determine background levels of fluorescence, and thus act as a crude estimate of specificity, has been the use of sense *To whom correspondence should be addressed. Tel: 858 822 0461; Fax: 858 822 3021; Email: wmcginnis@ucsd.edu The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Published online 17 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 10 e115 doi:10.1093/nar/gkq042 The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. by on July 11, 2010 http://nar.oxfordjournals.org Downloaded from Collaboration in the Cloud at Google Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy Google Inc. January 8, 2014 Abstract Through a detailed analysis of logs of activity for all Google employees1 , this paper shows how the Google Docs suite (documents, spreadsheets and slides) enables and increases collaboration within Google. In particular, visualization and analysis of the evolution of Google’s collaboration network show that new employees2 , have started collaborating more quickly and with more people as usage of Docs has grown. Over the last two years, the percentage of new employees who collaborate on Docs per month has risen from 70% to 90% and the percentage who collaborate with more than two people has doubled from 35% to 70%. Moreover, the culture of collaboration has become more open, with public sharing within Google overtaking private sharing. 1 Introduction Google Docs is a cloud productivity suite and it is designed to make collaboration easy and natural, regardless of whether users are in the same or different locations, working at the same or different times, or working on desktops or mobile devices. Edits and comments on the document are displayed as they are made, even if many people are simultaneously writing and commenting on or viewing the document. Comments enable real-time discussion and feedback on the document, without changing the document itself. Authors are notified when a new comment is made or replied to, and authors can continue a conversation by replying to the comment, or end the discussion by resolving it, or re-start the discussion by re-opening a closed discussion stream. Because documents are stored in the cloud, users can access any document they own or that has been shared with them anywhere, any time and on any device. The question is whether this enriched model of collaboration matters? There have been a few previous qualitative analyses of the effects of Google Docs on collaboration. For example, the review of Google Docs in [1] suggested that its features should improve collaboration and productivity among college students. A technical report [2] from the University of Southern Queensland, Australia argued that Google Docs can overcome barriers to usability such as difficulty of installation and document version control and help resolve conflicts among co-authors of research papers. There has also been at least one rigorous study of the effect of Google Docs on collaboration. Blau and Caspi [3] ran a small experiment that was designed to compare collaboration on writing documents to merely sharing documents. In their experiment, 118 undergraduate students of the Open University of Israel were randomized to one of five groups in which they shared their written assignments and received feedback from other students to varying degrees, ranging from keeping texts 1Full-time Google employees, excluding interns, part-times, vendors, etc 2Full-time employees who have joined Google for less than 90 days 12 COLLABORATION VISUALIZATION private to allowing in-text suggestions or allowing in-text edits. None of the students had used Google Docs previously. The authors found that only students in the collaboration group perceived the quality of their final document to be higher after receiving feedback, and students in all groups thought that collaboration improves documents. This paper takes a different approach, and looks for the effects of collaboration on a large, diverse organization with thousands of users over a much longer period of time. The first part of the paper describes some of the contexts in which Google Docs is used for collaboration, and the second part analyzes how collaboration has evolved over the last two years. 2 Collaboration Visualization 2.1 The Data This section introduces a way to visualize the events during a collaboration and some simple statistics that summarize how widespread collaboration using Google Docs is at Google. The graphics and metrics are based on the view, edit and comment actions of all full-time employees on tens of thousands of documents created in April 2013. 2.2 A Simple Example To start, a document with three collaborators Adam (A), Bryant (B) and Catherine (C) is shown in Figure 1. The horizontal axis represents time during the collaboration. The vertical axis is broken into three regions representing viewing, editing and commenting. Each contributor is assigned a color. A box with the contributor’s color is drawn in any time interval in which the contributor was active, at a vertical position that indicates what the user was doing in that time interval. This allows us to see when contributors were active and how often they contributed to the document. Stacking the boxes allows us to show when contributors were acting at the same time. Only time intervals in which at least one contributor was active are shown, and gaps in time that are shorter than a threshold are ignored. Gray vertical bars of fixed width are used to represent periods of no activity that are longer than the threshold. In this paper, the threshold is set to be 12 hours in all examples. In Figure 1, an interval represents an hour. Adam and Bryant edited the document together during the hour of 10 AM May 4 and Bryant edited alone in the following hour. The collaboration paused for 8 days and resumed during the hour of 2 pm on May 12. Adam, Bryant and Catherine all viewed the document during that hour. Catherine commented on the document in the next hour. Altogether, the collaboration had two active sessions, with a pause of 8 days between them. Figure 1: This figure shows an example of the collaboration visualization technique. Each colored block except the gray one represents an hour and the gray one represents a period of no activity. The Y axis is the number of users for each action type. This document has three contributors, each assigned a different color. Although we have used color to represent collaborators here, we could instead use color to represent the locations of the collaborators, their organizations, or other variables. Examples with different colorings are given in Sections 2.5 and 2.6. 2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics 2.3 Collaboration Metrics To estimate the percentage of users who concurrently edit a document and the percentage of documents which had concurrent editing, we discretize the timestamps of editing actions into 15 minute intervals and consider editing actions by different contributors in the same 15 minute interval to be concurrent. Two users who edit the same document but always more than 15 minutes apart would not be considered as concurrent, although they would still be considered collaborators. Edge cases in which two collaborators edit the same document within 15 minutes of each other but in two adjacent 15 minute intervals would not be counted as concurrent events. The choice of 15 minutes is arbitrary; however, metrics based on a 15 minute discretization and a 5 minute discretization are little different. The choice of 15 minute intervals makes computation faster. A more accurate approach would be to look for sequences of editing actions by different users with gaps below 15 minutes, but that requires considerably more computing. 2.4 Collaborative Editing Collaborative editing is common at Google. 53% of the documents that were created and shared in April 2013 were edited by more than one employee, and half of those had at least one concurrent editing session in the following six months. Looking at employees instead of documents, 80% of the employees who edited any document contributed content to a document owned by others and 65% participated in at least one 15 minute concurrent editing session in April 2013. Concurrent editing is sticky, in the sense that 76% of the employees who participate in a 15 minute concurrent editing session in April will do so again the following month. There are many use cases for collaborative editing, including weekly reports, design documents, and coding interviews. The following three plots show an example of each of these use cases. Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the number of users for each action type. The document was mainly edited by 3 employees, commented on by 18 and viewed by 50+. Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION Figure 2 shows the life of a design document created by engineers. The X axis is time in hours and the Y axis is the number of employees working on the document for each action type. The document was mainly edited by three employees, commented on by 18 employees and viewed by more than 50 employees from three major locations. This document was completed within two weeks and viewed many times in the subsequent month. Design documents are common at Google, and they typically have many contributors. Figure 3 shows the life of a weekly report document. Each bar represents a day and the Y axis is the number of employees who edited and viewed the document in a day. This document has the following submission rules: • Wednesday, AM: Reminder for submissions • Wednesday, PM: All teams submit updates • Thursday, AM: Document is locked The activities on the document exhibit a pronounced weekly pattern that mirrors the submission rules. Weekly reports and meeting notes that are updated regularly are often used by employees to keep everyone up-to-date as projects progress. Figure 3: Collaboration on a weekly report. The X axis is time in days and the Y axis is the number of users for each action type. The activities exhibit a pronounced weekly pattern and reflect the submission rules of the document. Finally, Figure 4 shows the life of a document used in an interview. The X axis represents time in minutes. The document was prepared by a recruiter and then viewed by an engineer. At the beginning of the interview, the engineer edited the document and the candidate then wrote code in the document. The engineer was able to watch the candidate typing. At the end of the interview, the candidate’s access to the document was revoked so no further change could be made, and the document was reviewed by the engineer. Collaborative editing allows the coding interview to take place remotely, and it is an integral part of interviews for software engineers at Google. Figure 4: The activity on a phone interview document. The X axis is time in minutes and the Y axis is the number of users for each action type. The engineer was able to watch the candidate typing on the document during a remote interview. 2.5 Commenting Commenting is common at Google. 30% of the documents created in April 2013 that are shared received comments within six months of creation. 57% of the employees who used Google Docs in April commented at least once in April, and 80% of the users who commented in April commented again in the following month. 4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis is the number of user actions for each user location. There are four user actions, each assigned a different color. Timestamps are in Pacific time. Figure 5 shows the life of a design document. Here color represents the type of user action (create a comment, reply to a comment, resolve a comment and edit the document), and the Y axis is split into two locations. The document was written by one engineering team and reviewed by another. The review team used commenting to raise many questions, which the engineering team resolved over the next few days. Collaborators were located in London, UK and Mountain View, California, with a nine hour time zone difference, so the two teams were almost ”taking turns” working on the document (timestamps are in Pacific time). There are many similar communication patterns between engineers via commenting to ask questions, have discussions and suggest modifications. 2.6 Collaboration Across Sites Employees use the Docs suite to collaborate with colleagues across the world, as Figure 6 shows. In that figure, employees working from nine locations in eight countries across the globe contributed to a document that was written within a week. The document was either viewed or edited with gaps of less than 12 hours (the threshold for suppressing gaps in the plot) in the first seven days as people worked in their local timezones. After final changes were made to the document, it was reviewed by people in Dublin, Mountain View, and New York. Figure 7 shows one month of global collaborations for full-time employees using Google Docs. The blue dots show the locations of the employees and a line connects two locations if a document is created in one location and viewed in the other. The warmer the color of the line, moving from green to red, the more documents shared between the two locations. Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in hours and the Y axis is the number of locations for each action type. Users from nine different locations contributed to the document. Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is collaboration on Google Docs between the two locations. 6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work 2.7 Cross Device Work The advantage of cloud-based software and storage is that a document can be accessed from any device. Figure 8 shows one employee’s visits to a document from multiple devices and locations. When the employee was in Paris, a desktop or laptop was used during working hours and a mobile device during non-working hours. Apparently, the employee traveled to Aix-En-Provence on August 18. On August 18 and the first part of August 19, the employee continued working on the same document from a mobile device while on the move. Figure 8: Visits to a document by one user working on multiple devices and from multiple locations. Not surprisingly, the pattern of working on desktops or laptops during working hours and on mobile devices out of business hours holds generally at Google, as Figure 9 shows. The day of week is shown on the X axis and hour of day in local time on the Y axis. Each pixel is colored according to the average number of employees working in Google Docs in a day of week and time of day slot, with brighter colors representing higher numbers. Pixel values are normalized within each plot separately. Desktop and laptop usage of Google Docs peaks during conventional working hours (9:00 AM to 11:00 AM and 1:00 PM to 5:00 PM), while mobile device usage peaks during conventional commuting and other out-of-office hours (7:00 AM to 9:00 AM and 6:00 PM to 8:00 PM). Figure 9: The average number of active users working in Google Docs in each day of week and time of day slot. The X axis is day of the week and the Y axis is time of the day in local time. Desktop/Laptop usage peaks during working hours while mobile usage peaks at out-of-office working hours. 3 The Evolution of Collaboration 3.1 The Data This section explores changes in the usage of Google Docs over time. Section 2 defined collaborators as users who edited or commented on the same document and used logs of employee editing, viewing and commenting actions to describe collaboration within Google. This section defines collaborators differently using metadata on documents. Metadata is much less rich than the event history logs used in Section 2, but metadata is retained for a much longer period of time. Document metadata includes the document creation time and the last time that the document Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION was accessed, but no other information about its revision history. However, the metadata does include the identification numbers for employees who have subscribed to the document, where a subscriber is anyone who has permission to view, edit or comment on a document and who has viewed the document at least once. Here we use metadata on documents, slides and spreadsheets. We call two employees collaborators (or subscription collaborators to be clear) if one is a subscriber to a document owned by the other and has viewed the document at least once and the document has fewer than 20 subscribers. The owner of the document is said to have shared the document with the subscriber. The number of subscribers is capped at 20 to avoid overcounting collaborators. The more subscribers the document has, the less likely it is that all the subscribers contributed to the document. There is no timestamp for when the employee subscribed to the document in the metadata, so the exact time of the collaboration is not known. Instead, the document creation time, which is known, is taken to be the time of the collaboration. An analysis (not shown here) of the event history data discussed in Section 2 showed that most collaborators join a collaboration soon after a document is created, so taking collaboration time to be document creation time is not unreasonable. To make this assumption even more tenable, we exclude documents for which the time of the last view, comment or edit is more than six months after the document was created. This section uses metadata on documents created between January 1, 2011 and March 31, 2013. We say that two employees had a subscription collaboration in July if they collaborated on a document that was created in July. 3.2 Collaboration for New Employees Here we define the new employees for a given month to be all the employees who joined Google no more than 90 days before the beginning of the month and started using Google Docs in the given month. For example, employees called new in the month of January 2011 must have joined Google no more than 90 days before January 1, 2011 and used Google Docs in January 2011. Each month can include different employees. New employees are said to share a document if they own a document that someone else subscribed to, whether or not the person subscribed to the document is a new employee. Similarly, a new employee is counted as a subscriber, regardless of the tenure of the document creator. Figure 10 shows that collaboration among new employees has increased since 2011. Over the last two years, subscribing has risen from 55% to 85%, sharing has risen from 30% to 50%, and the fraction of users who either share or subscribe has risen from 70% to 90%. In other words, new employees are collaborating earlier in their career, so there is a faster ramp-up and easier access to collective knowledge. Figure 10: This figure shows the percentage of new employees who share, subscribe to others’ documents and either share or subscribe in each one-month period over the last two years. Not only do new employees start collaborating more often (as measured by subscription and sharing), they also collaborate with more people. Figure 11 shows the percentage of new employees with at least a given number of collaborators by month. For example, the percentage of 8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing new employees with at least three subscription collaborators was 35% in January 2011 (the bottom red curve) and 70% in March 2013 (the top blue curve), a doubling over two years. It is interesting that the curves hardly cross each other and the curves for the farthest back months lie below those for recent months, suggesting that there has been steady growth in the number of subscription collaborators per new employee over this period. Figure 11: This figure shows the proportion of new employees who have at least a given number of collaborators in each one-month period. Each period is assigned a different color. The cooler the color of the curve, moving from red to blue, the more recent the month. The legend only shows the labels for a subset of curves. The percentage of new employees who have at least three collaborators has doubled from 35% to 70%. To present the data in Figure 11 in another way, Table 1 shows percentiles of the distribution of the number of subscription collaborators per new employee using Google Docs in January 2011 and in January 2013. For example, the lowest 25% of new employees using Google Docs had no such collaborators in January 2011 and two such collaborators in January 2013. 25% 50% 75% 90% 95% January 2011 0 1 4 7 11 January 2013 2 5 10 17 22 Table 1: This table shows the percentile of number of collaborators a new employee have in January 2011 and January 2013. The entire distribution shifts to the right. 3.3 Collaboration in Sales and Marketing Section 3.2 compared new employees who joined Google in different months. This section follows current employees in Sales and Marketing who joined Google before January 1, 2011. That is, the previous section considered changes in new employee behavior over time and this section considers changes in behavior for a fixed set of employees over time. We only analyze subscription collaborations among this fixed set of employees and collaborations with employees not in this set are excluded. Figure 12: This figure shows the percentage of current employees in Sales and Marketing who have at least a given number of collaborators in each onemonth period. Figure 12 shows the percentage of current employees in Sales and Marketing who have at least Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION a given number of collaborators at several times in the past. There we see that more employees are sharing and subscribing over time because the fraction of the group with at least one subscription collaborator has increased from 80% to 95%. And the fraction of the group with at least three subscription collaborators has increased from 50% to 80%. It shows that many of the employees who used to have no or very few subscription collaborators have migrated to having multiple subscription collaborators. In other words, the distribution of number of subscription collaborators for employees who have been in Sales and Marketing since January 1, 2011 has shifted right over time, which implies that collaboration in that group of employees has increased over time. Finally, the number of documents shared by the employees who have been in Sales and Marketing at Google since January 1, 2011 has nearly doubled over the last two years. Figure 13 shows the number of shared documents normalized by the number of shared documents in January, 2011. Figure 13: This figure shows the number of shared documents created by employees in Sales and Marketing each month normalized by the number of shared documents in January 2011. The number has almost doubled over the last two years. 3.4 Collaboration Between Organizations Collaboration between organizations has increased over time. To show that, we consider hundreds of employees in nine teams within the Sales and Marketing group and the Engineering and Product Management group who joined Google before January 1, 2011, were still active in March 31, 2013 and used Google Docs in that period. Figure 14 represents the Engineering and Product Management employees as red dots and the Sales and Marketing employees as blue dots. The same dots are included in all three plots in Figure 14 because the employees included in this analysis do not change. A line connects two dots if the two employees had at least one subscription collaboration in the month shown. The denser the lines in the graph, the more collaboration, and the more lines connecting red and blue dots, the more collaboration between organizations. Clearly, subscription collaboration has increased both within and across organizations in the past two years. Moreover, the network shows more pronounced communities (groups of connected dots) over time. Although there are nine individual teams, there seems to be only three major communities in the network. Figure 14 indicates that teams can work closely with each other even though they belong to separate departments. We also sampled 187 teams within the Sales and Marketing group and the Engineering and Product Management group. Figure 15 represents teams in Engineering and Product Management as red dots and teams in Sales and Marketing as blue dots. Two dots are connected if the two teams had a least one subscription collaboration between their members in the month. Figure 15 shows that the collaboration between those teams has increased and the interaction between the two organizations has becomed stronger over the past two years. 10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations Figure 14: An example of collaboration across organizations. Red dots represent employees in Engineering and Product Management and blue dots represent employees in Sales and Marketing Figure 15: An example of collaboration between teams. Red dots represent teams in Engineering and Product Management and blue dots represent teams in Sales and Marketing Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS 3.5 Cultural Changes in Collaboration Google Docs allows users to specify the access level (visibility) of their documents. The default access level in Google Docs is private, which means that only the user who created the document or the current owner of the document can view it. Employees can change the access level on a document they own and allow more people to access it. For example, the document owner can specify particular employees who are allowed to access the document, or the owner can mark the document as public within Google, in which case any employee can access the document. Clearly, not all documents created in Google can be visible to everyone at Google, but the more documents are widely shared, the more open the environment is to collaboration. Figure 16: This figure shows the percentage of shared documents that are ”public within Google” created in each month. Public sharing is overtaking private sharing at Google. Figure 16 shows the percentage of shared documents in Google created each month between January 1, 2012 and March 31, 2013 that are public within Google. The red line, which is a curve fit to the data to smooth out variability, shows that the percentage has increased about 12% from 48% to 54% in the last year alone. In that sense, the culture of sharing is changing in Google from private sharing to public sharing. 4 Conclusions We have examined how Google employees collaborate with Docs and how that collaboration has evolved using logs of user activity and document metadata. To show the current usage of Docs in Google, we have developed a visualization technique for the revision history of a document and analyzed key features in Docs such as collaborative editing, commenting, access from anywhere and on any device. To show the evolution of collaboration in the cloud, we have analyzed new employees and a fixed group of employees in Sales and Marketing, and computed collaboration network statistics each month. We find that employees are engaged in using the Docs suite, and collaboration has grown rapidly over the last two years. It would also be interesting to conduct a similar analysis for other enterprises and see how long it would take them to reach the benchmark Google has set for collaboration on Docs. Not only has the collaboration on Docs changed at Google, the number of emails, comments on G+, calender meetings between people who work together has also had significant changes over the past few years. How those changes reinforce each other over time would also be an interesting topic to study. Acknowledgements We would like to thank Ariel Kern for her insights about collaboration on Google Docs, Penny Chu and Tony Fagan for their encouragement and support and many thanks to Jim Koehler for his constructive feedback. 12 Google Inc.REFERENCES REFERENCES References [1] Dan R. Herrick (2009). Google this!: using Google apps for collaboration and productivity. Proceeding of the ACM SIGUCCS fall conference (pp. 55-64). [2] Stijn Dekeyser, Richard Watson (2009). Extending Google Docs to Collaborate on Research Papers. Technical Report, The University of Southern Queensland, Australia. [3] Ina Blau, Avner Caspi (2009). What Type of Collaboration Helps? Psychological Ownership, Perceived Learning and Outcome Quality of Collaboration Using Google Docs. Learning in the technological era: Proceedings of the Chais conference on instructional technologies research (pp. 48-55). Google Inc. 13 Page [1] of [40] [March, 2013] WORKING GROUP 4 Network Security Best Practices FINAL Report – BGP Security Best PracticesThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [2] of [40] Table of Contents Table of Contents 1 RESULTS IN BRIEF...................................................................................................................................... 3 1.1 CHARTER ........................................................................................................................................................ 3 1.2 EXECUTIVE SUMMARY................................................................................................................................. 3 2 INTRODUCTION........................................................................................................................................... 4 2.1 CSRIC STRUCTURE......................................................................................................................................... 5 2.2 WORKING GROUP [#4] TEAM MEMBERS ..................................................................................................... 7 3 OBJECTIVE, SCOPE, AND METHODOLOGY ................................................................................... 8 3.1 OBJECTIVE ..................................................................................................................................................... 8 3.2 SCOPE .............................................................................................................................................................. 9 3.3 METHODOLOGY ............................................................................................................................................ 9 4 BACKGROUND.............................................................................................................................................. 9 4.1 DEPLOYMENT SCENARIOS .............................................................................................................................. 9 5 ANALYSIS, FINDINGS AND RECOMMENDATIONS .............................................................................10 5.1 BGP SESSION-LEVEL VULNERABILITY ........................................................................................................10 5.1.1 SESSION HIJACKING ...................................................................................................................................................10 5.1.2 DENIAL OF SERVICE (DOS) VULNERABILITY ........................................................................................................12 5.1.3 SOURCE-ADDRESS FILTERING ..................................................................................................................................17 5.2 BGP INJECTION AND PROPAGATION VULNERABILITY...............................................................................20 5.2.1 BGP INJECTION AND PROPAGATION COUNTERMEASURES.................................................................................22 5.2.2 BGP INJECTION AND PROPAGATION RECOMMENDATIONS ................................................................................25 5.3 OTHER ATTACKS AND VULNERABILITIES OF ROUTING INFRASTRUCTURE..............................................26 5.3.1 HACKING AND UNAUTHORIZED 3RD PARTY ACCESS TO ROUTING INFRASTRUCTURE.....................................26 5.3.2 ISP INSIDERS INSERTING FALSE ENTRIES INTO ROUTERS ...................................................................................28 5.3.3 DENIAL-OF-SERVICE ATTACKS AGAINST ISP INFRASTRUCTURE.......................................................................28 5.3.4 ATTACKS AGAINST ADMINISTRATIVE CONTROLS OF ROUTING IDENTIFIERS....................................................30 6 CONCLUSIONS............................................................................................................................................32 7 APPENDIX .....................................................................................................................................................33 7.1 BACKGROUND................................................................................................................................................33 7.1.1 SALIENT FEATURES OF BGP OPERATION..............................................................................................................33 7.1.2 REVIEW OF ROUTER OPERATIONS..........................................................................................................................34 7.2 BGP SECURITY INCIDENTS AND VULNERABILITIES ...................................................................................35 7.3 BGP RISKS MATRIX......................................................................................................................................38 7.4 BGP BCP DOCUMENT REFERENCES............................................................................................................40The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [3] of [40] 1 Results in Brief 1.1 Charter This Working Group was convened to examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to some significant deployment of protocol extensions such as the Domain Name System Security Extensions (DNSSEC), Secure BGP (Border Gateway Protocol) and the like. The focus of the group is limited to what is possible using currently available and deployed hardware and software. Development and refinement of protocol extensions for both systems is ongoing, as is the deployment of such extensions, and is the subject of other FCC working groups. The scope of Working Group 4 is to focus on currently deployed and available feature-sets and processes and not future or non-widely deployed protocol extensions. 1.2 Executive Summary Routing is what provides reachability between the various end-systems on the Internet be they servers hosting web or email applications; home user machines; VoIP (Voice over Internet Protocol) equipment; mobile devices; connected home monitoring or entertainment systems. Across the length and breadth of the global network it is inter-domain routing that allows for a given network to learn of the destinations available in a distant network. BGP (Border Gateway Protocol) has been used for inter-domain routing for over 20 years and has proven itself a dynamic, robust, and manageable solution to meet these goals. BGP is configured within a network and between networks to exchange information about which IP address ranges are reachable from that network. Among its many features, BGP allows for a flexible and granular expression of policy between a given network and other networks that it exchanges routes with. Implicit in this system is required trust in information learned from distant entities. That trust has been the source of problems from time to time causing reachability and stability problems. These episodes have typically been short-lived but underscored the need for expanding the use of Best Current Practices (BCPs) for improving the security of BGP and the inter-domain routing system. These mechanisms have been described in a variety of sources and this document does not seek to re-create the work done elsewhere but to provide an overview and gloss on the vulnerabilities and methods to address each. Additionally, the applicability of these BCPs can vary somewhat given different deployment scenarios such as the scale of a network’s BGP deployment and the number of inter-domain neighbors. By tailoring advice for these various scenarios, recommendations that may seem confusing or contradictory can be clarified. Further, an appendix includes a table that indexes the risks and countermeasures according to different deployment scenarios. Issues that the working group considered included: • Session hijacking • Denial of service (DoS) vulnerabilities • Source-address filtering • BGP injection and propagation vulnerabilitiesThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [4] of [40] • Hacking and unauthorized access to routing infrastructure • Attacks against administrative controls of routing identifiers Working Group 4 recommends that the FCC encourage adoption of numerous best practices for protecting ISPs’ routing infrastructures and addressing risks related to routing that are continuously faced by ISPs.. Inter-domain routing via BGP is a fundamental requirement for ISPs and their customers to connect and interoperate with the Internet. As such, it is a critical service that ISPs must ensure is resilient to operational challenges and protect from abuse by miscreants. SPECIAL NOTE: For brevity, and to address the remit of the CSRIC committee to make recommendations for ISPs, the term ISP is used throughout the paper. However, in most instances the reference or the recommendations are applicable to any BGP service components whether implemented by an ISP or by other organizations that peer to the Internet such as business enterprises, hosting providers, and cloud providers. 2 Introduction CSRIC was established as a federal advisory committee designed to provide recommendations to the Commission regarding Best Practices and actions the Commission may take to ensure optimal operability, security, reliability, and resiliency of communications systems, including telecommunications, media, and public safety communications systems. Due to the large scope of the CSRIC mandate, the committee then divided into a set of Working Groups, each of which was designed to address individual issue areas. In total, some 10 different Working Groups were created, including Working Group 4 on Network Security Best Practices. This Working Group will examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to a anticipated widespread implementation of protocol updates such as the Domain Name System Security Extensions (DNSSEC) and Secure Border Gateway Protocol (BGPsec) extensions. The Working Group presented its report on DNS Security issues in September 2012. This Final Report – BGP Best Practices documents the efforts undertaken by CSRIC Working Group 4 Network Security Best Practices with respect to securing the inter-domain routing infrastructure that is within the purview of ISPs, enterprises, and other BGP operators. Issues affecting the security of management systems that provide control and designation of routing and IP-space allocation records that BGP is based on were also considered. Routing and BGP related services are necessary and fundamental components of all ISP operations, and there are many established practices and guidelines available for operators to consult. Thus most ISPs have mature BGP/routing management and infrastructures in-place. Still, there remain many issues and exposures that introduce major risk elements to ISPs, since the system itself is largely insecure and unauthenticated, yet provides the fundamental traffic control system of the Internet. This report enumerates the issues the group identified as most critical and/or that may need more attention.Page [5] of [40] 2.1 CSRIC Structure Communications Security, Reliability, and Interoperability Council (CSRIC) III Working Group 5: DNSSEC Implementation Practices for ISPs Working Group 6: Secure BGP Deployment Working Group 4: Network Security Best Practices Working Group 7: Botnet Remediation Working Group 3: E911 Location Accuracy Working Group 8: E911 Best Practices Working Group 2: Next Generation Alerting Working Group 9: Legacy Broadcast Alerting Issues Working Group 1: Next Generation 911 Working Group 10: 911 Prioritization CSRIC Steering Committee Co-Chairs Working Group 1 Co-Chairs Working Group 2 Co-Chairs Working Group 3 Co-Chairs Working Group 4 Chair Working Group 5 Co-Chairs Working Group 6 Chair Working Group 7 Chair Working Group 8 Co-Chairs Working Group 9 Co-Chairs Working Group 10Page [6] of [40]The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [7] of [40] 2.2 Working Group [#4] Team Members Working Group [#4] consists of the members listed below for work on this report. Name Company Rodney Joffe – Co-Chair Neustar, Inc. Rod Rasmussen – Co-Chair Internet Identity Mark Adams ATIS (Works for Cox Communications) Steve Bellovin Columbia University Donna Bethea-Murphy Iridium Rodney Buie TeleCommunication Systems, Inc. Kevin Cox Cassidian Communications, an EADS NA Comp John Crain ICANN Michael Currie Intrado, Inc. Dale Drew Level 3 Communications Chris Garner CenturyLink Joseph Gersch Secure64 Software Corporation Jose A. Gonzalez Sprint Nextel Corporation Kevin Graves TeleCommunication Systems (TCS) Tom Haynes Verizon Chris Joul T-Mobile Mazen Khaddam Cox Kathryn Martin Access Partnership Ron Mathis Intrado, Inc. Danny McPherson Verisign Doug Montgomery NIST Chris Oberg ATIS (Works for Verizon Wireless) Victor Oppleman Packet Forensics Elman Reyes Internet Identity Ron Roman Applied Communication Sciences Heather Schiller Verizon Jason Schiller Google Marvin Simpson Southern Company Services, Inc. Tony Tauber Comcast Paul Vixie Internet Systems Consortium Russ White Verisign Bob Wright AT&T Name Company Rodney Joffe – Co-Chair Neustar, Inc. Rod Rasmussen – Co-Chair Internet Identity Mark Adams ATIS (Works for Cox Communications) Steve Bellovin Columbia University Donna Bethea-Murphy Iridium Rodney Buie TeleCommunication Systems, Inc. Kevin Cox Cassidian Communications, an EADS NA Comp John Crain ICANN Michael Currie Intrado, Inc. Dale Drew Level 3 CommunicationsThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [8] of [40] Chris Garner CenturyLink Joseph Gersch Secure64 Software Corporation Jose A. Gonzalez Sprint Nextel Corporation Kevin Graves TeleCommunication Systems (TCS) Tom Haynes Verizon Chris Joul T-Mobile Mazen Khaddam Cox Kathryn Martin Access Partnership Ron Mathis Intrado, Inc. Danny McPherson Verisign Doug Montgomery NIST Chris Oberg ATIS (Works for Verizon Wireless) Victor Oppleman Packet Forensics Elman Reyes Internet Identity Ron Roman Applied Communication Sciences Heather Schiller Verizon Jason Schiller Google Marvin Simpson Southern Company Services, Inc. Tony Tauber Comcast Paul Vixie Internet Systems Consortium Russ White Verisign Bob Wright AT&T Table 1 - List of Working Group Members 3 Objective, Scope, and Methodology 3.1 Objective This Working Group was convened to examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to what some anticipate might be widespread implementation of protocol updates such as the Domain Name System Security Extensions (DNSSEC) and Secure Border Gateway Protocol (BGPsec) extensions (though the latter outcome is not entirely uncontroversial). DNS is the directory system that associates a domain name with an IP (Internet Protocol) address. In order to achieve this translation, the DNS infrastructure makes hierarchical inquiries to servers that contain this global directory. As DNS inquiries are made, their IP packets rely on routing protocols to reach their correct destination. BGP is the protocol utilized to identify the best available paths for packets to take between points on the Internet at any given moment. This foundational system was built upon a distributed unauthenticated trust model that has been mostly sufficient for over two decades but has some room for improvement. These foundational systems are vulnerable to compromise through operator procedural mistakes as well as through malicious attacks that can suspend a domain name or IP address's availability, or compromise their information and integrity. While there are formal initiatives under way within the IETF (which has been chartered to develop Internet technical standards and protocols) that will improve this situation significantly, global adoption and implementation will take some time. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [9] of [40] This Working Group will examine vulnerabilities within these areas and recommend best practices to better secure these critical functions of the Internet during the interval of time preceding deployment of more robust, secure protocol extensions. This report covers the BGP portion of these overall group objectives. 3.2 Scope Working Group 4’s charter clearly delineates its scope to focus on two subsets of overall network security, DNS and routing. It further narrows that scope to exclude consideration of the implementation of DNSSEC (tasked to Working Group 5) and secure extensions of BGP (tasked to Working Group 6). While those groups deal with protocol modifications requiring new software and/or hardware deployments; WG4 is geared toward items that either don't require these extensions or are risks which are outside the scope of currently contemplated extensions. For this report regarding BGP, the focus is on using known techniques within the Operator community. Some of these methods and the risks they seek to address are useful even in cases where protocol extensions are used in some future landscape. 3.3 Methodology With the dual nature of the work facing Working Group 4, the group was divided into two subgroups, one focused on issues in DNS security, another in routing security. Starting in December 2011, the entire Working Group met every two weeks via conference call(s) to review research and discuss issues, alternating between sub-groups. The group created a mailing list to correspond and launched a wiki to gather documents and to collectively collaborate on the issues. Additional subject matter experts were occasionally tapped to provide information to the working group via conference calls. The deliverables schedule called for a series of reports starting in June 2012 that would first identify issues for both routing and DNS security, then enumerate potential solutions, and finally present recommendations. The initial deliverables schedule was updated in March in order to concentrate efforts in each particular area for separate reports. This first report on DNS security issues was presented in September 2012, and this, the second report on routing issues, is being published in March 2013. Based on the discussions of the group, a list of BGP risks, potential solutions, and relevant BCP documents was created and refined over the course of the work. Subject matter experts in BGP then drove development of the initial documentation of issues and recommendations. These were then brought together into a full document for review and feedback. Text contributions, as completed, were reviewed, edited and approved by the full membership of Working Group 4. 4 Background 4.1 Deployment Scenarios BGP is deployed in many different kinds of networks of different size and profiles. Many different recommendations exist to improve the security and resilience of the inter-domain routing system. Some of the advice can even appear somewhat contradictory and often the key decision can come down to understanding what is most important or appropriate for a given network considering its size, the number of external connections, number of BGP routers, size The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [10] of [40] and expertise of the staff and so forth. We attempt to tailor the recommendations and highlight which are most significant for a given network operator’s situation. Further background and information on routing operations can be found in the Appendix (Section 7) of this document for readers unfamiliar with this area of practice. 5 Analysis, Findings and Recommendations The primary threats to routing include: • Risks to the routers and exchange of routing information • Routing information that is incorrect or propagates incorrectly • General problems with network operations 5.1 BGP Session-Level Vulnerability When two routers are connected and properly configured, they form a BGP peering session. Routing information is exchanged over this peering session, allowing the two peers to build a local routing table, which is then used to forward actual packets of information. The first BGP4 attack surface is the peering session between two individual routers, along with the routers themselves. Two classes of attacks are included here, session hijacking and denial of service. 5.1.1 Session Hijacking The BGP session between two routers is based on the Transport Control Protocol (TCP), a session protocol also used to transfer web pages, naming information, files, and many other types of data across the Internet. Like all these other connection types, BGP sessions can be hijacked or disrupted, as shown in Figure 1. Figure 1: Session Hijacking In this diagram, the attack host can either take over the existing session between Routers A and B, or build an unauthorized session with Router A. By injecting itself into the peering between Routers A and B, the attacker can inject new routing information, change the routing information being transmitted to Router A, or even act as a “man in the middle,” modifying the routing information being exchanged between these two routers.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [11] of [40] 5.1.1.1 Session-Level Countermeasures Current solutions to these types of attacks center on secure hash mechanisms, such as HMACMD5 (which has been deprecated) and HMAC-SHA. These mechanisms rely on the peering routers sharing a key (a shared key – essentially a password) that is used to calculate a cryptographic hash across each packet (or message) transmitted between the two routers, and included in the packet itself. The receiving router can use the same key to calculate the hash. If the hash in the packet matches the locally calculated hash, the packet could have only been transmitted by another router that knows the shared key. This type of solution is subject to a number of limitations. First, the key must actually be shared among the routers building peering sessions. In this case, the routers involved in the peering session are in different administrative domains. Coming to some uniform agreement about how keys are generated and communicated (e.g. phone, email, etc.) with the often hundreds of partners and customers of an ISP is an impractical task. There is the possibility that any key sharing mechanism deployed to ease this administrative burden could, itself, come under attack (although such attacks have never been seen in the wild). Lastly, some concerns have been raised that burden of cryptographic calculations could itself become a vector for a Denial-of-Service (DoS) attack by a directed stream of packets with invalid hash components. One way to deny service is to make the processor that is responsible for processing routing updates and maintaining liveness too busy to reliably process these updates in a timely manner. In many routers the processor responsible for calculating the cryptographic hash is also responsible for processing new routing information learned, sending out new routing information, and even transmitting keep-alive messages to keep all existing sessions up. Since calculating the cryptographic hash is computationally expensive, a smaller flood of packets with an invalid hash can consume all the resources of the processor, thus making it easier to cause the processor to be to busy. Other mechanisms, such as the Generalized TTL Security Mechanism (GTSM, described in RFC 5082), focus on reducing the scope of such attacks. This technique relies on a feature of the IP protocol that would prevent an attacker from effectively reaching the BGP process on a router with forged packets from some remote point on the Internet. Since most BGP sessions are built across point-to-point links (on which only two devices can communicate), this approach would prevent most attackers from interfering in the BGP session. Sessions built over a shared LAN, such as is the case in some Internet exchanges, will be protected from those outside the LAN, but will remain vulnerable to all parties that are connected to the LAN. This solution is more complicated to implement when BGP speaking routers are not directly connected. It is possible to count the number of hops between routers and limit the TTL value to only that number of hops. This will provide some protection, limiting the scope of possible attackers to be within that many hops. If this approach is used consider failure scenarios of devices between the pair of BGP speaking routers, what impact those failures will have on the hop count between the routers, and if you want to expand the TTL value to allow the session to remain up for a failure that increases the hop count. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [12] of [40] 5.1.1.2 Session-Level Current Recommendations For a network with a small (e.g., single-digit) number of eBGP neighbors, it is reasonable to follow the lead of what is specified by the upstream ISPs who may have a blanket policy of how they configure their eBGP sessions. A network with larger numbers of eBGP neighbors may be satisfied that they can manage the number of keys involved either through data-store or rubric. Note that a rubric may not always be feasible as you cannot ensure that your neighbors will always permit you to choose the key. Managing the keys for a large number of routers involved in BGP sessions (a large organization may have hundreds or thousands of such routers) can be an administrative burden. Questions and issues can include: • In what system should the keys be stored and who should have access? • Should keys be unique per usage having only one key for internal usage and another key that is shared for all external BGP sessions? • Should keys be unique per some geographical or geo-political boundary say separate keys per continent or per country or per router? • Should keys be unique to each administrative domain, for example a separate key for each Autonomous System a network peers with? • There is no easy way to roll over keys, as such changing a key is quite painful, as it disrupts the transmission of routing information, and requires simultaneous involvement from parties in both administrative domains. This makes questions of how to deal with the departure of an employee who had access to the keys, or what keys to use when peering in a hostile country more critical. Another consideration is the operational cost of having a key. Some routing domains will depend on their peers to provide the key each time a new session is established, and not bother to make a record of the key. This avoids the problems of how to store the key, and ensure the key remains secure. However if a session needs to be recreated because configuration information is lost either due to accidental deletion of the configuration, or hardware replacement, then the key is no longer known. The session will remain down until the peer can be contacted and the key is re-shared. Often times this communication does not occur, and the peer may simply try to remove the key as a troubleshooting step, and note the session reestablishes. When this happens the peer will often prefer for the session to remain up, leaving the peering session unsecured until the peer can be contacted, and a maintenance window can be scheduled. For unresponsive peers, an unsecured peering session could persist, especially considering that the urgency to address the outage has now passed. Despite these vulnerabilities having been widely known for a decade or more, they have not been implicated in any notable number of incidents. As a result some network operators have not found the cost/benefit trade-offs to warrant the operational cost of deploying such mechanisms while others have. Given these facts, the Working Group recommends that individual network operators continue to make their own determinations in using these countermeasures. 5.1.2 Denial of Service (DoS) Vulnerability Because routers are specialized hosts, they are subject to the same sorts of Denial of Service (DoS) attacks any other host or server is subject to. These attacks fall into three types:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [13] of [40] 1. Attacks that seek to consume all available interface bandwidth making it difficult for enough legitimate traffic to get through such as UDP floods and reflective attacks 2. Attacks that seek to exhaust resources such as consume all available CPU cycles, memory, or ports so that the system is too busy to respond such as TCP SYN attacks 3. Attacks utilizing specially crafted packets in an attempt to cause the system to crash or operate in an unexpected way such as buffer overflow attacks, or malformed packet attacks that create an exception that is not properly dealt with Bandwidth exhaustion attacks attempt to use so much bandwidth that there is not enough available bandwidth for services to operate properly. This type of an attack can cause routers to fail to receive routing protocol updates or keep-alive messages resulting in out-of-date routing information, routing loops, or interruption of routing altogether, such as happens when a BGP session goes down and the associated routing information is lost. Resource exhaustion attacks target traffic to the router itself, and attempt to make the router exhaust its CPU cycles or memory. In the case of the former, the router’s CPU becomes too busy to properly process routing keepalives and updates causing the adjacencies to go down. In the case of the latter, the attacker sends so much routing protocol information that the router has no available memory to store all of the required routing information. Crafted packets attacks attempt to send a relatively small number of packets that the router does not deal with appropriately. When a router receives this type of packet it may fill up interface buffers and then not forward any traffic on that interface causing routing protocols to crash and restart, reboot, or hang. In some cases the router CPU may restart, reboot, or hang likely causing loss of all topological and routing state. One example was the “protocol 55” attack, where some router vendors simply did not code properly how to deal with this traffic type. Some routers are specialized to forward high rates of traffic. These routers often implement their forwarding capabilities in hardware that is optimized for high throughput, and implement the less demanding routing functions in software. As such bandwidth exhaustion attacks are targeted at the routers interfaces or the backplane between those interfaces, or the hardware responsible for making forwarding decisions. The other types of attack target the software responsible for making the routing decisions. Due to the separation between routing and forwarding, a fourth class of attacks are targeted at exhausting the bandwidth of the internal interconnection between the forwarding components and the routing components. The section below on “Denial-of-Service Attacks on ISP Infrastructure” contains a discussion of disruptive attacks besides those targeting the exchange of BGP routing information. 5.1.2.1 Denial of Service Countermeasures GTSM, described above, can be an effective counter to some forms of DoS attacks against routers, by preventing packets originating outside the direct connection between two BGP peers from being processed by the router under attack. GTSM cannot resolve simple buffer overflow problems, or DoS attacks that exploit weaknesses in packet processing prior to the TTL check, however.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [14] of [40] Another mechanism currently used to prevent DoS attacks against routers is to simply make the interfaces on which the BGP session is running completely unreachable from outside the local network or the local segment. Using link-local addresses in IPv6, is one technique (with obviously limited applicability). Another approach is applying packet-filters on the relevant address ranges at the network edge. (This process is called infrastructure filtering). Other well-known and widely deployed DoS mitigation techniques can be used to protect routers from attack just as they can be used to protect other hosts. For instance, Control Plane Policing can prevent the routing process on a router from being overwhelmed with high levels of traffic by limiting the amount of traffic accepted by the router directed at the routing processor itself. 5.1.2.2 Denial-of-Service Current Recommendations Since routers are essentially specialized hosts, mechanisms that can be used to protect individual routers and peering sessions from attack are widely studied and well understood. What prevents these techniques from being deployed on a wide scale? Two things: the perception that the problem space is not large enough to focus on, and the administrative burden of actually deploying such defenses. For instance, when GTSM is used with infrastructure filtering, cryptographic measures may appear to be an administrative burden without much increased security. Smaller operators, and end customers, often believe the administrative burden too great to configure and manage any of these techniques. Despite these vulnerabilities having been widely known for a decade or more, they have not been implicated in any notable number of incidents. As a result some network operators have not found the cost/benefit trade-offs to warrant the operational cost of deploying such mechanisms while others have. Given these facts, the Working Group recommends that individual network operators continue to make their own determinations in using these countermeasures. In dealing with vulnerabilities due to “crafted packets”, the vendor should provide notification to customers as the issues are discovered as well as providing fixed software in a timely manner. Customers should make it a point to keep abreast of notifications from their vendors and from various security information clearing-houses. 5.1.2.2.1 Interface Exhaustion Attacks Recommendations include: 1. Understanding the actual forwarding capabilities of your equipment in your desired configuration 2. Examining your queuing configuration 3. Carefully considering which types of traffic share a queue with your routing protocols, and if that traffic can be blocked, rate-limited or forced to another queue 4. Understanding packet filtering capabilities of your equipment, and under what scenarios it is safe to deploy packet filters 5. When it is safe to do so, tactically deploy packet filers upstream from a router that is being attacked The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [15] of [40] The first thing to consider with regard to attacks that attempt to consume all available bandwidth is to determine the actual throughput of the router. It is not safe to assume that a router with two or more 100 Gigabit Ethernet interfaces can receive 100G on one interface and transmit that same 100G out another interface. Some routers can forward at line rate, and some routers cannot. Performance may vary with packet size, for example forwarding traffic in software is more taxing on the CPU as the number of packets increases. The next thing to consider is outbound queuing of routers upstream from the router that is being attacked. Routers typically place routing protocol traffic in a separate Network Control (NC) queue. Determine the characteristics of this queue such as the queue depth and frequency of it being serviced. These values may be tunable. Also consider what types of traffic are placed in this queue, and specifically what traffic an outside attacker can place in this queue. Consider preventing users of the network from being able to place traffic in this queue if they do not need to exchange routing information with your network. For direct customers running eBGP, limit traffic permitted into the NC queue to only traffic required to support their routing protocols. Consider rate-limiting this traffic so no one customer can fill up the queue. Note that rate limits will increase convergence time, so test a customer configuration that is advertising and receiving the largest set of routes, and measure how long it takes to re-learn the routing table after the BGP session is reset with and without the rate limits. If the attack traffic has a particular profile, and all traffic matching that profile can be dropped without impacting legitimate traffic than a packet filter can be deployed upstream from the router that is under attack. Ensure that a deploying a packet filter will not impact the performance of your router by testing packets filters with various types of attacks and packet sizes on your equipment in a lab environment. Ensure that total throughput is not decreased, and that there is not a particular packet per second count that causes the router to crash, or become unresponsive, or stop forwarding traffic reliably, cause routing protocols to time out, etc. If the upstream router belongs to a non-customer network, you will need to work with them to mitigate the attack. Additional bandwidth on the interconnect may allow you to move the bottleneck deeper into your network where you can deal with it. Often the IP destination of these attacks is something downstream from the router. It is possible that some or all of the attack traffic may be destined to the router. In that case some of the mitigation techniques in the next section may also be helpful. 5.1.2.2.2 Resource Exhaustion Attacks Recommendations include: 1. Consider deploying GTSM 2. Consider making router interfaces only reachable by directly connected network 3. Consider only permitting traffic sourced from configured neighbors 4. Consider deploying MD5 5. Deploy maximum prefixes The first set of recommendations is to consider deploying mechanisms that restrict who can send routing protocol traffic to a router. The second set of recommendations restricts how much routing protocol state a neighbor can cause a router to hold.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [16] of [40] GTSM, described above, can be an effective counter to some forms of DoS attacks against routers, by limiting who can send routing protocol traffic to the router by a configured hop-count radius. GTSM works by preventing packets originating outside the direct connection between two BGP peers from being processed by the router under attack. GTSM cannot resolve simple buffer overflow problems, or DoS attacks that exploit weaknesses in packet processing prior to the TTL check, however. Another mechanism currently used to prevent DoS attacks against routers is to simply make the interfaces on which the BGP session is running completely unreachable from outside the local network or the local segment. Using link-local addresses in IPv6, is one technique (with obviously limited applicability). Another approach is applying packet-filters on the relevant address ranges at the network edge. (This process is called infrastructure filtering). Some routers can dynamically generate packet filters from other portions of the router configuration. This enables one to create an interface packet filter that only allows traffic on the BGP ports from source IP addresses that belong to a configured neighbor. This means attempts to send packets to the BGP port by IP addresses that are not a configured neighbor will be dropped right at the interface. The same session level protections discussed earlier, such as MD5 can also limit who can send routing information to only those routers or hosts that have the appropriate key. As such this is also an effective mechanism to limit who can send routing protocol traffic. While these packets will be processed by the router, and could possibly tax the CPU, they cannot cause the router to create additional routing state such as adding entried to the Routing Information Base (RIB) or Forwarding Information Base (FIB). Lastly each neighbor should be configured to limit the number of prefixes they can send to a reasonable value. A single neighbor accidentally or intentionally de-aggregating all of the address space they are permitted to send could consume a large amount of RIB and FIB memory, especially with the large IPv6 allocations. 5.1.2.2.3 Crafted Packet Attacks Crafted packet attacks typically occur when a router receives some exception traffic that the vendor did not plan for. In some cases it may be possible to mitigate these attacks by filtering the attack traffic if that traffic has a profile that can be matched on and all traffic matching that profile can be discarded. More often then not, this is not the case. In all cases the vendor should provide new code that deals with the exception. Recommendations include keeping current with all SIRT advisories from your vendors. When vulnerability is published move quickly to upgrade vulnerable versions of the code. This may require an upgrade to a newer version of the code or a patch to an existing version. For larger organizations that have extensive and lengthy software certification programs, it is often more reasonable to ask the vendor to provide a patch for the specific version(s) of code that organization is running. If possible the vendor should provide the extent to which the code is modified to quantify how substantial the change is in order to help the provide plan what should be included in the abbreviated software certification tests.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [17] of [40] For smaller organizations, or organizations that complete little or no software certification, the newer version of code with the fix in place should be deployed. Generally this is deployed cautiously at first to see if issues are raised with a limited field trail, followed by a more widespread deployment. 5.1.2.2.4 Internal Bandwidth Exhaustion Attacks Other well-known and widely deployed DoS mitigation techniques can be used to protect routers from attack just as they can be used to protect other hosts. For instance, Control Plane Policing can prevent the routing process on a router from being overwhelmed with high levels of traffic by limiting the amount of traffic accepted by the router directed at the routing processor itself. One should consider not only the impact on the router CPU, but also the impact on the bandwidth between the forward components and the router CPU. There may be some internal queuing in place on the interconnect between the forwarding components and the router CPU. It may be possible to influence which queue routing protocol traffic is placed in, or with queue traffic generated by customers is placed in, the depths and or servicing of these queues in order to separate and minimize the ability of non-routing traffic to impact routing traffic. If traffic generated by customers (for routing protocols or otherwise) can crowd out a network’s internal routing protocol traffic, then operators may consider separately rate limiting this customer traffic. 5.1.3 Source-address filtering Many Internet security exploits hinge on the ability of an attacker to send packets with spoofed source IP addresses. Masquerading in this way can give the attacker entre to unauthorized access at the device or application level and some BGP vulnerabilities are also in this category. The problem of source-spoofing has long been recognized and countermeasures available for filtering at the interface level. 5.1.3.1 Source-address spoofing example Consider the diagram below which illustrates the legitimate bi-directional traffic flow between two hosts on the left-hand side. An attacker connected to another network can send IP packets with a source address field set to the address of one of the other machines, unless filtering is applied at the point that that attacker’s host or network attaches to AS65002.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [18] of [40] 5.1.3.2 Source-address spoofing attacks Though most IP transactions are bi-directional, attacks utilizing spoofed source IPs do not require bi-directional communication but instead exploit particular protocol or programmatic semantic weaknesses. Exploits using this technique have covered many areas over the years including these types followed by some examples. • Attacks against services which rely only on source IP of the incoming packet for authorization • rsh, rlogin, NFS, Xwindows, etc. • Attacks where the unreachability of the source can be exploited • TCP SYN floods which exhaust resources on the server • Attacks where the attacker masquerades as the “victim” • Small DNS or SNMP requests resulting in highly asymmetric data flow back toward the victim • Abusive traffic which result in the legitimate user getting blocked from the server or network 5.1.3.3 Source-address filtering challenges The barriers to implementing these countermeasures have ranged from lack of vendor support to lack of solid motivation to implement them. • Lack of proper vendor support: In older implementations of network devices, filtering based on the source address of a packet was performed in software, rather than hardware, and thus had a major impact on the rate of forwarding through the device. Most modern The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [19] of [40] network equipment can perform source filtering in the hardware switching path, eliminating this barrier to deployment. • Lack of scalable deployment and configuration management: In older deployments, filters based on the source of traffic was configured and managed manually, adding a large expense to the entity running the network. This barrier has largely been resolved through remote triggered black hole, unicast Reverse Path Forwarding (uRPF), and loose uRPF options. • Fear of interrupting legitimate traffic, for example in multi-homed situations: Vendors have created flexibility in uRPF filtering to reduce or eliminate this barrier. Future possible additions include “white lists,” which would allow traffic to pass through a uRPF check even though it didn’t meet the rules. • Lack of business motivation: Unilateral application of these features does not benefit or protect a network or its customers directly; rather, it contributes to the overall security of the Internet. Network operators are realizing that objection to incurring this “cost” is being overcome by the realization that if everyone performs this type of filtering, then everyone benefits. 5.1.3.4 Source-address filtering recommendations Filtering should be applied as close as possible to the entry point of traffic. Wherever one host, network, or subnet is attached, a feature such as packet filtering, uRPF, or source-addressvalidation should be used. Ensure adequate support from equipment vendors for subscribermanagement systems (e.g. for Cable and DSL providers) or data-center or campus routers and switches. Stub networks should also filter traffic at their borders to ensure IP ranges assigned to them do not appear in the source field of incoming packets and only those ranges appear in the source field of outgoing packets. Transit networks should likewise use features such as uRPF. Strict mode should be used at a border with a topological stub network and loose mode between transit networks. Transit networks that provide connectivity primarily stub networks, such as consumer ISPs, should consider uRPF strict mode on interfaces facing their customers. If these providers provide a home router to their customers they should consider making uRPF part of the default home router configuration. Transit networks that provide connectivity to a mix of stub networks and multi-homed networks must consider the administrative burden of configuring uRPF strict mode only on stub customers and uRPF loose mode, or no uRPF on customers that are, or become multi-homed. When using uRPF loose mode with the presence of a default route, one must special care to consider configuration options to include or exclude the default route. The value of loose mode uRPF with networks in the default free zone is debatable. It will only prevent traffic with a source address of RFC-1918 space and dark IPs (IP addresses that are not routed on the Internet). Often these dark IP addresses are useful for backscatter techniques and The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [20] of [40] tracing the source(s) of a spoofed DoS attack. It is also important to consider if RFC-1918 addresses are used internal to the transit provider’s network. This practice may become more common if ISPs implement Carrier Grade NAT. It is also worth pointing out that some business customers depend on VPN software that is poorly implemented, and only changes the destination IP address when re-encapsulating a packet. If these customers are using non-routed IP addresses in their internal network then enabling uRPF will break these customers. It is important to measure the impact of forwarding when enabling uRPF. Even when uRPF is implemented in hardware, the router must lookup the destination as well as the source. A double lookup will cause forwarding throughput to be reduced by half. This may have no in the forwarding rate if the throughput of the forwarding hardware is more than twice the rate of all the interfaces it supports. Further, more detailed advice and treatment of this subject can be found in: • IETF BCP38/RFC 2827 Network Ingress Filtering1 • BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks2 5.2 ICANN SAC004 Securing the EdgeBGP Injection and Propagation Vulnerability A second form of attack against the routing information provided by BGP4 is through injection of misleading routing information, as shown in figure 2. Figure 2: A Prefix Hijacking Attack In this network, AS65000 has authority to originate 192.0.2.0/24. Originating a route, in this context, means that computers having addresses within the address space advertised are actually reachable within your network — that a computer with the address 192.0.2.1, for instance, is physically attached to your network. Assume AS65100 would like to attract traffic normally destined to a computer within the 192.0.2.0/24 address range. Why would AS65100 want to do this? There are a number of possible motivations, including: 1 http://tools.ietf.org/html/bcp38 2 http://tools.ietf.org/html/bcp84The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [21] of [40] • A server with an address in this range accepts logins from customers or users, such as a financial web site, or a site that hosts other sensitive information, or information of value • A server with an address in this range processes information crucial to the operation of a business the owner of AS65100 would like to damage in some way, such as a competitor, or a political entity under attack AS65100, the attacker, can easily attract packets normally destined to 192.0.2.0/24 within AS65000 by simply advertising a competing route for 192.0.2.0/24 to AS65002. From within BGP itself, there’s no way for the operators in AS65002 to know which of these two advertisements is correct (or whether both origins are valid – a configuration which does see occasional legitimate use). The impact of the bogus information may be limited to the directly neighboring AS(es) depending on the routing policy of the nearby ASes. The likelihood of the incorrect route being chosen can be improved by two attributes of the route: • A shorter AS Path A shorter AS Path has the semantic value of indicating a topologically “closer” network. In the example above, the normal propagation of the route would show AS65100 as “closer” to AS65001 thus, other factors being equal, more preferred than the legitimate path via AS65000. • A longer prefix Longer prefixes represent more-specific routing information, so a longer prefix is always preferred over a shorter one. For instance, in this case the attacker might advertise 192.0.2.0/25, rather than 192.0.2.0/24, to make the false route to the destination appear more desirable than the real one. • A higher local-preference setting Local-preference is the non-transitive BGP attribute that most network operators use to administratively influence their local routing. Typically, routes learned from a “customer” (i.e., paying) network are preferred over those where the neighboring network has a non-transit relationship or where the operator is paying for transit from the neighboring network. This attribute is more important in the decision algorithm for BGP than AS-path length so routes learned over such a session can draw traffic even without manipulation of the AS-path attribute. This illustration can be used to help describe some related types of risks:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [22] of [40] Figure 3: BGP Propagation Vulnerabilities Route Leak: In this case, AS65100 is a transit customer of both AS65000 and AS65002. The operator of AS65100 accidentally leaks routing information advertised by AS65000 into its peering session with AS65002. This could possibly draw traffic passing from AS65002 towards a destination reachable through AS65000 through AS65100 when this path was not intended to provide transit between these two networks. Most often such happenings are the result of misconfigurations and can result in overloading the links between AS65100 and the other ASes. As the name suggests, this phenomenon has most often been the result of inadvertent misconfiguration. Occasionally they can result in more malicious outcomes: • Man in the Middle: In this case, all the autonomous systems shown have non-transit relationships. For policy reasons, AS65000 would prefer traffic destined to 192.0.2.0/24 pass through AS65100. To enforce this policy, AS65000 filters the route for 192.0.2.0/24 towards AS65100. In order to redirect traffic through itself (for instance, in order to snoop on the traffic stream), AS65100 generates a route advertisement that makes it appear as if AS65000 has actually advertised 192.0.2.0/24, advertising this route to AS65002, and thus drawing traffic destined to 192.0.2.0/24 through itself. • False Origination: This attack is similar to the man in the middle explained above, however in this case there is no link between AS65000 and AS65100. Any traffic destined to 192.0.2.0/24 into AS65100, is discarded rather than being delivered. Note that all three of these vulnerabilities are variations on a single theme: routing information that should not be propagated based on compliance with some specific policy nonetheless is. 5.2.1 BGP Injection and Propagation Countermeasures 5.2.1.1 Prefix Filtering The key bulwark against entry and propagation of illegitimate routing announcements into the global routing system is prefix-level filtering; typically at the edge between the ISP and their customers. The usual method involves the customer communicating a list of prefixes and downstream ASes which they expect to be reachable through the connection to the ISP. The ISP will then craft a filter applied to the BGP session which explicitly enumerates this list of expected prefixes (a “prefix-list”), perhaps allowing for announcement of some more-specific The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [23] of [40] prefixes within the ranges such as might be needed by the customer to achieve some goals in adjusting the load of the customer’s inbound traffic across their various connections. This configuration forms a “white-list”, in security parlance, of possible downstream destinations but does not validate the overall semantic correctness of the resulting routing table. 5.2.1.1.1 Manual Prefix-Filter Limitations The validity of the information in this list is obviously important and if the customer is either malicious or simply mistaken in the prefixes they communicate, the prefix filter could obviously still leave open a vulnerability to bogus route injection. Thus typically the information communicated by the customer is checked against registration records such as offered in the “whois” information available from Regional Internet Registries (RIRs) and/or others in the address assignment hierarchy. However, there is no information in the RIR information explicitly indicating a mapping between the address-assignment and the origin AS. The reality is that the process of manually checking a filter that is a few thousand lines long, with hundreds of changes a week is tedious and time consuming. Many transit providers do not check at all. Others have a policy to always check, but support staff may grow complacent with updates from certain customers that have long filter lists or have filter lists that change weekly, especially if those customers have never had a questionable prefix update in the past. In these cases they may only spot-check, only check the first few changes and then give up, or grow fatigued and be less diligent the more lines they check. Moreover, the entity name fields in the “whois” information are free-form and often can’t be reliably matched to the entity name used by the ISP’s customer records. Typically it is a fuzzy match between the customer name on record and the company name listed in the “whois” record. A truly malicious actor could order service with a name which is intentionally similar to a company name whose IP addresses they intend to use. Another possibility is that the company name on record is legitimate and an exact match to the company name on the “whois” record, but the customer is a branch office, and the legitimate holder of the IP addresses is the corporate office which has not authorized the branch office to use their space. It is also possible that the customer is the legitimate holder of the address space, but the individual who called in to the provider support team is not authorized to change the routing of the IP block in question. This problem is further complicated when a transit provider’s customer has one or more downstream customers of its own. These relationships are typically hard or impossible to verify. If every transit provider accurately filtered all of the prefixes their customers advertised, and each network that a transit provider peers with could be trusted to also accurately filter all of the prefixes of their customers, then route origination and propagation problems could be virtuailly eliminated. However, managing filters requires thousands of operators examining, devising, and adjusting the filters on millions of devices throughout the Internet. While there are processes and tools within any given network, such highly inconsistent processes, particularly when handling large amounts of data (a tedious process in and of itself), tends to produce an undesirable rate of errors. Each time an individual operator misjudges a particular piece of information, or simply makes a mistake in building a filter, the result is a set of servers (or services) that are unreachable until the mistake is found and corrected. 5.2.1.2 Internet Routing Registry (IRR) The second source of information a provider can use as a basis for filtering received routing The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [24] of [40] information is a voluntary set of databases of routing policy and origination called Internet Routing Registries (IRRs). These IRRs allow providers to register information about their policies towards customers and other providers, and also allow network operators to register which address space they intend to originate. Some providers require their customers to register their address space in an IRR before accepting the customer’s routes, oftentimes the provider will “proxy register” information on the customer’s behalf since most customers are not versed in IRR details. 5.2.1.2.1 IRR Limitations Because IRRs are voluntary, there is some question about the accuracy and timeliness of the information they contain (see Research on Routing Consistency in the Internet Routing Registry by Nagahashi and Esaki for a mostly negative view, and How Complete and Accurate is the Internet Routing Registry by Khan for a more positive view). Anecdotally, RIPE’s IRR is in widespread use today, and some large providers actually build their filtering off this database, so the accuracy level is at least operationally acceptable for some number of operators. Some IRR repositories use an authorization model as well as authentication but none that primarily serve North America perform RPSL authorization using the scheme described in RFC2725 – Routing Policy System Security3 . 5.2.1.3 AS-Path filtering Filters on the AS_PATH contents of incoming BGP announcements can also be part of a defensive strategy to guard against improper propagation of routing information. Some ISPs have used AS-path filters on customer-facing BGP sessions instead of prefix-filters. This approach is generally inadequate to protect against even the most naïve misconfigurations, much less a deliberate manipulation. Often a leak has involved either redistributing BGP routes inadvertently from one of a stub network’s ISPs to the other. Another problem in the past has involved redistributing BGP routes into an internal routing protocol and back to BGP. Where AS-path filters can be useful is to guard against an egregious leak. For instance an ISP would not expect ASNs belonging to known large ISPs to show up in the AS_PATH of updates from an enterprise-type customer network. Applying an AS-path filter to such a BGP session could act as a second line of defense to the specific prefix-list filter. Similarly, if there are networks which the ISP has non-transit relationships with, applying a similar AS-path filter to those sessions (which wouldn’t be candidates for prefix filters) could help guard against a leak resulting in an unintended transit path. 5.2.1.3.1 AS-Path filtering limitations Maintaining such a list of “known” networks which aren’t expected to show up in transit adjacencies can be fairly manual, incomplete and error-prone. Again, applying a filter which validates the neighbor AS is in the path is useless since this state is the norm of what’s expected. 5.2.1.4 Maximum-prefix cut-off threshold Many router feature-sets include the ability to limit the number of prefixes that are accepted from a neighbor via BGP advertisements. When the overall limit is exceeded, the BGP session is torn down on the presumption that this situation is a dangerous error condition. Typically also a threshold can be set at which a warning notification (e.g. log message) to the Operations staff 3 https://tools.ietf.org/html/rfc2725The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [25] of [40] is issued. This way a gradual increase in the number of advertisements will trigger a sensible manual raise in the cut-off threshold without causing an outage. This tool can be used to guard against the most egregious leaks which can, if the numbers are large enough, exhaust the routing table memory on the recipient’s routers and/or otherwise cause widespread network instability. Typical deployments will set the threshold based on the current observed number of advertisements within different bands; for instance, 1-100, 100-1000, 1000-5000, 5000-10000, 10000-50000, 50,000-100,000, 100,000-150,000, 150,000-200,000, 200,000-250,000, 250,000- 300,000. 5.2.1.4.1 Maximum-prefix limitations When the threshold is exceeded, the session is shut down and manual intervention is required to bring it back up. In the case where a network has multiple interconnection points to another network (thus multiple BGP neighbors), all sessions will typically go down at the same time assuming all are announcing the same number of prefixes. In this case, it may be the case that all connectivity between the two networks is lost during this period. Obviously this measure is an attempt to balance two different un-desirable outcomes so must be weighed judiciously. Above 10000 or perhaps 50000 (e.g., a full Internet routing table from a transit provider), applying maximum-prefix thresholds provide limited protection. A small number of neighbors each advertising a unique set of 300,000 routes would fill the memory of the receiving router anyway. However if these neighbors are all advertising a large portion of the Internet routes, with many routes overlapping, then the limit offers some protection. 5.2.1.5 Monitoring Aside from a proactive filtering approach, a network operator can use various vantage points external to their own network (e.g, “route servers” or “looking glasses”) to monitor the prefixes for which they have authority to monitor for competing announcements which may have entered the BGP system. Some tools such as BGPmon have been devised to automate such monitoring. 5.2.1.5.1 Monitoring Limitations Obviously, this approach is reactive rather than proactive and steps would then need to be taken to contact the offending AS and/or intermediate AS(es) to stop the advertisement and/or propagation of the misinformation. Also, the number of such vantage points is limited so a locally impacting bogus route may or may not be detected with this method. 5.2.2 BGP Injection and Propagation Recommendations The most common router software implementations of BGP do not perform filtering of route advertisements, either inbound or outbound, by default. While this situation eases the burden of configuration on network operators (the customers of the router vendors), it has also caused the majority of unintentional inter-domain routing problems to date. Thus it is recommended that network operators of all sizes take extra care in configuration of BGP sessions to keep unintentional routes from being injected and propagated. Stub network operators should configure their outbound sessions to only explicitly allow the The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [26] of [40] prefixes which they expect to be advertising over a particular session. ISPs should explicitly filter their inbound sessions at the boundary with their “customer edge”. The inter-provider connections between large ISPs are impractical locations for filtering given the requirement for significant dynamism in BGP routing and traffic-engineering across the global Internet. However, the cumulative gains accrued when each ISP filters at this “customer edge” are significant enough to lessen the residual risk of not filtering on these “non-customer” BGP sessions. ISPs (and even stub networks) should also consider using AS-path filters and maximum-prefix limits on sessions as a second line of defense to guard against leaks or other pathological conditions. 5.3 Other Attacks and Vulnerabilities of Routing Infrastructure There are many vulnerabilities and attack vectors that can be used to disrupt the routing infrastructure of an ISP outside of the BGP protocol and routing-specific operations. These are just as important to address as issues the working group has identified within the routing space itself. The largest attack surface for routing infrastructure likely lies within the standard operational security paradigm that applies to any critical networked asset. Therefore the working group looked at including BCPs relating to network and operational security as part of addressing these issues, and ISPs should be aware that they are likely to see attacks against their routing infrastructure based on these “traditional” methods of computer and network intrusion. 5.3.1 Hacking and unauthorized 3rd party access to routing infrastructure ISPs and all organizations with an Internet presence face the ever-present risk of hacking and other unauthorized access attempts on their infrastructure from various actors, both on and off network. This was already identified as a key risk for ISPs, and CSRIC 2A – Cyber Security Best Practices was published in March 2011 to provide advice to address these types of attacks and other risks for any ISP infrastructure elements, including routing infrastructure. The current CSRIC III has added a new Working Group 11 that will report out an update to prior CSRIC work in light of recent advancements in cybersecurity practices and a desire of several US government agencies to adopt consensus guidelines to protect government and critical infrastructure computers and networks. A recent SANS publication, Twenty Critical Security Controls for Effective Cyber Defense: Consensus Audit Guidelines (CAG)4 lays out these principals and maps them out versus prior work, including another relevant document, NIST SP-800-53 Recommended Security Controls for Federal Information Systems and Organizations. 5 The SANS publication appears to be a primary driver for Working Group 11’s work. The entire document is available for review, and we have included the 20 topic areas here for reference: 4 http://www.sans.org/critical-security-controls/ 5 http://csrc.nist.gov/publications/PubsSPs.htmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [27] of [40] Critical Control 1: Inventory of Authorized and Unauthorized Devices Critical Control 2: Inventory of Authorized and Unauthorized Software Critical Control 3: Secure Configurations for Hardware and Software on Laptops, Workstations, and Servers Critical Control 4: Continuous Vulnerability Assessment and Remediation Critical Control 5: Malware Defenses Critical Control 6: Application Software Security Critical Control 7: Wireless Device Control Critical Control 8: Data Recovery Capability Critical Control 9: Security Skills Assessment and Appropriate Training to Fill Gaps Critical Control 10: Secure Configurations for Network Devices such as Firewalls, Routers, and Switches Critical Control 11: Limitation and Control of Network Ports, Protocols, and Services Critical Control 12: Controlled Use of Administrative Privileges Critical Control 13: Boundary Defense Critical Control 14: Maintenance, Monitoring, and Analysis of Audit Logs Critical Control 15: Controlled Access Based on the Need to Know Critical Control 16: Account Monitoring and Control Critical Control 17: Data Loss Prevention Critical Control 18: Incident Response Capability Critical Control 19: Secure Network Engineering Critical Control 20: Penetration Tests and Red Team Exercises Because this work is being analyzed directly by Working Group 11 to address the generic risk to ISPs of various hacking and unauthorized access issues, Working Group 4 will not be commenting in-depth in this area, and refers readers to reports from Working Group 11 for comprehensive, and updated coverage of these risks when they issue their report. We will comment upon current BCPs for ISPs to look to adopt in the interim, and provide further background around risks unique to running BGP servers/routers in this area. An ISP’s routing infrastructure is an important asset to protect, as gaining control of it can lead to a wide variety of harms to ISP customers. Further, an ISP’s staff computers, servers, and networking infrastructure also rely upon their own routers to correctly direct traffic to its intended destinations. The ISP’s own sensitive data and processes could be compromised via hacked routers/servers. Thus routers should be included on the list of network assets that are assigned the highest level of priority for protection under any type of ISP security program. There are many industry standard publications pertaining to overall cybersecurity best practices available for adoption by ISPs or any organization at risk of attack, including prior CSRIC reports. It is incumbent upon ISPs to maintain their overall security posture and be up-to-date on the latest industry BCPs and adopt the practices applicable to their organization. Of particular note is the IETF’s RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments6 which offers a comprehensive survey of ISP security practices. An older IETF publication, but still active BCP, that still applies to ISP environments can be found with BCP 46, aka RFC 3013 Recommended Internet Service Provider Security Services and Procedures7 . NIST also puts out highly applicable advice and BCPs for running 6 http://www.ietf.org/rfc/rfc4778.txt 7 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [28] of [40] government networks, with the most currently relevant special report, NIST SP-800-53. The ultimate goal of someone attempting unauthorized access to routing infrastructure would be to either deny customer use of those servers or, more likely, insert false entries within the router to misdirect the users of those routers. This is a functional equivalent to route injection and propagation attacks as already described in section 5.2. So the analysis and recommendations presented in section 5.2.1.5 with respect to monitoring for and reacting to route injection and propagation attacks apply in the scenario where an attacker has breached a router to add incorrect entries. 5.3.1.1 Recommendations 1) ISPs should refer to and implement the practices found in CSRIC 2A – Cyber Security Best Practices that apply to securing servers and ensure that routing infrastructure is protected. 2) ISPs should adopt applicable BCPs found in other relevant network security industry approved/adopted publications. Monitor for applicable documents and update. Three documents were identified that currently apply to protecting ISP networks: IETF RFC 4778 and BCP 46 (RFC 3013); NIST special publications series: NIST SP-800-53 3) ISPs should ensure that methods exist within the ISP’s operations to respond to detected or reported successful route injection and propagation attacks, so that such entries can be rapidly remediated. 4) ISPs should consider implementing routing-specific monitoring regimes to assess the integrity of data being reported by the ISP’s routers that meet the particular operational and infrastructure environments of the ISP. 5.3.2 ISP insiders inserting false entries into routers While insider threats can be considered a subset of the more general security threat of unauthorized access and hacking, they deserve special attention in the realm of routing security. ISP insiders have unparalleled access to any systems run by an ISP, and in the case of routers, the ability to modify entries is both trivially easy and potentially difficult to detect. Since routers don’t typically have company-sensitive information, are accessed by thousands of machines continuously, and are not usually hardened or monitored like other critical servers, it is relatively easy for an insider to alter a router’s configuration in a way that adversely affects routing. The analysis and recommendations for this particular threat do not differ significantly from those presented in Section 5.3.1 of this report - Hacking and unauthorized 3rd party access to routing infrastructure. However, it is worth paying special attention to this particular exposure given the liabilities an ISP may be exposed to from such difficult-to-detect activities of its own employees. 5.3.2.1 Recommendations 1) Refer to section 5.3.1.1 for generic hacking threats. 5.3.3 Denial-of-Service Attacks against ISP Infrastructure Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) are some of the oldest and most prolific attacks that ISPs have faced over the years and continue to defend against today. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [29] of [40] Typically, an external actor who is targeting some Internet presence or infrastructure to make it unusable is behind such attacks. However, DoS/DDoS attacks come in many flavors that can be broadly lumped into two primary categories: logic attacks and resource exhaustion/flooding attacks.8 Logic attacks exploit vulnerabilities to cause a server or service to crash or reduce performance below usable thresholds. Resource exhaustion or flooding attacks cause server or network resources to be consumed to the point where the targeted service no longer responds or service is reduced to the point it is operationally unacceptable. We will examine the latter type of attack in this section of analysis, as resource exhaustion . Logic attacks are largely directed to break services/servers and can be largely addressed with the analysis and recommendations described above with respect to BGP specific issues and also put forward in section 5.3.1 that cover protecting networked assets from various hacking and other attacks. There is a large variety of flooding attacks that an ISP could face in daily operations. These can be targeted at networks or any server, machine, router, or even user of an ISP’s network. From the perspective of routing operations, it is helpful to differentiate between “generic” DoS attacks that could affect any server, and those that exploit some characteristic of BGP that can be utilized to affect routers in particular, which have already been covered. Due to the long history, huge potential impact, and widespread use of various DoS and DDoS attacks, there is an abundance of materials, services, techniques and BCPs available for dealing with these attacks. ISPs will likely have some practices in place for dealing with attacks both originating from their networks and that are being directed at their networks and impacting their services. The IETF’s RFC 4732 Internet Denial-of-Service Considerations9 provides an ISP with a thorough overview of DoS/DDoS attacks and mitigation strategies and provides a solid foundational document. The SANS Institute has published a useful document for ISPs that is another reference document of BCPs against DoS/DDoS attacks entitled A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment10. As mentioned in section 5.3.1, there are several documents that cover general ISP security concerns, and those typically include prescriptive advice for protecting a network against DoS/DDoS attacks. Such advice can be found in previously cited documents including prior CSRIC reports: CSRIC 2A – Cyber Security Best Practices11, the IETF’s RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments12, BCP 46, RFC 3013 Recommended Internet Service Provider Security Services and Procedures13 and NIST’s special report, NIST SP-800-53. For the most part, an ISP’s routers for interdomain routing must be publicly available in order for the networks they serve to be reachable across the Internet. Thus measures to restrict access 8 http://static.usenix.org/publications/library/proceedings/sec01/moore/moore.pdf 9 http://tools.ietf.org/rfc/rfc4732.txt 10 http://www.sans.org/reading_room/whitepapers/intrusion/summary-dos-ddos-preventionmonitoring-mitigation-techniques-service-provider-enviro_1212 11 http://www.fcc.gov/pshs/docs/csric/WG2A-Cyber-Security-Best-Practices-Final-Report.pdf 12 http://www.ietf.org/rfc/rfc4778.txt 13 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [30] of [40] that can be implemented for an ISP’s internal infrastructure are unavailable as options for these connecting routers. This leaves an ISP with limited choices for DDoS protection, including the traditional approaches of overprovisioning of equipment and bandwidth and various DoS/DDoS protection services and techniques. 5.3.3.1 Recommendations 1) ISPs should implement BCPs and recommendations for securing an ISP’s infrastructure against DoS/DDoS attacks that are enumerated in the IETF’s RFC 4732 Internet Denial-of-Service Considerations and consider implementing BCPs enumerated in the SANS Institute reference document of BCPs against DoS/DDoS attacks entitled A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment. 2) ISPs should refer to and implement the BCPs related to DoS/DDoS protection found in CSRIC 2A – Cyber Security Best Practices that apply to protecting servers from DoS/DDoS attacks. 3) ISPs should consider adopting BCPs found in other relevant network security industry approved/adopted publications that pertain to DoS/DDoS issues, and monitor for applicable documents and updates. Four that currently apply to protecting ISP networks from DoS/DDoS threats are IETF RFC 4778 and BCP 46 (RFC 3013); NIST special publications series: NIST SP-800-53; and ISOC Publication Towards Improving DNS Security, Stability, and Resiliency. 4) ISPs should review and apply BCPs for protecting network assets against DoS/DDoS attacks carefully to ensure they are appropriate to protect routing infrastructure. 5.3.4 Attacks against administrative controls of routing identifiers Blocks of IP space and Autonomous Systems Numbers (ASNs) are allocated by various registries around the world. Each of these Regional Internet Registries (RIR’s) is provided IP space and ASN allocation blocks by IANA, to manage under their own rules and practices. Inturn, several of these registries allow for country or other region/use specific registries to suballocate IP space based on their own rules, processes and systems. Each RIR maintains a centralized “whois” database that designates the “owner” of IP spaces or ASN’s within their remit. Access to the databases that control these designations, and thus “rights” to use a particular space or ASN is provided and managed by the RIR’s and sub RIR registries depending upon the region. Processes for authentication and management of these identifier resources are not standardized, and until recently, were relatively unsophisticated and insecure. This presents an administrative attack vector allowing a miscreant to use a variety of account attack methods, from hacking to password guessing to social engineering and more that could allow them to assume control over an ASN or IP space allocation. In other fields, such attacks would be considered “hijacking” or “account take-over” attacks, but the use of the word “hijacking” in the BGP space to include various injection and origin announcements complicates the common taxonomy. Thus for this section, we will refer to account “hijacking” as “account take-over”. The primary concern for most ASN and prefix block owners and the ISPs that service them in such scenarios is the take-over of active space they are using. A miscreant could literally “take The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [31] of [40] over” IP space being routed and used by the victim, much like an origin attack as described in section 5.3. In this case, it would be equivalent to a full take-over, with the majority of the global routing system recognizing the miscreant’s announcement as being the new “legitimate” one, with all the inherent risks previously described. The real owner will have to prove their legitimacy and actual legal ownership/control of the resource that has been taken over. Depending upon the authentication scheme the registry uses, this can prove difficult, especially for legacy space and older ASN registrations. A corollary of this attack scenario, is a miscreant taking over “dormant” IP space or an unused ASN, and thus “squatting” in unused territory14. While not impactful on existing Internet presence, squatting on IP space can lead to many forms of abuse, including the announcement of bogus peering arrangements, if the squatted resource is an ASN. In a take-over scenario, a miscreant typically impersonates or compromises the registrant of the ASN and/or IP space in order to gain access to the management account for that ASN or CIDR block. Until recently, nearly all RIRs and registries used an e-mail authentication scheme to manage registrant change requests. Thus, if the registrant’s e-mail address uses an available domain name, the miscreant can register the domain name, recreate the administration email address, and authenticates himself with the registry. If the domain isn’t available, the criminal could still try to hijack the domain name registration account to gain control of that same domain. If a registry or RIR requires more verification for registrant account management, the criminal use various social engineering tricks against the registry staff to get into the management account. Once a criminal has control of the registration account, they can update the information there to allow them to move to a new peering ISP, create new announcements from their “new” space, or launch any sort of BGP-type attack as listed above. Even more basically, the criminal can simply utilize their new control of the ASN/prefix to have their own abusive infrastructure announced on the Internet for whatever process they would like. This includes direct abuse against the Internet in general (e.g. hosting malware controllers, phishing, on-line scams, etc.), but also the ability to impersonate the original holder of the space they have taken over. Of course they can also intercept traffic originally destined for the legitimate holder of the space as previously described for various route-hijacking scenarios. The end result of an administrative account take-over is likely to be similar to other injection attacks against routing infrastructure as covered in section 5.2. Thus, ISPs will want to consult BCPs covering techniques for monitoring and reacting to those types of attacks. These BCPs cover the general effects of a BGP origin attacks – dealing with service interruptions and the worldwide impacts. ISPs typically do not have direct access or control to RIR or other registry account information that has been compromised in most hijacking attacks. The ISP is dependent upon the affected registry to restore control of the ISP’s management account, or in the case of a serious breach, the registrar/registry’s own services. Once control is re-established, the original, correct information needs to be re-entered and published again. This will usually mean updating 14 For a full description of the taxonomy of hijacking, squatting, and spoofing attacks in routing space, see Internet Address Hijacking, Spoofing and Squatting Attacks - http://securityskeptic.typepad.com/the-security-skeptic/2011/06/internet-address-hijackingspoofing-and-squatting-attacks.htmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [32] of [40] routing table entries/BGP announcements, and fixing any account information that has been modified. The industry has largely been slow to adopt security measures to protect account access for controlling ASN and CIDR block management that are found in other online services like financial services, e-commerce, or even some ISP management systems. The industry also has many participants with a wide variety of geographical regions, with few standards and requirements for the security of registration systems, and very limited oversight. This means it is often difficult to find support for typical online security tools like multi-factor authentication, multi-channel authentication, and verification of high-value transactions. This has been changing in recent years at the RIR level, with ARIN, RIPE, and several other IP registries at various levels of authority implementing new controls and auditing account information. Despite this, gaps exist, especially with “legacy” data entered many years ago before current management systems and authentication processes were implemented. While there is scant guidance on this topic area for ASN/IP block management, ICANN’s Security and Stability Advisory Committee (SSAC) has released two documents to address these issues in the domain name space which is quite analogous to the provisioning of IP space. These documents provide BCPs for avoiding and mitigating many of these issues. SAC 40 Measures to Protect Domain Registration Services Against Exploitation or Misuse15, addresses issues faced by domain name registrars and offers numerous BCPs and recommendations for securing a registrar against the techniques being used by domain name hijackers. Many of the BCP’s presented there would be applicable to RIR’s and other IP address provisioning authorities, including ISPs managing their own customers. SSAC 44, A Registrant's Guide to Protecting Domain Name Registration Accounts16, provides advice to domain name registrants to put in place to better protect their domains from hijacking. Similar techniques could be used by operators to protect their own IP space allocations. Given the limited choices and practices followed by various IP space allocators, ISPs need to carefully evaluate their security posture and the practices of their RIR’s or other IP space allocators with these BCPs in mind. 5.3.4.1 Recommendations 1) ISPs and their customers should refer to the BCPs and recommendations found in SSAC 44 A Registrant's Guide to Protecting Domain Name Registration Accounts with respect to managing their ASN’s and IP spaces they register and use to provide services. 2) ISPs should review the BCPs and recommendations found in SAC 40 Measures to Protect Domain Registration Services Against Exploitation or Misuse to provide similar protections for IP space they allocate to their own customers. 6 Conclusions Working Group 4 has recommended the adoption of numerous best practices for protecting the inter-domain BGP routing system. As a distributed infrastructure requiring several actors to both enable and protect it, network operators face challenges outside of their direct control in tackling many of the issues identified. The more widely Best Current Practices are utilized, the more robust the whole system will be to both bad actors and simple mistakes. See Appendix 7.3 for a 15 http://www.icann.org/en/groups/ssac/documents/sac-040-en.pdf 16 http://www.icann.org/en/committees/security/sac044.pdfThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [33] of [40] tabular display of risks indexed with the appropriate countermeasures as discussed in the body text of the document. 7 Appendix 7.1 Background Note that in order to remain consistent with other CSRIC III reports, considerable portions of “Salient Features of BGP Operation” have been taken verbatim from the Appendix section of the CSRIC III, Working Group 6 Interim Report published March 8, 2012. Other parts have been taken from the NIST (National Institute of Standards and Technologies) report entitled Border Gateway Protocol Security. 7.1.1 Salient Features of BGP Operation This section is intended for non-experts who have a need to understand the origins of BGP security problems. Although unknown to most users, the Border Gateway Protocol (BGP) is critical to keeping the Internet running. BGP is a routing protocol, which means that it is used to update routing information between major systems. BGP is in fact the primary inter-domain routing protocol, and has been in use since the commercialization of the Internet. Because systems connected to the Internet change constantly, the most efficient paths between systems must be updated on a regular basis. Otherwise, communications would quickly slow down or stop. Without BGP, email, Web page transmissions, and other Internet communications would not reach their intended destinations. Securing BGP against attacks by intruders is thus critical to keeping the Internet running smoothly. Many organizations do not need to operate BGP routers because they use Internet service providers (ISP) that take care of these management functions. But larger organizations with large networks have routers that run BGP and other routing protocols. The collection of routers, computers, and other components within a single administrative domain is known as an autonomous system (AS). An ISP typically represents a single AS. In some cases, corporate networks tied to the ISP may also be part of the ISP’s AS, even though some aspects of their administration are not under the control of the ISP. Participating in the global BGP routing infrastructure gives an organization some control over the path traffic traverses to and from its IP addresses (Internet destinations). To participate in the global BGP routing infrastructure, an organization needs: • Assigned IP addresses, grouped into IP network addresses (aka prefixes) for routing. • A unique integer identifier called an Autonomous System Number (ASN). • A BGP router ready to connect to a neighbor BGP router on an Internet Service Provider’s network (or another already connected AS) that is willing to establish a BGP session and exchange routing information and packet traffic with the joining organization. The basic operation of BGP is remarkably simple – each BGP-speaking router can relay The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [34] of [40] messages to its neighbors about routes to network addresses (prefixes) that it already knows, either because it “owns” these prefixes, or it already learned routes to them from another neighbor. As part of traveling from one border router to another, a BGP route announcement incrementally collects information about the ASes that the route “update” traversed in an attribute called AS_PATH. Therefore, every BGP route is constructed hop-by-hop according to local routing policies in each AS. This property of BGP is a source of its flexibility in serving diverse business needs, and also a source of vulnerabilities. The operators of BGP routers can configure routing policy rules that determine which received routes will be rejected, which will be accepted, and which will be propagated further – possibly with modified attributes, and can specify which prefixes will be advertised as allocated to, or reachable through, the router’s AS. In contrast to the simplicity of the basic operation of BGP, a routing policy installed in a BGP router can be very complex. A BGP router can have very extensive capabilities for manipulating and transforming routes to implement the policy, and such capabilities are not standardized, but instead, are largely dictated by AS interconnection and business relationships. A route received from a neighbor can be transformed before a decision is made to accept or reject the route, and can be transformed again before the route is relayed to other neighbors; or, the route may not be disseminated at all. All this works quite well most of the time – largely because of certain historically motivated trust and established communication channels among human operators of the global BGP routing system. This is the trust that a route received from a neighbor accurately describes a path to a prefix legitimately reachable through the neighbor ASes networks, and its attributes have not been tampered with. Notwithstanding the above, the “trust but verify” rule applies: Best Current Practices recommend filtering the routes received from neighbors. While this can be done correctly for well-known direct customers, currently there is no validated repository of the “ground truth” allowing for correct filtering of routes to all networks in the world. Now observe that the BGP protocol itself provides a perfect mechanism for spreading malformed or maliciously constructed routes, unless the BGP players are vigilant in filtering them out from further propagation. However, adequate route filtering may not be in place, and from time to time a malicious or inadvertent router configuration change creates a BGP security incident: malformed or maliciously constructed routing messages will propagate from one AS to another simply by exploiting legitimate route propagation rules, and occasionally can spread to virtually all BGP routers in the world. Because some BGP-speaking routers advertise all local BGP routes to all external BGP peers by default, another example that commonly occurs involves a downstream of two or more upstream ASes advertising routes learned from one upstream ISP to another ISP – both the customer and the ISPs should put controls in place to scope the propagation of all routes to those explicitly allocated to the customer AS, but this is difficult given the lack of “ground truth”. The resulting routing distortions can cause very severe Internet service disruptions, in particular effective disconnection of victim networks or third parties from parts or all of Internet, or forcing traffic through networks that shouldn’t carry it, potentially opening higher-level Internet transactions up to packet snooping or man-in-themiddle attacks. 7.1.2 Review of Router Operations In a small local area network (LAN), data packets are sent across the wire, typically using Ethernet hardware, and all hosts on the network see the transmitted packets. Packets addressed The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [35] of [40] to a host are received and processed, while all others are ignored. Once networks grow beyond a few hosts, though, communication must occur in a more organized manner. Routers perform the task of communicating packets among individual LANs or larger networks of hosts. To make internetworking possible, routers must accomplish these primary functions: • Parsing address information in received packets • Forwarding packets to other parts of the network, sometimes filtering out packets that should not be forwarded • Maintaining tables of address information for routing packets. BGP is used in updating routing tables, which are essential in assuring the correct operation of networks. BGP is a dynamic routing scheme—it updates routing information based on packets that are continually exchanged between BGP routers on the Internet. Routing information received from other BGP routers (often called “BGP speakers”) is accumulated in a routing table. The routing process uses this routing information, plus local policy rules, to determine routes to various network destinations. These routes are then installed in the router’s forwarding table. The forwarding table is actually used in determining how to forward packets, although the term routing table is often used to describe this function (particularly in documentation for home networking routers). 7.2 BGP Security Incidents and Vulnerabilities In this section we classify the observed BGP security incidents, outline the known worst-case scenarios, and attempt to tie the incidents to features of proposed solutions that could prevent them. Many of the larger incidents are believed to have been the result of misconfigurations or mistakes rather than intentional malice or criminal intent. It has long been suspected that more frequent, less visible incidents have been happening with less attention or visibility. BGP security incidents usually originate in just one particular BGP router, or a group of related BGP routers in an AS, by means of changing the router’s configuration leading to announcements of a peculiar route or routes that introduce new paths towards a given destination or trigger bugs or other misbehaviors in neighboring routers in the course of propagation. There are no generally accepted criteria for labeling a routing incident as an “attack”, and – as stressed in the recommendations – lack of broadly accepted routing security metrics that could automatically identify certain routing changes as “routing security violations”. BGP security incidents that were observed to date can be classified as follows: • Route origin hijacking (unauthorized announcements of routes to IP space not assigned to the announcer). Such routing integrity violations may happen under various scenarios: malicious activity, inadvertent misconfigurations (“fat fingers”), or errors in traffic engineering. There are further sub-categories of such suspected security violations:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [36] of [40] o Hijacking of unused IP space such as repetitive hijacks of routes to prefixes within a large IP blocks assigned to an entity such as US government but normally not routed on the public Internet. Temporarily using these “unused” addresses enables criminal or antisocial activities (spam, network attacks) while complicating efforts to detect and diagnose the perpetrators. o Surgically targeted hijacks of specific routes and de-aggregation attacks on specific IP addresses. They may be hard to identify unless anomaly detection is unambiguous, or the victim is important enough to create a large commotion. Examples: Pakistan Hijacks YouTube17 (advertisement of a more specific is globally accepted, and totally black-holes the traffic to the victim). There may be significantly more such attacks than publicly reported, as they may be difficult to distinguish from legitimate traffic engineering or network re-engineering activities. o Unambiguous massive hijacks of many routes where many distinct legitimate origin ASes are replaced by a new unauthorized origin AS advertising the hijacked routes. Significant recent incidents include a 2010 “China's 18-minute Mystery”18, or a hijacking of a very large portion of the Internet for several hours by TTNet in 200419, or a 2006 ConEd incident20. Without knowing the motivations of the implicated router administrators it is difficult to determine if these and similar incidents were due to malicious intent, or to errors in implementations of routing policy changes. • Manipulation of AS_PATH attribute in transmitted BGP messages executed by malicious, selfish, or erroneous policy configuration. The intention of such attacks is to exploit BGP routers’ route selection algorithms dependent on AS_PATH properties, such as immediate rejection of a route with the router’s own ASN in the AS_PATH (mandated to prevent routing loops), or AS_PATH length. Alternatively, such attacks may target software bugs in distinct BGP implementations (of which quite a few were triggered in recent years with global impact). o For routing incidents triggered by long AS_PATHs see House of Cards21, AfNOG Takes Byte Out of Internet22, Longer is Not Always Better23 for actual examples. o Route leaks - A possibility of “man in the middle” (MITM) AS_PATH attacks detouring traffic via a chosen AS was publicly demonstrated at DEFCON in 17 http://www.renesys.com/blog/2008/02/pakistan-hijacks-youtube-1.shtml 18 http://www.renesys.com/blog/2010/11/chinas-18-minute-mystery.shtml 19 Alin C. Popescu, Brian J. Premore, and Todd Underwood, Anatomy of a Leak: AS9121. NANOG 34, May 16, 2005. 20 http://www.renesys.com/blog/2006/01/coned-steals-the-net.shtml 21 http://www.renesys.com/blog/2010/08/house-of-cards.shtml 22 http://www.renesys.com/blog/2009/05/byte-me.shtml 23 http://www.renesys.com/blog/2009/02/longer-is-not-better.shtmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [37] of [40] 200824. Two other similar incidents were found in a 7-month period surrounding the DEFCON demo by mining of a BGP update repository conducted in 200925 but were not confirmed as malicious. This can occur either by accident as detailed above, and is sometimes referred to as route “leaks”, or may be intentional. Additionally, such attacks may or may not attempt to obscure the presence of additional ASes in the AS path, should they exist. These are particularly problematic to identify as they require some knowledge of intent by the resource holder and intermediate ASes. o AS_PATH poisoning – sometimes used by operators to prevent their traffic AS from reaching and/or transiting a selected AS, or steer the traffic away from certain paths. It is technically a violation of BGP protocol and could be used harmfully as well. • Exploitations of router packet forwarding bugs, router performance degradation, bugs in BGP update processing o Example of a transient global meltdown caused by a router bug tickled by deaggregation26 and several other cases cited there. There are also BGP vulnerabilities that may have not been exploited in the wild so far, but that theoretically could do a lot of damage. The BGP protocol does not have solid mathematical foundations, and certain bizarre behaviors – such as persistent route oscillations – are quite possible. There have been several RFCs and papers addressing BGP vulnerabilities in the context of protocol standard specification and threat modeling, see the following Request For Comments (RFCs): • RFC 4272 “BGP Security Vulnerabilities Analysis” S. Murphy, Jan 2006. • RFC 4593 “Generic Threats to Routing Protocols”, A. Barbir, S. Murphy and Y. Yang, Oct 2006. • Internet draft draft-foo-sidr-simple-leak-attack-bgpsec-no-help-01 “Route Leak Attacks Against BGPSEC”, D. McPherson and S. Amante, Nov 2011. • Internet draft draft-ietf-sidr-bgpsec-threats-01 “Threat Model for BGP Path Security”, S. Kent and A. Chi, Feb 2012. 24 A. Pilosov and T. Kapela, Stealing the internet, DEFCON 16 August 10, 2008 25C. Hepner and E. Zmijewski, Defending against BGP Man-in-the-Middle attacks, Black Hat DC February 2009 26 J. Cowie, The Curious Incident of 7 November 2011, NANOG 54, February 7, 2012Page [38] of [40] 7.3 BGP Risks Matrix BGP Routing Security Risks Examined by WG 4 Network Operator Role Risks Report Sect. Recommendations Stub network (e.g. Enterprise, Data Center) Session-level threats 5.1.1 • Consider MD5 or GTSM if neighbor recommends it DoS (routers and routing info) 5.1.2 • Control-Plane Policing (rate-limiting) • Keep up-to-date router software Spoofed Source IP Addresses 5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict mode or other similar features at access edge of network (e.g. datacenter or campus). • Filter source IP address on packets at network edge to ISPs Incorrect route injection and propagation 5.2.1 • Keep current information in “whois” and IRR (Internet Routing Registry) databases • Outbound prefix filtering • Use monitoring services to check for incorrect routing announcements and/or propagation Other Attacks (e.g., hacking, insider, social engineering) 5.3 • Consider many recommendations about operational security processes Internet Service Provider Network Session-level threats 5.1.1 • Consider a plan to use MD5 or GTSM including flexibility to adjust to different deployment scenario specifics DoS (routers and routing info) 5.1.2 • Control-Plane Policing (rate-limiting)The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [39] of [40] • Keep up-to-date router software Spoofed Source IP Addresses 5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict or loose mode as appropriate (e.g. strict mode at network ingress such as data-center or subscriber edge, loose mode at inter-provider border) Incorrect route injection and propagation 5.2.1 • Keep current information in “whois” and IRR (Internet Routing Registry) databases • Consult current information in “whois” and IRR (Internet Routing Registry) databases when provisioning or updating customer routing • Implement inbound prefix filtering from customers • Consider AS-path filters and maximum-prefix limits as second line of defense • Use monitoring services to check for incorrect routing announcements and/or propagation Other Attacks (e.g., hacking, insider, social engineering) 5.3 • Consider many recommendations about operational security processesPage [40] of [40] 7.4 BGP BCP Document References Network Protection Documents NIST Special Publication 800-54 Border Gateway Protocol (BGP) Security Recommendations WG2A - Cyber Security Best Practices SANS: Twenty Critical Security Controls for Effective Cyber Defense: Consensus Audit Guidelines (CAG) NIST Special Publication 800-53 Recommended Security Controls for Federal Information Systems and Organizations IETF RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments IETF RFC 3013 Recommended Internet Service Provider Security Services and Procedures Source Address verification/filtering IETF BCP38/RFC 2827 Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks BCP 140/RFC 5358 Preventing Use of Recursive Nameservers in Reflector Attacks ICANN SAC004 Securing the Edge DoS/DDoS Considerations IETF RFC 4732 Internet Denial-of-Service Considerations SANS A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment Scalable Dynamic Nonparametric Bayesian Models of Content and Users ∗ Amr Ahmed1 Eric Xing2 1Research @ Google , 2Carnegie Mellon University amra@google.com, epxing@cs.cmu.edu Abstract Online content have become an important medium to disseminate information and express opinions. With their proliferation, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. In this paper, we addresses the problem of information organization of online document collections, and provide algorithms that create a structured representation of the otherwise unstructured content. We leverage the expressiveness of latent probabilistic models (e.g., topic models) and non-parametric Bayes techniques (e.g., Dirichlet processes), and give online and distributed inference algorithms that scale to terabyte datasets and adapt the inferred representation with the arrival of new documents. This paper is an extended abstract of the 2012 ACM SIGKDD best doctoral dissertation award of Ahmed [2011]. 1 Introduction Our online infosphere is evolving with an astonishing rate. It is reported that there are 50 million scientific journal articles published thus far [Jinha, 2010], 126 million blogs 1 , an average of one news story published per second, and around 500 million tweets per day. With the proliferation of such content, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. Thus several unsupervised techniques were proposed to build a structured representation of users and content. Traditionally, clustering is used as a popular unsupervised technique to explore and visualize a document collection. When applied in document modeling, it assumes that each document is generated from a single component (cluster or topic) and that each cluster is a uni-gram distribution over a given vocabulary. This assumption limits the expressive power of the model, and does not allow for modeling documents as a mixture of topics. ∗The dissertation on which this extended abstract is based was the recipient of the 2012 ACM SIGKDD best doctoral dissertation award, [Ahmed, 2011]. 1 http://www.blogpulse.com/ Recently, mixed membership models [Erosheva et al., 2004], also known as admixture models, have been proposed to remedy the aforementioned deficiency of mixture models. Statistically, an object wd is said to be derived from an admixture if it consists of a bag of elements, say {wd1, . . . , wdN }, each sampled independently or coupled in some way, from a mixture model, according to an admixing coefficient vector θ, which represents the (normalized) fraction of contribution from each of the mixture component to the object being modeled. In a typical text modeling setting, each document corresponds to an object, the words thereof correspond to the elements constituting the object, and the document-specific admixing coefficient vector is often known as a topic vector and the model is known as latent Dirichlet allocation (LDA) model due to the choice of a Dirichlet distribution as the prior for the topic vector θ [Blei et al., 2003]. Notwithstanding these developments, existing models can not faithfully model the dynamic nature of online content, represent multiple facets of the same topic and scale to the size of the data on the internet. In this paper, we highlight several techniques to build a structured representation of content and users. First we present a flexible dynamic non-parametric Bayesian process called the Recurrent Chinese Restaurant Process for modeling longitudinal data and then present several applications in modeling scientific publication, social media and tracking of user interests. 2 Recurrent Chinese Restaurant Process Standard clustering techniques assume that the number of clusters is known a priori or can be determined using cross validation. Alternatively, one can consider non-parametric techniques that adapt the number of clusters as new data arrives. The power of non-parametric techniques is not limited to model selection, but they endow the designer with necessary tools to specify priors over sophisticated (possibly in- finite) structures like trees, and provide a principled way of learning these structures form data. A key non-parametric distribution is the Dirichlet process (DP). DP is a distribution over distributions [Ferguson, 1973]. A DP denoted by DP(G0, α) is parameterized by a base measure G0 and a concentration parameter α. We write G ∼ DP(G0, α) for a draw of a distribution G from the Dirichlet process. G itself is a distribution over a given parameter space θ, therefore we can draw parameters θ1:N from G. Integrating out G, the1987 speech Neuroscience NN Classificatio n Methods Control Prob. Models image SOM RL Bayesian Mixtures Generalizat -ion 1990 boosting 1991 Clustering 1995 ICA Kernels 1994 1996 Memory speech Kernels ICA PM Classification Mixtures Control support kernel svm regularization sv vectors feature regression kernel support sv svm machines regression vapnik feature solution Kernels kernel support Svm regression feature machines solution margin pca Kernel svm support regression solution machines matrix feature regularization 1996 1997 1998 1999 -Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, V.Vapnik, S. E. Golowich and A.Smola - Support Vector Regression Machines H. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik -Improving the Accuracy and Speed of Support Vector Machines, C. Burges and B. Scholkopf - From Regularization Operators to Support Vector Kernels, A. Smola and B. Schoelkopf - Prior Knowledge in Support Vector Kernels, B. Schoelkopf, P. Simard, A. Smola and V.Vapnik - Uniqueness of the SVM Solution, C. Burges and D.. Crisp - An Improved Decomposition Algorithm for Regression Support Vector Machines, P. Laskov ..... Many more Figure 1: Left: the NIPS conference timeline as discovered by the iDTM. Right the evolution of the Topic Kernel Methods. parameters θ follow a Polya urn distribution [Blackwell and MacQueen, 1973], also knows as the Chinese restaurant process (CRP), in which the previously drawn values of θ have strictly positive probability of being redrawn again, thus making the underlying probability measure G discrete with probability one. More formally, θi |θ1:i−1, G0, α ∼ X k mk i − 1 + α δ(φk) + α i − 1 + α G0. (1) where φ1:k denotes the distinct values among the parameters θ, and mk is the number of parameters θ having value φk. By using the DP at the top of a hierarchical model, one obtains the Dirichlet process mixture model, DPM [Antoniak, 1974]. The generative process thus proceeds as follows: G|α, G0 ∼ DP(α, G0), θd|G ∼ G, wd|θd ∼ F(.|θd), (2) where F is a given likelihood function parameterized by θ. Dirichlet process mixture (or CRP) models provide a flexible Bayesian framework, however the full exchangeability assumption they employ makes them an unappealing choice for modeling longitudinal data such as text streams that can arrive or accumulate as epochs, where data points inside the same epoch can be assumed to be fully exchangeable, whereas across the epochs both the structure (i.e., the number of mixture components) and the parametrization of the data distributions can evolve and therefore unexchangeable. In this section, we present the Recurrent Chinese Restaurant Process ( RCRP ) [Ahmed and Xing, 2008] as a framework for modeling these complex longitudinal data, in which the number of mixture components at each time point is unbounded; the components themselves can retain, die out or emerge over time; and the actual parametrization of each component can also evolve over time in a Markovian fashion. In RCRP, documents are assumed to be divided into epochs (e.g., one hour or one day); we assume exchangeability only within each epoch. For a new document at epoch t, a probability mass proportional to α is reserved for generating a new cluster. Each existing cluster may be selected with probability proportional to the sum mkt + m0 kt, where mkt is the number of documents at epoch t that belong to cluster k, and m0 kt is the prior weight for cluster k at time t. If we let ctd denotes the cluster assingment of document d at time t, then: ctd|c1:t−1, ct,1:d−1 ∼ RCRP(α, λ, ∆) (3) to indicate the distribution P(ctd|c1:t−1, ct,1:d−1) ∝  m0 kt + m−td kt existing cluster α new cluster (4) As in the original CRP, the count m−td kt is the number of documents in cluster k at epoch t, not including d. The temporal aspect of the model is introduced via the prior m 0 kt, which is defined as m 0 kt = X ∆ δ=1 e − δ λ mk,t−δ. (5) This prior defines a time-decaying kernel, parametrized by ∆ (width) and λ (decay factor). When ∆ = 0 the RCRP degenerates to a set of independent Chinese Restaurant Processes at each epoch; when ∆ = T and λ = ∞ we obtain a global CRP that ignores time. In between, the values of these two parameters affect the expected life span of a given component, such that the lifespan of each storyline follows a power law distribution [Ahmed and Xing, 2008]. In addition, the distribution φk of each component changes over time in a markovian fashion, i.e.: φkt|φk,t−1 ∼ P(.|φk,t−1). In the following three sections we give various models build on top of RCRP and highlight how inference is performed and scaled to the size of data over the internet. 3 Modeling Scientific Publications With the large number of research publications available online, it is important to develop automated methods that can discover salient topics (research area), when each topicstarted, how each topic developed over time and what are the representative publications in each topic at each year. Mixedmembership models (such as LDA) are static in nature and while several dynamic extensions have been proposed ([Blei and Lafferty, 2006]), non of them can deal with evolving all of the aforementioned aspects. While, the RCRP models can be used for modeling the temporal evolution of research topics, it assumes that each document is generated from a single topic (cluster). To marry these two approaches, we first introduce Hierarchical Dirichlet Processes (HDP [Teh et al., 2006]) and then illustrate our proposed model. Instead of modeling each document wd as a single data point, we could model each document as a DP. In this setting,each word wdn is a data point and thus will be associated with a topic sampled from the random measure Gd, where Gd ∼ DP(α, G0). The random measure Gd thus represents the document-specific mixing vector over a potentially infi- nite number of topics. To share the same set of topics across documents, we tie the document-specific random measures by modeling the base measure G0 itself as a random measure sampled from a DP(γ, H). The discreteness of the base measure G0 ensures topic sharing between all the documents. Now we proceed to introduce our model, iDTM [Ahmed and Xing, 2010b] which allows for infinite number of topics with variable durations. The documents in epoch t are modeled using an epoch specific HDP with high-level base measure denoted as Gt 0 . These epoch-specific base measures {Gt 0} are tied together using the RCRP of Section 2 to evolve the topics’ popularity and distribution over words as time proceeds. To enable the evolution of the topic distribution over words, we model each topic as a logistic normal distribution and evolve its parameters using a Kalman filter. This choice introduces non-conjugacy between the base measure and the likelihood function and we deal with it using a Laplace approximate inference technique proposed in [Ahmed and Xing, 2007]. We applied this model to the collection of papers published in the NIPS conference over 18 years. In Figure 1 we depict the conference timeline and the evolution of the topic ‘Kernel Methods’ alone with popular papers in each year. In addition to modeling temporal evolution of topics, in [Ahmed et al., 2009] we developed a mixed-membership model for retrieving relevant research papers based on multiple modalities: for example figures or key entities in the paper such as genes or protein names (as in biomedical papers). Figures in biomedical papers pose various modeling challenges that we omit here for space limitations. 4 Modeling Social Media News portals and blogs/twitter are the main means to disseminate news stories and express opinions. With the sheer volume of documents and blog entries generated every second, it is hard to stay informed. This section explores methods that create a structured representation of news and opinions. Storylines emerge from events in the real world, such as the Tsunami in Japan, and have certain durations. Each story can be classified under multiple topics such as disaster, rescue and economics. In addition, each storyline focuses on certain Sports games won team final season league held Politics government minister authorities opposition officials leaders group Unrest police attack run man group arrested move India-Pakistan tension nuclear border dialogue diplomatic militant insurgency missile Pakistan India Kashmir New Delhi Islamabad Musharraf Vajpayee UEFA-soccer champions goal leg coach striker midfield penalty Juventus AC Milan Real Madrid Milan Lazio Ronaldo Lyon    Tax bills tax billion cut plan budget economy lawmakers Bush Senate US Congress Fleischer White House Republican T O P I C S S T O R Y L I N E S Figure 2: Some example storylines and topics extracted by our system. For each storyline we list the top words in the left column, and the top named entities at the right; the plot at the bottom shows the storyline strength over time. For topics we show the top words. The lines between storylines and topics indicate that at least 10% of terms in a storyline are generated from the linked topic. words and named entities such as the name of the cities or people involved in the event. In [Ahmed et al., 2011b,a] we used RCRP to model storylines. In a nutshell, we emulate the process of generating news articles. A story is characterized by a mixture of topics and the names of the key entities involved in it. Any article discussing this story then draws its words from the topic mixture associated with the story, the associated named entities, and any story-specific words that are not well explained by the topic mixture. The latter modification allows us to improve our estimates for a given story once it becomes popular. In summary, we model news story clustering by applying a topic model to the clusters, while simultaneously allowing for cluster generation using RCRP. Such a model has a number of advantages: estimates in topic models increase with the amount of data available. Modeling a story by its mixture of topics ensures that we have a plausible cluster model right from the start, even after observing only one article for a new story. Third, the RCRP ensures a continuous flow of new stories over time. Finally, a distinct named entity model ensures that we capture the characteristic terms rapidly. In order to infer storyline from text stream, we developed a Sequential Monte Carlo (SMC) algorithm that assigns news articles to storylines in real time. Applying our online model to a collection of news articles extracted from a popular news portal, we discovered the structure shown in Figure 2. This structure enables the user to browse the storylines by topics as well as retrieve relevant storylines based on any combination of the storyline attributes. Note that named entities are extracted by a preprocessing step using standard extractors. Quantitatively, we compared the accuracy of our online clustering with a strong offline algorithm [Vadrevu et al., 2011] with favorable outcome. Second, we address the problem of ideology-bias detection in user generated content such as microblogs. We follow the notion of ideology as defines by Van Dijk [Dijk, 1998]: “a set of general abstract beliefs commonly shared by a grouppalestinian israeli peace year political process state end right government need conflict way security palestinian israeli Peace political occupation process end security conflict way government people time year force negotiation bush US president american sharon administration prime settlement pressure policy washington ariel new middle unit state american george powell minister colin visit internal policy statement express pro previous package work transfer european administration receive arafat state leader roadmap george election month iraq week peace june realistic yasir senior involvement clinton november post mandate terrorism US role Israeli View roadmap phase violence security ceasefire state plan international step implement authority final quartet issue map effort roadmap end settlement implementation obligation stop expansion commitment consolidate fulfill unit illegal present previou assassination meet forward negative calm process force terrorism unit road demand provide confidence element interim discussion want union succee point build positive recognize present timetable Roadmap process israel syria syrian negotiate lebanon deal conference concession asad agreement regional october initiative relationship track negotiation official leadership position withdrawal time victory present second stand circumstance represent sense talk strategy issue participant parti negotiator peace strategic plo hizballah islamic neighbor territorial radical iran relation think obviou countri mandate greater conventional intifada affect jihad time Arab Involvement Palestinian View Israeli Background topic Palestinian Background topic Figure 3: Ideology-detection. Middle topics represent the unbiased portion of each topic, while each side gives the Israeli and Palestinian perspective. At time t At time t+1 At time t+2 At time t+3 User 1 process User 2 process User 3 process Global process m m' n n' Figure 4: A Fully evolving non-parametric process. Top level process evolves the global topics via an RCRP. Each row represents a user process evolving using an RCR process whose topics depends both on the the global topics at each epoch and the previous state of the user at previous epochs. The user process is sparser than the global process as users need not appear in each epoch, moreover users can appear (and leave) at any epoch. of people.” In other words, an ideology is a set of ideas that directs one’s goals, expectations, and actions. For instance, freedom of choice is a general aim that directs the actions of “liberals”, whereas conservation of values is the parallel for “conservatives”. In Ahmed and Xing [2010a] we developed a multi-view mixed-membership model that utilizes a factored representation of topics, where each topic is composed of two parts: an unbiased part (shared across ideologies) and a biased part (different for each ideology). Applying this model on a few ideologically labelled documents as seeds and many unlabeled documents, we were able to identify how each ideology stands with respect to mainstream topics. For instance in Figure 3 we show the result of applying the model to a set of articles written on the middle east conflict by both Israeli and Palestinian writers. Given a new documents, the model can 1) detect its idealogical bias (if any), 2) point where the bias appears (i.e. highlight words and/or biased sentences) and 3) retrieve documents written on the same topic from the opposing ideology. Our model achieves state of the art results in task 1 and 3 while being unique in solving task 2. 0 10 20 30 40 0 0.1 0.2 0.3 0.4 0.5 Propotion Day Baseball Dating Celebrity Health 0 10 20 30 40 0 0.1 0.2 0.3 Propotion Day Baseball Finance Jobs Dating Snooki Tom Cruise Katie Holmes Pinkett Kudrow Hollywood League baseball basketball, doublehead Bergesen Griffey bullpen Greinke skin body fingers cells toes wrinkle layers women men dating singles personals seeking match Dating Baseball Celebrity Health job career business assistant hiring part-time receptionist financial Thomson chart real Stock Trading currency Jobs Finance Figure 5: Dynamic interests of two users. 5 Modeling User Interests Historical user activity is key for building user profiles to predict the user behaviour and affinities in many web applications such as targeting of online advertising, content personalization and social recommendations. User profiles are temporal, and changes in a user’s activity patterns are particularly useful for improved prediction and recommendation. For instance, an increased interest in car-related web pages suggests that the user might be shopping for a new vehicle. In Ahmed et al. [2011c] we present a comprehensive statistical framework for user profiling based on the RCRP model which is able to capture such effects in a fully unsupervised fashion. Our method models topical interests of a user dynamically where both the user association with the topics and the topics themselves are allowed to vary over time, thus ensuring that the profiles remain current. For instance if we represent each user as a bag of the words in their search history, we could use the iDTM model described in Section 3. However, unlink research papers that exist in a given epoch, users exist along multiple epoch (where each epoch here might denote a day). To solve this problem we extend iDTM by modeling each user himself as a RCRP that evolves over time as shown in Figure 4. To deal with the size of data on the internet, we developed a streaming, distributed inference algorithm that distribute users over multiple machines and synchronizing the model parameters using an asynchronous consensus protocol described in more details in [Ahmed et al., 2012; Smola and Narayanamurthy, 2010]. Figure 5 shows qualitatively the output of our model over two users. Quantitatively the discovered interests when used as features in an advertising task results in significant improvement over a strong deployed system. 6 Conclusions Our infosphere is diverse and dynamic. Automated methods that create a structured representation of users and content are key to help users staying informed. We presented a flexible nonparametric Bayesian model called the Recurrent Chinese Restaurant Process and showed how using this formalism (in addition to mixed-membership models) can solve this task. We validated our approach on many domains and showed how to scale the inference to the size of the data on the internet and how to performing inference in online settings.References A. Ahmed and E. P. Xing. On tight approximate inference of the logistic normal topic admixture model. In AISTATS, 2007. A. Ahmed and E. P. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In SDM, pages 219–230. SIAM, 2008. A. Ahmed and E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP, 2010. A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. In UAI, 2010. A. Ahmed, E. P. Xing, W. W. Cohen, and R. F. Murphy. Structured correspondence topic models for mining captioned figures in biological literature. In KDD, pages 39– 48. ACM, 2009. A. Ahmed, Q. Ho, J. Eisenstein, E. P. Xing, A. J. Smola, , and C. H. Teo. Unified analysis of streaming news. In in WWW 2011, 2011. A. Ahmed, Q. Ho, C. hui, J. Eisenstein, A. Somla, and E. P. Xing. Online inference for the infinte topic-cluster model: Storylines from text stream. In AISTATS, 2011. A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In KDD, pages 114–122, 2011. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A.J. Smola. Scalable inference in latent variable models. In Web Science and Data Mining (WSDM), 2012. A. Ahmed. Modeling Users and Content: Structured Probabilistic Representation, and Scalable Online Inference Algorithms. PhD thesis, School of Computer Science, Carnegie Mellon University, 2011. C. E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. D. Blackwell and J. MacQueen. Ferguson distributions via polya urn schemes. The Annals of Statistics, 1(2):353–355, 1973. D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, volume 148, pages 113–120. ACM, 2006. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. T. A. Van Dijk. Ideology: A multidisciplinary approach. 1998. E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. PNAS, 101(1), 2004. T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230, 1973. A. E. Jinha. Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing, 23(3):258–263, 2010. A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010. Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(576):1566–1581, 2006. S. Vadrevu, C. H. Teo, S. Rajan, K. Punera, B. Dom, A. J. Smola, Y. Chang, and Z. Zheng. Scalable clustering of news search results. In in WSDM 2011, 2011. Towards A Unified Modeling and Verification of Network and System Security Configurations Mohammed Noraden Alsaleh, Ehab Al-Shaer University of North Carolina at Charlotte Charlotte, NC, USA Email: {malsaleh, ealshaer}@uncc.edu Adel El-Atawy Google Inc Mountain View, CA, USA Email: aelatawy@google.com Abstract—Systems and networks access control configuration are usually analyzed independently although they are logically combined to define the end-to-end security property. While systems and applications security policies define access control based on user identity or group, request type and the requested resource, network security policies uses flow information such as host and service addresses for source and destination to define access control. Therefore, both network and systems access control have to be configured consistently in order to enforce endto-end security policies. Many previous research attempt to verify either side separately, but it does not provide a unified approach to automatically validate the logical consistency between both of them. Thus, using existing techniques requires error-prone manual and ad-hoc analysis to validate this link. In this paper, we introduce a cross-layer modeling and verifi- cation system that can analyze the configurations and policies across both application and network components as a single unit. It combines policies from different devices as firewalls, NAT, routers and IPSec gateways as well as basic RBACbased policies of higher service layers. This allows analyzing, for example, firewall polices in the context of application access control and vice versa. Thus, by incorporating policies across the network and over multiple layers, we provide a true endto-end configuration verification tool. Our model represents the system as a state machine where packet header, service request and location determine the state and transitions that conform with the configuration, device operations, and packet values are established. We encode the model as Boolean functions using binary decision diagrams (BDDs). We used an extended version of computational tree logic (CTL) to provide more useful operators and then use it with symbolic model checking to prove or find counter examples to needed properties. The tool is implemented and we gave special consideration to efficiency and scalability. I. INTRODUCTION Users inadvertently trigger a long sequence of operations in many locations and devices by just a simple request. The application requests are encapsulated inside network packets which in turn are routed through the network devices and subjected to different types of routing, access control and transformation policies. Misconfigurations at the different layers in any of the network devices can affect the end to end connection between the hosts them selves and the communicating services running on top of them. Moreover, applications may require to transform requests to another one or more requests with different characteristics. This means that the network layer should guarantee more than one packet flow at the same time in order for the application request to be successful. Although it is already very hard to verify that only the legitimate packets can pass through the network successfully, the consistency between network and application layer access configuration adds another challenge. The different natures of policies from network layer devices to the logic of application access control makes it more complex. In this paper we have extended ConfigChecker [3] to include application layer access control. The ConfigChecker is a model checker for network configuration. It implemented many network devices including: routers, firewalls, IPSec gateways, hosts, and NAT/PAT nodes. ConfigChecker models the transitions based on the packet forwarding at the network layer where the packet header fields along with the location represent the variables for the model checker. We define application layer requests following a loose RBAC model: a 4-tuple of . The request can be created by users or services running on top of hosts in the network. The services in our model can also transform a request into another one or more requests and forward them to a different destination. We have implemented a parallel model checker for application layer configuration. The transitions in the application layer model is determined by the movement of the requests between different services. However, the network and application layer model checkers are not operating separately. Requirements regarding both models can be verified in a single query using our unified query interface. Moreover, inconsistency between network configuration and application layer configuration can be detected. The nature of the problem of verifying network-wide configurations necessitate having a very scalable system in terms of time and space requirements. Larger networks, more complex configurations, and richer variety of devices are all dimensions over which the system should handle gracefully. The application-layer access control depends on different variables than most of network layer policies. We chose to implement a parallel model checker for application access control rather than adding the application layer variables (which correspond to request fields) to the network model checker itself. This can decrease the number of system states and improve the performance. As the case in ConfigChecker, both model checkers are represented as state machines and encoded as boolean function using Binary Decision Diagrams (BDDs). We use an extended version of computational tree logic (CTL) toFig. 1. A simple overview of the framework design and flow provide more useful operators and then use it with symbolic model checking to prove or find counter examples for needed properties regarding both models. The rest of this paper is organized as follows. We first briefly describe our framework components in Section II. We then present the model used for capturing the network and application layer configuration in Section III and Section IV respectively. Section V shows how to query the model for properties, and lists some sample queries. The related work is presented in Section VI. We finally present our conclusion and future remarks in Section VII. II. FRAMEWORK OVERVIEW The framework consists of a few key components: configuration loader, model compiler, and query engine. The duty of each component is described briefly below: • The Configuration Loader parses the main configuration file that points out to the configuration files of network devices. Each file represents a device or entity (e.g., firewall, router, application-layer service, etc). Each con- figuration file consists mainly of two sections: meta-data directives, and policy. The initial directives act as an initialization step to configure the device properties like default gateway, service port, host address, etc. The policy is listed afterwards as a simple list of rules. • The Model Compiler translates the configuration into a Boolean expressions for each policy rule and builds a single expression for each device. These expressions are then combined into a single expression representing the whole network. • The Query Engine is responsible for verifying properties of the system by executing simple scripts based on CTL expressions. Scripts is written using a very limited set of primitives for refining the user output, and for defining the property itself. The model compiler component builds two separate expressions. The first represents the network layer configuration that reflects the packets forwarding and transformation through the network core and end points as described in Section III. The other expression represents the application layer configuration including services and users. This reflects how requests are forwarded and transformed in the service level. Section IV describes this process in more details. Although we can integrate the two expressions and build only one expression that accommodates for both the network and application layer configuration, we chose to split them into two expression. The variables used on each of them are generally independent except for the location variable. Building one expression takes more space because the network configuration will be duplicated for each different combination of application layer variables generating more and more states. This helps our model to scale and avoid state explosion. III. NETWORK MODEL We model the network as a single monolithic finite state machine. The state space is the cross-product of the packet properties by its possible locations in the network. The packet properties include the header information that determines the network response to a specific packet. A. State representation Initially, the only information we need about the packet is the source and destination information contained in the IP header and the current location of the packet in the network. Therefore, we can encode the state of the network with the following characteristic function: σn : IPs × ports × IPd × portd × loc → {true, false} IPs the 32-bit source IP address ports the 16-bit source port number IPd the 32-bit destination IP address portd the 16-bit destination port number loc the 32-bit IP address of the device currently processing the packet The function σn encodes the state of the network by evaluating to true whenever the parameters used as input to the function correspond to a packet that is in the network and false otherwise. If the network contains 5 different packets, then exactly five assignments to the parameters of the function σn will result in true. Note that because we abstract payload information, we cannot distinguish between 2 packets that are at the same device if they also have the same IP header information. Each device in the network can then be modeled by describing how it changes a packet that is currently located at the device. For example, a firewall might remove the packet from the network or it might allow it to move on to the device on the other side of the firewall. A router might change the location of the packet but leave all the header information unchanged. A device performing network address translation might change the location of the packet as well as some of the IP header information. A hub might copy the same packet to multiple new locations. The behavior of each of these devices can be described by a list of rules. Each rule has a condition and an action. The rule condition can be described using a Boolean formula over the bits of the state (the parameters of the characteristic function σn). If the packet at the devicematches (satisfies) a rule condition, then the appropriate action is taken. As described above, the action could involve changing the packet location as well as changing IP header information. In all cases, however, the change can be described by a Boolean formula over the bits of the state. Sometimes the new values are constant (completely determined by the rule itself), and sometimes they may depend on the values of some of the bits in the current state. In either case, a transition relation can be constructed as a relation or characteristic function over two copies of the state bits. An assignment to the bits/variables in the transition relation yields true if the packet described by the first copy of the bits will be transformed into a packet described by the second copy of the bits when it is at the device in question. B. Network devices We integrate the policies of different network devices including firewalls, routers, NAT and IPSec gateways. The details of their policies and how they are encoded into BDDs are discussed thoroughly in [3]. However, we have modified the encoding of hosts to reflect the request transformation performed by the services running on top of each host. The host may be configured to run one or multiple services. Each of which has its own access-control list as will be discussed in Section IV. The service configuration may also specify a set of possible request transformations where the incoming request is transformed into another (sometimes completely new) request. For example, a request to a web server can be translated into an NFS request to load a users home page. The new request will be carried through the network over packets. In our initial model the host receives packet and then forward them into the application layer within the host itself and it cannot forward it to another host. We need to modify this model so that the host will be able to forward packets to the other hosts in order to support the requests transformation performed by the services running on top of it. IV. APPLICATION LAYER MODEL We also model the application layer as a finite state machine. The state space is the cross-product of the application layer request properties by its possible locations in the network. The request properties include the its fields that determine the service response to a specific request. σp : usr × role × obj × act × loc × srv → {true, false} usr the 32-bit user ID role the 32-bit role ID which the user belongs to obj the 32-bit object ID act the 16-bit action ID loc the 32-bit IP address of the device currently processing the request srv the 16-bit service ID In the application layer model the devices in the network are modeled by describing how they change the requests. Only the devices that operates on the application level are considered (i.e., the devices who has a defined users list or services running on top of them). Here, we describe how we define access-control rights for service requests, and how we model these services and integrate them into the application layer state transition diagram. A. Application layer access-control In order to have a homogeneous policy definition across applications, we revert to a simplified RBAC model as a way to specify all application requests and consequently the access-control policy. As in firewall policy, the access-control list of application layer services is defined by specifying an action like (permit or deny) to the requests satisfying certain criteria. This criteria is defined using the request fields < user, role, object, action > or < u, r, o, a > for short. We assume that each host has a list of potential users who can use it to send requests. This list can simply be set to ”any”, to indicate that all defined users can access the host which enables a more powerful model for an adversary against which we want to verify the robustness of the policy. Also, another assumption is that any user can assume any role. This enables a more flexible usage of the model to incorporate more types of services. It is even possible to use one of the request fields in a slightly different meaning. For example, in a web server model; an action can be a POST or GET, the role can be a logged in versus guest visitor and the object and user will have their obvious meanings. On the other hand, for database servers, we might have users, roles, actions, and resources used in their original meaning. When a service receives the a request from other device it first verifies it against its access-control policy. If it satisfies the access-control policy it will be forwarded to the service to be executed or transformed as shown in Fig. 2. If the request does not satisfy the access policy it will be dropped (i.e., there is no valid transition in the finite state machine that goes to the required service). A policy typically is defined as a list of tuples with an assigned action: user role resource action decision ;user black listing 1 * * * deny 2 * * * deny ;admin account 100 1 * * permit ;guests can only read * guest 1-50 3 permit ;a read only resource * * 60 4 permit * * 60 * deny As in firewall policies, we use a first-match methodology. For example, the last two rules allow read access, and then deny every other action. Also, the first few black-listing rules do not conflict with the guest account rule that appears later in the policy. From the common practices in the area of application level and RBAC policies, we believe that it should never be the case that a user or role be specified as a range. The value of user and role IDs are irrelevant, therefore having a range in the specification is hard to have a practical value. Although, this fact does not affect our implementation (i.e., we support single values, ranges, or “any” values in all four fields of the access-control rules).Fig. 2. Service Model. S1, S2, S3, and S4 represent services running on different hosts. The dashed lines represent application requests. Requests are subjected to the access control list of the target service. B. State representation Requests that pass the access-control phase are forwarded to the execution phase of the service. We have simplified the execution phase to one of two options as suggested in Fig. 2. A request can be executed on the service itself, which means that the request life-time ends at this phase, and no further events are triggered. The other possibility is that the service transforms the request into another form by modifying one or more fields (i.e., user, role, object or action) and sends it to another service running on the same or a different host. For example, a request to a web server can be translated into an NFS request to load a user’s home page. Each request transformation is associated with a packet flow in the network level, the host should be able to send the appropriate packet to its gateway based on the service transformation. Only the network devices that support application layer services represented by hosts in our model are included in the application layer model. The requests that leave a host may come from two sources: either a user operating directly on the host or through a service, or a request is transformed from another one. We do not require the destination of each request to be defined in the configuration. We assume that any request instantiated in any host can be directed to any service in the network whose access control list allows the request to pass. To build the transition relation for each host in the network we need the following inputs. • The set U of users who can access the host. Each user is represented by its unique ID in the system. It can also be expressed as a pair. • The set T of possible request transformations that can be performed in the host. The configuration may specify the exact ID and location of the target service, or it can be anonymous. • The set P of access control policies for each service running in the network. These policies are encoded as BDD expressions before building the transition relation of the application layer. Each policy in the set corresponds to a particular service ID (The port number of the service can be used as its ID) running on a particular host. Lets assume that the list UH of pairs represents the users who can access the host H. We need first to encode the possible states that result from having these users on the host H (recall that the state is the product of the request properties in this case by the location in which the request exists). UBDD = _ i∈UH (usr = ui ∧ role = ri ∧ loc = H) (1) where ui and ri are the user and role IDs of the item i in the users list UH. To find the transitions we need to find out which services can be reached starting from the states defined in the expression UBDD (i.e., any service whose access-control policy allows defined requests to pass). Tu = _ i∈indices(P ) (UBDD ∧ P(i) ∧ loc0 = li ∧ srv0 = si) (2) where P(i), li , and si are the policy, location, and the service ID of the target service i respectively. The variables loc0 and srv0 represents the location and service ID variables in the next state of the transition. The expression Tu does not include the transitions that results from request transformation by the service running on the host. The following represents the transition that result from one transformation performed by the service i. usr = ui ∧ role = ri ∧ obj = oi ∧ act = ai loc = H ∧ srv = si usr0 = u 0 i ∧ role0 = r 0 i ∧ obj0 = o 0 i ∧ act0 = a 0 i ∧ P 0 (s 0 i ) loc0 = l 0 i ∧ srv = s 0 i (3) The values < ui , ri , oi , ai > are the properties of the initial request and the values < u0 i , r0 i , o0 i , a0 i > represent the properties of the transformed packet. The values P 0 (s 0 i ), l 0 i and s 0 i are the policy, location and the service ID of the target service to which the request should be transformed. Note that we use P 0 (s 0 i ) instead of P(s 0 i ) to indicate that the transformed request (and not the initial request) should pass the target service access-control policy in order to complete the transition. The disjunction of all the transitions caused by all the possible transformations along with the expression Tu calculated earlier formulate the total transition relation of the host H. V. QUERYING THE MODEL A query in our system takes the form of a Boolean expression that specifies some properties over packet flows, requests and locations, with temporal logic criteria specified using CTL operators. By evaluating the given expression in the context of the built state machine (i.e., states and transitions), we obtain the satisfying assignments to that expression represented in the same symbolic representation as the model itself. The simplest form for the result can be the constant expression “true” (e.g., the property is always satisfied), or “false” (e.g., no one violates the required property), or can be any subset of the space that satisfies the property (e.g., only flows with port 80, or traffic that starts from this location, etc).In the following subsections, we will go over a few examples for reachability and security properties. For each, we show how to construct the query (i.e., the Boolean expression), and what the results look like. Moreover, we discuss how to write the script that extracts the results and data fields in the intended format that makes sense to every specific query. The aim of this presentation is to show the applicability of the system to many types of properties, as well as showing the expressive power of the model. A. Model Checking We have described how to construct a transition relation for each device in the network. Each such transition relation describes a list of outgoing transitions for the device it models. The formulas are constructed with the requirement that the current location be equal to the device being modeled, so these transitions can only be taken when a packet or a request is at the device. To get the transition relation for the entire network, we simply take the disjunction of the formulas for the individual devices. This is applied on both models (The network and the application layer models). The current location of a packet or a request will match the location of one device in the network at most, and so only its transitions will apply. Recall that this global transition relation is a characteristic function for transitions in the model. If we substitute the values for a packet that is in the system into the current state variables of the transition relation, what we are left with is a formula describing what the possible next states of that packet look like. We have all the machinery to perform symbolic model checking. We use BDDs for all the formulas described above and we use standard model checking algorithms to explore the state space and compute states that satisfy various CTL properties. The BuDDy BDD package provides all the required operations (including quantification). For a much more complete description of symbolic model checking, the reader is encouraged to see [7]. The network layer model checker and the application layer model checker are encoded separately. Each of them is working on different variables. However, we may need to verify some requirements using both of them together. Although the application layer requests are transmitted from an application to another, the requests are encapsulated inside network packets. We do not require to have static one-one mapping between each request and the packet flow that should be used to transfer the request. The mapping can be expressed in the query itself by specifying precise network packet characteristics . For this purpose we use the variable loc which is common between the two models and has the same meaning. Fig. 3 shows how the two models are used together to verify the requirements. B. Query structure and features The query in our model checkers retrieves the states that satisfy a given condition. The condition is expressed by Fig. 3. Using two models to run a query. restricting some variables in the model checker to a given value or using CTL operators to express a temporal condition. We also need to specify what information to be retrieved about the states which satisfy the query (i.e., a list of variables to be retrieved). An example query can look like this: Q3 = [loc(10.12.13.14) ∧ EF((¬loc(10.0.0.0/8)) ∧ (¬Q2))] Q3 : extractF ield loc dport Q3 : listBounded 20 loc dport The query Q3 is defined by the given expression (i.e., what flows are in a given location (10.12.13.14) and in the future will be outside of the domain (10.0.0.0/8) and do not satisfy a previously defined query (Q2). The second and third lines are used to format the result of the query. The second line tells the query engine that we are only interested in the variables loc and dport. The third line specifies that we need to display only the first 20 satisfying assignments. If there is no satisfying assignment for the given query nothing will be returned. To handle the queries on both the network layer and application layer models, we introduced the concept of sub-query. Each sub-query is applied on one model. The applicationlayer sub-query should not include variables related to network layer model such as source or destination addresses and port numbers and vice versa. A query can include one or more sub-queries based on the following cases. • It can include only one sub-query. In this case the query is applied only on the appropriate model. The query engine detects the appropriate model based on the variables used. • It can include more than one sub-query of the same type linked by the logical operators such as AND, OR, IMP LIES, etc. In this case all the sub-queries are executed on the appropriate model and the final result is calculated by applying the specified operation. The results of the different sub-queries in this case are identical in terms of the number and type of the variables returned. The linking operation can be directly applied. • It can include multiple sub-queries of different types linked by logical operators. In this case, the results of the different types of sub-queries have different variables and we cannot apply the linking operation directly. Thelocation variable (loc) is common between the two models and it has the same meaning and value for the same device. For different types of sub-queries we apply the logical operations based on the location variable only. For example, if an application-layer sub-query is combined with a network-layer sub-query by the AND operation, we calculate the result of both sub-queries and then calculate the intersection between the location values in both results. Only requests and packet flows whose location falls within the intersection are returned. The result of a query is a list of states that satisfy the query expression. In our model we may have two types of results. The first is a set of network-layer states each represented as a packet flow characteristics and location. The second is a set of application-layer states each represented as a request characteristics and location. The existence of these two types depends on the types of sub-queries included within the query script. C. Example properties Property 1: Conflicting network and application accesscontrol (a): Given a user location and userID, does the current configuration allows the user to access the server machine, while the application layer access-control blocks the connection? The query shown in the Table I specifies the initial properties of flows with certain user information (e.g., a specific source IP ”user addr”, and user identifier in the application layer request) and targeted towards a service residing elsewhere (i.e., server, port0 ). If there is an inconsistency in the configuration the query returns a list of requests that cannot access the specified service and another list of packet flows that can eventually reach the host network layer. We can see that the query combines two different types of sub-queries and restricts the location to a particular source machine. Each sub-query is surrounded by angle brackets ”[ ]”. (b): Does the current configuration blocks the user’s access to the server machine through network layer filtering, while the application’s access-control layer permits such connection? As in the pervious property (a), we try to see if a request that is permitted by the application layer access control will never reach the service. This means that somewhere before reaching the server hosting the service there is a network layer device blocks the traffic, or fails to route it correctly. We use the application layer model to find those requests that can pass from the source to a particular service, and we use the network model to find if the underlaying packet flows who should carry the requests are allowed to flow from the source to the appropriate destination. Property 2: Can a user access a resource under different credentials, if he is prohibited from accessing it under his original identity? In this property, we check if a certain user can masquerade under another identity to access a resource. This forms a back-door to this specific object-action pair. A straight forward example can be a user accessing an NFS server for which he does not have access via a web-server who can retrieve the content in the form of web pages. This can be achieved by an improper request transformation in a service which should not be reachable by the specified user. This is defined formally by evaluating the expression specifying which users cannot access an object, while it can be accessed eventually if the constraint on the user identity is removed. Property 3: What access rights does an object require? (a): What roles can user u use to access object o? It is sometimes essential to know what roles can a user manifest when accessing a specific object, or a group of objects. The query consists of checking the space of requests that can pass through the network and RBAC filtering to reach our object of interest. By restricting the user part of the space, we get the possible roles that can be used. We can also, restrict the action if needed. An addition that can prove practically useful is to add another restriction of origin of request: by filtering the location and source address from which the request originated from other than that of the server. In Table I we show only the condition part of the query neglecting the result format part, we can specify to return only the roles and/or any other fields in the results. (b): Which users can access object o via a role r? This is a similar query to the previous one. This query concerns the different users who can access a given object in a certain capacity. For example, we might want to know who can access a critical file as an administrator. Also, we can add extra restrictions to see who can access this object for writing rather than just reading. Property 4: Is there any conflicts within the application layer access-control? (a): Is there any inconsistencies in allowed actions for a specific object? Such conflicts can arise within the same policy or cross policies. For example, if a user is granted the write access to an object then, most probably, read access should be allowed as well. This query is application dependent, and priority between actions has to be specified explicitly (e.g., ‘delete’ > ‘write’ > ‘read’). We write the query for a service to check if it is possible for some user/role to reach the service via a higher action, but not with a lower one. We represent the general form for such query in Table I, the profiles [high security requirements] and [low security requirements] can be replaced with any combination of constraints on the request fields. For example, to compare rights for reading and writing, the high security profile may be [obj(o)∧act(wr)] and the low security profile may be represented as [obj(o) ∧ act(rd)] for the particular object (o). (b): Are role-role relations consistent? As in the pervious property (a), we might need to verify that the order of role privileges is maintained. In other words, aProperty Query expression form P1 (a): Requests reaching the host but not the service it is running. loc(user addr)∧[src(user addr)∧dest(server)∧dport(port0 )∧ EF(loc(server))]∧[usr(userID)∧ ¬EF(loc(server) ∧ srv(port0 ))] P1 (b): Requests reaching the service but can’t reach the host itself. loc(user addr) ∧ [usr(userID) ∧ EF(loc(server) ∧ srv(port0 ))]∧ [src(user addr) ∧ dest(server) ∧ dport(port0 ) ∧ ¬EF(loc(server))] P2: Backdoor: A user is denied direct access to a service, but can use another service to indirectly access it. loc(user addr) ∧ [usr(userID) ∧ ¬EF(usr(userID) ∧ obj(o) ∧ loc(server) ∧ srv(port0 )) ∧ EF(usr(¬userID) ∧ obj(o) ∧ loc(server) ∧ srv(port0 )] ∧ [EF(loc(server) ∧ dport(port0 ))] P3 (a): What roles and actions can a user use to access a specific object from outside the server domain? ¬loc(server) ∧ [EF(loc(server) ∧ srv(port0 ) ∧ usr(u) ∧ obj(o))]∧ [¬src(server) ∧ EF(loc(server) ∧ dport(port0 ))] P3 (b): What users can access a given object? ¬loc(server) ∧ [¬src(server) ∧ EF(loc(server) ∧ dport(port0 ))]∧ [EF(loc(server) ∧ srv(port0 ) ∧ role(r) ∧ obj(o) ∧ act(a))] P4: Is there any inconsistency between rights of low and high privilege requests? EF(loc(server) ∧ srv(port0 ) ∧ [high security requirements]) ∧¬EF(loc(server) ∧ srv(port0 ) ∧ [low security requirements]) TABLE I EXAMPLES FOR REACHABILITY AND SECURITY PROPERTIES more powerful role should be always capable of performing all actions possible for a weaker role. For example, an administrator should perform at least everything doable by a staff member, and guests should never have more access to other roles. It is defined by checking if the space of possible actionsover-objects that can be performed by role1 but not role2 is empty (given that role1 < role2). In this case the high and low security profiles can be represented as [role(role1)∧obj(o)∧ act(a)] and [role(role2) ∧ obj(o) ∧ act(a)] respectively. VI. RELATED WORK There have been significant research effort in the area of configuration verification and management in the past few years. We can classify the work in this area into two main approaches: top-down and bottom-up. The topdown approaches [20], [5] create clean-slate configurations based on high-level requirements. However, the bottom-up approaches [13], [1], [24] analyze the existing configuration to verify desired properties. We focus our discussion on bottomup approach as it is closer to our work in this paper. There has been considerable work recently in detecting misconfiguration in routing and firewall. Many of these approaches are specific for BGP misconfiguration [9], [17], [11], [4]. [23], [13], [1], [24] focused on conflict analysis of firewalls configuration. A BDD-based modeling and taxonomy of IPSec configuration conflicts was presented in [2], [13]. FIREMAN [24] uses BDD to show conflicts on Linux iptables configurations. In [19] and [21], the authors developed a firewall analysis tool to perform customized queries on a set of filtering rules of a firewall. But no general model of network connections is used in this work. In the field of distributed firewalls, current research mainly focuses on the management of distributed firewall policies. The first generation of global policy management technology is presented in [12], which proposes a global policy definition language along with algorithms for verifying the policy and generating filtering rules. In [6], the authors adopted a better approach by using a modular architecture that separates the security policy and the underlying network topology to allow for flexible modification of the network topology without the need to update the security policy. Similar work has been done in [14] with a procedural policy definition language, and in [16] with an object-oriented policy definition language. In terms of distributed firewall policy enforcement, a novel architecture is proposed in [15] where the authors suggest using a trust management system to enforce a centralized security policy at individual network endpoints based on access rights granted to users or hosts. We found that none of the published work in this area addressed the problem of discovering conflicts in distributed firewall environments. A variety of approaches have been proposed in the area of policy conflict analysis. The most significant attempt for IPSec policy analysis is proposed in [10]. The technique simulates IPSec processing by tracking the protection applied on the traffic in every IPSec device. At any point in the simulation, if packet protection violates the security policy requirements, a policy conflict is reported. Although this approach can discover IPSec policy violations in a certain simulation scenario, there is no guarantee that it discovers every possible violation that may exist. In addition, the proposed technique only discovers IPSec conflicts resulting from incorrect tunnel overlapping, but do not address the other types of conflictsthat we study in this research. Other works attempt to create general models for analyzing network configuration [8], [22]. An approach for formulating and deriving of sufficient conditions of connectivity constraints is presented in [8]. The static analysis approach [22]is one of the most interesting work that is close to ConfigChecker. This work uses graph-based approach to model connectivity of network configuration and use set operations to perform static analysis. The transitive closure, as apposed to a fixed point in our approach, is computed. Thus, it seems that all possible paths are computed explicitly. In addition, considering security devices and properties, providing a rich query interface based on our CTL extension, and utilizing BDDs optimization are major advantages of our work. Anteater [18] is another interesting tool for checking invariants in the data plane. It checks the high-level network invariants represented as instances of boolean satisfiability problems (SAT) against network state using a SAT solver, and reports counterexamples for violations, if exist. Thus, in conclusion, although this body of work has a significant impact on the filed, it is either provide limited analysis due to restriction on specific network or application. Unlike the previous work, our work offers a global configuration verification that is comprehensive, scalable and highly expressive. VII. CONCLUSION We presented an extension to the ConfigChecker tool to incorporate both network and application configurations in a unified system across the entire network. Our extended system models the configuration of various devices in the network layer (hubs, switches, routers, firewalls, IPsec gateways) and access control of application layer services including multiplelevel of request translation. Network and system configuration can be modeled together and used to verify properties using CTL-embedded functions translated into Boolean operations. We show that we can separate variables in two model checkers to reduce the state space and required resources. Yet, both models can be used to run combine queries. Our future work includes enhancements in the model’s performance for even faster execution and lower the construction time. Also, we plan to extending the supported devices, and node types to add more virtual devices and compound devices that can incorporate multi-node functionality as in some modern network-based devices. Moreover, a user interface for facilitating interactive execution of queries as well as updating and editing the configurations for a more practical deployment patterns for the tool. We will also try to find a practical mapping scheme between application requests and corresponding packet flows to automatically detect the flows required to communicate a request between different services. REFERENCES [1] E. Al-Shaer and H. Hamed. Discovery of policy anomalies in distributed firewalls. In Proceedings of IEEE INFOCOM’04, March 2004. [2] E. Al-Shaer and H. Hamed. Taxonomy of conflicts in network security policies. IEEE Communications Magazine, 44(3), March 2006. [3] E. Al-Shaer, W. Marrero, A. El-Atawy, and K. Elbadawi. Network configuration in a box: Towards end-to-end verification of network reachability and security. In ICNP, pages 123–132, 2009. [4] R. Alimi, Y. Wang, and Y. R. Yang. Shadow configuration as a network management primitive. In SIGCOMM ’08: Proceedings of the ACM SIGCOMM 2008 conference on Data communication, pages 111–122, New York, NY, USA, 2008. ACM. [5] H. Ballani and P. Francis. Conman: a step towards network manageability. SIGCOMM Comput. Commun. Rev., 37(4):205–216, 2007. [6] Y. Bartal, A. Mayer, K. Nissim, and A. Wool. Firmato: A novel firewall management toolkit. ACM Trans. Comput. Syst., 22(4):381–420, 2004. [7] J. Burch, E. Clarke, K. McMillan, D. Dill, and J. Hwang. Symbolic model checking: 1020 states and beyond. Journal of Information and Computation, 98(2):1–33, June 1992. [8] R. Bush and T. Griffin. Integrity for virtual private routed networks. In IEEE INFOCOM 2003, volume 2, pages 1467– 1476, 2003. [9] N. Feamster and H. Balakrishnan. Detecting BGP configuration faults with static analysis. In NSDI, 2005. [10] Z. Fu, F. Wu, H. Huang, K. Loh, F. Gong, I. Baldine, and C. Xu. IPSec/VPN security policy: Correctness, conflict detection and resolution. In Policy’2001 Workshop, pages 39–56, January 2001. [11] T. G. Griffin and G. Wilfong. On the correctness of IBGP configuration. In SIGCOMM ’02: Proceedings of the ACM SIGCOMM 2002 conference on Data communication, pages 17–29, 2002. [12] J. Guttman. Filtering posture: Local enforcement for global policies. In IEEE Symposium on Security and Privacy, pages 120–129, May 1997. [13] Hazem Hamed, Ehab Al-Shaer and Will Marrero. Modeling and verification of IPSec and VPN security policies. In IEEE International Conference of Network Protocols (ICNP’2005), Nov. 2005. [14] S. Hinrichs. Policy-based management: Bridging the gap. In 15th Annual Computer Security Applications Conference (ACSAC’99), pages 209–218, December 1999. [15] S. Ioannidis, A. Keromytis, S. Bellovin, and J. Smith. Implementing a distributed firewall. In 7 th ACM Conference on Computer and Comminications Security (CCS’00), pages 190–199, November 2000. [16] I. Luck, C. Schafer, and H. Krumm. Model-based tool assistance for packet-filter design. In IEEE Workshop on Policies for Distributed Systems and Networks (POLICY’01), pages 120–136, January 2001. [17] R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGP misconfiguration. In SIGCOMM ’02: Proceedings of the ACM SIGCOMM 2002 conference on Data communications, pages 3–16, New York, NY, USA, 2002. ACM. [18] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T. King. Debugging the data plane with anteater. SIGCOMM Comput. Commun. Rev., 41(4):290–301, Aug. 2011. [19] A. Mayer, A. Wool, and E. Ziskind. Fang: A firewall analysis engine. In IEEE Symposium on Security and Privacy (SSP’00), pages 177–187, May 2000. [20] S. Narain. Network configuration management via model finding. In LISA, pages 155– 168, 2005. [21] A. Wool. A quantitative study of firewall configuration errors. IEEE Computer, 37(6):62–67, 2004. [22] G. G. Xie, J. Zhan, D. Maltz, H. Zhang, A. Greenberg, G. Hjalmtysson, and J. Rexford. On static reachability analysis of ip networks. In IEEE INFOCOM 2005, volume 3, pages 2170– 2183, 2005. [23] Y. Yang, C. U. Martel, and S. F. Wu. On building the minimum number of tunnels: An ordered-split approach to manage ipsec/vpn tunnels. In In 9th IEEE/IFIP Network Operation and Management Symposium (NOMS2004), pages 277–290, May 2004. [24] L. Yuan, J. Mai, Z. Su, H. Chen, C. Chuah, and P. Mohapatra. FIREMAN: A toolkit for firewall modeling and analysis. In IEEE Symposium on Security and Privacy (SSP’06), May 2006. JMLR: Workshop and Conference Proceedings 2012 11th International Conference on Grammatical Inference Bootstrapping Dependency Grammar Inducers from Incomplete Sentence Fragments via Austere Models Valentin I. Spitkovsky valentin@cs.stanford.edu Computer Science Department, Stanford University and Google Research, Google Inc. Hiyan Alshawi hiyan@google.com Google Research, Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA, 94043 Daniel Jurafsky jurafsky@stanford.edu Departments of Linguistics and Computer Science, Stanford University, Stanford, CA, 94305 Editors: Jeffrey Heinz, Colin de la Higuera, and Tim Oates Abstract Modern grammar induction systems often employ curriculum learning strategies that begin by training on a subset of all available input that is considered simpler than the full data. Traditionally, filtering has been at granularities of whole input units, e.g., discarding entire sentences with too many words or punctuation marks. We propose instead viewing interpunctuation fragments as atoms, initially, thus making some simple phrases and clauses of complex sentences available to training sooner. Splitting input text at punctuation in this way improved our state-of-the-art grammar induction pipeline. We observe that resulting partial data, i.e., mostly incomplete sentence fragments, can be analyzed using reduced parsing models which, we show, can be easier to bootstrap than more nuanced grammars. Starting with a new, bare dependency-and-boundary model (DBM-0), our grammar inducer attained 61.2% directed dependency accuracy on Section 23 (all sentences) of the Wall Street Journal corpus: more than 2% higher than previous published results for this task. Keywords: Dependency Grammar Induction; Unsupervised Dependency Parsing; Curriculum Learning; Partial EM; Punctuation; Unsupervised Structure Learning. 1. Introduction “Starting small” strategies (Elman, 1993) that gradually increase complexities of training models (Lari and Young, 1990; Brown et al., 1993; Frank, 2000; Gimpel and Smith, 2011) and/or input data (Brent and Siskind, 2001; Bengio et al., 2009; Krueger and Dayan, 2009; Tu and Honavar, 2011) have long been known to aid various aspects of language learning. In dependency grammar induction, pre-training on sentences up to length 15 before moving on to full data can be particularly effective (Spitkovsky et al., 2010a,b, 2011a,b). Focusing on short inputs first yields many benefits: faster training, better chances of guessing larger fractions of correct parse trees, and a preference for more local structures, to name a few. But there are also drawbacks: notably, unwanted biases, since many short sentences are not representative, and data sparsity, since most typical complete sentences can be quite long. We propose starting with short inter-punctuation fragments of sentences, rather than with small whole inputs exclusively. Splitting text on punctuation allows more and simpler word sequences to be incorporated earlier in training, alleviating data sparsity and complex- c 2012 V.I. Spitkovsky, H. Alshawi & D. Jurafsky.ity concerns. Many of the resulting fragments will be phrases and clauses, since punctuation correlates with constituent boundaries (Ponvert et al., 2010, 2011; Spitkovsky et al., 2011a), and may not fully exhibit sentence structure. Nevertheless, we can accommodate these and other unrepresentative short inputs using our dependency-and-boundary models (DBMs), which distinguish complete sentences from incomplete fragments (Spitkovsky et al., 2012). DBMs consist of overlapping grammars that share all information about head-dependent interactions, while modeling sentence root propensities and head word fertilities separately, for different types of input. Consequently, they can glean generalizable insights about local substructures from incomplete fragments without allowing their unrepresentative lengths and root word distributions to corrupt grammars of complete sentences. In addition, chopping up data plays into other strengths of DBMs — which learn from phrase boundaries, such as the first and last words of sentences — by increasing the number of visible edges. Figure 1: Three types of input: (a) fragments lacking sentence-final punctuation are always considered incomplete; (b) sentences with trailing but no internal punctuation are considered complete though unsplittable; and (c) text that can be split on punctuation yields several smaller incomplete fragments, e.g., Bach’s, Air and followed. In modeling stopping decisions, Bach’s is still considered left-complete — and followed right-complete — since the original input sentence was complete. Odds and Ends (a) An incomplete fragment. “It happens.” (b) A complete sentence that cannot be split on punctuation. Bach’s “Air” followed. (c) A complete sentence that can be split into three fragments. 2. Methodology All of our experiments make use of DBMs, which are head-outward (Alshawi, 1996) classbased models, to generate projective dependency parse trees for Penn English Treebank’s Wall Street Journal (WSJ) portion (Marcus et al., 1993). Instead of gold parts-of-speech, we use context-sensitive unsupervised tags,1 obtained by relaxing a hard clustering produced by Clark’s (2003) algorithm using an HMM (Goldberg et al., 2008). As in our original setup without gold tags (Spitkovsky et al., 2011b), training is split into two stages of Viterbi EM (Spitkovsky et al., 2010b): first on shorter inputs (15 or fewer tokens), then on most sentences (up to length 45). Evaluation is against the reference parse trees of Section 23.2 Our baseline system learns DBM-2 in Stage I and DBM-3 (with punctuation-induced constraints) in Stage II, starting from uniform punctuation-crossing attachment probabilities (see Appendix A for details of DBMs). Smoothing and termination of both stages are as in Stage I of the original system. This strong baseline achieves 59.7% directed dependency accuracy — somewhat higher than our previous state-of-the-art result (59.1%, see also Table 1). In all experiments we will only make changes to Stage I’s training, initialized from the same exact trees as in the baselines and affecting Stage II only via its initial trees. 1. http://nlp.stanford.edu/pubs/goldtags-data.tar.bz2:untagger.model 2. Unlabeled dependencies are converted from labeled constituents using deterministic “head-percolation” rules (Collins, 1999) — after discarding punctuation marks, tokens that are not pronounced where they appear (i.e., having gold part-of-speech tags $ and #) and any empty nodes — as is standard practice.Table 1: Directed dependency and exact tree accuracies (DDA / TA) for our baseline, experiments with split data, and previous state-of-the-art on Section 23 of WSJ. Stage I Stage II DDA TA Baseline (§2) DBM-2 constrained DBM-3 59.7 3.4 Experiment #1 (§3) split DBM-2 constrained DBM-3 60.2 3.5 Experiment #2 (§4) split DBM-i constrained DBM-3 60.5 4.9 Experiment #3 (§5) split DBM-0 constrained DBM-3 61.2 5.0 (Spitkovsky et al., 2011b, §5.2) constrained DMV constrained L-DMV 59.1 — Table 2: Feature-sets parametrizing dependency-and-boundary models three, two, i and zero: if comp is false, then so are comproot and both of compdir; otherwise, comproot is true for unsplit inputs, compdir for prefixes (if dir = L) and suffixes (when dir = R). Model PATTACH (root-head) PATTACH (head-dependent) PSTOP (adjacent/not) DBM-3 (Appendix A) (⋄, L, cr, comproot) (ch, dir, cd, cross) (compdir, ce, dir, adj) DBM-2 (§3, Appendix A) (⋄, L, cr, comproot) (ch, dir, cd) (compdir, ce, dir, adj) DBM-i (§4, Appendix B) (⋄, L, cr, comproot) (ch, dir, cd) (compdir, ce, dir) DBM-0 (§5, Appendix B) (⋄, L, cr) iff comproot (ch, dir, cd) (compdir, ce, dir) 3. Experiment #1 (DBM-2): Learning from Fragmented Data In our experience (Spitkovsky et al., 2011a), punctuation can be viewed as implicit partial bracketing constraints (Pereira and Schabes, 1992): assuming that some (head) word from each inter-punctuation fragment derives the entire fragment is a useful approximation in the unsupervised setting. With this restriction, splitting text at punctuation is equivalent to learning partial parse forests — partial because longer fragments are left unparsed, and forests because even the parsed fragments are left unconnected (Moore et al., 1995). We allow grammar inducers to focus on modeling lower-level substructures first,3 before forcing them to learn how these pieces may fit together. Deferring decisions associated with potentially long-distance inter-fragment relations and dependency arcs from longer fragments to a later training stage is thus a variation on the “easy-first” strategy (Goldberg and Elhadad, 2010), which is a fast and powerful heuristic from the supervised dependency parsing setting. We bootstrapped DBM-2 using snippets of text obtained by slicing up all input sentences at punctuation. Splitting data increased the number of training tokens from 163,715 to 709,215 (and effective short training inputs from 15,922 to 34,856). Ordinarily, tree generation would be conditioned on an exogenous sentence-completeness status (comp), using presence of sentence-final punctuation as a binary proxy. We refined this notion, accounting for new kinds of fragments: (i) for the purposes of modeling roots, only unsplit sentences could remain complete; as for stopping decisions, (ii) leftmost fragments (prefixes of complete original sentences) are left-complete; and, analogously, (iii) rightmost fragments (suf- fixes) retain their status vis-`a-vis right stopping decisions (see Figure 1). With this set-up, performance improved from 59.7 to 60.2% (from 3.4 to 3.5% for exact trees — see Table 1). Next, we will show how to make better use of the additional fragmented training data. 3. About which our loose and sprawl punctuation-induced constraints agree (Spitkovsky et al., 2011a, §2.2).4. Experiment #2 (DBM-i): Learning with a Coarse Model In modeling head word fertilities, DBMs distinguish between the adjacent case (adj = T, deciding whether or not to have any children in a given direction, dir ∈ {L, R}) and nonadjacent cases (adj = F, whether to cease spawning additional daughters — see PSTOP in Table 2). This level of detail can be wasteful for short fragments, however, since nonadjacency will be exceedingly rare there: most words will not have many children. Therefore, we can reduce the model by eliding adjacency. On the down side, this leads to some loss of expressive power; but on the up side, pooled information about phrase edges could flow more easily inwards from input boundaries, since it will not be quite so needlessly subcategorized. We implemented DBM-i by conditioning all stopping decisions only on the direction in which a head word is growing, the input’s completeness status in that direction and the identity of the head’s farthest descendant on that side (the head word itself, in the adjacent case — see Table 2 and Appendix B). With this smaller initial model, directed dependency accuracy on the test set improved only slightly, from 60.2 to 60.5%; however, performance at the granularities of whole trees increased dramatically, from 3.5 to 4.9% (see Table 1). 5. Experiment #3 (DBM-0): Learning with an Ablated Model DBM-i maintains separate root distributions for complete and incomplete sentences (see PATTACH for ⋄ in Table 2), which can isolate verb and modal types heading typical sentences from the various noun types deriving captions, headlines, titles and other fragments that tend to be common in news-style data. Heads of inter-punctuation fragments are less homogeneous than actual sentence roots, however. Therefore, we can simplify the learning task by approximating what would be a high-entropy distribution with a uniform multinomial, which is equivalent to updating DBM-i via a “partial” EM variant (Neal and Hinton, 1999). We implemented DBM-0 by modifying DBM-i to hardwire the root probabilities as one over the number of word classes (1/200, in our case), for all incomplete inputs. With this more compact, asymmetric model, directed dependency accuracy improved substantially, from 60.5 to 61.2% (though only slightly for exact trees, from 4.9 to 5.0% — see Table 1). 6. Conclusion We presented an effective divide-and-conquer strategy for bootstrapping grammar inducers. Our procedure is simple and efficient, achieving state-of-the-art results on a standard English dependency grammar induction task by simultaneously scaffolding on both model and data complexity, using a greatly simplified dependency-and-boundary model with interpunctuation fragments of sentences. Future work could explore inducing structure from sentence prefixes and suffixes — or even bootstrapping from intermediate n-grams, perhaps via novel parsing models that may be better equipped for handling distituent fragments. Acknowledgments We thank the anonymous reviewers and conference organizers for their help and suggestions. Funded, in part, by Defense Advanced Research Projects Agency (DARPA) Machine Reading Program, under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09-C-0181.References H. Alshawi. Head automata for speech translation. In ICSLP, 1996. Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009. M. R. Brent and J. M. Siskind. The role of exposure to isolated words in early vocabulary development. Cognition, 81, 2001. P. F. Brown, V. J. Della Pietra, S. A. Della Pietra, and R. L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19, 1993. A. Clark. Combining distributional and morphological information for part of speech induction. In EACL, 2003. M. Collins. Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, University of Pennsylvania, 1999. J. L. Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48, 1993. R. Frank. From regular to context-free to mildly context-sensitive tree rewriting systems: The path of child language acquisition. In A. Abeill´e and O. Rambow, editors, Tree Adjoining Grammars: Formalisms, Linguistic Analysis and Processing. CSLI Publications, 2000. K. Gimpel and N. A. Smith. Concavity and initialization for unsupervised dependency grammar induction. Technical report, CMU, 2011. Y. Goldberg and M. Elhadad. An efficient algorithm for easy-first non-directional dependency parsing. In NAACL-HLT, 2010. Y. Goldberg, M. Adler, and M. Elhadad. EM can find pretty good HMM POS-taggers (when given a good start). In HLT-ACL, 2008. K. A. Krueger and P. Dayan. Flexible shaping: How learning in small steps helps. Cognition, 110, 2009. K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language, 4, 1990. M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 1993. R. Moore, D. Appelt, J. Dowding, J. M. Gawron, and D. Moran. Combining linguistic and statistical knowledge sources in natural-language processing for ATIS. In SLST, 1995. R. M. Neal and G. E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT Press, 1999. F. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL, 1992. E. Ponvert, J. Baldridge, and K. Erk. Simple unsupervised identification of low-level constituents. In ICSC, 2010. E. Ponvert, J. Baldridge, and K. Erk. Simple unsupervised grammar induction from raw text with cascaded finite state models. In ACL-HLT, 2011. V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. From Baby Steps to Leapfrog: How “Less is More” in unsupervised dependency parsing. In NAACL-HLT, 2010a. V. I. Spitkovsky, H. Alshawi, D. Jurafsky, and C. D. Manning. Viterbi training improves unsupervised dependency parsing. In CoNLL, 2010b. V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Punctuation: Making a point in unsupervised dependency parsing. In CoNLL, 2011a. V. I. Spitkovsky, A. X. Chang, H. Alshawi, and D. Jurafsky. Unsupervised dependency parsing without gold part-of-speech tags. In EMNLP, 2011b. V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Three dependency-and-boundary models for grammar induction. In EMNLP-CoNLL, 2012. K. Tu and V. Honavar. On the utility of curricula in unsupervised learning of probabilistic grammars. In IJCAI, 2011.Appendix A. The Dependency-and-Boundary Models (DBMs 1, 2 and 3) All DBMs begin by choosing a class for the root word (cr). Remainders of parse structures, if any, are produced recursively. Each node spawns off ever more distant left dependents by (i) deciding whether to have more children, conditioned on direction (left), the class of the (leftmost) fringe word in the partial parse (initially, itself), and other parameters (such as adjacency of the would-be child); then (ii) choosing its child’s category, based on direction, the head’s own class, etc. Right dependents are generated analogously, but using separate factors. Unlike traditional head-outward models, DBMs condition their generative process on more observable state: left and right end words of phrases being constructed. Since left and right child sequences are still generated independently, DBM grammars are split-head. DBM-2 maintains two related grammars: one for complete sentences (comp = T), approximated by presence of final punctuation, and another for incomplete fragments. These grammars communicate through shared estimates of word attachment parameters, making it possible to learn from mixtures of input types without polluting root and stopping factors. DBM-3 conditions attachments on additional context, distinguishing arcs that cross punctuation boundaries (cross = T) from lower-level dependencies. We allowed only heads of fragments to attach other fragments as part of (loose) constrained Viterbi EM; in inference, entire fragments could be attached by arbitrary external words (sprawl). All missing families of factors (e.g., those of punctuation-crossing arcs) were initialized as uniform multinomials. Appendix B. Partial Dependency-and-Boundary Models (DBMs i and 0) Since dependency structures are trees, few heads get to spawn multiple dependents on the same side. High fertilities are especially rare in short fragments, inviting economical models whose stopping parameters can be lumped together (because in adjacent cases heads and fringe words coincide: adj = T → h = e, hence ch = ce). Eliminating inessential components, such as the likely-heterogeneous root factors of incomplete inputs, can also yield benefits. Consider the sentence a z . It admits two structures: ya z and xa z . In theory, neither should be preferred. In practice, if the first parse occurs 100p% of the time, a multicomponent model could re-estimate total probability as p n + (1 − p) n , where n may exceed its number of independent components. Only root and adjacent stopping factors are nondeterministic here: PROOT(a ) = PSTOP(z , L) = p and PROOT(z ) = PSTOP(a , R) = 1 − p; attachments are fixed (a can only attach z and vice-versa). Tree probabilities are thus cubes (n = 3): a root and two stopping factors (one for each word, on different sides), P(a z ) = P( ya z ) + P( xa z ) = p z }| { PROOT(a ) PSTOP(a , L) | {z } 1 p z }| { (1 − PSTOP(a , R)) PATTACH(a , R,z ) | {z } 1 p z }| { PSTOP(z , L) PSTOP(z , R) | {z } 1 + 1−p z }| { PROOT(z ) PSTOP(z , R) | {z } 1 1−p z }| { (1 − PSTOP(z , L)) PATTACH(z , L,a ) | {z } 1 1−p z }| { PSTOP(a , R) PSTOP(a , L) | {z } 1 = p 3 + (1 − p) 3 . For p ∈ [0, 1] and n ∈ Z +, p n+(1−p) n ≤ 1, with strict inequality if p /∈ {0, 1} and n > 1. Clearly, as n grows above one, optimizers will more strongly prefer extreme solutions p ∈ {0, 1}, despite lacking evidence in the data. Since the exponent n is related to numbers of input words and independent modeling components, a recipe of short inputs — combined with simpler, partial models — could help alleviate some of this pressure towards arbitrary determinism. Building high-level features using large scale unsupervised learning Quoc%V.%Le% Stanford%University%and%Google% Joint%work%with:%Marc’Aurelio Ranzato,%Rajat Monga,%MaEhieu%Devin,%Kai%Chen,%% Greg%Corrado,%Jeff%Dean,%Andrew%Y.%Ng%pixels% edges% Face%parts% (combinaRon%% of%edges)% Face%detectors% Lee%et%al,%2009.%Sparse%DBNs.% Hierarchy%of%feature%representaRons%Faces% Random%images%from%the%Internet%Quoc%V.%Le% Face%detector% Human%body%detector% Cat%detector% Key%results%Algorithm% Each%RICA%layer%=%1%filtering%layer%+%pooling%layer%+%local%contrast% normalizaRon%layer% See%Le%et%al,%NIPS%11%and%Le%et%al,%CVPR%11%for%applicaRons%on%acRon% recogniRon,%object%recogniRon,%biomedical%imaging% Very%large%model%`>%Cannot%fit%in%a%single%machine% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`>%Model%parallelism,%Data%parallelism% Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Image% Sparse% autoencoder Quoc%V.%Le% Sparse% autoencoder Sparse% autoencoderLocal%recepRve%field%networks% Machine%#1% Machine%#2% Machine%#3% Machine%#4% Le,%et%al.,%Tiled&Convolu,onal&Neural&Networks.%NIPS%2010% Features% Image% Quoc%V.%Le%Asynchronous%Parallel%SGDs% Parameter%server% Quoc%V.%Le%Asynchronous%Parallel%SGDs% Parameter%server% Quoc%V.%Le%Training% Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web% Train%on%1000%machines%(16000%cores)%for%1%week% 1.15%billion%parameters% ` 100x%larger%than%previously%reported%% ` Small%compared%to%visual%cortex% Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Quoc%V.%Le% Image% Sparse% autoencoder Sparse% autoencoder Sparse% autoencoderQuoc%V.%Le% Top%sRmuli%from%the%test%set%Quoc%V.%Le% Face%detector% OpRmal%sRmulus%via%opRmizaRon%Quoc%V.%Le% Face%detector% Human%body%detector% Cat%detector%Feature%value% Random%distractors% Faces% Frequency% Quoc%V.%Le%Invariance%properRes% Feature%response% Horizontal%shils% VerRcal%shils% Feature%response% 3D%rotaRon%angle% Feature%response% 90% 20%pixels% o% Feature%response% Scale%factor% 1.6x% 0%pixels% 0%pixels% 20%pixels% 0% o% 0.4x% 1x% Quoc%V.%Le%ImageNet%classificaRon% 20,000%categories,%16,000,000%images% Hand`engineered%features%(SIFT,%HOG,%LBP),%%SpaRal%pyramid,%% SparseCoding/Compression,%Kernel%SVMs% Quoc%V.%Le%20,000%is%a%lot%of%categories…%% …% smoothhound,%smoothhound%shark,%Mustelus mustelus American%smooth%dogfish,%Mustelus canis Florida%smoothhound,%Mustelus norrisi whiteRp%shark,%reef%whiteRp%shark,%Triaenodon obseus AtlanRc%spiny%dogfish,%Squalus acanthias Pacific%spiny%dogfish,%Squalus suckleyi hammerhead,%hammerhead%shark% smooth%hammerhead,%Sphyrna zygaena smalleye%hammerhead,%Sphyrna tudes shovelhead,%bonnethead,%bonnet%shark,%Sphyrna Rburo angel%shark,%angelfish,%SquaRna squaRna,%monkfish% electric%ray,%crampfish,%numbfish,%torpedo% smalltooth%sawfish,%PrisRs pecRnatus guitarfish% roughtail sRngray,%DasyaRs centroura buEerfly%ray% eagle%ray% spoEed%eagle%ray,%spoEed%ray,%Aetobatus narinari cownose%ray,%cow`nosed%ray,%Rhinoptera bonasus manta,%manta%ray,%devilfish% AtlanRc%manta,%Manta%birostris devil%ray,%Mobula hypostoma grey%skate,%gray%skate,%Raja%baRs liEle%skate,%Raja%erinacea …% SRngray% Mantaray Quoc%V.%Le%0.005%% Random%guess% 9.5%% ?% Feature%learning%% From%raw%pixels% State`of`the`art% (Weston,%Bengio%‘11)% Quoc%V.%Le%ImageNet%2009%(10k%categories):%Best%published%result:%17%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(Sanchez%&%Perronnin%‘11%),%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Our%method:%19%% Using%only%1000%categories,%our%method%>%50%% 0.005%% Random%guess% 9.5%% State`of`the`art% (Weston,%Bengio%‘11)% 15.8%% Feature%learning%% From%raw%pixels% Quoc%V.%Le%Feature%1% Feature%2% Feature%3% Feature%4% Feature%5% Quoc%V.%Le%Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature%7%% Feature%8% Feature%6% Feature%9% Quoc%V.%Le%Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature%11%% Feature%10% Feature%12% Feature%13% Quoc%V.%Le%• RICA%learns%invariant%features% • Face%neuron%with%totally%unlabeled%data%% %%%%%%%%with%enough%training%and%data% • State`of`the`art%performances%on%% – AcRon%RecogniRon% – Cancer%image%classificaRon% – ImageNet Conclusions% Cancer%classificaRon% AcRon%recogniRon% Feature%visualizaRon% Face%neuron% 0.005%% 9.5%% 15.8%% Random%guess% Best%published%result% Our%method% ImageNetSamy Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei Koh,% Mark%Mao,%Jiquan Ngiam,%Patrick%Nguyen,%Andrew%Saxe,% Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%% Peng Xe,%Serena%Yeung,%Will%Zou AddiRonal% Thanks:% Kai%Chen% Greg%Corrado% Jeff%Dean% MaEhieu%Devin% Rajat%Monga% Andrew%Ng% MarcʼAurelio% Ranzato Paul%Tucker% Ke%Yang% Joint%work%with%• Q.V.%Le,%M.A.%Ranzato,%R.%Monga,%M.%Devin,%G.%Corrado,%K.%Chen,%J.%Dean,%A.Y.% Ng.%Building(high*level(features(using(large*scale(unsupervised(learning.% ICML,%2012.% • Q.V.%Le,%J.%Ngiam,%Z.%Chen,%D.%Chia,%P.%Koh,%A.Y.%Ng.%Tiled(Convolu7onal(Neural( Networks.%NIPS,%2010.%% • Q.V.%Le,%W.Y.%Zou,%S.Y.%Yeung,%A.Y.%Ng.%Learning(hierarchical(spa7o*temporal( features(for(ac7on(recogni7on(with(independent(subspace(analysis.%CVPR,% 2011.% • Q.V.%Le,%J.%Ngiam,%A.%Coates,%A.%Lahiri,%B.%Prochnow,%A.Y.%Ng.%% On(op7miza7on(methods(for(deep(learning.%ICML,%2011.%% • Q.V.%Le,%A.%Karpenko,%J.%Ngiam,%A.Y.%Ng.%%ICA(with(Reconstruc7on(Cost(for( Efficient(Overcomplete(Feature(Learning.%NIPS,%2011.%% • Q.V.%Le,%J.%Han,%J.%Gray,%P.%Spellman,%A.%Borowsky,%B.%Parvin.%Learning(Invariant( Features(for(Tumor(Signatures.%ISBI,%2012.%% • I.J.%Goodfellow,%Q.V.%Le,%A.M.%Saxe,%H.%Lee,%A.Y.%Ng,%%Measuring(invariances(in( deep(networks.%NIPS,%2009.% References% hEp://ai.stanford.edu/~quocle Tera-scale deep learning Quoc%V.%Le% Stanford%University%and%Google%Samy Bengio,%Zhenghao%Chen,%Tom%Dean,%Pangwei Koh,% Mark%Mao,%Jiquan Ngiam,%Patrick%Nguyen,%Andrew%Saxe,% Mark%Segal,%Jon%Shlens,%%Vincent%Vanhouke,%%Xiaoyun%Wu,%% Peng Xe,%Serena%Yeung,%Will%Zou AddiNonal% Thanks:% Kai%Chen% Greg%Corrado% Jeff%Dean% MaQhieu%Devin% Rajat%Monga% Andrew%Ng% MarcʼAurelio% Ranzato Paul%Tucker% Ke%Yang% Joint%work%with%Machine%Learning%successes% Face%recogniNon% OCR% Autonomous%car% RecommendaNon%systems% Web%page%ranking% Email%classificaNon%Classifier% Feature%extracNon% (Mostly%hand\cra]ed%features)% Feature%ExtracNon%Hand\Cra]ed%Features% Computer%vision:% Speech%RecogniNon:% MFCC% Spectrogram% ZCR% …% SIFT/HOG% SURF% …%New%feature\designing%paradigm% Unsupervised%Feature%Learning%/%Deep%Learning%% ReconstrucNon%ICA% Expensive%and%typically%applied%to%small%problems%The%Trend%of%BigDataNo%maQer%the%algorithm,%more%features%always%more%successful.% Outline% \%%%%ReconstrucNon%ICA% \ ApplicaNons%to%videos,%cancer%images% \ Ideas%for%scaling%up% \ Scaling%up%Results%Topographic%Independent%Component%Analysis%(TICA)%2.%Learning% Input%data:% W% 1% W% 9% W% 1% T% W% 9% T% 1.%Feature%computaNon% W% 9% T% (% )% 2% W% 1% (% T% )%2% W%=%% W% 1% W% 2% W% 10000% .% .%Topographic%Independent%Component%Analysis%(TICA)%Invariance%explained% F2% F1% Loc1% 1% 0% 0% 1% Pooled%feature%of%F1%and%F2% Loc2% Images% Features% Image1% Image2% sqrt(1%+%0%)%=%1% 2% 2% sqrt(0%%+%1%)%=%1% 2% 2% Same%value%regardless%the%locaNon%of%the%edge%TICA:% ReconstrucNon%ICA:% Equivalence%between%Sparse%Coding,%Autoencoders,%RBMs%and%ICA% Build%deep%architecture%by%treaNng%the%output%of%one%layer%as%input%to% another%layer% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%ReconstrucNon%ICA:% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%ReconstrucNon%ICA:% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011% Data%whitening%TICA:% ReconstrucNon%ICA:% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011% Data%whitening%Why%RICA?% Algorithms% Speed% Sparse%Coding% RBMs/Autoencoders Ease%of%training% Invariant%Features%% TICA% ReconstrucNon%ICA% Le,%et%al.,%ICA$with$Reconstruc1on$Cost$for$Efficient$Overcomplete$Feature$Learning.%NIPS%2011%Summary%of%RICA% \ Two\layered%network% \ ReconstrucNon%cost%instead%of%orthogonality%constraints% \ Learns%invariant%features%ApplicaNons%of%RICA%Sit%up% Drive%Car% Get%%Out%of%Car% Eat% Answer%phone% Kiss% Run% Stand%up% Shake%hands% AcNon%recogniNon% Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$ ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$ ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%70% 71% 72% 73% 74% 75% 76% 75% 77% 79% 81% 83% 85% 87% 35% 37% 39% 41% 43% 45% 47% 49% 51% 53% 55% 80% 82% 84% 86% 88% 90% 92% 94% KTH% Hollywood2% UCF% YouTube% Hessian/SURF% Learned%Features% Learned%Features% Combined% Engineered%Features% Learned%Features% Learned%Features% pLSA HOF% GRBMs% 3DCNN% HMAX% HOG% Hessian/SURF% HOG/HOF% HOG3D% GRBMS% HOF% Hessian% HOF% HOG3D% HOG.HOF% Hessian/SURF% HOG% Le,%et%al.,%Learning$hierarchical$spa1o>temporal$features$for$$ ac1on$recogni1on$with$independent$subspace$analysis.%CVPR%2011%Cancer%classificaNon% ApoptoNc% Viable%tumor% region% Necrosis% …% Le,%et%al.,%Learning$Invariant$Features$of$Tumor$Signatures.%ISBI%2012% 84%% 86%% 88%% 90%% 92%% Hand%engineered%Features% RICA%Scaling%up%% deep%RICA%networks%Scaling%up%Deep%Learning% Real%data% Deep%learning%data%No%maQer%the%algorithm,%more%features%always%more%successful.% It’s%beQer%to%have%more%features!% Coates,%et%al.,%An$Analysis$of$Single>Layer$Networks$in$Unsupervised$Feature$Learning.%AISTATS’11%Most%are%% local%features%Local%recepNve%field%networks% Machine%#1% Machine%#2% Machine%#3% Machine%#4% Le,%et%al.,%Tiled$Convolu1onal$Neural$Networks.%NIPS%2010% RICA%features% Image%Challenges%with%1000s%of%machines%Asynchronous%Parallel%SGDs% Parameter%server% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Asynchronous%Parallel%SGDs% Parameter%server% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Summary%of%Scaling%up% \ Local%connecNvity% \ Asynchronous%SGDs% …%And%more% \ RPC%vs MapReduce \ Prefetching% \ Single%vs%Double% \ Removing%slow%machines% \ OpNmized%So]max \ …%10%million%200x200%images%% 1%billion%parameters%Training% Dataset:%10%million%200x200%unlabeled%images%%from%YouTube/Web% Train%on%2000%machines%(16000%cores)%for%1%week% 1.15%billion%parameters% \ 100x%larger%than%previously%reported%% \ Small%compared%to%visual%cortex% Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012% Image% RICA% RICA% RICA%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opNmizaNon% The%face%neuron% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Feature%value% Random%distractors% Faces% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012% Frequency%Invariance%properNes% Feature%response% Horizontal%shi]s% VerNcal%shi]s% Feature%response% 3D%rotaNon%angle% Feature%response% 90% 20%pixels% o% Feature%response% Scale%factor% 1.6x% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012% 0%pixels% 0%pixels% 20%pixels% 0% o% 0.4x% 1x%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opNmizaNon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Random%distractors% Pedestrians% Feature%value% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012% Frequency%Top%sNmuli%from%the%test%set% OpNmal%sNmulus%% by%numerical%opNmizaNon% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Random%distractors% Cat%faces% Feature%value% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012% Frequency%ImageNet%classificaNon% 22,000%categories% 14,000,000%images% Hand\engineered%features%(SIFT,%HOG,%LBP),%% SpaNal%pyramid,%%SparseCoding/Compression% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%22,000%is%a%lot%of%categories…%% …% smoothhound,%smoothhound%shark,%Mustelus mustelus American%smooth%dogfish,%Mustelus canis Florida%smoothhound,%Mustelus norrisi whiteNp%shark,%reef%whiteNp%shark,%Triaenodon obseus AtlanNc%spiny%dogfish,%Squalus acanthias Pacific%spiny%dogfish,%Squalus suckleyi hammerhead,%hammerhead%shark% smooth%hammerhead,%Sphyrna zygaena smalleye%hammerhead,%Sphyrna tudes shovelhead,%bonnethead,%bonnet%shark,%Sphyrna Nburo angel%shark,%angelfish,%SquaNna squaNna,%monkfish% electric%ray,%crampfish,%numbfish,%torpedo% smalltooth%sawfish,%PrisNs pecNnatus guitarfish% roughtail sNngray,%DasyaNs centroura buQerfly%ray% eagle%ray% spoQed%eagle%ray,%spoQed%ray,%Aetobatus narinari cownose%ray,%cow\nosed%ray,%Rhinoptera bonasus manta,%manta%ray,%devilfish% AtlanNc%manta,%Manta%birostris devil%ray,%Mobula hypostoma grey%skate,%gray%skate,%Raja%baNs liQle%skate,%Raja%erinacea …% SNngray% MantarayBest%sNmuli% Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature%1% Feature%2% Feature%3% Feature%4% Feature%5% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature%7%% Feature%8% Feature%6% Feature%9% Best%sNmuli% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%Pooling Size = 5 Number of maps = 8 Image Size = 200 Number of output channels = 8 Number of input channels = 3 One layer RF size = 18 Input to another layer above (image with 8 channels) W H LCN Size = 5 Feature%11%% Feature%10% Feature%12% Feature%13% Best%sNmuli% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%0.005%% Random%guess% 9.5%% ?% Feature%learning%% From%raw%pixels% State\of\the\art% (Weston,%Bengio%‘11)% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%ImageNet%2009%(10k%categories):%Best%published%result:%17%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%(Sanchez%&%Perronnin%‘11%),%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%Our%method:%20%% Using%only%1000%categories,%our%method%>%50%% 0.005%% Random%guess% 9.5%% State\of\the\art% (Weston,%Bengio%‘11)% 15.8%% Feature%learning%% From%raw%pixels% Le,%et%al.,%Building$high>level$features$using$large>scale$unsupervised$learning.%ICML%2012%No%maQer%the%algorithm,%more%features%always%more%successful.% Other%results% \ We%also%have%great%features%for%% \ Speech%recogniNon% \ Word\vector%embedding%for%NLPs%• RICA%learns%invariant%features% • Face%neuron%with%totally%unlabeled%data%% %%%%%%%%with%enough%training%and%data% • State\of\the\art%performances%on%% – AcNon%RecogniNon% – Cancer%image%classificaNon% – ImageNet Conclusions% 80% 82% 84% 86% 88% 90% 92% 94% Cancer%classificaNon% AcNon%recogniNon% AcNon%recogniNon%benchmarks% Feature%visualizaNon% Face%neuron% 0.005%% 9.5%% 15.8%% Random%guess% Best%published%result% Our%method% ImageNet• Q.V.%Le,%M.A.%Ranzato,%R.%Monga,%M.%Devin,%G.%Corrado,%K.%Chen,%J.%Dean,%A.Y.% Ng.%Building)high+level)features)using)large+scale)unsupervised)learning.% ICML,%2012.% • Q.V.%Le,%J.%Ngiam,%Z.%Chen,%D.%Chia,%P.%Koh,%A.Y.%Ng.%Tiled)Convolu8onal)Neural) Networks.%NIPS,%2010.%% • Q.V.%Le,%W.Y.%Zou,%S.Y.%Yeung,%A.Y.%Ng.%Learning)hierarchical)spa8o+temporal) features)for)ac8on)recogni8on)with)independent)subspace)analysis.%CVPR,% 2011.% • Q.V.%Le,%J.%Ngiam,%A.%Coates,%A.%Lahiri,%B.%Prochnow,%A.Y.%Ng.%% On)op8miza8on)methods)for)deep)learning.%ICML,%2011.%% • Q.V.%Le,%A.%Karpenko,%J.%Ngiam,%A.Y.%Ng.%%ICA)with)Reconstruc8on)Cost)for) Efficient)Overcomplete)Feature)Learning.%NIPS,%2011.%% • Q.V.%Le,%J.%Han,%J.%Gray,%P.%Spellman,%A.%Borowsky,%B.%Parvin.%Learning)Invariant) Features)for)Tumor)Signatures.%ISBI,%2012.%% • I.J.%Goodfellow,%Q.V.%Le,%A.M.%Saxe,%H.%Lee,%A.Y.%Ng,%%Measuring)invariances)in) deep)networks.%NIPS,%2009.% References% hQp://ai.stanford.edu/~quocle On the Predictability of Search Trends Yair Shimshoni Niv Efron Yossi Matias Google, Israel Labs Draft date: August 17, 2009 1. Introduction Since Google Trends and Google Insights for Search were launched, they provide a daily insight into what the world is searching for on Google, by showing the relative volume of search traffic in Google for any search query. An understanding of web search trends can be useful for advertisers, marketers, economists, scholars, and anyone else interested in knowing more about their world and what's currently top-of-mind. The trends of some search queries are quite seasonal and have repeated patterns. See, for instance, the search trends for ski in the US and in Australia peak during the winter season; or check out how search trends for basketball correlate with annual league events and how consistent it is year-over-year. When looking at trends of the aggregated volume of search queries related to particular categories, one can also observe regular patterns in at least some of hundreds of categories, like the Food & Drink or Automotive categories. Such trends sequences appear quite predictable, and one would naturally expect the patterns of previous years to repeat looking forward. On the other hand, for many other search queries and categories, the trends are quite irregular and hard to predict. For example, the search trends for Obama, Twitter, Android, or global warming, and trends of aggregate searches in the News & Current Events category. Having predictable trends for a search query or for a group of queries could have interesting ramifications. One could forecast the trends into the future, and use it as a "best guess" for various business decisions such as budget planning, marketing campaigns and resource allocations. One could identify deviation from such forecasting and identify new factors that are influencing the search volume like in the detection of influenza epidemics using search queries [Ginsberg etal. 2009] known as Flu Trends. We were therefore interested in the following questions: • How many search queries have trends that are predictable? • Are some categories more predictable than others? How is the distribution of predictable trends between the various categories? • How predictable are the trends of aggregated search queries for different categories? Which categories are more predictable and which are less so? To learn about the predictability of search trends, and so as to overcome our basic limitation of not knowing what the future will entail, we characterize the predictability of a Trends series based on its historical performance. That is, based on the a posteriori predictability of a sequence determined by the discrepancy of forecast trends applied at some point in the past vs the actual performance.Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The discrepancy between the forecasting trends and the actual trends characterize the predictability level of a sequence, and when the discrepancy is smaller than a predefined threshold, we denote the trends query as predictable. We investigate time series of search trends provided by Google Insights for Search (I4S), which represent query shares of given search terms (or for aggregations of terms). A query share is the total number of queries for a search term (or an entire search category) in a given geographic region divided by the total number of queries in that region at a given point in time. The query share represents the popularity of a query, or the aggregated search interest that users have in a query, and we will therefore use the term search interest interchangeably with query share. The highlights of our observations can be summarized as follows: • Over half of the most popular Google search queries were found predictable in 12 month ahead forecast, with a mean absolute prediction error of approximately 12% on average. • Nearly half of the most popular queries are not predictable, with respect to the prediction model and evaluation framework that we have used. • Some categories have particularly high fraction of predictable queries; for instance, Health (74%), Food & Drink (67%) and Travel (65%). • Some categories have particularly low fraction of predictable queries; for instance, Entertainment (35%) and Social Networks & Online Communities (27%). • The trends of aggregated queries per categories are much more predictable: 88% of the aggregated category search trends of over 600 categories in Insights for Search are predictable with a mean absolute prediction error of less than 6% on average. • There is a clear association between the existence of seasonality patterns and higher predictability as well as an association between high levels of outliers and lower predictability. Recently the research community has started to use Google search data provided publicly by Google Insights for Search (I4S) as auxiliary indicators for economic forecast. [Choi & Varian 2009] have shown that aggregated search trends of Google Categories can be used as extra indicators and effectively leverage several US econometrics prediction models. [Askitas & Zimmermann 2009] and [Suhoy 2009] have shown similar findings on German and Israeli economic data, respectively. Getting a better insight into the behavior of relevant search trends has therefore high potential applicability for these domains. For queries or aggregated set of queries for which the search trends are predictable, one can use a forecasted trends based on the prediction model as a baseline for identifying deviations in actual trends. Such deviations are of particular interest as they are often indicative of material changes in the domain of the queries. We consider a few examples with observed deviation of actual trends relative to the forecasted trends, including: • Automotive Industry We show that in the recent 12 months there is a positive deviation relative to the forecast baseline (i.e., an increased query share) in the searches of Auto Parts and Vehicle Maintenance while there is a negative deviation (i.e., a decrease in query share) in the searches of Vehicle Shopping and Auto Financing.• US Unemployment In relation with the recent research that showed an improvement in prediction of unemployment rates using Google query shares [Choi and Varian 2009 b], we show that the search interest in the category of Welfare & Unemployment has substantially risen in the last year above the forecast based on the prediction model. We also show that the search interest in the category Jobs has significantly decreased according to the prediction model. • Mexico as Vacation Destination We examine the large decrease in the query share of the category Mexico as a vacation destination, compared to the predictions for the last 12 months. We show that a similar deviation of (actual vs. forecast) query share is not observed for other related categories. • Recession Markers We show several examples that demonstrate possible influences of the recent recession on search behavior, like an observed increase of query share for the category Coupons & Rebate compared to the forecast. We also show a negative deviation between the query share for category Restaurants compared to the forecast, where as the category Cooking and Recipes shows a similar positive deviation. Outline The rest of this paper is organized as follows. In Section 2 we formulate the notion of predictability and describe the method of estimating it along with the evaluation measures, prediction model and the time series data that we use. In Section 3 we describe the experiments we conducted and present their results. In Section 4 we examine the association between the predictability of search interest and the level of seasonality or internal deviation of the underlying search trends. Section 5 will present sensitivity analysis and error diagnostics and in section 6 we discuss the potential use of forecasting as a baseline for identifying deviations from regular search behavior; we demonstrate with some examples that the discrepancies from model predictions can act as signals for recent changes in the query share. 2. Time Series Predictability In this section we define the notion of predictability as we use it in our experiments. Predictability. We characterize the predictability of a time series with respect to a prediction model and a discrepancy measure, as follows. Assume we have: • A time series X={xt-H, ... ,xt+F} with history of size H and a future horizon of size F Denote: X H={xt-H, ... ,xt} and X F={xt+1, ... ,xt+F} • A Prediction model: M, which computes a forecast Y=M(X H ) , where Y={yt+1, ... ,yt+F} • A Discrepancy Measure: D=D(X,Y) • A Threshold: D' Then, we say that X is predictable w.r.t (M,D,D',t,H,F) iff D=D(X,Y) < D' . The size of the discrepancy also characterizes the level of predictability of a series. We will often refer to a trends sequence as predictable (or not-predictable) where the various parameters are implied by the context.Data. The time series that are used in the following experiments are based on Google Insights for Search1 (I4S), which reports the query share for search terms in any time, location and category, as well as capable of reporting the most popular queries within a given time / location / category. A query share is defined as the total number of queries for a search term or a set of terms (e.g., an entire search category) in a given geographic region, divided by the total number of queries in that region, at a given point in time. The I4S categories are organized in a tree-like hierarchical structure, with about 30 root level categories, that are further divided into subcategories in a 3-level taxonomy, to a total of about 600 categories and sub-categories. Each search query is classified by I4S to a single category, nevertheless it will also be counted as a part of the query share to all its 'parent' categories. For each category, I4S calculates an aggregated times series which represents the overall query share of this category (i.e., the combined search interest of all the queries in the category). In order to stay focused on the most influential patterns of the yearly seasonality and overall trend (direction), we are using time series of monthly granularity (i.e., one data point per calendar month) and refer to the entire available period (2004-2009). Obviously search trends with finer granularity (e.g., weekly or daily search data) do capture more patterns of search behavior within the intra-monthly and especially the day-of-week effect, however the fine resolution data is also noisier and thus calls for prediction models with higher complexity and a less homogeneous model space. We leave that for future research. We have extracted time series of the entire available time range (2004-2009)2 that consists of 67 data points, which were partitioned into two parts: 1. The History Period - 55 monthly data points (January 2004 - July 2008) 2. The Forecast Period - 12 monthly data points (August 2008 - July 2009) Throughout the work, we will refer to 3 data sets of time series (with a similar format): 1. Country Data - Includes time series of the query shares for the 10,000 most popular queries in each of these countries: USA, UK, Germany, France and Brazil. 2. Category Data - Includes time series of the query shares for the 1,000 most popular queries in the US, for 10 major I4S categories: Automotive, Entertainment, Finance & Insurance, Food & Drink, Health, Social Networks & Online Communities, Real Estate, Shopping, Telecommunications, and Travel. 3. Aggregated Categories Data - Includes time series of aggregated query shares for about 600 I4S categories, which represent the normalized combined search volume in the US for each respective category. Generic Prediction Model. Our prediction process is based on the STL Procedure [Cleveland et.al. 1990], which is a filtering procedure based on locally weighted least squares for decomposing a given time series X into the Trend, Seasonal and Residual components. STL is basically an EM-like algorithm that calculates the seasonal part assuming knowledge of the trend part (iteratively). To compute the forecast of the future values, we extrapolate the trend sub-series using regression, and use the last seasonal period of the seasonal component. 1. URL: http://www.google.com/insights/search/# 2. The time series data was pulled during July 2009, thus the value for this last month might change.The STL procedure uses 6 configuration parameters, 3 of which are smoothing parameters for the three components, which in general should be chosen per time series. The prediction process in our experiments was using a fixed STL configuration for the forecast of all the time series. Given a sampled archive of search time series, we have used an exhaustive exploration and evaluation process that was searching for the best parameter set from a pre-defined set of optional parameter values. The optimality criterion was minimal mean absolute error and the output was a single parameter set w.r.t. the given sampled archive. By choosing to use a particular (fixed) configuration, rather than adjusting an individual parameters set for each given time series, we are adjusting the configuration to a large set of time series thus simplifying the prediction model and enabling much faster forecast. Prediction Discrepancy function. We define the discrepancy D as a combination of several error metrics between the forecasted trends X F and the actual trends Y, as well as seasonal consistency metrics determined by difference in the auto-correlation between X H and X. Specifically, D is defined as a tuple: D = < MAPE, MaxAPE, NMSSE, MeanAbsACFDiff, MaxAbsACFDiff > based on metrics defined below, we say that D . Thus, we say that a given time series is predictable within the available time frame, w.r.t. the prediction model we use and the above error and consistency metrics, if all the following conditions are fulfilled: 1) The Mean Absolute Prediction Error (MAPE) < 25% 2) The Max Absolute Prediction Error (MaxAPE) < 100% 3) The Normalized Mean Sum of Squared Errors (NMSSE) < 10.0 4) The Mean Absolute Difference of the ACF Coef. Sets (MeanAbsACFDiff) < 0.2 5) The Max Absolute Difference of the ACF Coef. Sets (MaxAbsACFDiff) < 0.4 Predictability Ratio. Given a set A of time series, denote its predictability ratio as the number of predictable time series in A, divided by the total number of time series in A.3. Experiments and Results Comparing the Predictability of Top Queries in Different Countries. We have conducted an experiment to test the predictability of search trends regard to the 10,000 most popular search queries in five countries: Country Predictability Ratio Avg. MAPE (for predictable queries) Avg. MaxAPE (for predictable queries) USA 54.1 11.8 27.1 UK 51.4 12.7 32.1 Germany 56.1 11.8 28.2 France 46.9 12.8 28.8 Brazil 46.3 13.7 30.5 Although the above results show some variability among the different countries, one can see that in general, about half of the time series that correspond to popular queries in Google Web Search are predictable with respect to the given prediction model and discrepancy function / threshold. One can see that among the predictable queries, the mean absolute prediction error (MAPE) is about 12% on average, while the maximum absolute prediction error (MaxAPE) is about 30% on average. The Seasonally of Time Series. Time series in general often include various forms of regularity, like a consistent trend (straight, upward or downward) or seasonal patterns (daily, weekly, monthly, etc). In seasonal time series, the amplitude changes along the time in a regular recurring fashion according to the relevant season. In many practical cases, it is common to use a seasonality adjustment where the seasonal component is subtracted from the time series before the analysis, where there are procedures that decompose time series into their seasonal and trend components [Cleveland and Tiao 1976], [Lytras etal. 2007]. We use such a decomposition to compute a metric that represents the relative portion of seasonality within a time series as follows: Given a time series X = {x1, ... ,xT} and a decomposition of X into a seasonal component S and a Trend (i.e., directional) component Tr then: Seasonality Ratio(X) = ( ∑ |Si| ) / ( ∑ |Tri| ) For each time series we forecast, we compute the respective seasonality ratio. For example, let us examine the time series which represents the search interest for the query Cheesecake (in the US, 2004-2009). The blue curve in the following plot shows the original time series which has a significant seasonal component. The red curve is the seasonality adjusted time series; i.e., the trend component that is left after subtracting the seasonality component. It has an upward trend (with Slope:0.18) plus some variability. The seasonality ratio is 2.64 (which is on the 96 percentile of the 10,000 tested queries) and approximates theratio of the area between the red and blue curves, and the area underneath the red curve. The Deviation of a Time Series. In order to assess the extent to which a time series contains extreme values or outliers with large deviation from the overall pattern, we calculate for each time series the deviation ratio. In general, we compute the sum of the top values in the series divided by the total sum of the series, assuming that a large ratio would indicate the existence of considerable extreme values in the series. We normalize by the relative number of top values under consideration. Given a time series X = {x1, ... ,xT} and an integer w, s.t. 1= Prc(X,w) . We use w=90. Notice that the normalization term, (1-(w/100)) in the denominator, is setting the minimal ratio to be 1. Due to the relatively short time series (of 67 points) and since many cases show seasonal patterns with high and narrow peaks (e.g., like in the plot above), it is possible that these sharp peaks will be considered as outliers, although they are a regular part of the time series' recurring dynamics. To mitigate this, we computed the deviation ratio on the seasonal adjusted time series (i.e., on the Trend component that is left after the seasonal component is subtracted by the decomposition we have described above). The Predictability of Search Categories. In order to assess the predictability of categories, we have extracted the 1000 most popular queries in the US for a selection of 10 root level categories and tested their predictability. In the following table, we present the summary results, where the Predictability column on the left refers to the entire 1,000 queries (per category), and the two error metrics (MAPE and MaxAPE in columns 3 & 4) refer only to the sub-set of Predictable queries within each category. The seasonality and deviation ratios are also referring to the entire category sets. The Predictability per category spans from 74% for the Health category, to 27% for the Social Networks & Online Communities category. In the third column we can see that the Mean Absolute Prediction Error (MAPE), varies from of 9% (in the Health category), to 14.1% (in the Social Networks & Online Communities category). The average MAPE for the 10 categories is 12.35%. Notice that the order of predictability ratio is not equal to the order of the MAPE errorsince the Predictability is based on several other metrics as described in Section 2, however the correlation between them is high (r= -0.85). The variability within the columns of seasonality ratio and deviation ratio represents the differences between the search profiles of the various categories, which correspond to the variability of the categories' predictability ratio. For example notice the relatively high seasonality ratio and low deviation ratio of the Food & Drink category which has 66.7% predictability ratio vs. the opposite situation of the Entertainment category that has 35.4% predictability ratio with a relatively low seasonality ratio and high deviation ratio. Category Name Predictability Ratio MAPE predictable queries MaxAPE predictable queries Seasonality Ratio Deviation Ratio Health 74.00 9.00 20.00 0.73 1.58 Food & Drink 66.70 11.90 26.00 1.20 1.74 Travel 64.70 11.80 27.00 1.09 1.61 Shopping 63.30 12.40 28.00 1.21 1.78 Automotive 57.60 11.20 24.90 0.71 1.84 Finance & Insurance 52.90 13.30 30.60 0.65 2.00 Real Estate 49.50 12.90 29.90 0.72 1.82 Telecommunications 45.60 12.90 29.40 0.32 2.34 Entertainment 35.40 14.00 32.30 0.46 2.49 Social Networks 27.50 14.10 30.10 0.19 2.95 For the above summary results of the 10 categories, the correlation between the Predictability and the Seasonality Ratio is r= 0.80 while the Deviation Ratio has a (negative) correlation of r= -0.94 with the Predictability. In the next section we will further examine the association between these regularity characteristics and the predictability. The Predictability of Aggregated Time Series that represent Categories. We now show the results of an experiment of forecasting aggregated times series that represent the overall query share of categories (i.e., the combined search interest of all the queries in the category). We ran the experiment on the aggregated time series of over 600 I4S categories and computed the average absolute prediction error over a period of 12 months ahead. We found 88% of the aggregated category time series to be predictable. The average MAPE for the entire set of aggregated category time series is 8.15%. (6.7% for Predictable queries only), with STD=4.18%. The Average Maximum Prediction Error (MaxAPE) for the entire set was 19.2% (16.6% for Predictable queries only). In the table below, we show the prediction errors for the aggregated time series for the same 10 root categories we examined above. Notice that the prediction errors are now smaller, which was expected. However, we can also see that the order of the categories is not the same as the respective order in the table of the previous experiment. In general, the aggregated time series should have a higher predictability due to the noise reduction effect of the aggregation. The rightmost column shows the MAPE Reduction Rate, which is the relative improvement of the prediction error (MAPE) of the 1,000 queries per category (in the previousexperiment) and the single MAPE for the aggregated category time series here. All categories (except Social Networks & Online Communities) had their MAPE reduced, starting from 47% improvement for the Finance & Insurance category up to 85% for the Food & Drink Category. Category MAPEMaxAPESeasonality Ratio Deviation Ratio MAPE Reduction Rate Food & Drink 1.76 4.52 0.70 1.18 0.85 Shopping 2.72 6.02 2.77 1.11 0.78 Entertainment 2.74 5.95 0.30 1.16 0.80 Health 2.99 7.69 1.04 1.11 0.67 Automotive 3.27 7.36 1.69 1.12 0.71 Travel 3.94 7.61 1.92 1.12 0.67 Telecommunications 5.2 9.07 0.74 1.20 0.60 Real Estate 5.62 12.8 2.95 1.11 0.56 Finance & Insurance 7.08 17.8 0.61 1.26 0.47 Social Networks 38.6 50.4 0.06 2.46 -1.74 The I4S classification into search categories is based on a hierarchical tree-like taxonomy where each category at the root level of the tree has several sub-categories under it. Thus, a combination of all the categories' prediction error into an overall evaluation of the prediction error, can consist of the average MAPE values of the 27 root level categories. However, a 'regular' (uniform) average which gives the same weight to each category, might be inaccurate. Therefore, we have computed a weighted average of the root categories' MAPE, where the weights are the overall relative search interest of each root category. The MAPE Weighted Average is 4.25%. The following table shows the predication errors for the I4S root categories (sorted by the MAPE): Root Category MAPE MaxAPE Food & Drink 1.76 4.52 Beauty & Personal Care 2.2 7.41 Home & Garden 2.21 4.9 Photo & Video 2.34 8.31 Lifestyles 2.38 5.27 Games 2.59 4.45 Shopping 2.72 6.02 Entertainment 2.74 5.95 Business 2.91 11.5 Health 2.99 7.69 Local 3.24 5.49 Automotive 3.27 7.36 Reference 3.7 8.2 Industries 3.77 7.14 Recreation 3.81 7.58 Computers & Electronics 3.93 7.83Travel 3.94 7.61 Internet 4.87 15 Telecommunications 5.2 9.07 Society 5.57 12.6 Real Estate 5.62 12.8 Sports 5.81 29.3 Arts & Humanities 6.98 11.8 Finance & Insurance 7.08 17.8 Science 10.1 15.5 News & Current Events 16.6 47 Social Networks 38.6 50.4 Average 5.81 12.5 Comparing the Predictability of a Category and its Sub-Categories. It is reasonable to expect that a time series of the aggregated search of a set of queries should in general be more predictable than single queries. The larger the aggregation set is, the smaller would be the variability of the aggregated time series. This has implications on the predictability of categories vs. sub-categories, but also has implications regarding aggregated time series of group of queries such as campaign related queries or brand/topic related queries in general. In order to demonstrate this, we have explored the MAPE and MaxAPE prediction errors of the I4S category Vehicle Brands (in the Automotive category), compared to all its 31 'children' sub-categories. The variability of Prediction Errors (MAPE) within the 31 vehicle brands sub-categories is substantial and varies from 3% to 38%. The average MAPE of the 31 brands is 11.4% (with STD=7.7%) which is quite similar to the average MAPE for the 1,000 most popular queries in the Automotive category (11.2%) as we presented above. As expected, the average MAPE of the 31 sub-categories is larger than the MAPE of the aggregated time series of the Vehicle Brands category which is only 3.39%. We have also calculated the median MAPE (9.3%), as well as the weighted average MAPE (with relative search interest per category as weights) (9.7%). Both the median and the weighted average are lower than the regular average but still much larger than the MAPE for the overall aggregated category of Vehicle Brands.4. Predictability vs. Seasonality and Deviation Ratios Among the 10 categories for which we have analyzed their 1,000 most popular queries, we calculated a correlation of r= 0.80 between the Predictability and the Seasonality Ratio and r= -0.94 between the Predictability and the Deviation Ratio (see table in Section 3). Below, we examine the association between the these two time series' characteristics and the MAPE prediction error in the experiment we conducted on the 10,000 most popular queries in the US. Seasonality and Prediction Errors. Many patterns of search behavior have a strong seasonal component (e.g. holidays shopping, summer vacation, etc.) as implied from the specific market they are in. Occasionally, there is also a directional trend effect (up, down or changing) which might be less visually pronounced due to the confounding seasonal pattern. We have used the Seasonality Ratio (described above) as a representation for the 'level of seasonality' of the queries. Among the 10,000 most popular queries in the US, the Seasonality Ratio varies in the rather large range [0.01,13], from time series with no seasonal component up to extremely seasonal time series. The median Seasonality Ratio is 0.4 and its mean value is 0.8. We could see no significant correlation between the prediction error and the seasonality ratio. In order to visualize this possible association, we have sorted the values of seasonality ratio and created a ('smoothed') arrays of 10 average points3 . Similarly, we have computed a 'smoothed' array of averages for the 10,000 corresponding MAPE prediction errors which were sorted according to the corresponding seasonality ratio. We show here a scatter plot of the 'smoothed' MAPE vs the 'smoothed' seasonality ratio. The plot shows a non-stable 'negative' association between prediction errors and the seasonality. The correlation coefficient between the 'smoothed' arrays is substantial (r=0.55), compared to the insignificant correlation we saw for the entire set. 3. Given a time series {YN}, N=10,000 ; K=10; M=N/K=1,000. We compute an array A={A1, A2,.....,AK} of the averages of K consecutive non-overlapping windows of size M over the time series {YN}, such that Ak= (1/M) ∑Yi, where k=1,..,K and i={1+(k-1)M,..,kM}.For the next plot we have repeated the same process - but for predictable time series only. The result shows a stronger 'negative' association between the MAPE prediction error and the seasonality ratio for Predictable queries. Deviation Ratio and Prediction Errors. The Deviation Ratio, which represents the level of outliers and irregular extreme values in a time series was found to be associated with the Predictability of the search interest time series. For the 10,000 queries we tested, the average deviation ratio was 2.08 (STD=1.9). Only 5% of the Predictable time series had a deviation ratio in the upper quartile and 73% of the predictable time series had a deviation ratio under the median. The correlation coefficient between the deviation ratio and the the MAPE error was r=0.29. The average deviation ratio for the Predictable time series was: 1.50 where as for the non-Predictable queries the average was: 2.77. We have applied the same process as above in order to visually demonstrate the association between MAPE and the deviation ratio. The following plot shows a clear positive association between the (sorted) 'smoothed' array of the deviation Ratio and the corresponding 'smoothed' array of prediction errors (MAPE). The correlation coefficient calculated for the 'smoothed' arrays was r=0.88 (compared to r=0.29 which was computed with the original values). Hence, we can say that the larger the deviation level in the time series, the larger is the prediction error. This can also be seen in the next plot for the the Predictable queries only.5. Sensitivity Analysis and Errors Diagnostics Sensitivity of the Predictability Thresholds. As described earlier, we have chosen a predefined set of thresholds which correspond to the three prediction error metrics (MAPE, MaxAPE, NMSSE) and two consistency metrics. These thresholds are responsible for the trade-off between the Predictability Ratio and the distribution of errors within the Predictable time series. In the following figure we see a sensitivity plot for the Mean Absolute Prediction Error (MAPE), that shows how the Predictability Ratio behaves as a function of the Predictability Threshold. We present a separate analysis for each error measure and not as a conjunction of all the conditions as appears in our Predictability definition. The following plot shows that choosing a Predictability Threshold [MAPE<0.25] 'qualifies' more than 60% of the queries (for a single metric condition). Raising the MAPE threshold by 100% into 0.5, would imply that the Predication Ratio would rise by ~30% (using only the MAPE error metric). Raising the MAPE threshold even more, by 200% into 0.75, would imply that the Predication Ratio would rise by ~50% and will qualify approximately 90% of the queries.The next plots are the sensitivity plots for the MaxAPE and NMSSE error metrics. We can see that both chosen Predictability thresholds (1.0, 10.0) are located much farther into the "Predictable Region" and qualify almost 90% of the queries. Thus, in our experiments we use the MAPE as our primary 'filter' where the MaxAPE and the NMSSE play a secondary role. The following plot displays a similar presentation by showing the number of Predicable time series as a function of the Predictability Threshold (using only the MAPE error measure).Prediction Errors Diagnostics. In this section we show diagnostics plots for the US data (top 10,000 queries). The following figure shows the actual values vs. the predicted values (in log scale), for each of the 12 months in the Forecast Period. The top 12 diagrams refer to the entire set of queries, followed by 12 diagrams for the Predictable queries only. One can clearly see the better prediction performance for the Predictable queries (at the bottom part) as expected. Notice that the performance for the different months deteriorates with time (higher average and STD of the prediction errors) especially towards the later months.In order to learn more on the distribution of the average and maximum prediction errors within the top 10,000 most popular queries in the US, we present the histogram of the MAPE and MaxAPE error measures, with the density estimation superimposed (in red). We can see that both distributions are positively skewed and that the value of the average error is largely affected by the extreme error values. Notice that we have trimmed the data at 0.75 and 3.0 for MAPE and MaxAPE respectively (i.e., 3 x the chosen thresholds), to stay focused on the major part of the distribution.Comparison of the Forecast Performance along the Future Horizon. Since in our experiments we are simultaneously predicting 12 month ahead, it is expected that the forecasts for the later months may have larger prediction errors. We have compared the prediction performance for the 12 consecutive month in the forecast period. The following plot shows the distribution of MAPE prediction errors for each future month. We are showing the average monthly MAPE for the Predictable queries only (among the 10,000 most popular in the US). Notice that the first month is predicted in greater accuracy than the rest, then there is an approximately constant error level for months 2-9, with some increase of the error rate in the last 3 months in the Forecast period. The following plot shows the same type of diagram, but for the Mean Prediction Error (i.e., the 'directional' error measure with the sign). We can learn from this plot that there was a positive bias (upward) in the predictions along all months except the 11'th month. Such systematic tendency of the errors can be explained by a reduction of query share for many queries in the Forecast period (Aug 2008 - July 2009) due to the global economic crisis. Hence the actual search interest values were lower than expected by the prediction model that was based on the previous years. In the following section we present examples of categories (and queries) regarding various markets and brands, for which the actual monthly query shares for the recent 12 months are different than model prediction.6. Search Interest Forecasting as baseline for identifying deviations The aggregated query share of the Google Insights for Search (I4S) categories were used in a recent work of Choi and Varian (2009), that showed how data taken from Google I4S could help to predict economic time series. For example, in the analysis on the US Retail Trade they have used the weekly aggregated time series of categories like: Automotive, Computers & Electronics, Apparel, Sporting Goods, Mass Shopping, Merchants & Department Stores, etc. In a later work [Choi and Varian 2009 b] have applied the same methodology on the U.S. unemployment time series using two sub-categories, Jobs and Welfare & Unemployment. They did not attempt to forecast the Google query share; rather, they have successfully used it as predictors for external economic time series. Other works have shown similar results, regarding the capability of aggregated categories' query share to predict econometrics and unemployment data from Germany [Askitas and Zimmerman 2009] as well as from Israel [Suhoy 2009]. In the following, we will show time series of monthly query share of categories, where the forecast values (in red) were superimposed on the actual values (in blue). The errors made by the prediction model are expressing the deviation between the expected and the actual search behavior, which conveys a valuable information regarding the current state of search interest in the respective categories. Choi and Varian have shown that the users' search interest in several categories as represented by the aggregated query shares indeed have a short term predictive power regarding the actual underlying. The following plots show the aggregated time series of various categories that relate to some major US markets. These category plots, which are ordered by their average MAPE, vary in their Predictability level. From the 10 category plots, we can see that many present a clear seasonal pattern. The first 7 time series showed a relatively low error rate (MAPE<6%), which is in accordance with the substantial regularity of search behavior of the respective categories that was maintained throughout the Forecast period. However, notice that the category of Finance & Insurance which shows a seasonal patterns with some medium irregularities (the seasonality ratio is well above its median), underwent a considerable change in the recent 12 months, highlighting an observed discrepancies between the predicted and the actual monthly search interest. The months of September-October 2008 which were low months in each year during the entire history period are observed as peak month in the Forecast period. This is an example where the prediction model could not anticipate the unexpected exogenous events. The category of Energy and Utility showed the most irregular search behavior (with the lowest Seasonality Ratio and the highest Deviation Ratio among the first 9 categories). In addition to the low regularity of its history, it seems that this category has also underwent a change in the dynamics of search interest, probably since mid year 2008. These contributed to the low prediction results for this category. Another good example for lack of Predictability w.r.t. the prediction model, is the last plot of the Social Networks & Online Communities category that has shown a considerable exponential growth in the forecast period (due to the growing popularity of social networks like Facebook and Twitter), which could not be captured by the prediction model (notice the high deviationratio). We will show below several other examples of the relation between the prediction performance and the external market events.Next, we show several examples where one can use the (posterior) prediction results in order to explore the changing dynamics of users' search behavior and possibly get insights on the relevant markets. Whenever we observe substantial prediction errors, i.e., discrepancies between the actual values vs the predicted values, we can conclude that the regularities in the time series (e.g., seasonality and trend) which were captured by the prediction model, were disturbed in the Forecast period. In cases where the actual values show a regularity that is not in accordance with the history's regularity, one could investigate the reasons for such deviation with relation to known external factors. It is important to emphasize that users' search interest is not necessarily always related to consumer preferences, buying intentions, etc. and can be related sometimes to news or or other associated events. A full discussion on the background and reasons for the following market observations is beyond the scope of this paper. Example: The Automotive Industry. We can see that the forecast for the entire Vehicle brands category for the 12 month period between Aug-08 and Jul-094 shows a relatively low prediction error rate of -2.3% on average. However, as we show below there are some noticeable deviations in different sub-categories. We can see in the next 4 plots that the category Vehicle Shopping shows an average negative deviation of 6% from the prediction model in the last 12 months and that the category Auto 4. The time series data was pulled during July 2009, thus the value for this last month is partial and might be biased.Financing is showing a small negative deviation with average of 2.3% respectively. Notice that both categories of Vehicle Maintenance and Auto Parts are showing a positive average deviation of 4.3% and 5.2% respectively, compared to the predictions.Example: US Unemployment. Choi and Varian (2009 b) have used weekly time series of the I4S aggregated categories Welfare & unemployment and Jobs, to help in short term prediction of "Initial Jobless Claims” reports which are issued by US Department of Labor. In the following plots, we show that the search interest the category Welfare & Unemployment has risen substantially above the forecast by the prediction model. The deviation of Welfare and Unemployment is systematic and relatively quite large. While the average MAPE for the entire set of (aggregated) categories' query shares is 8.1%, with STD 8.2%, the MAPE for Welfare & Unemployment is 31.2% which is 2.8 standard deviations above the overall average MAPE. The actual monthly values for the aggregated query share of the category Jobs are also all higher than forecasted by the model. The time series shows a seasonal pattern with a distinguishable low value in December each year and a relatively constant level in between. At the end of the History period and throughout the Forecast, this regularity is shifted upwards by a confounding volatile factor, which causes large positive prediction errors. The Average Error is almost 9% per month.We present here also the aggregated query share of the category Recruitment & Staffing, for which we can observe a corresponding negative deviation where the model expectations are larger than the actual search interest values. Interestingly, despite a similar seasonal pattern as in the Jobs category, it seems that the change in the users' search behavior in this category has not started until March 2009. Beforehand the predictions were rather accurate and the average monthly deviation is therefore only about (-4.8%). Example: Mexico as Vacation Destination. In this example we show that the search interest for Mexico as a vacation destination has decreased substantially in the recent months. The I4S category Mexico is a sub-category of the Vacation Destinations category (in the Travel root category) which aggregates only the vacation related searches on Mexico. In the next plots we can see that the search interest in the category Mexico is down by almost 15% compared to the predicted. In comparison, we show the respective deviation in the entire category of Vacation Destinations, which is only -1.6% on average in the same forecast period. Notice for a reference that the search interest of another related vacation destination, the Caribbean Islands (with a similar seasonal pattern), also has not shown a deviation of similar magnitude (only -2.5%).We considered the recent outbreak of the Swine Flu pandemic that started to spread in April 2009 as a possible contributor for such a negative deviation of actual-vs-forecast query share for Mexico. We examined the time series of the query share for H1N1 and found it to be highly (anti) correlated (r = -0.93) with the observed deviations for Mexico. As a reference, we show the aggregated query share for the category Infectious Diseases, demonstrating the magnitude of the search interest in this subject (in blue) that was spiking following the Swine Flu outbreak:Example: Recession Markers. The following plots present the aggregated query share for some I4S sub-categories in subjects that might demonstrate the influence of the recent recession on search behavior of consumers, and often appear in articles and blog posts. The change in search interest for the category Coupons & Rebates is visible in the following plot, where one can see an average monthly deviation of 15.9% between the observed query share in the recent 12 month compared to the values predicted by the model. The model has captured the general seasonal pattern, however only accounted for a lower holidays peak and a much more moderate upward trend. Next we see the observed query share of the I4S category Restaurants, that is systematically lower than the model predictions. The time series for the aggregated search interests in this category does not show a seasonal pattern, however there exist an upward trend since 2004, which was apparently broken at September 2008 hence causing negative actual-vs-forecast deviation with a an average of -7.8% per month.Below we can see for reference that the Cooking & Recipes category has a systematic positive deviation of actual-vs-forecast query share. The average monthly deviation of 6.15% represents a higher observed search interest in this category for the entire Forecast period compared to model prediction, with almost a constant deviation since January 2009. Another example is the category Gifts, for which the query share has decreased in the recent 12 months compared to the model predictions, by 11% per month on average. Below we can also see that the category Luxury Goods is showing a negative deviation in the actual-vsforecast query share, of 5.8% per month on average.7. Conclusions We studied the predictability of search trends. We found that over half of the most popular Google search queries are predictable w.r.t. the method we have selected, and that several search categories were considerably more predictable than others; that the aggregated queries of the different categories are more predictable than the individual queries and that almost 90% of I4S categories have predictable query shares. In particular we showed that queries with seasonal time series and lower levels of outliers are more predictable. We considered forecasting as a baseline for identification of deviation of actual-vs-forecast, and considered some concrete examples for situations from the automotive, travel and labor verticals. Further research can include an improved implementation of the prediction model as well as incorporating other forecasting models. We would also like to examine short-term forecasting in finer time granularity. Further analysis on actual-vs-forecast (including confidence estimation) could be conducted in various domains, like market analysis, economy, health, etc. In conjunction with this study, a basic forecasting capability was introduced into Google Insights For Search, which provides forecasting for trends that are identified as predictable. Researchers, marketers, journalists, and others, can use I4S to get a wide picture on search trends which now also includes predictability of single queries and aggregated categories in any area of interest. Acknowledgments We would like to thank Yannai Gonczarowsky for designing and implementing the forecasting capabilities in I4S as well as Nir Andelman, Yuval Netzer and Amit Weinstein for creating the forecasting model library. We thank Hal Varian for his helpful comments. Special thanks to the entire team of Google Insights for Search that made this research possible.References [Askitas and Zimmerman 2009] Nikos Askitas and Kalus F. Zimmerman. Google econometrics and unemployment forecasting. Applied Economics Quarterly, 55:107;120, 2009. URL http://ftp.iza.org/dp4201.pdf [Choi and Varian 2009] Hyunyoung Choi and Hal Varian. Predicting the present with google trends. Technical report, Google, 2009. URL http://google.com/googleblogs/pdfs/google_predicting_the_present.pdf. [Choi and Varian 2009b] Hyunyoung Choi and Hal Varian. Predicting Initial Claims for Unemployment Insurance Using Google Trends. Tech. Report, Google, 2009. URL http://research.google.com/archive/papers/initialclaimsUS.pdf [Cleveland and Tiao 1976] W.P. Cleveland and G.C. Tiao. Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the American Statistical Association, Vol. 71, No. 355, 1976 pp. 581-587. [Cleveland etal. 1990] R.B Cleveland, W.S. Cleveland, J.E. McRae and Irma Terpenning. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Jou. of Official Stat., VOL. 6, No. 1, 1990 pp. 3-73. [Ginsberg etal. 2009] Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski & Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (2009). URL http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html [Lytras etal. 2007] Demerta P. Lytras, Roxanne M. Felpausch, and William R. Bell. Determining Seasonality: A Comparison of Diagnostics From X-12-ARIMA (Presented at ICES III, June, 2007). [Suhoy 2009] Tanya Suhoy. Query indices and a 2008 downturn: Israeli data. Tech. Report, Bank of Israel, 2009. URL http://www.bankisrael.gov.il/deptdata/mehkar/papers/dp0906e.pdf Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le quocle@cs.stanford.edu Marc’Aurelio Ranzato ranzato@google.com Rajat Monga rajatmonga@google.com Matthieu Devin mdevin@google.com Kai Chen kaichen@google.com Greg S. Corrado gcorrado@google.com Jeff Dean jeff@google.com Andrew Y. Ng ang@cs.stanford.edu Abstract We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9- layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art. Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). 1. Introduction The focus of this work is to build high-level, classspecific feature detectors from unlabeled images. For instance, we would like to understand if it is possible to build a face detector from only unlabeled images. This approach is inspired by the neuroscientific conjecture that there exist highly class-specific neurons in the human brain, generally and informally known as “grandmother neurons.” The extent of class-specificity of neurons in the brain is an area of active investigation, but current experimental evidence suggests the possibility that some neurons in the temporal cortex are highly selective for object categories such as faces or hands (Desimone et al., 1984), and perhaps even specific people (Quiroga et al., 2005). Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain these class-specific feature detectors. For example, to build a face detector, one needs a large collection of images labeled as containing faces, often with a bounding box around the face. The need for large labeled sets poses a significant challenge for problems where labeled data are rare. Although approaches that make use of inexpensive unlabeled data are often preferred, they have not been shown to work well for building high-level features. This work investigates the feasibility of building highlevel features from only unlabeled data. A positive answer to this question will give rise to two significant results. Practically, this provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data. Informally, this would suggest that it is at least in principle possible that a baby learns to group faces into one classBuilding high-level features using large-scale unsupervised learning because it has seen many of them and not because it is guided by supervision or rewards. Unsupervised feature learning and deep learning have emerged as methodologies in machine learning for building features from unlabeled data. Using unlabeled data in the wild to learn features is the key idea behind the self-taught learning framework (Raina et al., 2007). Successful feature learning algorithms and their applications can be found in recent literature using a variety of approaches such as RBMs (Hinton et al., 2006), autoencoders (Hinton & Salakhutdinov, 2006; Bengio et al., 2007), sparse coding (Lee et al., 2007) and K-means (Coates et al., 2011). So far, most of these algorithms have only succeeded in learning lowlevel features such as “edge” or “blob” detectors. Going beyond such simple features and capturing complex invariances is the topic of this work. Recent studies observe that it is quite time intensive to train deep learning algorithms to yield state of the art results (Ciresan et al., 2010). We conjecture that the long training time is partially responsible for the lack of high-level features reported in the literature. For instance, researchers typically reduce the sizes of datasets and models in order to train networks in a practical amount of time, and these reductions undermine the learning of high-level features. We address this problem by scaling up the core components involved in training deep networks: the dataset, the model, and the computational resources. First, we use a large dataset generated by sampling random frames from random YouTube videos.1 Our input data are 200x200 images, much larger than typical 32x32 images used in deep learning and unsupervised feature learning (Krizhevsky, 2009; Ciresan et al., 2010; Le et al., 2010; Coates et al., 2011). Our model, a deep autoencoder with pooling and local contrast normalization, is scaled to these large images by using a large computer cluster. To support parallelism on this cluster, we use the idea of local receptive fields, e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This idea reduces communication costs between machines and thus allows model parallelism (parameters are distributed across machines). Asynchronous SGD is employed to support data parallelism. The model was trained in a distributed fashion on a cluster with 1,000 machines (16,000 cores) for three days. Experimental results using classification and visualization confirm that it is indeed possible to build highlevel features from unlabeled data. In particular, using a hold-out test set consisting of faces and distractors, we discover a feature that is highly selective for faces. 1This is different from the work of (Lee et al., 2009) who trained their model on images from one class. This result is also validated by visualization via numerical optimization. Control experiments show that the learned detector is not only invariant to translation but also to out-of-plane rotation and scaling. Similar experiments reveal the network also learns the concepts of cat faces and human bodies. The learned representations are also discriminative. Using the learned features, we obtain significant leaps in object recognition with ImageNet. For instance, on ImageNet with 22,000 categories, we achieved 15.8% accuracy, a relative improvement of 70% over the stateof-the-art. Note that, random guess achieves less than 0.005% accuracy for this dataset. 2. Training set construction Our training dataset is constructed by sampling frames from 10 million YouTube videos. To avoid duplicates, each video contributes only one image to the dataset. Each example is a color image with 200x200 pixels. A subset of training images is shown in Appendix A. To check the proportion of faces in the dataset, we run an OpenCV face detector on 60x60 randomly-sampled patches from the dataset (http://opencv.willowgarage.com/wiki/). This experiment shows that patches, being detected as faces by the OpenCV face detector, account for less than 3% of the 100,000 sampled patches 3. Algorithm In this section, we describe the algorithm that we use to learn features from the unlabeled training set. 3.1. Previous work Our work is inspired by recent successful algorithms in unsupervised feature learning and deep learning (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007; Lee et al., 2007). It is strongly influenced by the work of (Olshausen & Field, 1996) on sparse coding. According to their study, sparse coding can be trained on unlabeled natural images to yield receptive fields akin to V1 simple cells (Hubel & Wiesel, 1959). One shortcoming of early approaches such as sparse coding (Olshausen & Field, 1996) is that their architectures are shallow and typically capture low-level concepts (e.g., edge “Gabor” filters) and simple invariances. Addressing this issue is a focus of recent work in deep learning (Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009) which build hierarchies of feature representations. In particular, Lee et al (2008) show that stacked sparse RBMs can model certain simple functions of the V2 area ofBuilding high-level features using large-scale unsupervised learning the cortex. They also demonstrate that convolutional DBNs (Lee et al., 2009), trained on aligned images of faces, can learn a face detector. This result is interesting, but unfortunately requires a certain degree of supervision during dataset construction: their training images (i.e., Caltech 101 images) are aligned, homogeneous and belong to one selected category. Figure 1. The architecture and parameters in one layer of our network. The overall network replicates this structure three times. For simplicity, the images are in 1D. 3.2. Architecture Our algorithm is built upon these ideas and can be viewed as a sparse deep autoencoder with three important ingredients: local receptive fields, pooling and local contrast normalization. First, to scale the autoencoder to large images, we use a simple idea known as local receptive fields (LeCun et al., 1998; Raina et al., 2009; Lee et al., 2009; Le et al., 2010). This biologically inspired idea proposes that each feature in the autoencoder can connect only to a small region of the lower layer. Next, to achieve invariance to local deformations, we employ local L2 pooling (Hyv¨arinen et al., 2009; Gregor & LeCun, 2010; Le et al., 2010) and local contrast normalization (Jarrett et al., 2009). L2 pooling, in particular, allows the learning of invariant features (Hyv¨arinen et al., 2009; Le et al., 2010). Our deep autoencoder is constructed by replicating three times the same stage composed of local filtering, local pooling and local contrast normalization. The output of one stage is the input to the next one and the overall model can be interpreted as a nine-layered network (see Figure 1). The first and second sublayers are often known as filtering (or simple) and pooling (or complex) respectively. The third sublayer performs local subtractive and divisive normalization and it is inspired by biological and computational models (Pinto et al., 2008; Lyu & Simoncelli, 2008; Jarrett et al., 2009).2 As mentioned above, central to our approach is the use of local connectivity between neurons. In our experiments, the first sublayer has receptive fields of 18x18 pixels and the second sub-layer pools over 5x5 overlapping neighborhoods of features (i.e., pooling size). The neurons in the first sublayer connect to pixels in all input channels (or maps) whereas the neurons in the second sublayer connect to pixels of only one channel (or map).3 While the first sublayer outputs linear filter responses, the pooling layer outputs the square root of the sum of the squares of its inputs, and therefore, it is known as L2 pooling. Our style of stacking a series of uniform modules, switching between selectivity and tolerance layers, is reminiscent of Neocognition and HMAX (Fukushima & Miyake, 1982; LeCun et al., 1998; Riesenhuber & Poggio, 1999). It has also been argued to be an architecture employed by the brain (DiCarlo et al., 2012). Although we use local receptive fields, they are not convolutional: the parameters are not shared across different locations in the image. This is a stark difference between our approach and previous work (LeCun et al., 1998; Jarrett et al., 2009; Lee et al., 2009). In addition to being more biologically plausible, unshared weights allow the learning of more invariances other than translational invariances (Le et al., 2010). In terms of scale, our network is perhaps one of the largest known networks to date. It has 1 billion trainable parameters, which is more than an order of magnitude larger than other large networks reported in literature, e.g., (Ciresan et al., 2010; Sermanet & LeCun, 2011) with around 10 million parameters. It is worth noting that our network is still tiny compared to the human visual cortex, which is 106 times larger in terms of the number of neurons and synapses (Pakkenberg et al., 2003). 3.3. Learning and Optimization Learning: During learning, the parameters of the second sublayers (H) are fixed to uniform weights, 2The subtractive normalization removes the weighted average of neighboring neurons from the current neuron gi,j,k = hi,j,k − P iuv Guvhi,j+u,i+v The divisive normalization computes yi,j,k = gi,j,k/ max{c,( P iuv Guvg 2 i,j+u,i+v) 0.5 }, where c is set to be a small number, 0.01, to prevent numerical errors. G is a Gaussian weighting window. (Jarrett et al., 2009) 3For more details regarding connectivity patterns and parameter sensitivity, see Appendix B and E.Building high-level features using large-scale unsupervised learning whereas the encoding weights W1 and decoding weights W2 of the first sublayers are adjusted using the following optimization problem minimize W1,W2 Xm i=1  W2WT 1 x (i) − x (i) 2 2 + λ Xk j=1 q ǫ + Hj (WT 1 x(i)) 2  . (1) Here, λ is a tradeoff parameter between sparsity and reconstruction; m, k are the number of examples and pooling units in a layer respectively; Hj is the vector of weights of the j-th pooling unit. In our experiments, we set λ = 0.1. This optimization problem is also known as reconstruction Topographic Independent Component Analysis (Hyv¨arinen et al., 2009; Le et al., 2011a).4 The first term in the objective ensures the representations encode important information about the data, i.e., they can reconstruct input data; whereas the second term encourages pooling features to group similar features together to achieve invariances. Optimization: All parameters in our model were trained jointly with the objective being the sum of the objectives of the three layers. To train the model, we implemented model parallelism by distributing the local weights W1, W2 and H to different machines. A single instance of the model partitions the neurons and weights out across 169 machines (where each machine had 16 CPU cores). A set of machines that collectively make up a single copy of the model is referred to as a “model replica.” We have built a software framework called DistBelief that manages all the necessary communication between the different machines within a model replica, so that users of the framework merely need to write the desired upwards and downwards computation functions for the neurons in the model, and don’t have to deal with the low-level communication of data across machines. We further scaled up the training by implementing asynchronous SGD using multiple replicas of the core model. For the experiments described here, we divided the training into 5 portions and ran a copy of the model on each of these portions. The models communicate updates through a set of centralized “parameter servers,” which keep the current state of all parameters for the model in a set of partitioned servers (we used 256 parameter server partitions for training the model described in this paper). In the simplest 4 In (Bengio et al., 2007; Le et al., 2011a), the encoding weights and the decoding weights are tied: W1 = W2. However, for better parallelism and better features, our implementation does not enforce tied weights. implementation, before processing each mini-batch a model replica asks the centralized parameter servers for an updated copy of its model parameters. It then processes a mini-batch to compute a parameter gradient, and sends the parameter gradients to the appropriate parameter servers, which then apply each gradient to the current value of the model parameter. We can reduce the communication overhead by having each model replica request updated parameters every P steps and by sending updated gradient values to the parameter servers every G steps (where G might not be equal to P). Our DistBelief software framework automatically manages the transfer of parameters and gradients between the model partitions and the parameter servers, freeing implementors of the layer functions from having to deal with these issues. Asynchronous SGD is more robust to failure and slowness than standard (synchronous) SGD. Specifically, for synchronous SGD, if one of the machines is slow, the entire training process is delayed; whereas for asynchronous SGD, if one machine is slow, only one copy of SGD is delayed while the rest of the optimization can still proceed. In our training, at every step of SGD, the gradient is computed on a minibatch of 100 examples. We trained the network on a cluster with 1,000 machines for three days. See Appendix B, C, and D for more details regarding our implementation of the optimization. 4. Experiments on Faces In this section, we describe our analysis of the learned representations in recognizing faces (“the face detector”) and present control experiments to understand invariance properties of the face detector. Results for other concepts are presented in the next section. 4.1. Test set The test set consists of 37,000 images sampled from two datasets: Labeled Faces In the Wild dataset (Huang et al., 2007) and ImageNet dataset (Deng et al., 2009). There are 13,026 faces sampled from non-aligned Labeled Faces in The Wild.5 The rest are distractor objects randomly sampled from ImageNet. These images are resized to fit the visible areas of the top neurons. Some example images are shown in Appendix A. 4.2. Experimental protocols After training, we used this test set to measure the performance of each neuron in classifying faces against distractors. For each neuron, we found its maximum 5http://vis-www.cs.umass.edu/lfw/lfw.tgzBuilding high-level features using large-scale unsupervised learning and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% accuracy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the local contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with previous study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus (Berkes & Wiskott, 2005; Erhan et al., 2009; Le et al., 2010). In particular, we find the normbounded input x which maximizes the output f of the tested neuron, by solving: x ∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron indeed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to numerical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face detector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distortions to them, e.g., scaling and translating. For outof-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the robustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 6Scaled, translated faces are generated by standard cubic interpolation. For 3D rotated faces, we used 10 se-Building high-level features using large-scale unsupervised learning Figure 4. Scale (left) and out-of-plane (3D) rotation (right) invariance properties of the best feature. Figure 5. Translational invariance properties of the best feature. x-axis is in pixels The results show that the neuron is robust against complex and difficult-to-hard-wire invariances such as out-of-plane rotation and scaling. Control experiments on dataset without faces: As reported above, the best neuron achieves 81.7% accuracy in classifying faces against random distractors. What if we remove all images that have faces from the training set? We performed the control experiment by running a face detector in OpenCV and removing those training images that contain at least one face. The recognition accuracy of the best neuron dropped to 72.5% which is as low as simple linear filters reported in section 4.3. 5. Cat and human body detectors Having achieved a face-sensitive neuron, we would like to understand if the network is also able to detect other high-level concepts. For instance, cats and body parts are quite common in YouTube. Did the network also learn these concepts? To answer this question and quantify selectivity properties of the network with respect to these concepts, we constructed two datasets, one for classifying human bodies against random backgrounds and one for classifying cat faces against other random distractors. For the ease of interpretation, these datasets have a positive-to-negative ratio identical to the face dataset. The cat face images are collected from the dataset dequences of rotated faces from The Sheffield Face Database – http://www.sheffield.ac.uk/eee/research/iel/research/face. See Appendix F for a sample sequence. Figure 6. Visualization of the cat face neuron (left) and human body neuron (right). scribed in (Zhang et al., 2008). In this dataset, there are 10,000 positive images and 18,409 negative images (so that the positive-to-negative ratio is similar to the case of faces). The negative images are chosen randomly from the ImageNet dataset. Negative and positive examples in our human body dataset are subsampled at random from a benchmark dataset (Keller et al., 2009). In the original dataset, each example is a pair of stereo black-and-white images. But for simplicity, we keep only the left images. In total, like in the case of human faces, we have 13,026 positive and 23,974 negative examples. We then followed the same experimental protocols as before. The results, shown in Figure 6, confirm that the network learns not only the concept of faces but also the concepts of cat faces and human bodies. Our high-level detectors also outperform standard baselines in terms of recognition rates, achieving 74.8% and 76.7% on cat and human body respectively. In comparison, best linear filters (sampled from the training set) only achieve 67.2% and 68.1% respectively. In Table 1, we summarize all previous numerical results comparing the best neurons against other baselines such as linear filters and random guesses. To understand the effects of training, we also measure the performance of best neurons in the same network at random initialization. We also compare our method against several other algorithms such as deep autoencoders (Hinton & Salakhutdinov, 2006; Bengio et al., 2007) and K-means (Coates et al., 2011). Results of these baselines are reported in the bottom of Table 1. 6. Object recognition with ImageNet We applied the feature learning method to the task of recognizing objects in the ImageNet dataset (Deng et al., 2009). We started from a network that already learned features from YouTube and ImageNet images using the techniques described in this paper. We then added one-versus-all logistic classifiers on top of the highest layer of this network. This method of initializing a network by unsupervisedBuilding high-level features using large-scale unsupervised learning Table 1. Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs. simple baselines. Here, the first three columns are results for methods that do not require training: random guess, random weights (of the network at initialization, without any training) and best linear filters selected from 100,000 examples sampled from the training set. The last three columns are results for methods that have training: the best neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means. Concept Random Same architecture Best Best first Best Best neuron without guess with random weights linear filter layer neuron neuron contrast normalization Faces 64.8% 67.0% 74.0% 71.0% 81.7% 78.5% Human bodies 64.8% 66.5% 68.1% 67.2% 76.8% 71.8% Cats 64.8% 66.0% 67.8% 67.1% 74.6% 69.3% Concept Our Deep autoencoders Deep autoencoders K-means on network 3 layers 6 layers 40x40 images Faces 81.7% 72.3% 70.9% 72.5% Human bodies 76.7% 71.2% 69.8% 69.3% Cats 74.8% 67.5% 68.3% 68.5% Table 2. Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet. Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼14M images, ∼22K categories) State-of-the-art 16.7% (Sanchez & Perronnin, 2011) 9.3% (Weston et al., 2011) Our method 16.1% (without unsupervised pretraining) 13.6% (without unsupervised pretraining) 19.2% (with unsupervised pretraining) 15.8% (with unsupervised pretraining) learning is also known as “unsupervised pretraining.” During supervised learning with labeled ImageNet images, the parameters of lower layers and the logistic classifiers were both adjusted. This was done by first adjusting the logistic classifiers and then adjusting the entire network (also known as “fine-tuning”). As a control experiment, we also train a network starting with all random weights (i.e., without unsupervised pretraining: all parameters are initialized randomly and only adjusted by ImageNet labeled data). We followed the experimental protocols specified by (Deng et al., 2010; Sanchez & Perronnin, 2011), in which, the datasets are randomly split into two halves for training and validation. We report the performance on the validation set and compare against state-of-theart baselines in Table 2. Note that the splits are not identical to previous work but validation set performances vary slightly across different splits. The results show that our method, starting from scratch (i.e., raw pixels), bests many state-of-the-art hand-engineered features. On ImageNet with 10K categories, our method yielded a 15% relative improvement over previous best published result. On ImageNet with 22K categories, it achieved a 70% relative improvement over the highest other result of which we are aware (including unpublished results known to the authors of (Weston et al., 2011)). Note, random guess achieves less than 0.005% accuracy for this dataset. 7. Conclusion In this work, we simulated high-level class-specific neurons using unlabeled data. We achieved this by combining ideas from recently developed algorithms to learn invariances from unlabeled data. Our implementation scales to a cluster with thousands of machines thanks to model parallelism and asynchronous SGD. Our work shows that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data. In our experiments, we obtained neurons that function as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos. These neurons naturally capture complex invariances such as out-of-plane and scale invariances. The learned representations also work well for discriminative tasks. Starting from these representations, we obtain 15.8% accuracy for object recognition on ImageNet with 20,000 categories, a significant leap of 70% relative improvement over the state-of-the-art. Acknowledgements: We thank Samy Bengio, Adam Coates, Tom Dean, Jia Deng, Mark Mao, Peter Norvig, Paul Tucker, Andrew Saxe, and Jon Shlens for helpful discussions and suggestions. References Bengio, Y. and LeCun, Y. Scaling learning algorithms towards AI. In Large-Scale Kernel Machines, 2007. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layerwise training of deep networks. In NIPS, 2007. Berkes, P. and Wiskott, L. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 2005. Ciresan, D. C., Meier, U., Gambardella, L. M., andBuilding high-level features using large-scale unsupervised learning Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010. Coates, A., Lee, H., and Ng, A. Y. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS 14, 2011. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and FeiFei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does classifying more than 10,000 image categories tell us? In ECCV, 2010. Desimone, R., Albright, T., Gross, C., and Bruce, C. Stimulus-selective properties of inferior temporal neurons in the macaque. The Journal of Neuroscience, 1984. DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 2012. Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of deep networks. Technical report, University of Montreal, 2009. Fukushima, K. and Miyake, S. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 1982. Gregor, K. and LeCun, Y. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. Hinton, G. E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 2006. Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 2006. Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. Hubel, D. H. and Wiesel, T.N. Receptive fields of single neurons in the the cat’s visual cortex. Journal of Physiology, 1959. Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. Natural Image Statistics. Springer, 2009. Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009. Keller, C., Enzweiler, M., and Gavrila, D. M. A new benchmark for stereo-based pedestrian detection. In Proc. of the IEEE Intelligent Vehicles Symposium, 2009. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010. Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. In NIPS, 2011a. Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A.Y. On optimization methods for deep learning. In ICML, 2011b. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceeding of the IEEE, 1998. Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient sparse coding algorithms. In NIPS, 2007. Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief net model for visual area V2. In NIPS, 2008. Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. Lyu, S. and Simoncelli, E. P. Nonlinear image representation using divisive normalization. In CVPR, 2008. Olshausen, B. and Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 1996. Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J., Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L. Aging and the human neocortex. Experimental Gerontology, 2003. Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world visual object recognition hard? PLoS Computational Biology, 2008. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. Invariant visual representation by single neurons in the human brain. Nature, 2005. Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y. Self-taught learning: Transfer learning from unlabelled data. In ICML, 2007. Raina, R., Madhavan, A., and Ng, A. Y. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009. Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. Riesenhuber, M. and Poggio, T. Hierarchical models of object recognition in cortex. Nature Neuroscience, 1999. Sanchez, J. and Perronnin, F. High-dimensional signature compression for large-scale image-classification. In CVPR, 2011. Sermanet, P. and LeCun, Y. Traffic sign recognition with multiscale convolutional neural networks. In IJCNN, 2011. Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011. Zhang, W., Sun, J., and Tang, X. Cat head detection - how to effectively exploit shape and texture features. In ECCV, 2008.Building high-level features using large-scale unsupervised learning A. Training and test images A subset of training images is shown in Figure 7. As can be seen, the positions, scales, orientations of faces in the dataset are diverse. A subset of test images for Figure 7. Thirty randomly-selected training images (shown before the whitening step). identifying the face neuron is shown in Figure 8. Figure 8. Some example test set images (shown before the whitening step). B. Models Central to our approach in this paper is the use of locally-connected networks. In these networks, neurons only connect to a local region of the layer below. In Figure 9, we show the connectivity patterns of the neural network architecture described in the paper. The actual images in the experiments are 2D, but for simplicity, our images in the visualization are in 1D. Figure 9. Diagram of the network we used with more detailed connectivity patterns. Color arrows mean that weights only connect to only one map. Dark arrows mean that weights connect to all maps. Pooling neurons only connect to one map whereas simple neurons and LCN neurons connect to all maps. C. Model Parallelism We use model parallelism to distribute the storage of parameters and gradient computations to different machines. In Figure 10, we show how the weights are divided and stored in different “partitions,” or more simply, machines (see also (Krizhevsky, 2009)). D. Further multicore parallelism Machines in our cluster have many cores which allow further parallelism. Hence, we split these cores to perform different tasks. In our implementation, the cores are divided into three groups: reading data, sending (or writing) data, and performing arithmetic computations. At every time instance, these groups work in parallel to load data, compute numerical results and send to network or write data to disks. E. Parameter sensitivity The hyper-parameters of the network are chosen to fit computational constraints and optimize the training time of our algorithm. These parameters can be changed at the expense of longer training time or more computational resources. For instance, one could increase the size of the receptive fields at an expense of using more memory, more computation, and more net-Building high-level features using large-scale unsupervised learning Figure 10. Model parallelism with the network architecture in use. Here, it can be seen that the weights are divided according to the locality of the image and stored on different machines. Concretely, the weights that connect to the left side of the image are stored in machine 1 (“partition 1”). The weights that connect to the central part of the image are stored in machine 2 (“partition 2”). The weights that connect to the right side of the image are stored in machine 3 (“partition 3”). work bandwidth per machine; or one could increase the number of maps at an expense of using more machines and memories. These hyper-parameters also could affect the performance of the features. We performed control experiments to understand the effects of the two hyperparameters: the size of the receptive fields and the number of maps. By varying each of these parameters and observing the test set accuracies, we can gain an understanding of how much they affect the performance on the face recognition task. Results, shown in Figure 11, confirm that the results are only slightly sensitive to changes in these control parameters. 12 14 16 18 20 60 65 70 75 80 85 receptive field size test set accuracy 6 7 8 9 10 60 65 70 75 80 85 number of maps test set accuracy Figure 11. Left: effects of receptive field sizes on the test set accuracy. Right: effects of number of maps on the test set accuracy. F. Example out-of-plane rotated face sequence In Figure 12, we show an example sequence of 3D (out-of-plane) rotated faces. Note that the faces are black and white but treated as a color picture in the test. More details are available at the webpage for The Sheffield Face Database dataset – http://www.sheffield.ac.uk/eee/research/ iel/research/face Figure 12. A sequence of 3D (out-of-plane) rotated face of one individual. The dataset consists of 10 sequences. G. Best linear filters In the paper, we performed control experiments to compare our features against “best linear filters.” This baseline works as follows. The first step is to sample 100,000 random patches (or filters) from the training set (each patch has the size of a test set image). Then for each patch, we compute its cosine distances between itself and the test set images. The cosine distances are treated as the feature values. Using these feature values, we then search among 20 thresholds to find the best accuracy of a patch in classifying faces against distractors. Each patch gives one accuracy for our test set. The reported accuracy is the best accuracy among 100,000 patches randomly-selected from the training set. H. Histograms on the entire test set Here, we also show the detailed histograms for the neurons on the entire test sets. The fact that the histograms are distinctive for positive and negative images suggests that the network has learned the concept detectors.Building high-level features using large-scale unsupervised learning Figure 13. Histograms of neuron’s activation values for the best face neuron on the test set. Red: the histogram for face images. Blue: the histogram for random distractors. Figure 14. Histograms for the best human body neuron on the test set. Red: the histogram for human body images. Blue: the histogram for random distractors. I. Most responsive stimuli for cats and human bodies In Figure 16, we show the most responsive stimuli for cat and human body neurons on the test sets. Note that, the top stimuli for the human body neuron are black and white images because the test set images are black and white (Keller et al., 2009). J. Implementation details for autoencoders and K-means In our implementation, deep autoencoders are also locally connected and use sigmoidal activation function. For K-means, we downsample images to 40x40 in order to lower computational costs. We also varied the parameters of autoencoders, K-means and chose them to maximize performances given resource constraints. In our experiments, we used 30,000 centroids for Kmeans. These models also employed parallelism in a similar fashion described in the paper. They also used 1,000 machines for three days. Figure 15. Histograms for the best cat neuron on the test set. Red: the histogram for cat images. Blue: the histogram for random distractors. Figure 16. Top: most responsive stimuli on the test set for the cat neuron. Bottom: Most responsive human body stimuli on the test set for the human body neuron. On-Demand Language Model Interpolation for Mobile Speech Input Brandon Ballinger1, Cyril Allauzen2, Alexander Gruenstein1, Johan Schalkwyk2 1Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA 2Google, 76 Ninth Avenue, New York, NY 10011, USA brandonb@google.com, allauzen@google.com, alexgru@google.com, johans@google.com Abstract Google offers several speech features on the Android mobile operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic. Index Terms: language modeling, interpolation, mobile 1. Introduction Entering text on mobile devices is often slow and error-prone in comparison to typing on a full-sized keyboard. Google offers several features on Android aimed at making speech a viable alternative input method: search by voice, voice input into any text field, and a speech API for application developers. To search by voice, users simply tap a microphone icon on the desktop search box, or hold down the physical search button. They can speak any query, and are then shown the Google search results. To use the Voice Input feature, users tap the microphone key on the on-screen keyboard, and then speak to enter text virtually anywhere they would normally type. Users may dictate e-mail and SMS messages, fill in forms on web pages, or enter text into any application. Finally, the Android Speech API is a simple way for developers to integrate speech recognition capabilities into their own applications. While a large portion of usage of the speech recognition service is comprised of spoken queries and dictation of SMS messages, there is a long tail of usage from thousands of other applications. Due to this diversity, choosing an appropriate language model for each utterance (recorded audio) is challenging. Two viable options are to build a single language model to handle all traffic, or to train a language model appropriate to each major use case and then choose the “best” one for each utterance, depending on the context of that utterance. We develop and compare a third option in this paper, in which a development set of utterances from each context is used to optimize interpolation weights among a small number of component language models. Since there may be thousands of such “contexts”, the language models are interpolated ondemand, either during decoding or as a post-processing rescoring phase. On-demand interpolation is performed efficiently via the use of a “compact interpolated” finite state transducer (FST), in which transition weights are dynamically computed. Percent of utterances Voice input 49% Search by Voice 44% Speech API 7% Table 1: Breakdown of speech traffic on Android devices that support Voice Input, Search by Voice, and Speech API. 2. Related Work The technique of creating interpolated language models for different contexts has been used with success in a number of conversational interfaces [1, 2, 3] In this case, the pertinent context is the system’s “dialogue state”, and it’s typical to group transcribed utterances by dialogue state and build one language model per state. Typically, states with little data are merged, and the state-specific language models are interpolated, or otherwise merged. Language models corresponding to multiple states may also be interpolated, to share information across similar states. The technique we develop here differs in two key respects. First, we derive interpolation weights for thousands of recognition contexts, rather than a handful of dialogue states. This makes it impractical to create each interpolated language model offline and swap in the desired one at runtime. Our language models are large, and we only learn the recognition context for a particular utterance when the audio starts to arrive. Second, rather than relying on transcribed utterances from each recognition context to train state-specific language modes, we instead interpolate a small number of language models trained from large corpora. 3. Android Speech Usage Analysis The challenge of supporting a variety of use cases is illustrated by examining the usage of the speech features available on Android. Table 1 breaks down the portion of utterances from the Android platform associated with the three speech features: voice input, search by voice, and the speech API. We note that this distinction isn’t perfect, as some users might, for example, speak a search query into a text box in the browser using the voice input feature. In addition, a large majority of the speech API utterances come from built-in Google applications – Google Maps provides a popular voice-enabled search box, for example. Overall, we observe roughly an even split between searching and dictation. The voice input feature encourages a wide range of usage. Since its launch in January, 2010, users have dictated text into over 8,000 distinct text fields. Table 2 shows the 10 most popular text fields. SMS is extremely popular, with usage levels an order of magnitude greater than any other application. Moreover, among the top 10 fields, 4 of them come from either the built-in SMS application, or one of the many SMS applicaCopyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 1812Text Field Usage SMS - Compose 63.1% An SMS app from Market - Compose 4.9% Browser 4.8% Google Talk 4.5% Gmail - Compose 3.3% Android Market - Search 2.4% Email - Compose 1.8% SMS - To 1.3% Maps - Directions Endpoint 1.0% An SMS app from Market - Compose 1.0% Table 2: The 10 most popular voice input text fields and their percent usage. 0 10 20 30 40 50 60 70 80 90 100 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of fields, sorted by usage Cumulative percent of utterances Figure 1: Cumulative usage for the most popular 100 text fields, rank ordered by usage. tions available on the Android Market. Also popular are other dictation-style applications: Gmail, Email, and Google Talk. Android Market and Maps, both of which also appear in the top 10, represent different kinds of utterances – search queries. Finally, the Browser category here actually encompasses a wide range of fields – any text field on any web page. Figure 1 shows the cumulative usage per text field of the 100 most popular text fields, rank ordered by usage. Although the usage is certainly concentrated among a handful of applications, there remains a significant tail. While increasing accuracy for the tail may not have a huge effect on the overall accuracy of the system, it’s important for users to have a seamless experience using voice input: users will have a difficult time discerning that voice input may work better in some text fields than others. 4. Compact Interpolated FST In this setting, we have a relatively small set of language models that is fixed and known in advance. At recognition time, each utterance comes with a custom set of interpolation (or mixture) weights and we need to be able to efficiently compute ondemand the corresponding interpolated model. In a backoff language model, the conditional probability of w ∈ Σ given context h ∈ Σ∗ is recursively defined as P(w | h) = P(w | h) if hw ∈ S αhP(w | h ) otherwise, where P is the adjusted maximum likelihood probability (derived from the training corpus), S is the skeleton of the model, αh is the backoff weight for the context h and h is the longest common suffix of h. The order of the model is maxhw∈S |hw|. Such a language model can naturally be represented by a weighted automaton over the real semiring (R, +, ×, 0, 1) using failure transitions [4]: the set of states is Q = x y φ/.5 xa a/.5 ya a/.4 yb b/.4 yc c/.04 x y φ/.4 xb b/.6 ya a/.6 yb b/.2 yc c/.02 x y φ/(.5,.4) a/(.5,.24) xa xb b/(.2,.6) a/(.4,.6) ya yb b/(.4,.2) c/(.04,.02) yc (a) (b) (c) Figure 2: Outgoing transitions from state x in (a) G1, (b) G2 and (c) I. For λ = (.6, .4)T , PIλ (a | x) = .6 × .5 + .4 × .24. {h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S}, for each state h, there is a failure transition from h to h labeled by φ and with weight αh, and for each hw ∈ S, there is a transition from h to the longest suffix of hw that belongs to Q, labeled by w and with weight P(w | h). Given a set G = {G1,...,Gm} of m backoff language models and a vector of mixture weights λ = (λ1,...λm) T , the linear interpolation of G by λ is defined as the language model Iλ assigning the conditional probability: PIλ (w | h) = m i=1 λiPGi (w | h). (1) Using (1) directly to perform on-demand interpolation would be inefficient because for a given pair (w, h) we might need to backoff several times in several of the models and this can become rather expensive when using the automata representation. Instead, we chose to reformulate the interpolated model as a backoff model: PIλ (w | h) = λT phw if hw ∈ S(G), f(λ, αh)PIλ (w | h ) otherwise, where phw = (PG1 (w|h),..., PGm(w|h))T , S(G) = ∪m i=1S(Gi) and αh = (αh(G1),...,αh(Gm))T . There exists a closed-form expression of f(λ, α) that ensure the proper normalization of the model. However, in practice we decided to approximate it by the dot product of λ and αh: f(λ, αh) = λT αh. The benefit of this formulation is that it perfectly fits our requirement. Since the set of models is known in advance we can precompute S(G) and all the relevant vectors (phw and αh) effectively building a generic interpolated model I as a model over Rm. Given a new utterance and a corresponding vector of mixture weights λ, we can obtain the relevant interpolated model Iλ by taking the dot product of each component vector of I with λ. Moreover, this approach also allows for an efficient representation of I as a weighted automaton over the semiring (Rm, +, ◦, 0, 1) (◦ denotes componentwise multiplication), the weight of each transition in the automaton being a vector in Rm. The set of states is Q = {h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S(G)}. For each state h, there is a failure transition from h to h labeled by φ and with weight αh, and for each hw ∈ S(G), there is a transition from h to the longest suffix of hw that belongs to Q, labeled by w and with weight phw. Figure 2 illustrates this construction. Given a new utterance and a corresponding vector of mixture weights λ, this automaton can be converted on-demand into a weighted automaton over the real semiring by taking the dot product of λ and the weight vector of each visited transition. 1813 ReFr: An Open-Source Reranker Framework Daniel M. Bikel, Keith B. Hall Google Research, New York, NY {dbikel,kbhall}@google.com Abstract ReFr (http://refr.googlecode.com) is a software architecture for specifying, training and using reranking models, which take the n-best output of some existing system and produce new scores for each of the n hypotheses that potentially induce a different ranking, ideally yielding better results than the original system. The Reranker Framework has some special support for building discriminative language models, but can be applied to any reranking problem. The framework is designed with parallelism and scalability in mind, being able to run on any Hadoop cluster out of the box. While extremely efficient, ReFr is also quite flexible, allowing researchers to explore a wide variety of features and learning methods. ReFr has been used for building state-of-the-art discriminative LM’s for both speech recognition and machine translation systems. Index Terms: language modeling, discriminative language modeling, reranking, structured prediction 1. Introduction Creating effective software tools for research is a tricky business. The classic tension between flexibility and efficiency arises with greater urgency. We want researchers to be able to try out many different ideas easily, but we also want them to be able to have a quick code-test-evaluate cycle. ReFr grew out of the 2011 Johns Hopkins Summer Workshop, from the team using automatically generated confusions to synthesize training data for discriminative language models for speech and machine translation, led by Prof. Brian Roark of OHSU. That approach required tools that would scale up to training data sizes orders of magnitude larger than had previously been used to build discriminative language models, so we not only needed our training and inference to be inherently fast, but we needed to design tools with distributed computing in mind from the outset. This paper describes the tools we have developed to solve not only the immediate research problem of exploring confusions for discriminative language modeling, but also the more general problem of reranking approaches to speech and language processing, including structured prediction. We designed ReFr to have the following properites: • “library quality” code • industrial strength • academic flexibility • easy exploration of different types of features, different update methods (e.g., MIRA-style, direct loss minimization, loss-sensitive) and different learning methods (e.g., perceptron-style, log-linear, kernel methods) • modern, object-oriented design, complete with dynamic factories and dynamic composition for flexibility • parallelizable, especially for distributed-computing environments 2. Data Format for I/O There are two main choices when building discriminative reranking models for speech or machine translation: (a) rescore a lattice or hypergraph or (b) simply use a strict reranking approach applied to n-best lists. For ReFr, early on we decided to use (b) reranking n-best lists. The primary reasons were the flexibility this would allow us in designing features and tools. N-best lists readily allow for sentence-level features in a way that, say, lattices do not. Additionally, it is far easier to de- fine generic schemes of passing around n-best lists than it is for designing schemes to take speech lattices as well as machine translation hypergraphs or other, problem-specific data types. ReFr is meant to be flexible enough to allow for a variety of data sources. In order to avoid the need for overly complex data formats, we have chosen to adopt a formalism which allows one to augment the input format, allowing for flexible feature extraction and data manipulation/analysis. We opted to use a data format which mirrors the data-structures that are used internally for training. The Google protocol buffers[1] provide a programming-language independent specification framework to define data formats. The protocol buffers specification language is used by the protocol buffer tools to generate source-code for serializing and deserializing the data stored in the format. Code is generated to allow for native programming-language encapsulation of the data. For example, in C++ each item of data is stored in an object based on a object oriented data specification (a C++ class) allowing for access to the data.1 3. Core learning framework Consider Algorithm 1, which describes the training procedure for a generic online-learning algorithm. Each training example ei comprises a set of candidate hypotheses, each of which is projected via some function Φ into a feature space, R F . We typically think of Φ as being a suite of feature functions, one per dimension. The model itself is defined as a weight vector in this space, w. Decoding, or inference, is carried out simply by taking the dot product of the model and a test instance. More generally, any kernel function K may be used. The training procedure iterates over the training data T—each iteration is called an epoch—until the NEEDTOKEEPTRAINING() predicate returns false. Often, such a predicate is based on the average loss of the current model on some held-out development data D, which is the purpose of the EVALUATE(D) line in the TRAIN(T) procedure. 1For the 2011 Johns Hopkins Workshop, we were targeting multiple tasks (ASR and MT), and so our toolkit provides a means to convert from two types of text-based n-best formats, one the output of an ASR system, the other the output of an MT system. These conversion tools are not only useful in their own right, but serve as example implementations for any developer converting from their own, proprietary format to the Google Protocol Buffer format used by ReFr. Copyright © 2013 ISCA 25-29 August 2013, Lyon, France INTERSPEECH 2013: Show & Tell Contribution 756Algorithm 1 Training algorithm for online-learning reranking models. Let ei = {c1, . . . , ck} be a training example, where each cj is a candidate hypothesis. Similarly, let di = {c1, . . . , ck} be a held-out development data example, also consisting of k candidate hypotheses. Finally, let K be a kernel function. procedure TRAIN(T = {e1, . . . , en}, D = {d1, . . . , dm}) while NEEDTOKEEPTRAINING() do TRAINONEEPOCH(T) EVALUATE(D) end while end procedure procedure TRAINONEEPOCH(T) foreach training example ei do SCORECANDIDATES(ei) if NEEDTOUPDATE() then UPDATE() end if end for end procedure procedure SCORECANDIDATES(ei) foreach candidate hypothesis cj ∈ ei do cj .score ← K(wt, cj ) end for end procedure Model Candidate Scorer Update Predicate Updater … Figure 1: A pictorial view of how a Model wraps instances of other interfaces that specify the predicates and functions needed to carry out model training. For the basic perceptron, the model starts out at time step 0 as the zero vector; that is, wo = ~0. The update is wt+1 = wt + Rt [Φ (yoracle (ei)) − Φ (ˆy (ei))] , (1) where yoracle is a function that picks out the hypothesis towards which we want to bias our model, yˆ is a function that picks out the candidate hypothesis we want to bias our model against and Rt is a learning rate or step size. Most often, yoracle is defined to pick the hypothesis with the lowest loss relative to some goldstandard truth, and yˆ is defined to pick the candidate hypothesis that scores highest under the current model wt. Most of the variations of this basic learning method involve finding different ways of defining Rt, Φ, yoracle and yˆ, along with the various procedures and predicates shown in Algorithm 1. Therefore, we would like our Reranker Framework to make it easy for the researcher to define these various functions, as well as to specify which ones to use at run-time. ReFr defines a Model interface with virtual methods for all of the functions shown in Algorithm 1. To avoid the exponential blow-up of overriding different combinations of these methods, ReFr also employs dynamic composition. That is, we keep the idea of a Model interface, but additionally have each Model instance wrap a set of predicate/manipulator objects, each of which itself conforms to an interface. Figure 1 shows a pictorial representation of this scheme. As we discussed above, we employ dynamic composition to avoid defining a new subclass of Model every time we wish model file = "my model file"; // model output file model = PerceptronModel( name("my model"), score comparator(DirectLossScoreComparator())); exec feature extractor = ExecutiveFeatureExtractorImpl( feature extractors({NgramFeatureExtractor(n(2)), RankFeatureExtractor()}); training efe = exec feature extractor; dev efe = exec feature extractor; training files = {"training1.gz", "training2.gz"}; devtest files = {"dev1.gz", "dev2.gz"}; Figure 2: An example ReFr configuration file, read by its Interpreter class. to explore a new combination of learning method functions. To do this, ReFr includes a very lightweight and yet powerful interpreter for a language that allows for assignment statements for primitives, vectors of primitives, Factory-constructible objects and vectors of Factory-constructible objects. Figure 2 shows an example ReFr configuration file. The syntax is intentionally very similar to that of C++. This lightweight language provides a flexible mechanism by which to specify how feature extraction, training and inference shall occur. 4. Cluster-based distributed training As Algorithm 1 shows, the basic perceptron algorithm involves “online” updating, and thus it is possible to read in each training example from file each time it is needed, only keeping the model’s parameters persistently in memory. The Reranker Framework allows both the memory-intensive way of training as well as this “streaming mode” version of training, essential for distributed learning. The structured perceptron [2] and it’s variants have proven to be effective in supervised, discriminative language modeling work [3]. We have centered the development of our opensource discriminative learning toolkit around perceptron-style algorithms, which are, by definition, online learning algorithms. Identifying the optimal solution for a distributed online optimization algorithm is still an open research question. We borrow from our previous work on distributed perceptron training in [4, 5] and use the Iterative Parameter Mixtures algorithm for distributed computation. The Reranker Framework makes it easy to switch between single processor and distributed training, which uses the Hadoop implementation of MapReduce [6]. 5. Demo Plan Our demo will consist of a walk-through of all ReFr’s features, followed by a hands-on demonstration of how easy it is to implement a new class of features for the reranker based on the rank of each candidate hypothesis. We will also show how easy it is to integrate that new class of features into training and inference. We will then demonstrate the ease with which one can use the API and the interpreted configuration language to alter the training algorithm. Finally, we will demonstrate the simple way that a user can switch from single processor training to large-scale distributed training. 6. Acknowledgements The authors would like to thank Prof. Brian Roark of Oregon Health and Science University for leading a fantastic team at the 2011 Johns Hopkins Workshop, and we would also like to thank all of our teammates, especially Prof. Izhak Shafran of OHSU and Ph.D. candidate Maider Lehr, who are actively working with and helping us improve ReFr. 7577. References [1] Google, “Protocol buffers,” http://code.google.com/apis/protocolbuffers/. [2] M. Collins, “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms,” in Proc. EMNLP, 2002, pp. 1–8. [3] B. Roark, M. Sarac¸lar, and M. Collins, “Discriminative n-gram language modeling,” Computer Speech and Language, vol. 21, no. 2, pp. 373 – 392, 2007. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0885230806000271 [4] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in HLT-NAACL, 2010. [5] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed optimization,” in NIPS Workshop on Leaning on Cores, Clusters, and Clouds, 2010. [6] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” CACM, vol. 51:1, 2008. 758 Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices Xin Lei1 Andrew Senior2 Alexander Gruenstein1 Jeffrey Sorensen2 1Google Inc., Mountain View, CA USA 2Google Inc., New York, NY USA {xinlei,andrewsenior,alexgru,sorenj}@google.com Abstract In this paper we describe the development of an accurate, smallfootprint, large vocabulary speech recognizer for mobile devices. To achieve the best recognition accuracy, state-of-the-art deep neural networks (DNNs) are adopted as acoustic models. A variety of speedup techniques for DNN score computation are used to enable real-time operation on mobile devices. To reduce the memory and disk usage, on-the-fly language model (LM) rescoring is performed with a compressed n-gram LM. We were able to build an accurate and compact system that runs well below real-time on a Nexus 4 Android phone. Index Terms: Deep neural networks, embedded speech recognition, SIMD, LM compression. 1. Introduction Smartphones and tablets are rapidly overtaking desktop and laptop computers as people’s primary computing device. They are heavily used to access the web, read and write messages, interact on social networks, etc. This popularity comes despite the fact that it is significantly more difficult to input text on these devices, predominantly by using an on-screen keyboard. Automatic speech recognition (ASR) is a natural, and increasingly popular, alternative to typing on mobile sevices. Google offers the ability to search by voice [1] on Android, iOS, and Chrome; Apple’s iOS devices come with Siri, a conversational assistant. On both Android and iOS devices, users can also speak to fill in any text field where they can type (see, e.g., [2]), a capability heavily used to dictate SMS messages and e-mail. A major limitation of these products is that speech recognition is performed on a server. Mobile network connections are often slow or intermittent, and sometimes non-existant. Therefore, in this study, we investigate techniques to build an accurate, small-footprint speech recognition system that can run in real-time on modern mobile devices. Previously, speech recognition on handheld computers and smartphones has been studied in the DARPA sponsored Transtac Program, where speech-to-speech translation systems were developed on the phone [3, 4, 5]. In the Transtac systems, Gaussian mixture models (GMMs) were used to as acoustic models. While the task was a small domain, with limited training data, the memory usage in the resulting systems was moderately high. In this paper, we focus on large vocabulary on-device dictation. We show that deep neural networks (DNNs) can provide large accuracy improvements over GMM acoustic models, with a significantly smaller footprint. We also demonstrate how memory usage can be significantly reduced by performing onthe-fly rescoring with a compressed language model during decoding. The rest of this paper is organized as follows. In Section 2, the embedded GMM acoustic model is described. Section 3 presents the training of embedded DNNs, and the techniques we employed to speed up DNN inference at runtime. Section 4 describes the compressed language models for on-the-fly rescoring. Section 5 shows the experimental results of recognition accuracy and speed on Nexus 4 platform. Finally, Section 6 concludes the paper and discusses future work. 2. GMM Acoustic Model Our embedded GMM acoustic model is trained on 4.2M utterances, or more than 3,000 hours of speech data containing randomly sampled anonymized voice search queries and other dictation requests on mobile devices. The acoustic features are 9 contiguous frames of 13-dimensional PLP features spliced and projected to 40 dimensions by linear discriminant analysis (LDA). Semi-tied covariances [6] are used to further diagonalize the LDA transformed features. Boosted-MMI [7] was used to train the model discriminatively. The GMM acoustic model contains 1.3k clustered acoustic states, with a total of 100k Gaussians. To reduce model size and speed up computation on embedded platforms, the floatingpoint GMM model is converted to a fixed-point representation, similar to that described in [8]. Each dimension of the Gaussian mean vector is quantized into 8 bits, and 16-bit for precision vector. The resulting fixed-point GMM model size is about 1/3 of the floating-point model, and there is no loss of accuracy due to this conversion in our empirical testing. 3. DNNs for Embedded Recognition We have previously described the use of deep neural networks for probability estimation in our cloud-based mobile voice recognition system [9]. We have adopted this system for developing DNN models for embedded recognition, and summarize it here. The model is a standard feed-forward neural network with k hidden layers of nh nodes, each computing a nonlinear function of the weighted sum of the outputs of the previous layer. The input layer is the concatenation of ni consecutive frames of 40-dimensional log filterbank energies calculated on 25ms windows of speech every 10ms. The no softmax outputs estimate the posterior of each acoustic state. We have experimented with conventional logistic nonlinearities and rectified linear units that have recently shown superior performance in our large scale task [10], while also reducing computation. Copyright © 2013 ISCA 25-29 August 2013, Lyon, France INTERSPEECH 2013 662While our server-based model has 50M parameters (k = 4, nh = 2560, ni = 26 and no = 7969), to reduce the memory and computation requirement for the embedded model, we experimented with a variety of sizes and chose k = 6, nh = 512, ni = 16 and no = 2000, or 2.7M parameters. The input window is asymmetric; each additional frame of future context adds 10ms of latency to the system so we limit ourselves to 5 future frames, and choose around 10 frames of past context, trading off accuracy and computation. Our context dependency (CD) phone trees were initially constructed using a GMM training system that gave 14,247 states. By pruning this system using likelihood gain thresholds, we can choose an arbitrary number of CD states. We used an earlier large scale model with the full state inventory that achieved around 14% WER to align the training data, then map the 14k states to the desired smaller inventory. Thus we use a better model to label the training data to an accuracy that cannot be achieved with the embedded scale model. 3.1. Training Training uses conventional backpropagation of gradients from a cross entropy error criterion. We use minibatches of 200 frames with an exponentially decaying learning rate and a momentum of 0.9. We train our neural networks on a dedicated GPU based system. With all of the data available locally on this system, the neural network trainer can choose minibatches and calculate the backpropagation updates. 3.2. Decoding speedup Mobile CPUs are designed primarily for lower power usage and do not have as many or as powerful math units as CPUs used in server or desktop applications. This makes DNN inference, which is mathematically computationally expensive, a particular challenge. We exploit a number of techniques to speed up the DNN score computation on these platforms. As described in [11], we use a fixed-point representation of DNNs. All activations and intermediate layer weights are quantized into 8-bit signed char, and biases are encoded as 32-bit int. The input layer remains floating-point, to better accommodate the larger dynamic ranges of input features. There is no measured accuracy loss resulting from this conversion to fixed-point format. Single Instruction Multiple Data (SIMD) instructions are used to speed up the DNN computation. With our choice of smaller-sized fixed-point integer units, the SIMD acceleration is significantly more efficient, exploiting up to 8 way parallelism in each computation. We use a combination of inline assembly to speed up the most expensive matrix multiplication functions, and compiler intrinsics in the sigmoid and rectified linear calculations. Batched lazy computation [11] is also performed. To exploit the multiple cores present on modern smartphones, we compute the activations up to the last layer in a dedicated thread. The output posteriors of the last layer are computed only when needed by the decoder in a separate thread. Each thread computes results for a batch of frames at a time. The choice of batch size is a tradeoff between computation efficiency and recognition latency. Finally, frame skipping [12] is adopted to further reduce computation. Activations and posteriors are computed only every nb frames and used for nb consecutive frames. In experiments we find that for nb = 2, the accuracy loss is negligible; however for nb ≥ 3, the accuracy degrades quickly. 4. Language Model Compression We create n-gram language models appropriate for embedded recognition by first training a 1M word vocabulary and 18M n-gram Katz-smoothed 4-gram language model using Google’s large-scale LM training infrastructure [13]. The language model is trained using a very large corpus (on the order of 20 billion words) from a variety of sources, including search queries, web documents and transcribed speech logs. To reduce memory usage, we use two language models during decoding. First, a highly-pruned LM is used to build a small CLG transducer [14] that is traversed by the decoder. Second, we use a larger LM to perform on-the-fly lattice rescoring during search, similar to [15]. We have observed that a CLG transducer is generally two to three times larger than a standalone LM, so this rescoring technique significantly reduces the memory footprint. Both language models used in decoding are obtained by shrinking the 1M vocabulary and 18M n-gram LM. We aggressively reduce the vocabulary to the 50K highest unigram terms. We then apply relative entropy pruning [16] as implemented in the OpenGrm toolkit [17]. The resulting finite state model for rescoring LM has 1.4M n-grams, with just 280K states and 1.7M arcs. The LM for first pass decoding contains only unigrams and about 200 bigrams. We further reduce the memory footprint of the rescoring LM by storing it in an extremely memory-efficient manner, discussed below. 4.1. Succinct storage using LOUDS If you consider a backoff language model’s structure, the failure arcs from (n + 1)-gram contexts to n-gram contexts and, ultimately, to the unigram state form a tree. Trees can be stored using 2 bits per node using a level-order unary degree sequence (LOUDS), where we visit the nodes breadth-first writing 1s for the number of (n + 1)-gram contexts and then terminating with a 0 bit. We build a bit sequence similarly for the degree of outbound non-φ arcs. The LOUDS data structure provides first-child, last-child, and parent navigation, so we are able to store a language model without storing any next-state values. As a contiguous, indexfree data object, the language model can be easily memory mapped. The implementation of this model is part of the OpenFst library [18] and covered in detail in [19]. The precise storage requirements, measured in bits, are 4ns + na + (W + L)(ns + na) + W nf + c where ns is the number of states, nf the number of final states, na is the number of arcs, L is the number of bits per wordid, and W is the number of bits per probability value. This is approximately one third the storage required by OpenFst’s vector representation. For the models discussed here, we use 16 bits for both labels and weights. During run time, to support fast navigation in the language model, we build additional indexes of the LOUDS bit sequences to support the operations rankb(i) the number of b valued bits before index i, and its inverse selectb(r). We maintain a two level index that adds an additional 0.251(4ns + na) bits. Here it is important to make use of fast assembly operations such as find first set during decoding, which we do through compiler intrinsics. 6634.2. Symbol table compression The word symbol table for an LM is used to map words to unique identifiers. Symbol tabels are another example of a data structure that can be represented as a tree. In this case we relied upon the implementation contained in the MARISA library [20]. This produces a symbol table that fits in just one third the space of the concatenated strings of the vocabulary, yet provides a bidirectional mapping between integers and vocabulary strings. We are able to store our vocabulary in about 126K bytes, less than 3 bytes per entry in a memory mappable image. The MARISA library assigns the string to integer ids during compression, so we relabel all of the other components in our system to match this assignment. 5. Experimental Results To evaluate accuracy performance, we use a test set of 20,000 anonymized transcribed utterances from users speaking in order to fill in text fields on mobile devices. This biases the test set towards dictation, as opposed to voice search queries, because dictation is more useful than search when no network connection is available. To measure speed performance, we decode a subset of 100 utterances on an Android Nexus 4 (LG) phone. The Nexus 4 is equipped with a 1.5GHz quad-core Qualcomm Snapdragon S4 pro CPU, and 2GB of RAM. It runs the Android 4.2 operating system. To reduce start up loading time, all data files, including the acoustic model, the CLG transducer, the rescoring LM and the symbol tables are memory mapped on the device. We use a background thread to “prefetch” the memory mapped resources when decoding starts, which mitigates the slowdown in decoding for the first several utterances. 5.1. GMM acoustic model The GMM configuration achieves a word error rate (WER) of 20.7% on this task, with an average real-time (RT) factor of 0.63. To achieve this speed, the system uses integer arithmetic for likelihood calculation and decoding. The Mahalanobis distance computation is accelerated using fixed-point SIMD instructions. Gaussian selection is used to reduce the burden of likelihood computation, and further efficiencies come from computing likelihoods for batches of frames. 5.2. Accuracy with DNNs We compare the accuracy of DNNs with different configurations to the baseline GMM acoustic model in Table 1. A DNN with 1.48M parameters already outperforms the GMM in accuracy, with a disk size of only 17% of the GMM’s. By increasing the number of hidden layers from 4 to 6 and number of outputs from 1000 to 2000, we obtain a large improvement of 27.5% relative in WER compared to the GMM baseline. The disk size of this DNN is 26% of the size of the GMMs. For comparison, we also evaluate a server-sized DNN with an order of magnitude of more parameters, and it gives 12.3% WER. Note that all experiments in Table 1 use smaller LMs in decoding. In addition, with an un-pruned server LM, the server DNN achieves 9.9% WER while the server GMM achieves 13.5%. Therefore, compared to a full-size DNN server system, there is a 2.4% absolute loss due to smaller LMs, and 2.8% due to smaller DNN. Compared to the full-size GMM server system, the embedded DNN system is about 10% relatively worse in WER. The impact of frame skipping is evaluated with the DNN 6×512 model. As shown in Table 2, the accuracy performance quickly degrades when nb is larger than 2. Table 2: Accuracy results with frame skipping in a DNN system. nb 1 2 3 4 5 WER (%) 15.1 15.2 15.6 16.0 16.7 5.3. Speed benchmark For speed benchmark, we measure average RT factor as well as 90-percentile RT factor. As shown in Table 3, the baseline GMM system with SIMD optimization gives an average RT factor of 0.63. The fixed-point DNN gives 1.32×RT without SIMD optimization, and 0.75×RT with SIMD. Batched lazy computation improves average RT by 0.06 but degrades the 90- percentile RT performance, probably due to less efficient ondemand computation for difficult utterances. After frame skipping with nb = 2, the speed of DNN system is further improved slightly to 0.66×RT. Finally, the overhead of the compact LOUDS based LM is about 0.13×RT on average. Table 3: Averge real-time (RT) and 90-percentile RT factors of different system settings. Average RT RT(90) GMM 0.63 0.90 DNN (fixed-point) 1.32 1.43 + SIMD 0.75 0.87 + lazy batch 0.69 1.01 + frame skipping 0.66 0.97 + LOUDS 0.79 1.24 5.4. System Footprint Compared to the baseline GMM system, the new system with LM compression and DNN acoustic model achieves a much smaller footprint. The data files sizes are listed in Table 4. Note that conversion of the 34MB floating-point GMM model to a 14MB fixed-point GMM model itself provides a large reduction in size. The use of DNN reduces the size by 10MB, and the LM compression contributed to another 18MB reduction. Our final embedded DNN system size is reduced from 46MB to 17MB, while achieving a big WER reduction from 20.7% to 15.2%. 6. Conclusions In this paper, we have described a fast, accurate and smallfootprint speech recognition system for large vocabulary dictation on the device. DNNs are used as acoustic model, which provides a 27.5% relative WER improvement over the baseline GMM models. The use of DNNs also significantly reduces the memory usage. Various techniques are adopted to speed up the DNN inference at decoding time. In addition, a LOUDS based language model compression reduces the rescoring LM size by more than 60% relative. Overall, the size of the data files of the system is reduced from 46MB to 17MB. 664Table 1: Comparison of GMM and DNNs with different sizes. The input layer is denoted by number of filterbank energies × the context window size (left + current + right). The hidden layers are denoted by number of hidden layers × number of nodes per layer. The number of outputs is the number of HMM states in the model. Model WER (%) Input Layer Hidden Layers # Outputs # Parameters Size GMM 20.7 - - 1314 8.08M 14MB DNN 4×400 22.6 40×(8+1+4) 4×400 512 0.9M 1.5MB DNN 4×480 20.3 40×(10+1+5) 4×480 1000 1.5M 2.4MB DNN 6×512 15.1 40×(10+1+5) 6×512 2000 2.7M 3.7MB Server DNN 12.3 40×(20+1+5) 4×2560 7969 49.3M 50.8MB Table 4: Comparison of data file sizes (in MB) in baseline GMM system and DNN system with and without LOUDS LM compression. AM denotes acoustic model, CLG is the transducer for decoding, LM denotes the rescoring LM, and symbols denote the word symbol table. System AM CLG LM Symbols Total GMM 14 2.7 29 0.55 46 + LOUDS 14 2.7 10.7 0.13 27 DNN 3.7 2.8 29 0.55 36 + LOUDS 3.7 2.8 10.7 0.13 17 Future work includes speeding up rescoring using the LOUDS LM as well as further compression techniques. We also continue to investigate the accuracy performance with different sizes of LM for CLG and rescoring. 7. Acknowledgements The authors would like to thank our former colleague Patrick Nguyen for implementing the portable neural network runtime engine used in this study. Thanks also to Vincent Vanhoucke and Johan Schalkwyk for helpful discussions and support during this work. [9] N. Jaitly, P. Nguyen, A. W. Senior, and V. Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition,” in Proc. Interspeech, 2012. [10] M. D. Zeiler et al., “On rectified linear units for speech processing,” in Proc. ICASSP, 2013. [11] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011. [12] V. Vanhoucke, M. Devin, and G. Heigold, “Multiframe deep neural networks for acoustic modeling,” in Proc. ICASSP, 2013. [13] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in EMNLP, 2007, pp. 858–867. [14] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” Handbook of Speech Processing, pp. 559–582, 2008. [15] T. Hori and A. Nakamura, “Generalized fast on-the-fly composition algorithm for WFST-based speech recognition,” in Proc. Interspeech, 2005. [16] A. Stolcke, “Entropy-based pruning of backoff language models,” in DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 8–11. [17] B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai, “The OpenGrm open-source finite-state grammar software libraries,” in Proceedings of the ACL 2012 System Demonstrations. 2012, ACL ’12, pp. 61–66, Association for Computational Linguistics. 8. References [1] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, “Google search by voice: A case study,” in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, pp. 61–90. Springer, 2010. [2] B. Ballinger, C. Allauzen, A. Gruenstein, and J. Schalkwyk, “Ondemand language model interpolation for mobile speech input,” in Proc. Interspeech, 2010. [3] J. Zheng et al., “Implementing SRI’s Pashto speech-to-speech translation system on a smart phone,” in SLT, 2010. [4] J. Xue, X. Cui, G. Daggett, E. Marcheret, and B. Zhou, “Towards high performance LVCSR in speech-to-speech translation system on smart phones,” in Proc. Interspeech, 2012. [5] R. Prasad et al., “BBN Transtalk: Robust multilingual two-way speech-to-speech translation for mobile platforms,” Computer Speech and Language, vol. 27, pp. 475–491, February 2013. [6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 272–281, 1999. [7] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in Proc. ICASSP, 2008. [8] E. Bocchieri, “Fixed-point arithmetic,” Automatic Speech Recognition on Mobile Devices and over Communication Networks, pp. 255–275, 2008. [18] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007). 2007, vol. 4783 of Lecture Notes in Computer Science, pp. 11– 23, Springer, http://www.openfst.org. [19] J. Sorensen and C. Allauzen, “Unary data structures for language models,” in Proc. Interspeech, 2011. [20] S. Yata, “Prefix/Patricia trie dictionary compression by nesting prefix/Patricia tries (japanese),” in Proceedings of 17th Annual Meeting of the Association for Natural Language, Toyohashi, Japan, 2011, NLP2011, https://code.google.com/p/marisa-trie/. 665 Backoff Inspired Features for Maximum Entropy Language Models Fadi Biadsy, Keith Hall, Pedro Moreno and Brian Roark Google, Inc. {biadsy,kbhall,pedro,roark}@google.com Abstract Maximum Entropy (MaxEnt) language models [1, 2] are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed ngram language models using Katz backoff [3] and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. Index Terms: maximum entropy modeling, language modeling, n-gram models, linear models 1. Introduction A central problem in language modeling is how to combine information from various model components, e.g., mixing models trained with differing Markov orders for smoothing or on distinct corpora for adaptation. Smoothing (regularization) for n-gram language models is typically presented as a mechanism whereby higher-order models are combined with lower-order models so as to achieve both the specificity of the higher-order model and the more robust generality of the lower-order model. Most commonly, this combination is effected via an interpolation or backoff mechanism, in which each prefix (history) of an n-gram has a parameter which dictates how much cost is associated with making use of lower-order n-gram estimates, often called the “backoff cost”. This becomes a parameter estimation problem in its own right, either through discounting or mixing parameters; and these are often estimated via extensive parameter tying, heuristics based on count histograms, or both. Log linear models provide an alternative to n-gram backoff or interpolated models for combining evidence from multiple, overlapping sources of evidence, with very different regularization methods. Instead of defining a specific model structure with backoff costs and/or mixing parameters, these models combine features from many sources into a single linear feature vector, and score a word by taking the dot product of the feature vector with a learned parameter vector. Learning can be via locally normalized likelihood objective functions, as in Maximum Entropy (MaxEnt) models [1, 2, 4] or global “whole sentence” objectives [5, 6, 7]. For locally normalized MaxEnt models, which estimate a conditional distribution over a vocabulary given the prefix history (just as the backoff smoothed n-gram models do), the brute-force local normalization over the vocabulary obviates the need for complex backoff schemes to avoid zero probabilities. One can simply toss in n-gram features of all the orders, and learn their relative contribution. Recall, however, that the standard backoff n-gram models do not only contain parameters associated with n-grams; they also contain parameters associated with the backoff weights for each prefix history. For every proper prefix of an n-gram in the model, there will be an associated backoff weight, which penalizes to a greater or lesser extent words that have been previously unseen following that prefix history. For some histories we should have a relatively high expectation of seeing something new, either because the history itself is rare (hence we do not have enough observations yet to be strongly predictive) or it simply predicts relatively open classes of possible words, e.g., “the”, which can precede many possible words, including many that were presumably unobserved following “the” in the training corpus. Other prefixes may be highly predictive so that the expectation of seeing something previously unobserved is relatively low, e.g., “Barack”. Granted, MaxEnt language models (LMs) do not need this information about prefix histories to estimate regularized probabilities. Chen and Rosenfeld [4] survey various smoothing and regularization methods for MaxEnt language models, including reducing the number of features (as L1 regularization does), optimizing to match expected frequencies to discounted counts, or optimizing to modified objectives, such as L2 regularization. In none of these methods are there parameters in the model associated with the sort of “otherwise” semantics of conventional n-gram backoffs. Because such features are not required for smoothing, they are not part of the typical feature set used in log linear language modeling, yet our results demonstrate that they should be. The ultimate usefulness of such features likely depends on the amount of training data available, and we have thus applied highly optimized MaxEnt training to very large data sets. In large scale n-gram modeling, it has been shown that the specific details of the smoothing algorithm is typically less important than the scale. So-called “stupid backoff” [8] is an efficient, scalable estimation method that, despite lack of normalization guarantees, is shown to be extremely effective in very large data set scenarios. While this has been taken to demonstrate that the specifics of smoothing is unimportant as the data gets large, those parameters are still important components of the modeling approach, even if their usefulness is robust to variation in parameter value. We demonstrate that features patterned after backoff weights, and several related generalizations of these features, can in fact make a large difference to a MaxEnt language model, even if the amount of training data is very large. In the next section, we present background for language modeling and cover related work. We then present our MaxEnt training approach, and the new features. Finally, we present experimental results on a range of large scale speech tasks. 2. Background and Related Work Let wi be the word at position i in the string, and let w i−1 i−k = wi−k . . . wi−1 be the prefix history of the string prior to wi, and P a probability estimate assigned to seen n-grams by the specific smoothing method. Then the standard backoff language model formulation is as follows: P(wi | w i−1 i−k ) = ( P(wi | w i−1 i−k ) if c(wi i−k ) > 0 α(w i−1 i−k ) P(wi | w i−1 i−k+1) otherwise This recursive smoothing formulation has two kinds of paramCopyright © 2014 ISCA 14-18 September 2014, Singapore INTERSPEECH 2014 2645eters: n-gram probabilities P(wi | w i−1 i−k ) and backoff weights α(w i−1 i−k ), which are parameters associated with the prefix history w i−1 i−k . MaxEnt models are log linear models score that alternatives by taking the exponential of the dot product between a feature vector and a parameter vector and normalizing. Let Φ(wi−k . . . wi) be a d-dimensional feature vector, θ a ddimensional parameter vector, and V a vocabulary. Then P(wi | w i−1 i−k ) = exp(Φ(wi−k . . . wi) · θ) Z(wi−k . . . wi−1, θ) where Z is a partition function (normalization constant): Z(wi−k, . . . , wi−1, θ) = X v∈V exp(Φ(wi−k, . . . , wi−1v) · θ) Training with a likelihood objective function is a convex optimization problem, with well-studied efficient estimation techniques, such as stochastic gradient descent. Regularization techniques are also well-studied, and include L1 and L2 regularization, or their combination, which are modifications of the likelihood objective to either keep parameter values as close to zero as possible (L2) or reduce the number of features with nonzero parameter weights by pushing many parameters to zero (L1). We employ a distributed approximation to L1, see Section 3.1. The most expensive part of this optimization is the calculation of the partition function, since it requires summing over the entire vocabulary, which can be very large. Efficient methods to enable training with very large corpora and large vocabularies have been investigated over the past decades, from methods to exploit structural overlap between features [9, 10] to methods for decomposing the multi-class language modeling problem into many binary language modeling problems (one versus the rest) and sampling less data to effectively learn the models [11]. For this paper, we employed many optimizations to enable training with very large vocabularies (several hundred thousand words) and very large training sets (>100B words). 3. Methods 3.1. Maximum Entropy training Many features have been used in MaxEnt language models, including standard n-grams and trigger words [1], topic-based features [12] and morphological and sub-word based features [13, 14]. Feature engineering is a major consideration in this sort of modeling, and in Section 3.2 we detail our newly designed feature templates. Before we do so, we present the training methods that allow us to scale up to a very large vocabulary and many training instances. In this work, we wish to scale up MaxEnt language model training to learn from the same amount of data used for standard backoff n-gram language models. We achieve this by exploiting recent work on gradient-based distributed optimization; specifically, distributed stochastic gradient descent (SGD) [15, 16, 17, 18, 19]. We differ slightly from previous work in multiple aspects: (1) we apply a final L1 regularization setp at the end of each reducer using statistics collected from the mappers; (2) We estimate the gradient using a mini-batch of 16 samples where the mini-batch is processed in parallel via multi-threading; (3) We do not perform any binarization or subsampling as in [20]; (4) Unlike [21], we do not peform any clustering of our vocabulary. Algorithm 1 presents our variant of the iterative parameter mixtures (IPM) algorithm based on sampling. This presents a merging of concepts from the original IPM algorithm described in [16] and the distributed sample-based algorithm in [18] as well as the lazy L1 SGD computation from [22]. Algorithm 1 Sample-based Iterative Parameter Mixtures Require: n is the number of samples per worker per epoch Require: Break S into K partitions 1: S ← {D 1 , . . . , Dj , . . . , DK} 2: t ← 0 3: Θt ← 0 4: repeat 5: t ← t + 1 6: {θ 1 1, . . . , θK L } ← IPMMAP(D 1 , . . . , DK, Θt−1, n) 7: Θ 0 t ← IPMREDUCE(θ 1 1, . . . , θj l , . . . , θK L ) 8: Θt ← APPLYL1(Θ0 t) 9: until converged 10: function IPMMAP(D, Θ, n) 11: . IPMMAP processes training data in parallel 12: Θ0 ← Θ 13: for i = 1 . . . n do . n examples from D 14: Sample di from D 15: Θ 0 i ← ApplyLazyL1(ActiveF eatures(di, Θi−1)) 16: Θi ← Θ 0 i − α∇Fdi (Θ0 i) 17: α ← U pdateAlpha(α, i) 18: end for 19: return Θn 20: end function 21: function IPMREDUCE(θ 1 l , . . . , θj l , . . . , θK l ) 22: . IPMREDUCE processes model parameters in parallel 23: θl ← 1 K P j θ j l 24: return θl 25: end function While this is a general paradigm for distributed optimization, we show the MapReduce [23] implementation in Algorithm 1. We begin the process by partitioning the training data S into multiple units D j , processing each of these units with the IPMMAP function on separate processing nodes. On each of these nodes, IPMMAP samples a subset of D j which we call di. This can be a single example or a mini-batch of examples. We perform the Lazy L1 regularization update to the model, compute the gradient of the regularized loss associated with the mini-batch (which can be also be done in parallel), update the local copy of the model parameters Θ, and update the learningrate α. Each node samples n examples from its data partition. Finally, IPMREDUCE collects the local model parameters from each IPMMAP and averages them in parallel. Parallelization here can be done over subsets of the parameter indices (each IPMREDUCE node averages a subset of the parameter space). We refer to each full MapReduce pass as an epoch of training. Starting with the second epoch, the IPMMAP nodes are initialized with the previous epoch’s merged, regularized model. In a general shared distributed framework, which is used at Google, some machines may be slower than others (due to hardware or overload), machines may fail, or jobs may be preempted. When using a large number of machines this is inevitable. To avoid starting the training process over in these cases, and make all others wait for for the lagging machines, we enforce a timeout on our trainers. In other words, all mappers have to finish within a certain amount of time. Therefore, the reducer will merge all models when they either finished processing their samples or timed-out. 3.2. Backoff inspired features MaxEnt language models commonly have n-gram features, which we denote here as a function of the string, the position, 2646and the order as follows: NGram(w1 . . . wn, i, k) = We now introduce some features inspired by the backoff parameters α(w i−1 i−k ) presented in Section 2. We begin with the most directly related features, which we term suffix backoff features. SuffixBackoff(w1 . . . wn, i, k) = These fire if and only if the full n-gram NGram(w1 . . . wn, i, k) is not in the feature dictionary (see section 4.1). This is directly analogous to the backoff weights in standard n-gram models, since it is a parameter associated with the prefix history that fires when the particular n-gram is unobserved. Inspired by the form of this feature, we can introduce other general backoff features. First, rather than just replacing the suffix, we can replace the prefix: PrefixBackoff(w1 . . . wn, i, k) = Next, we can replace multiple words in the feature, to generalize across several such contexts: PrefixBackoffj (w1 . . . wn, i, k) = SuffixBackoffj (w1 . . . wn, i, k) = These features indicate that an n-gram of length k + 1 ending with (PrefixBackoff), or beginning with (SuffixBackoff), the particular j words, in the feature, are not in the feature dictionary. Note that, if j=k−1, then PrefixBackoffj is identical to the earlier defined PrefixBackoff feature, and SuffixBackoffj is identical to SuffixBackoff. For example, suppose that we have the following string S=“we will save the quail eggs” and that the 4-gram “will save the quail” does not exist in our feature dictionary. Then we can fire the following features at word wi=5 = “quail”: SuffixBackoff(S, 5, 3) = < will, save, the, BO > PrefixBackoff(S, 5, 3) = < BO, save, the, quail > SuffixBackoff0(S, 5, 3) = < will, BO3 > SuffixBackoff1(S, 5, 3) = < will, save, BO3 > PrefixBackoff0(S, 5, 3) = < BO3, quail > PrefixBackoff1(S, 5, 3) = < BO3, the, quail > As with n-gram feature templates, we include all such features up to some specified length, e.g., if we have a trigram model, that includes n-grams up to length 3, including unigrams, bigrams and trigrams. Similarly, for our prefix and suffix backoff features, we will have a maximum length and include in our possible feature set all such features of that length or shorter. 4. Experimental results We performed two experiments to evaluate the utility of these new backoff-inspired features in maximum entropy language models trained on very large corpora. First, we examine perplexity improvements when such features are included in the model alongside n-gram features. Next, we look at Word Error Rate (WER) performance when reranking the output of a baseline recognizer, again using different backoff feature templates. In all cases, we fixed the vocabulary and feature budget of the model so that improvements are not simply due to having more parameters in the model. We set the vocabulary of our model to 200 thousand words, by selecting all words from the 2M words in the baseline recognizer vocabulary that had been emitted by the recognizer in the last 6 months of log files. All other words are mapped to “”. We use the same vocabulary in all of our experiments. For our experiments, we focus on the voice search task. Our data sets are assembled and pooled from anonymized supervised and unsupervised spoken queries (such as, search queries, questions, and voice actions) and typed queries to google.com, YouTube, and Google Maps, from desktop and mobile devices. Our overall training set is about 305 billion words (including end of sentence symbols). We divide this set into K subsets. We assign subset D k to trainer k (where, 1 ≤ k ≤ K). Then, we run our distributed training (Algorithm 1) using K machines. Since the amount of training data is very large, trainer k randomly samples data points from its subset D k . Each epoch utilizes a different seed for sampling, which equals to the epoch number. As mentioned above, the trainer may terminate due to completing its subsample or due to a timeout. We fix the timeout threshold for each epoch across all our experiments. In our experiments, the timout is 6 hours. 4.1. Feature Dictionary A feature dictionary maps each feature key (e.g., trigram: “save the quail”) to an index in the paramater vector Θ. As described in Algorithm 2, we build this dictionary by iterating over all strings in our training data and make use of the NGgram function (defined above) to build the ngram feature keys (for every k = 0 . . . 4). Also, for each string, we build the required backoff feature keys (depending on the experiment). Upon collecting all of these keys, we compute the total observed count for each feature key and then retain only the most frequent ones. We assign a different count cutoff for each feature template. We determine these counts based on a classical cross-entropy pruned n-gram model trained on the same data Afterwards, our dictionary maps each key to a unique consecutive index = 0 . . . Dim. In all our experiments, we allocated the same budget of 228 million paramaters. It is important to note that the number of features dedicated for backoff features may significantly vary across backoff-feature types. Note that, while the backoff inspired features detailed in section 3.2 are defined to fire only when the corresponding ngram does not appear in the feature dictionary, they themselves must appear in the feature dictionary in order to fire. If one of these features does not appear frequently enough, it will not appear in the feature dictionary and neither the original n-gram nor the backoff feature will fire. 4.2. Feature Sets In these experiments, all MaxEnt language models include ngrams up to 5-grams. Our backoff inspired features are also Algorithm 2 Dictionary Construction for all w1, w2, . . . , wn ∈ Data do for i ← 1 . . . n do . We use 5-gram features. for k ← 0 . . . 4 do key ← NGram(w1, . . . , wn, i, k) dictk ← dictk ∪ {key} countk[key] ← countk[key] + 1 . Call the backoff functions above. bo key ← SuffixBackoff(w1, . . . , wn, i, k) dictk ← dictk ∪ {bo key} countk[bo key] ← countk[bo key] + 1 end for end for end for . Retain the most frequent features in dictk and map each feature to a unique index, for each k = 0, . . . , 4. 2647Perplexity Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). Perplexity Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). Epochs Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). based on substrings up to length 5, i.e., up to 4 words, either preceded (prefix) or followed (suffix) by the “BO” token in the case of PrefixBackoff and SuffixBackoff features; or “BOj ” up to j = 4 preceding (prefix) or following (suffix) the word. We examine several feature set pools: (1) n-gram features alone (NG); (2) n-gram features plus PrefixBackoff (NG+P) or SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj (NG+Pk) or SuffixBackoffj (NG+Sk); and (4) n-gram features plus PrefixBackoffj and SuffixBackoff (NG+Pk+S) or SuffixBackoffj (NG+Pk+Sk). In each case, feature dictionaries are built, so they may contain more or fewer n-grams as required to include the backoff features in the dictionary. For the current experiments, trials with PrefixBackoffj or SuffixBackoffj only include features with j = 0, i.e., a single word alongside the “BOk” token. Note that the number of such features is relatively constrained compared to the n-gram features and other backoff features – at most k|V | possible features for a vocabulary V . 4.3. Perplexity Perplexity was measured on a held-aside random sample of 5 million words from our pool of data. Figure 1 plots perplexity versus number of epochs (up to 11) for different possible feature sets. Recall that data is randomly sampled from the overall training set, so that this plot also shows behavior as the amount of training data is increased. Table 1 presents perplexities after the epoch 11, along with the number of samples used during the training and number of active features with non-zero parameters. The number of samples varies because some trainers may run faster than others depending on the number and type of features used; since we enforce a timeout, an epoch may vary in the number of samples processed in time. Nonetheless, Figure 1 shows that most models have approached or reached convergence before completing all the 11 epochs. A notable exception is the n-gram only model, which seems to require a few more epochs before reaching convergence – though clearly performance will not reach that of the other trials. This points to another benefit of the backoff features – they also seem to speed convergence for these models. Interestingly, they also seem to considerably reduce the number of active features. The results show a large perplexity improvement due to the use of backoff features, and in particular the generalized Prefix/SuffixBackoff-k features. One potential reason for the improved performance with these generalized backoff features is the relatively small number of them and they fire more often, as discussed in the previous section. Feature Set Description Pplx Samp ActFt NG N-grams only 167.0 137B 197.8M NG+P N-grams + PrefixBackoff 122.6 112B 189.5M NG+S N-grams + SuffixBackoff 109.8 125B 188.9M NG+Pk N-grams + PrefixBackoffk 88.0 100B 170.1M NG+Pk+S N-grams + PrefixBackoffk + SuffixBackoff 85.5 113B 172.6M NG+Sk N-grams + SuffixBackoffk 82.7 126B 160.2M NG+Pk+Sk N-grams + PrefixBackoffk + SuffixBackoffk 80.2 96B 162.4M Table 1: Perplexity (Pplx) after 11 epochs of training, with a fixed feature budget. Also giving number of samples (Samp) used for training each model, in billions; and active features (ActFt), in millions. 4.4. Speech Recognition Rescoring Results We evaluated our models by rescoring n-best outputs from a baseline recognizer. In our experiments, we set n to 500. The acoustic model of the baseline system is a deep-neural networkbased model with 85M parameters, consisting of eight hidden layers with 2560 Rectified Linear hidden units each and softmax outputs for the 14,000 context-dependent state posteriors. The network processes a context window of 26 (20 past and 5 future) frames of speech, each represented with 40 dimensional log mel filterbank energies taken from 25ms windows every 10ms. The system is trained to a Cross-Entropy criterion on a US English data set of 3M anonymized utterances (1,700 hours or about 600 million frames) collected from live voice search dictation trafic. The utterances are hand-transcribed and force-aligned with a previously trained DNN. See [24] for Google’s VoiceSearch system design. The baseline LM is a Katz [3] smoothed 5-gram model pruned to 23M n-grams, trained on the same data using Bayesian interpolation to balance multiple sources. It has vocabulary size of 2M and an OOV rate of 0.57% [25]. The score assigned to each hypothesis by our MaxEnt LM is linearly interpolated with the baseline recognizer’s LM score (with an untuned mixture factor of 0.33). Table 2 presents WER results for multiple anonymized voice-search data sets collected from anonymized and manually transcribed live traffic from mobile devices. These data sets contain regular spoken search queries, questions, and YouTube queries. We achieve modest gains over the baseline system and over rescoring with just ngram features in all of the test sets, achieving, in aggregate, a half a point of improvement over the baseline system. 5. Conclusion In this paper we introduced and explored features for maximum entropy language models inspired by the backoff mechanism of standardly smoothed language models. We found large perplexity improvements over using n-gram features alone, for the same feature budget; and a 0.5% absolute (3.4% relative) WER improvement over the baseline system for our best performing model. Future work will include exploring further variants of our general backoff feature templates and combining with other features beyond n-grams. Table 2: WER results on 7 sub-corpora and overall, for the baseline recognizer (no reranking) versus reranking models trained with different feature sets. Reranking feature set Test Utts / Wds NG+ NG+ NG+ Set (×1000) None NG Pk Sk Pk+Sk 1 22.5 / 98.0 12.7 12.6 12.4 12.4 12.4 2 17.8 / 74.0 12.7 12.5 12.4 12.4 12.3 3 16.2 / 61.1 17.3 17.1 16.7 16.8 16.7 4 18.0 / 64.0 12.8 12.7 12.6 12.6 12.5 5 7.4 / 50.7 16.8 16.6 16.2 16.2 16.2 6 7.3 / 31.9 15.1 15.0 14.8 14.8 14.9 7 19.6 / 69.1 16.5 16.2 15.9 15.9 15.9 all 108.9 / 448.8 14.6 14.4 14.2 14.2 14.1 26486. References [1] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language models: a maximum entropy approach,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1993, pp. 45– 48. [2] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language modeling,” Computer Speech and Language, vol. 10, pp. 187–228, 1996. [3] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recogniser,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400–401, 1987. [4] S. F. Chen and R. Rosenfeld, “A survey of smoothing techniques for ME models,” IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 37–50, 2000. [5] R. Rosenfeld, “A whole sentence maximum entropy language model,” in Proceedings of IEEE Workshop on Speech Recognition and Understanding, 1997, pp. 230– 237. [6] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguisticstatistical integration,” Computer Speech and Language, vol. 15, no. 1, pp. 55–73, Jan. 2001. [7] B. Roark, M. Saraclar, and M. Collins, “Discriminative ngram language modeling,” Computer Speech & Language, vol. 21, no. 2, pp. 373–392, 2007. [8] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), 2007. [9] J. Wu and S. Khudanpur, “Efficient training methods for maximum entropy language modeling.” in INTERSPEECH, 2000, pp. 114–118. [10] T. Alumae and M. Kurimo, “Efficient estimation of maxi- ¨ mum entropy language models with n-gram features: an srilm extension.” in INTERSPEECH, 2010, pp. 1820– 1823. [11] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient subsampling for training complex language models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, pp. 1128–1136. [12] J. Wu and S. Khudanpur, “Building a topic-dependent maximum entropy model for very large corpora,” in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 1. IEEE, 2002, pp. I–777. [13] R. Sarikaya, M. Afify, Y. Deng, H. Erdogan, and Y. Gao, “Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal arabic,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 7, pp. 1330– 1339, 2008. [14] M. A. B. Shaik, A. E.-D. Mousa, R. Schluter, and H. Ney, ¨ “Feature-rich sub-lexical language models using a maximum entropy approach for german LVCSR,” in INTERSPEECH, 2013. [15] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31:9, 1986. [16] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed optimization,” in Neural Information Processing Systems Workshop on Leaning on Cores, Clusters, and Clouds, 2010. [17] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 456–464. [18] M. Zinkevich, M. Weimer, A. Smola, and L. Li, “Parallelized stochastic gradient descent,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., 2010, pp. 2595–2603. [19] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild: A ´ lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011. [20] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient subsampling for training complex language models.” in EMNLP. ACL, 2011, pp. 1128–1136. [Online]. Available: http://dblp.uni-trier.de/db/conf/emnlp/emnlp2011. html#XuGK11 [21] F. Morin and Y. Bengio, “Hierarchical probabilistic neural network language model,” in AISTATS05, 2005, pp. 246– 252. [22] Y. Tsuruoka, J. Tsujii, and S. Ananiadou, “Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, 2009, pp. 477–485. [23] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, ser. OSDI’04, 2004, pp. 10–10. [24] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, “your word is my command: Google search by voice: A case study,” in Advances in Speech Recognition. Springer, 2010, pp. 61–90. [25] C. Allauzen and M. Riley, “Bayesian language model interpolation for mobile speech input.” in INTERSPEECH, 2011, pp. 1429–1432. 2649 Unsupervised Testing Strategies for ASR Brian Strope, Doug Beeferman, Alexander Gruenstein, Xin Lei Google, Inc. bps, dougb, alexgru, xinlei @google.com Abstract This paper describes unsupervised strategies for estimating relative accuracy differences between acoustic models or language models used for automatic speech recognition. To test acoustic models, the approach extends ideas used for unsupervised discriminative training to include a more explicit validation on held out data. To test language models, we use a dual interpretation of the same process, this time allowing us to measure differences by exploiting expected ‘truth gradients’ between strong and weak acoustic models. The paper shows correlations between supervised and unsupervised measures across a range of acoustic model and language model variations. We also use unsupervised tests to assess the non-stationary nature of mobile speech input. Index Terms: speech recognition, unsupervised testing, nonstationary distributions 1. Introduction Current commercial speech recognition systems can use years of unsupervised data to train relatively large, discriminatively optimized, acoustic models (AM). Similarly, web-scale text corpora for estimating language models (LM) are often available online, and unsupervised recognition results themselves can provide an additional source of LM training data. Since there is no human transcription in any of these steps, the remaining use for manual human transcription is for generating test sets, as a final sanity check for validating system parameters and models. In this paper, we augment that strategy with unsupervised evaluations and begin the discussion of whether eventually we might be able to get rid of the need for any explicit human transcription. The motivation for human transcription for testing is obvious. Despite steady advances and relative commercial successes, it is generally accepted that humans are much more accurate transcribers than automatic speech recognition systems [1]. While there are a few notable exceptions where machines were more accurate than humans [2], human transcription accuracy is so much better, we use it unquiestioningly as our best approximate for absolute truth. But there are equally obvious disadvantages to relying on human transcription. While it may feel premature, accepting human performance as absolute truth imposes an upper bound on accuracy. The absolute truth is not absolute, and so we’ll eventually have to figure out how to beat it. In fact with our current processes and tasks, below, we show that human transcribers can be only comparable in accuracy to current ASR systems. Absolute truth is already a problem. In response, we are improving transcription processes, but also considering unsupervised ways to augment traditional testing. Another obvious disadvantage of human transcription is that the tests themselves have to be limited in size and type. Even in a commercially successful research lab, getting extensive tests across every combination of speaker and channel type, recognition context, language, and time period is prohibitive. But a detailed characterization of those types of variations could help prioritize efforts. Similarly when tests are unsupervised, it is easier to update development and evaluation sets to avoid problems related to stale, over-fit tests. This is mostly an empirical paper. The next section describes some of the experiments we ran trying to assess our existing human transcription accuracy. Then we describe the generalizations of unsupervised discriminative training that enable a new evaluation strategy. Next the paper includes evaluations that show correlations between supervised and unsupervised tests, and concludes with unsupervised tests that start to characterize the non-stationary distribution of spoken data coming through Google mobile applications. 2. Problems with human transcriptions Recent efforts have begun to consider human transcription accuracy in the context of increased efficiency. These studies have generally shown that depending on the amount of effort, and the task, individual word error rates can vary from 2-15% [3, 4]. Ef- ficiency pressures on human transcription can lead to transcription noise and bias. 2.1. Early experiments Over the last few years we have seen several simple experiments not work: we have added matched data to our language models and seen error rates get worse; we have added unsupervised acoustic modeling data matched to a new fielded acoustic condition, and seen the error rates on new matched tests go up, but surprisingly, error rates on an old test, with slightly mismatched conditions, go down. For each of these, after tediously examining errors, we found the problem was that we typically “seed” our transcription process with the recognition result from the field. Mostly as a matter of expedience; it is easier for the transcriber to hit return than to type “home depot in palo alto california” yet again, and it can improve reliability since retyping can be error prone. But the power of the suggested transcription is also enough to bias the transcribers into rubber-stamping some of the fielded recognition results. When the transcriber rubber-stamps an error we potentially get penalized twice. The baseline gets credit where it should not, and a new system that corrects that error is falsely penalized for adding an error. The surprising improvement noted on the older, slightly mis-matched test happened because the transcriptions for the older test were seeded with transcriptions from an older system, decorrelating some of the transcription bias with the current baseline. In this case, transcription bias toward the baseline model was a bigger effect than the change in acoustics. Copyright © 2011 ISCA 28-31 August 2011, Florence, Italy INTERSPEECH 2011 16852.2. Multiple attempts To measure the human transcription accuracy more directly we started sending the same data for multiple attempts at human transcription, and we intentionally reduced the quality of our starting seeds to move any bias away from our best systems. For one test we sent 200K Voice Search utterances to be transcribed twice. Ignoring trivial differences like spaces, apostrophes, function words, and others, half of the transcripts agreed, which implies a sentence transcription accuracy of 71%, assuming independence of the attempts. Similarly when we sent the remaining 100K utterances, where transcriptions did not agree, back for two more attempts, we were still left with about 10% of the original set with 4 distinct human transcriptions. Again assuming independence, 10% disagreement in 4 attempts is consistent with 68% accuracy for each attempt. But we believe our system has a sentence accuracy higher than 70%. Looking through the errors many of the problems are related to cultural references, popular names, and businesses that are not obvious to everyone. The cultural and geographic requirements of the voice search task may be unusually difficult. It combines short utterances and wide open semantic contexts to generate surprisingly unfamiliar sounding speech. Finding ways to bring the correct cultural context to the transcriber is another obvious path to pursue. 3. Generalizing unsupervised discriminative training While some published results considered unsupervised maximum likelihood estimation of model parameters [5], many systems use unsupervised discriminative optimization, directly using recognizer output as input [6]. Cynically we might ask what we are learning if we are using the recognition result as truth for discriminatively optimizing its parameters. It is hard to imagine that we can fix the errors it makes, when we use the model to generate truth. But when we look into the details of commonly used discriminative training techniques based on maximum mutual information, we see that the LM used to generate competing hypotheses is not the same LM used to generate truth. To improve the generalization of discriminative training, we use a unigram to describe the space of potential errors [7], but a trigram or higher to give us transcription truth with unsupervised training. One interpretation of unsupervised discriminative training for acoustic models is that we are using the difference between a weak unigram and a relatively stronger trigram to give us a known improvement in relative truth. We do not know that the strong-LM (trigram) result is absolutely correct, we only know that it is better than the result with the weak LM (unigram). When there is a difference, if we can move toward the results of the strong-LM system by changing acoustic model parameters, then we are building a more accurate AM, that also helps with the final system using a stronger LM. With this interpretation, the AM learns from the ‘truth gradient’ between the strong and weak LMs. 3.1. Unsupervised AM testing Extending unsupervised discriminative AM training to unsupervised AM testing involves retesting the criterion used during training in a new test context. More prescriptively, we sample a new set of live data from production logs, and take the recognition result from the fielded system using a strong AM and LM as assumed truth. Then we re-recognize the same data using multiple strong acoustic models and a weak LM. If one of the systems using a weak LM can better approximate the system using a strong LM, then at a minimum, we can say that it is doing a better job of generalizing our training criteria to new data. More directly, we have evidence that one of the strong acoustic models could be more accurate than the rest. For scoring we are assuming truth from the fielded system, not a human transcriber. Therefore, when reporting unsupervised testing results, we count traditional word error rates, but because there is no human transcription, we report it as a word difference rate (WDR), to highlight that, for example, in the case of unsupervised AM tests, it is the word differences between the systems with the strong and weak LM. 3.2. Unsupervised LM testing To use the same strategy for LM testing we reverse the roles of the AM and the LM. For better generalization of discriminative AM testing, we used a weak LM to generate more competing alternates. That establishes a truth gradient that generally changes around 1/3 of the words. The dual for LM testing is to use a weak AM instead. To get a truth gradient of a similar magnitude with our systems, we backed off to a context-dependent acoustic model that uses around 1/10th the number of parameters of our strong models, and only uses maximum likelihood training. Then as above, we test with multiple strong LMs and assume that the LM that can move the results of the system using the weak AM closest to the results of the production system (with the strong AM), is the most accurate LM. With unsupervised LM testing we again report WDR and not WER, where the magnitude of the difference is now from the difference between the strong AM and the weak AM. 3.3. Relative measures In this paper we are ignoring the harder problem of measuring absolute accuracy. Instead we focus on relative differences between different acoustic or language models. Others have predicted absolute error measures using statistics from the training set as represented in the final acoustic models [8], without looking at testing data. But here we are interested in estimating relative performance across production data that was unseen during training. Our goal is to assess whether new models or new approaches are helping on new data, and whether the data might be changing from the distributions used during training. 4. Correlating supervised and unsupervised measures First we show that the performance on unsupervised offline tests for the AM and for the LM correlate with more traditional supervised tests. Our production data started with primarily Voice Search queries intended for google.com, but over time has included increasing amounts of general Voice Input traffic which includes a large fraction of short person-to-person messages. To start the analyses, we consider these data streams separately. For Voice Search, our traditional supervised test is built from the 200K utterance set that we sent for multiple transcriptions. For this test we exclude the 10% of the utterances where we got 4 distinct human transcriptions and sample a test set randomly from the remaining 90%. Similarly for the supervised 1686Voice Input test, we sent utterances twice and selected from the utterances with at least 80% agreement between human transcriptions. On the utterances where not all the words agreed, we randomly chose one of the human transcriptions as truth. This led to a test that excluded about 28% of the utterances. Both of these supervised tests are biased in that they only include the utterances that we could reliably transcribe. The Voice Search test has 27K utterances and 87K words. The Voice Input test has 49K utterances and 320K words. For the first unsupervised tests here, we sampled production logs for a single day of traffic. We found the median recognizer confidence for each task and then randomly selected a few hundred thousand utterances that were above median confidence for each task. For all unsupervised experiments we used the recognition results from the field as truth. Our recognition configuration for both systems is fairly standard and described in the literature. Specifically we use a PLP front-end [11] together with LDA and STC [12], and optimize our acoustic models using BMMI [13] on mostly unsupervised data mixed from both tasks. Our language models are n-grams, with Katz interpolation and entropy pruning, and the fielded Voice Input system also includes dynamic interpolation [14]. The Voice Search system used trigrams and the Voice Input system included 4-grams. 4.1. AM experiments The AM experiments use a weak LM (in this case a unigram) for each task estimated from the few hundred thousand high confidence utterances sampled for that day’s test. All the utterances in the test were also used to train the LM, so there is no OOV. This step is consistent with the matched unigram we train for discriminative acoustic model training. For Voice Search, the resulting unigram had 17K words, and for Voice Input there were 18K unique words. The acoustic models we tested here were trained using 11M (mostly unsupervised) utterances from a mix of both tasks. The parameter we vary for these experiments is the size of the acoustic models. We use the same decision tree and context state definitions for all models, but we vary the number of Gaussians assigned to each state. Each model is trained with the same number of iterations through all the data. The final model sizes range from 100K to 1M Gaussians. Decoder parameters are set in production mode, which generally means we lose around 0.5% absolute from the best possible accuracy to have faster than real-time search. # Gauss Sup VS Unsup VS Sup VI Unsup VI 100K 16.0 36.0 14.5 24.8 200K 15.3 34.4 13.6 22.8 340K 14.6 33.9 13.4 22.7 500K 14.3 33.3 13.2 22.3 1M 13.9 33.0 12.9 21.8 Table 1: WER in % on supervised (Sup) and WDR in % on unsupervised (Unsup) AM tests for Voice Search (VS) and Voice Input (VI). 4.2. LM experiments For the LM experiments we vary the number of n-grams used for the Voice Input task from around 2M to 30M by varying our final entropy pruning threshold. Unlike the production system used to generate truth for the unsupervised tests, for these tests the LM is a static n-gram. We show results with two different weak acoustic models (A/B). Condition A is a context-dependent model estimated using maximum likelihood criteria with 2 Gaussians per state for a total of 16K Gaussians. Condition B uses a similar model with a variable number of Gaussians across model states, and a total of 40K Gaussians. On supervised tests, these weak acoustic models have around two to three times the error rates of final strong production models. n-grams Sup PPL Sup WER Unsup A/B WDR 1.9M 109 15.2 38.1/25.9 3.8M 98 14.4 36.8/24.5 7.6M 92 14.1 36.0/23.8 15M 87 13.9 35.5/23.2 30M 85 13.7 35.1/22.8 Table 2: Comparing supervised (Sup) and unsupervised (Unsup) LM tests for Voice Input. WER/WDR are in %, PPL is perplexity. Unsup A and B are for different sized AMs. The relative improvement in both AM and LM experiments is consistently around 10% for a 10x increase in model size. Correlations between supervised and unsupervised tests range between 0.98 and 0.99. 5. Additional experiments Varying model size is a controlled way to generate accuracy differences. Here we include additional unsupervised measurements that show expected differences in the context of other AM and LM modeling efforts. 5.1. CMLLR To evaluate an implementation of constrained maximum likelihood linear regression [9] for adaptation, we started by testing with read speech corpora from several data collections [10] used to initialize acoustic models in a new context. With a large and regular amounts of acoustic data per speaker, we see the typical improvements of 6-10% relative, over a matched discriminative baseline. To estimate the accuracy impact of CMLLR on the production system, (where the actual distributions of amount of data per user is not imposed by the strict specifications of a data collection) we used unsupervised testing. Here we sampled all personalized users over a 30 day period, and measured the change in WDR with a weak LM and either the production AM or the production AM with CMLLR. Further we break the differences in WDR down by the amount of data available for each speaker. # Utts No Adapt Adapt 1-20 25.7 25.4 20-50 26.6 25.6 50-100 25.8 24.6 100-200 23.5 22.5 Table 3: WDR in % on adaptation tests. Input is binned by the number of utterances for a given user. From the table, it is clear that we are seeing a similar relative difference as we saw with more traditional read speech tests, and we are further able to characterize the expected satu- 1687ration of the relatively small number of parameters in CMLLR after around 20 voice input utterances. 5.2. LM update At one point we updated our language model to include a rescoring pass more explicitly matched to recent Voice Search queries. By testing this update with recent unsupervised tests we are able to show the expected win on new voice search type utterances. # Model Config Sup VS Unsup VS Original 14.6 30.0 Updated 14.6 28.6 Table 4: WER in % on supervised (Sup) and WDR in % unsupervised (Unsup) LM tests for Voice Search. One interpretation of these results is that we are updating the LM to better represent the recent query data which itself is better matched to the recent unsupervised test. It also suggests that the distribution of our data might be moving. 5.3. Estimating non-stationary distributions Finally we ran two sweeps of AM tests to estimate how stationary the acoustics for our system have been over the last 14 months. The first system is trained using the Voice Search supervised data available at the beginning of the 14 months, and the second uses only unsupervised data sampled from the last 3 months. Therefore, one model represents our initial estimate of the distribution, and the other approximates a most recent distribution. Both systems use around 350K gaussians. To evaluate the AM performance, we use a weak LM estimated from a year’s worth of production data. Figure 1: Change in WDR over time with two different AMs. Both lines show that the distribution of the data has shifted away from the original supervised data, and toward the recent unsupervised data. Additional unsupervised tests will illuminate the causes of this change in more detail. We currently suspect an increase in the fraction of voice input recognition, but it is already obvious that the distribution of the acoustics for this data is changing. The plot also suggests that with a single AM the change of WDR across conditions may also be informative. Note that since we are generalizing from the same criteria we used for AM training, and we are getting rid of some of the necessity of human transcription, we are concerned about converging away from reality. The ground is a little firmer for the LM side, since our current LM processes are in fact not yet learning from AM truth gradients the same way our unsupervised AM training learns from LM truth gradients. From the AM side, our current unsupervised tests are simply checking whether the training optimizations extend to unseen data. Pragmatically, because it is unsupervised we also have the opportunity to test that generalization with a range of weak LMs and with a range of input data, and thereby to increase our con- fidence in the generalization. Moreover, reducing the accuracy improvement provided by a strong LM seems like a safe requirement to impose on AM training. But from an experiment perspective, we have to remember what gradient we are exploiting and not cheat. In other words, augmenting the AM with features directly related to the strong LM would not lead to improvements. We also monitor coarse signals related to application use (counts of user actions in response to recognition results) to give us additional complimentary evidence of successful generalization. 6. Conclusions This paper extends unsupervised discriminative training to an unsupervised testing strategy suitable for evaluating AM and LM changes. We show strong correlations with traditional testing strategies when we change AM or LM model size. We also show expected gains on unsupervised measures with other types of AM and LM changes, and use the unsupervised measures to begin to characterize the stationarity of the input data to Google mobile. Together with unsupervised training, unsupervised testing enables development paths that no longer impose human performance as the upper bound for accuracy. 7. References [1] R. Lippmann, “Speech recognition by machines and humans,” Speech Communication, July 1997. [2] T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath, “Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System,” Proc. ICSLP, 2006. [3] S. Novotney, C. Callison-Burch, “Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription,” Proc. NAACL, 2010. [4] A. Gruenstein, I. McGraw, A. Sutherland, “A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game,” Proc. SLaTE, 2009. [5] J. Ma, R. Schwartz, “Unsupervised versus supervised training of acoustic models,” Proc. ICSLP, 2008. [6] L. Wang, M. Gales, P. Woodland, “Unsupervised Training for Mandarin Broadcast News Conversation Transcription,” Proc ICASSP, 2007. [7] P.C. Woodland, D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Comp. Speech & Lang., Jan. 2002. [8] Y. Deng, M. Mahajan, A. Acero, “Estimating Speech Recognition Error Rate without Acoustic Test Data,” Proc. Eurospeech, 2003. [9] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comp. Speech & Lang., Vol 12.2 1998. [10] T. Hughes, K. Nakajima, L. Ha, A. Vasu, P. Moreno, M. LeBeau, “Building transcribed speech corpora quickly and cheaply for many languages,” Proc ICSLP, 2010. [11] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” JASA, v87.4, 1990. [12] M. Gales, “Semi-Tied Covariance Matrices for Hidden Markov Models,” Proc. IEEE Trans. SAP, May 2000. [13] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” Proc. ICASSP, 2008. [14] B. Ballinger, C. Allauzen, A. Gruenstein, J. Schalkwyk, “OnDemand Language Model Interpolation for Mobile Speech Input,” Proc. ICSLP, 2010. 1688 Parallel Algorithms for Unsupervised Tagging Sujith Ravi Google Mountain View, CA 94043 sravi@google.com Sergei Vassilivitskii Google Mountain View, CA 94043 sergeiv@google.com Vibhor Rastogi∗ Twitter San Francisco, CA vibhor.rastogi@gmail.com Abstract We propose a new method for unsupervised tagging that finds minimal models which are then further improved by Expectation Maximization training. In contrast to previous approaches that rely on manually specified and multi-step heuristics for model minimization, our approach is a simple greedy approximation algorithm DMLC (DISTRIBUTEDMINIMUM-LABEL-COVER) that solves this objective in a single step. We extend the method and show how to ef- ficiently parallelize the algorithm on modern parallel computing platforms while preserving approximation guarantees. The new method easily scales to large data and grammar sizes, overcoming the memory bottleneck in previous approaches. We demonstrate the power of the new algorithm by evaluating on various sequence labeling tasks: Part-of-Speech tagging for multiple languages (including lowresource languages), with complete and incomplete dictionaries, and supertagging, a complex sequence labeling task, where the grammar size alone can grow to millions of entries. Our results show that for all of these settings, our method achieves state-of-the-art scalable performance that yields high quality tagging outputs. 1 Introduction Supervised sequence labeling with large labeled training datasets is considered a solved problem. For ∗The research described herein was conducted while the author was working at Google. instance, state of the art systems obtain tagging accuracies over 97% for part-of-speech (POS) tagging on the English Penn Treebank. However, learning accurate taggers without labeled data remains a challenge. The accuracies quickly drop when faced with data from a different domain, language, or when there is very little labeled information available for training (Banko and Moore, 2004). Recently, there has been an increasing amount of research tackling this problem using unsupervised methods. A popular approach is to learn from POS-tag dictionaries (Merialdo, 1994), where we are given a raw word sequence and a dictionary of legal tags for each word type. Learning from POStag dictionaries is still challenging. Complete wordtag dictionaries may not always be available for use and in every setting. When they are available, the dictionaries are often noisy, resulting in high tagging ambiguity. Furthermore, when applying taggers in new domains or different datasets, we may encounter new words that are missing from the dictionary. There have been some efforts to learn POS taggers from incomplete dictionaries by extending the dictionary to include these words using some heuristics (Toutanova and Johnson, 2008) or using other methods such as type-supervision (Garrette and Baldridge, 2012). In this work, we tackle the problem of unsupervised sequence labeling using tag dictionaries. The first reported work on this problem was on POS tagging from Merialdo (1994). The approach involved training a standard Hidden Markov Model (HMM) using the Expectation Maximization (EM) algorithm (Dempster et al., 1977), though EM does notperform well on this task (Johnson, 2007). More recent methods have yielded better performance than EM (see (Ravi and Knight, 2009) for an overview). One interesting line of research introduced by Ravi and Knight (2009) explores the idea of performing model minimization followed by EM training to learn taggers. Their idea is closely related to the classic Minimum Description Length principle for model selection (Barron et al., 1998). They (1) formulate an objective function to find the smallest model that explains the text (model minimization step), and then, (2) fit the minimized model to the data (EM step). For POS tagging, this method (Ravi and Knight, 2009) yields the best performance to date; 91.6% tagging accuracy on a standard test dataset from the English Penn Treebank. The original work from (Ravi and Knight, 2009) uses an integer linear programming (ILP) formulation to find minimal models, an approach which does not scale to large datasets. Ravi et al. (2010b) introduced a two-step greedy approximation to the original objective function (called the MIN-GREEDY algorithm) that runs much faster while maintaining the high tagging performance. Garrette and Baldridge (2012) showed how to use several heuristics to further improve this algorithm (for instance, better choice of tag bigrams when breaking ties) and stack other techniques on top, such as careful initialization of HMM emission models which results in further performance gains. Their method also works under incomplete dictionary scenarios and can be applied to certain low-resource scenarios (Garrette and Baldridge, 2013) by combining model minimization with supervised training. In this work, we propose a new scalable algorithm for performing model minimization for this task. By making an assumption on the structure of the solution, we prove that a variant of the greedy set cover algorithm always finds an approximately optimal label set. This is in contrast to previous methods that employ heuristic approaches with no guarantee on the quality of the solution. In addition, we do not have to rely on ad hoc tie-breaking procedures or careful initializations for unknown words. Finally, not only is the proposed method approximately optimal, it is also easy to distribute, allowing it to easily scale to very large datasets. We show empirically that our method, combined with an EM training step outperforms existing state of the art systems. 1.1 Our Contributions • We present a new method, DISTRIBUTED MINIMUM LABEL COVER, DMLC, for model minimization that uses a fast, greedy algorithm with formal approximation guarantees to the quality of the solution. • We show how to efficiently parallelize the algorithm while preserving approximation guarantees. In contrast, existing minimization approaches cannot match the new distributed algorithm when scaling from thousands to millions or even billions of tokens. • We show that our method easily scales to both large data and grammar sizes, and does not require the corpus or label set to fit into memory. This allows us to tackle complex tagging tasks, where the tagset consists of several thousand labels, which results in more than one million entires in the grammar. • We demonstrate the power of the new method by evaluating under several different scenarios—POS tagging for multiple languages (including low-resource languages), with complete and incomplete dictionaries, as well as a complex sequence labeling task of supertagging. Our results show that for all these settings, our method achieves state-of-the-art performance yielding high quality taggings. 2 Related Work Recently, there has been an increasing amount of research tackling this problem from multiple directions. Some efforts have focused on inducing POS tag clusters without any tags (Christodoulopoulos et al., 2010; Reichart et al., 2010; Moon et al., 2010), but evaluating such systems proves dif- ficult since it is not straightforward to map the cluster labels onto gold standard tags. A more popular approach is to learn from POS-tag dictionaries (Merialdo, 1994; Ravi and Knight, 2009), incomplete dictionaries (Hasan and Ng, 2009; Garrette and Baldridge, 2012) and human-constructed dictionaries (Goldberg et al., 2008).Another direction that has been explored in the past includes bootstrapping taggers for a new language based on information acquired from other languages (Das and Petrov, 2011) or limited annotation resources (Garrette and Baldridge, 2013). Additional work focused on building supervised taggers for noisy domains such as Twitter (Gimpel et al., 2011). While most of the relevant work in this area centers on POS tagging, there has been some work done for building taggers for more complex sequence labeling tasks such as supertagging (Ravi et al., 2010a). Other related work include alternative methods for learning sparse models via priors in Bayesian inference (Goldwater and Griffiths, 2007) and posterior regularization (Ganchev et al., 2010). But these methods only encourage sparsity and do not explicitly seek to minimize the model size, which is the objective function used in this work. Moreover, taggers learned using model minimization have been shown to produce state-of-the-art results for the problems discussed here. 3 Model Following Ravi and Knight (2009), we formulate the problem as that of label selection on the sentence graph. Formally, we are given a set of sequences, S = {S1, S2, . . . , Sn} where each Si is a sequence of words, Si = wi1, wi2, . . . , wi,|Si| . With each word wij we associate a set of possible tags Tij . We will denote by m the total number of (possibly duplicate) words (tokens) in the corpus. Additionally, we define two special words w0 and w∞ with special tags start and end, and consider the modified sequences S 0 i = w0, Si , w∞. To simplify notation, we will refer to w∞ = w|Si|+1. The sequence label problem asks us to select a valid tag tij ∈ Tij for each word wij in the input to minimize a specific objective function. We will refer to a tag pair (ti,j−1, tij ) as a label. Our aim is to minimize the number of distinct labels used to cover the full input. Formally, given a sequence S 0 i and a tag tij for each word wij in S 0 i , let the induced set of labels for sequence S 0 i be Li = |S 0 i [ | j=1 {(ti,j−1, tij )}. The total number of distinct labels used over all sequences is then φ = ∪i Li | = [ i |S [i|+1 j=1 {(ti,j−1, tij )}|. Note that the order of the tokens in the label makes a difference as {(NN, VP)} and {(VP, NN)} are two distinct labels. Now we can define the problem formally, following (Ravi and Knight, 2009). Problem 1 (Minimum Label Cover). Given a set S of sequences of words, where each word wij has a set of valid tags Tij , the problem is to find a valid tag assignment tij ∈ Tij for each word that minimizes the number of distinct labels or tag pairs over all sequences, φ = S i S|Si|+1 j=1 {(ti,j−1, tij )}| . The problem is closely related to the classical Set Cover problem and is also NP-complete. To reduce Set Cover to the label selection problem, map each element i of the Set Cover instance to a single word sentence Si = wi1, and let the valid tags Ti1 contain the names of the sets that contain element i. Consider a solution to the label selection problem; every sentence Si is covered by two labels (w0, ki) and (ki , w∞), for some ki ∈ Ti1, which corresponds to an element i being covered by set ki in the Set Cover instance. Thus any valid solution to the label selection problem leads to a feasible solution to the Set Cover problem ({k1, k2, . . .}) of exactly half the size. Finally, we will use {{. . .}} notation to denote a multiset of elements, i.e. a set where an element may appear multiple times. 4 Algorithm In this Section, we describe the DISTRIBUTEDMINIMUM-LABEL-COVER, DMLC, algorithm for approximately solving the minimum label cover problem. We describe the algorithm in a centralized setting, and defer the distributed implementation to Section 5. Before describing the algorithm, we briefly explain the relationship of the minimum label cover problem to set cover. 4.1 Modification of Set Cover As we pointed out earlier, the minimum label cover problem is at least as hard as the Set Cover prob-1: Input: A set of sequences S with each words wij having possible tags Tij . 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: Let M be the multi set of all possible labels generated by choosing each possible tag t ∈ Tij . M = [ i   |S [i|+1 j=1 [ t 0∈Ti,j−1 t∈Tij {{(t 0 , t)}}   (1) 4: Let L = ∅ be the set of selected labels. 5: repeat 6: Select the most frequent label not yet selected: (t 0 , t) = arg max(s 0 ,s)∈L/ |M ∩ (s 0 , s)|. 7: For each bigram (wi,j−1, wij ) where t 0 ∈ Ti,j−1 and t ∈ Tij tentatively assign t 0 to wi,j−1 and t to wij . Add (t 0 , t) to L. 8: If a word gets two assignments, select one at random with equal probability. 9: If a bigram (wij , wi,j+1) is consistent with assignments in (t, t0 ), fix the tentative assignments, and set Ti,j−1 = {t 0} and Tij = t. Recompute M, the multiset of possible labels, with the updated Ti,j−1 and Tij . 10: until there are no unassigned words Algorithm 1: MLC Algorithm 1: Input: A set of sequences S with each words wij having possible tags Tij . 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: (Graph Creation) Initialize each vertex vij with the set of possible tags Tij and its neighbors vi,j+1 and vi,j−1. 4: repeat 5: (Message Passing) Each vertex vij sends its possibly tags Tij to its forward neighbor vij+1. 6: (Counter Update) Each vertex receives the the tags Ti,j−1 and adds all possible labels {(s, s0 )|s ∈ Ti,j−1, s0 ∈ Tij} to a global counter (M). 7: (MaxLabel Selection) Each vertex queries the global counter M to find the maximum label (t, t0 ). 8: (Tentative Assignment) Each vertex vij selects a tag tentatively as follows: If one of the tags t, t0 is in the feasible set Tij , it tentatively selects the tag. 9: (Random Assignment) If both are feasible it selects one at random. The vertex communicates its assignment to its neighbors. 10: (Confirmed Assignment) Each vertex receives the tentative assignment from its neighbors. If together with its neighbors it can match the selected label, the assignment is finalized. If the assigned tag is T, then the vertex vij sets the valid tag set Tij to {t}. 11: until no unassigned vertices exist. Algorithm 2: DMLC Implementation lem. An additional challenge comes from the fact that labels are tags for a pair of words, and hence are related. For example, if we label a word pair (wi,j−1, wij ) as (NN, VP), then the label for the next word pair (wij , wi,j+1) has to be of the form (VP, *), i.e., it has to start with VP. Previous work (Ravi et al., 2010a; Ravi et al., 2010b) recognized this challenge and employed two phase heuristic approaches. Eschewing heuristics, we will show that with one natural assumption, even with this extra set of constraints, the standard greedy algorithm for this problem results in a solution with a provable approximation ratio of O(log m). In practice, however, the algorithm performs far better than the worst case ratio, and similar to the work of (Gomes et al., 2006), we find that the greedy approach selects a cover approximately 11% worse than the optimum solution. 4.2 MLC Algorithm We present in Algorithm 1 our MINIMUM LABEL COVER algorithm to approximately solve the minimum label cover problem. The algorithm is simple, efficient, and easy to distribute. The algorithm chooses labels one at a time, selecting a label that covers as many words as possible inevery iteration. For this, it generates and maintains a multi-set of all possible labels M (Step 3). The multi-set contains an occurrence of each valid label, for example, if wi,j−1 has two possible valid tags NN and VP, and wij has one possible valid tag VP, then M will contain two labels, namely (NN, VP) and (VP, VP). Since M is a multi-set it will contain duplicates, e.g. the label (NN, VP) will appear for each adjacent pair of words that have NN and VP as valid tags, respectively. In each iteration, the algorithm picks a label with the most number of occurrences in M and adds it to the set of chosen labels (Step 6). Intuitively, this is a greedy step to select a label that covers the most number of word pairs. Once the algorithm picks a label (t 0 , t), it tries to assign as many words to tags t or t 0 as possible (Step 7). A word can be assigned t 0 if t 0 is a valid tag for it, and t a valid tag for the next word in sequence. Similarly, a word can be assigned t, if t is a valid tag for it, and t 0 a valid tag for the previous word. Some words can get both assignments, in which case we choose one tentatively at random (Step 8). If a word’s tentative random tag, say t, is consistent with the choices of its adjacent words (say t 0 from the previous word), then the tentative choice is fixed as a permanent one. Whenever a tag is selected, the set of valid tags Tij for the word is reduced to a singleton {t}. Once the set of valid tags Tij changes, the multi-set M of all possible labels also changes, as seen from Eq 1. The multi-set is then recomputed (Step 9) and the iterations repeated until all of words have been tagged. We can show that under a natural assumption this simple algorithm is approximately optimal. Assumption 1 (c-feasibility). Let c ≥ 1 be any number, and k be the size of the optimal solution to the original problem. In each iteration, the MLC algorithm fixes the tags for some words. We say that the algorithm is c-feasible, if after each iteration there exists some solution to the remaining problem, consistent with the chosen tags, with size at most ck . The assumption encodes the fact that a single bad greedy choice is not going to destroy the overall structure of the solution, and a nearly optimal solution remains. We note that this assumption of cfeasibility is not only sufficient, as we will formally show, but is also necessary. Indeed, without any assumptions, once the algorithm fixes the tag for some words, an optimal label may no longer be consistent with the chosen tags, and it is not hard to find contrived examples where the size of the optimal solution doubles after each iteration of MLC. Since the underlying problem is NP-complete, it is computationally hard to give direct evidence verifying the assumption on natural language inputs. However, on small examples we are able to show that the greedy algorithm is within a small constant factor of the optimum, specifically it is within 11% of the optimum model size for the POS tagging problem using the standard 24k dataset (Ravi and Knight, 2009). Combined with the fact that the final method outperforms state of the art approaches, this leads us to conclude that the structural assumption is well justified. Lemma 1. Under the assumption of c-feasibility, the MLC algorithm achieves a O(c log m) approximation to the minimum label cover problem, where m = P i |Si | is the total number of tokens. Proof. To prove the Lemma we will define an objective function φ¯, counting the number of unlabeled word pairs, as a function of possible labels, and show that φ¯ decreases by a factor of (1−O(1/ck)) at every iteration. To define φ¯, we first define φ, the number of labeled word pairs. Consider a particular set of labels, L = {L1, L2, . . . , Lk} where each label is a pair (ti , tj ). Call {tij} a valid assignment of tokens if for each wij , we have tij ∈ Tij . Then the score of L under an assignment t, which we denote by φt , is the number of bigram labels that appear in L. Formally, φt(L) = | ∪i,j {{(ti,j−1, tij ) ∩ L}}|. Finally, we define φ(L) to be the best such assignment, φ(L) = maxt φt(L), and φ¯(L) = m − φ(L) the number of uncovered labels. Consider the label selected by the algorithm in every step. By the c-feasibility assumption, there exists some solution having ck labels. Thus, some label from that solution covers at least a 1/ck fraction of the remaining words. The selected label (t, t0 ) maximizes the intersection with the remaining feasible labels. The conflict resolution step ensures that in expectation the realized benefit is at least a half of the maximum, thereby reducing φ¯ by at least a(1 − 1/2ck) fraction. Therefore, after O(kc log m) operations all of the labels are covered. 4.3 Fitting the Model Using EM Once the greedy algorithm terminates and returns a minimized grammar of tag bigrams, we follow the approach of Ravi and Knight (2009) and fit the minimized model to the data using the alternating EM strategy. In this step, we run an alternating optimization procedure iteratively in phases. In each phase, we initialize (and prune away) parameters within the two HMM components (transition or emission model) using the output from the previous phase. We initialize this procedure by restricting the transition parameters to only those tag bigrams selected in the model minimization step. We train in conjunction with the original emission model using EM algorithm which prunes away some of the emission parameters. In the next phase, we alternate the initialization by choosing the pruned emission model along with the original transition model (with full set of tag bigrams) and retrain using EM. The alternating EM iterations are terminated when the change in the size of the observed grammar (i.e., the number of unique bigrams in the tagging output) is ≤ 5%. 1 We refer to our entire approach using greedy minimization followed by EM training as DMLC + EM. 5 Distributed Implementation The DMLC algorithm is directly suited towards parallelization across many machines. We turn to Pregel (Malewicz et al., 2010), and its open source version Giraph (Apa, 2013). In these systems the computation proceeds in rounds. In every round, every machine does some local processing and then sends arbitrary messages to other machines. Semantically, we think of the communication graph as fixed, and in each round each vertex performs some local computation and then sends messages to its neighbors. This mode of parallel programming directs the programmers to “Think like a vertex.” The specific systems like Pregel and Giraph build infrastructure that ensures that the overall system 1 For more details on the alternating EM strategy and how initialization with minimized models improve EM performance in alternating iterations, refer to (Ravi and Knight, 2009). is fault tolerant, efficient, and fast. In addition, they provide implementation of commonly used distributed data structures, such as, for example global counters. The programmer’s job is simply to specify the code that each vertex will run at every round. We implemented the DMLC algorithm in Pregel. The implementation is straightforward and given in Algorithm 2. The multi-set M of Algorithm 1 is represented as a global counter in Algorithm 2. The message passing (Step 3) and counter update (Step 4) steps update this global counter and hence perform the role of Step 3 of Algorithm 1. Step 5 selects the label with largest count, which is equivalent to the greedy label picking step 6 of Algorithm 1. Finally steps 6, 7, and 8 update the tag assignment of each vertex performing the roles of steps 7, 8, and 9, respectively, of Algorithm 1. 5.1 Speeding up the Algorithm The implementation described above directly copies the sequential algorithm. Here we describe additional steps we took to further improve the parallel running times. Singleton Sets: As the parallel algorithm proceeds, the set of feasible sets associated with a node slowly decreases. At some point there is only one tag that a node can take on, however this tag is rare, and so it takes a while for it to be selected using the greedy strategy. Nevertheless, if a node and one of its neighbors have only a single tag left, then it is safe to assign the unique label 2 . Modifying the Graph: As is often the case, the bottleneck in parallel computations is the communication. To reduce the amount of communication we reduce the graph on the fly, removing nodes and edges once they no longer play a role in the computation. This simple modification decreases the communication time in later rounds as the total size of the problem shrinks. 6 Experiments and Results In this Section, we describe the experimental setup for various tasks, settings and compare empirical performance of our method against several existing 2We must judiciously initialize the global counter to take care of this assignment, but this is easily accomplished.baselines. The performance results for all systems (on all tasks) are measured in terms of tagging accuracy, i.e. % of tokens from the test corpus that were labeled correctly by the system. 6.1 Part-of-Speech Tagging Task 6.1.1 Tagging Using a Complete Dictionary Data: We use a standard test set (consisting of 24,115 word tokens from the Penn Treebank) for the POS tagging task. The tagset consists of 45 distinct tag labels and the dictionary contains 57,388 word/tag pairs derived from the entire Penn Treebank. Per-token ambiguity for the test data is about 1.5 tags/token. In addition to the standard 24k dataset, we also train and test on larger data sets— 973k tokens from the Penn Treebank, 3M tokens from PTB+Europarl (Koehn, 2005) data. Methods: We evaluate and compare performance for POS tagging using four different methods that employ the model minimization idea combined with EM training: • EM: Training a bigram HMM model using EM algorithm (Merialdo, 1994). • ILP + EM: Minimizing grammar size using integer linear programming, followed by EM training (Ravi and Knight, 2009). • MIN-GREEDY + EM: Minimizing grammar size using the two-step greedy method (Ravi et al., 2010b). • DMLC + EM: This work. Results: Table 1 shows the results for POS tagging on English Penn Treebank data. On the smaller test datasets, all of the model minimization strategies (methods 2, 3, 4) tend to perform equally well, yielding state-of-the-art results and large improvement over standard EM. When training (and testing) on larger corpora sizes, DMLC yields the best reported performance on this task to date. A major advantage of the new method is that it can easily scale to large corpora sizes and the distributed nature of the algorithm still permits fast, efficient optimization of the global objective function. So, unlike the earlier methods (such as MIN-GREEDY) it is fast enough to run on several millions of tokens to yield additional performance gains (shown in last column). Speedups: We also observe a significant speedup when using the parallelized version of the DMLC algorithm. Performing model minimization on the 24k tokens dataset takes 55 seconds on a single machine, whereas parallelization permits model minimization to be feasible even on large datasets. Fig 1 shows the running time for DMLC when run on a cluster of 100 machines. We vary the input data size from 1M word tokens to about 8M word tokens, while holding the resources constant. Both the algorithm and its distributed implementation in DMLC are linear time operations as evident by the plot. In fact, for comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing even better than linear speedup. The reason for this is that distributed version has a constant overhead for initialization, independent of the data size. While the running time for rest of the implementation is linear in data size. Thus, as the data size becomes larger, the constant overhead becomes less significant, and the distributed implementation appears to complete slightly faster as data size increases. Figure 1: Runtime vs. data size (measured in # of word tokens) on 100 machines. For comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing a better than linear speedup. 6.1.2 Tagging Using Incomplete Dictionaries We also evaluate our approach for POS tagging under other resource-constrained scenarios. Obtain-Method Tagging accuracy (%) te=24k te=973k tr=24k tr=973k tr=3.7M 1. EM 81.7 82.3 2. ILP + EM (Ravi and Knight, 2009) 91.6 3. MIN-GREEDY + EM (Ravi et al., 2010b) 91.6 87.1 4. DMLC + EM (this work) 91.4 87.5 87.8 Table 1: Results for unsupervised part-of-speech tagging on English Penn Treebank dataset. Tagging accuracies for different methods are shown on multiple datasets. te shows the size (number of tokens) in the test data, tr represents the size of the raw text used to perform model minimization. ing a complete dictionary is often difficult, especially for new domains. To verify the utility of our method when the input dictionary is incomplete, we evaluate against standard datasets used in previous work (Garrette and Baldridge, 2012) and compare against the previous best reported performance for the same task. In all the experiments (described here and in subsequent sections), we use the following terminology—raw data refers to unlabeled text used by different methods (for model minimization or other unsupervised training procedures such as EM), dictionary consists of word/tag entries that are legal, and test refers to data over which tagging evaluation is performed. English Data: For English POS tagging with incomplete dictionary, we evaluate on the Penn Treebank (Marcus et al., 1993) data. Following (Garrette and Baldridge, 2012), we extracted a word-tag dictionary from sections 00-15 (751,059 tokens) consisting of 39,087 word types, 45,331 word/tag entries, a per-type ambiguity of 1.16 yielding a pertoken ambiguity of 2.21 on the raw corpus (treating unknown words as having all 45 possible tags). As in their setup, we then use the first 47,996 tokens of section 16 as raw data and perform final evaluation on the sections 22-24. We use the raw corpus along with the unlabeled test data to perform model minimization and EM training. Unknown words are allowed to have all possible tags in both these procedures. Italian Data: The minimization strategy presented here is a general-purpose method that does not require any specific tuning and works for other languages as well. To demonstrate this, we also perform evaluation on a different language (Italian) using the TUT corpus (Bosco et al., 2000). Following (Garrette and Baldridge, 2012), we use the same data splits as their setting. We take the first half of each of the five sections to build the word-tag dictionary, the next quarter as raw data and the last quarter as test data. The dictionary was constructed from 41,000 tokens comprised of 7,814 word types, 8,370 word/tag pairs, per-type ambiguity of 1.07 and a per-token ambiguity of 1.41 on the raw data. The raw data consisted of 18,574 tokens and the test contained 18,763 tokens. We use the unlabeled corpus from the raw and test data to perform model minimization followed by unsupervised EM training. Other Languages: In order to test the effectiveness of our method in other non-English settings, we also report the performance of our method on several other Indo-European languages using treebank data from CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). The corpus statistics for the five languages (Danish, Greek, Italian, Portuguese and Spanish) are listed below. For each language, we construct a dictionary from the raw training data. The unlabeled corpus from the raw training and test data is used to perform model minimization followed by unsupervised EM training. As before, unknown words are allowed to have all possible tags. We report the final tagging performance on the test data and compare it to baseline EM. Garrette and Baldridge (2012) treat unknown words (words that appear in the raw text but are missing from the dictionary) in a special manner and use several heuristics to perform better initialization for such words (for example, the probability that an unknown word is associated with a particular tag isconditioned on the openness of the tag). They also use an auto-supervision technique to smooth counts learnt from EM onto new words encountered during testing. In contrast, we do not apply any such technique for unknown words and allow them to be mapped uniformly to all possible tags in the dictionary. For this particular set of experiments, the only difference from the Garrette and Baldridge (2012) setup is that we include unlabeled text from the test data (but without any dictionary tag labels or special heuristics) to our existing word tokens from raw text for performing model minimization. This is a standard practice used in unsupervised training scenarios (for example, Bayesian inference methods) and in general for scalable techniques where the goal is to perform inference on the same data for which one wishes to produce some structured prediction. Language Train Dict Test (tokens) (entries) (tokens) DANISH 94386 18797 5852 GREEK 65419 12894 4804 ITALIAN 71199 14934 5096 PORTUGUESE 206678 30053 5867 SPANISH 89334 17176 5694 Results: Table 2 (column 2) compares previously reported results against our approach for English. We observe that our method obtains a huge improvement over standard EM and gets comparable results to the previous best reported scores for the same task from (Garrette and Baldridge, 2012). It is encouraging to note that the new system achieves this performance without using any of the carefully-chosen heuristics employed by the previous method. However, we do note that some of these techniques can be easily combined with our method to produce further improvements. Table 2 (column 3) also shows results on Italian POS tagging. We observe that our method achieves significant improvements in tagging accuracy over all the baseline systems including the previous best system (+2.9%). This demonstrates that the method generalizes well to other languages and produces consistent tagging improvements over existing methods for the same task. Results for POS tagging on CoNLL data in five different languages are displayed in Figure 2. Note that the proportion of raw data in test versus train 50 60 70 80 90 DANISH GREEK ITALIAN PORTUGUESE SPANISH 79.4 66.3 84.6 80.1 83.1 77.8 65.6 82 78.5 81.3 EM DMLC+EM Figure 2: Part-of-Speech tagging accuracy for different languages on CoNLL data using incomplete dictionaries. (from the standard CoNLL shared tasks) is much smaller compared to the earlier experimental settings. In general, we observe that adding more raw data for EM training improves the tagging quality (same trend observed earlier in Table 1: column 2 versus column 3). Despite this, DMLC + EM still achieves significant improvements over the baseline EM system on multiple languages (as shown in Figure 2). An additional advantage of the new method is that it can easily scale to larger corpora and it produces a much more compact grammar that can be efficiently incorporated for EM training. 6.1.3 Tagging for Low-Resource Languages Learning part-of-speech taggers for severely lowresource languages (e.g., Malagasy) is very challenging. In addition to scarce (token-supervised) labeled resources, the tag dictionaries available for training taggers are tiny compared to other languages such as English. Garrette and Baldridge (2013) combine various supervised and semi-supervised learning algorithms into a common POS tagger training pipeline to addresses some of these challenges. They also report tagging accuracy improvements on low-resource languages when using the combined system over any single algorithm. Their system has four main parts, in order: (1) Tag dictionary expansion using label propagation algorithm, (2) Weighted model minimization, (3) Expectation maximization (EM) training of HMMs using auto-supervision, (4) MaxEnt Markov Model (MEMM) training. The entire procedure results in a trained tagger model that can then be applied to tag any raw data.3 Step 2 in this procedure involves 3 For more details, refer (Garrette and Baldridge, 2013). "We consider a model of repeated online auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser's expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration into a machine learning system in an auction environment is exceedingly small." Accepted for publication in the Annals of Applied Statistics (in press), 09/2014 INFERRING CAUSAL IMPACT USING BAYESIAN STRUCTURAL TIME-SERIES MODELS By Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott Google, Inc. E-mail: kbrodersen@google.com Abstract An important problem in econometrics and marketing is to infer the causal impact that a designed market intervention has exerted on an outcome metric over time. This paper proposes to infer causal impact on the basis of a diffusion-regression state-space model that predicts the counterfactual market response in a synthetic control that would have occurred had no intervention taken place. In contrast to classical difference-in-differences schemes, statespace models make it possible to (i) infer the temporal evolution of attributable impact, (ii) incorporate empirical priors on the parameters in a fully Bayesian treatment, and (iii) flexibly accommodate multiple sources of variation, including local trends, seasonality, and the time-varying influence of contemporaneous covariates. Using a Markov chain Monte Carlo algorithm for posterior inference, we illustrate the statistical properties of our approach on simulated data. We then demonstrate its practical utility by estimating the causal effect of an online advertising campaign on search-related site visits. We discuss the strengths and limitations of state-space models in enabling causal attribution in those settings where a randomised experiment is unavailable. The CausalImpact R package provides an implementation of our approach. 1. Introduction. This article proposes an approach to inferring the causal impact of a market intervention, such as a new product launch or the onset of an advertising campaign. Our method generalizes the widely used ‘difference-in-differences’ approach to the time-series setting by explicitly modelling the counterfactual of a time series observed both before and after the intervention. It improves on existing methods in two respects: it provides a fully Bayesian time-series estimate for the effect; and it uses model averaging to construct the most appropriate synthetic control for modelling the counterfactual. The CausalImpact R package provides an implementation of our approach (http://google.github.io/CausalImpact/). Inferring the impact of market interventions is an important and timely Keywords and phrases: causal inference, counterfactual, synthetic control, observational, difference in differences, econometrics, advertising, market research 12 K.H. BRODERSEN ET AL. problem. Partly because of recent interest in ‘big data,’ many firms have begun to understand that a competitive advantage can be had by systematically using impact measures to inform strategic decision making. An example is the use of ‘A/B experiments’ to identify the most effective market treatments for the purpose of allocating resources (Danaher and Rust, 1996; Seggie, Cavusgil and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009). Here, we focus on measuring the impact of a discrete marketing event, such as the release of a new product, the introduction of a new feature, or the beginning or end of an advertising campaign, with the aim of measuring the event’s impact on a response metric of interest (e.g., sales). The causal impact of a treatment is the difference between the observed value of the response and the (unobserved) value that would have been obtained under the alternative treatment, i.e., the effect of treatment on the treated (Rubin, 1974; Hitchcock, 2004; Morgan and Winship, 2007; Rubin, 2007; Cox and Wermuth, 2001; Heckman and Vytlacil, 2007; Antonakis et al., 2010; Kleinberg and Hripcsak, 2011; Hoover, 2012; Claveau, 2012). In the present setting the response variable is a time series, so the causal effect of interest is the difference between the observed series and the series that would have been observed had the intervention not taken place. A powerful approach to constructing the counterfactual is based on the idea of combining a set of candidate predictor variables into a single ‘synthetic control’ (Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010). Broadly speaking, there are three sources of information available for constructing an adequate synthetic control. The first is the time-series behaviour of the response itself, prior to the intervention. The second is the behaviour of other time series that were predictive of the target series prior to the intervention. Such control series can be based, for example, on the same product in a different region that did not receive the intervention, or on a metric that reflects activity in the industry as a whole. In practice, there are often many such series available, and the challenge is to pick the relevant subset to use as contemporaneous controls. This selection is done on the pre-treatment portion of potential controls; but their value for predicting the counterfactual lies in their post-treatment behaviour. As long as the control series received no intervention themselves, it is often reasonable to assume the relationship between the treatment and the control series that existed prior to the intervention to continue afterwards. Thus, a plausible estimate of the counterfactual time series can be computed up to the point in time where the relationship between treatment and controls can no longer be assumed to be stationary, e.g., because one of the controls received treatment itself. In a Bayesian framework, a third source ofBAYESIAN CAUSAL IMPACT ANALYSIS 3 information for inferring the counterfactual is the available prior knowledge about the model parameters, as elicited, for example, by previous studies. We combine the three preceding sources of information using a statespace time-series model, where one component of state is a linear regression on the contemporaneous predictors. The framework of our model allows us to choose from among a large set of potential controls by placing a spikeand-slab prior on the set of regression coefficients, and by allowing the model to average over the set of controls (George and McCulloch, 1997). We then compute the posterior distribution of the counterfactual time series given the value of the target series in the pre-intervention period, along with the values of the controls in the post-intervention period. Subtracting the predicted from the observed response during the post-intervention period gives a semiparametric Bayesian posterior distribution for the causal effect (Figure 1). Related work. As with other domains, causal inference in marketing requires subtlety. Marketing data are often observational and rarely follow the ideal of a randomised design. They typically exhibit a low signal-to-noise ratio. They are subject to multiple seasonal variations, and they are often confounded by the effects of unobserved variables and their interactions (for recent examples, see Seggie, Cavusgil and Phelan, 2007; Stewart, 2009; Leeflang et al., 2009; Takada and Bass, 1998; Chan et al., 2010; Lewis and Reiley, 2011; Lewis, Rao and Reiley, 2011; Vaver and Koehler, 2011, 2012). Rigorous causal inferences can be obtained through randomised experiments, which are often implemented in the form of geo experiments (Vaver and Koehler, 2011, 2012). Many market interventions, however, fail to satisfy the requirements of such approaches. For instance, advertising campaigns are frequently launched across multiple channels, online and offline, which precludes measurement of individual exposure. Campaigns are often targeted at an entire country, and one country only, which prohibits the use of geographic controls within that country. Likewise, a campaign might be launched in several countries but at different points in time. Thus, while a large control group may be available, the treatment group often consists of no more than one region, or a few regions with considerable heterogeneity among them. A standard approach to causal inference in such settings is based on a linear model of the observed outcomes in the treatment and control group before and after the intervention. One can then estimate the difference between (i) the pre-post difference in the treatment group and (ii) the pre-post difference in the control group. The assumption underlying such differencein-differences (DD) designs is that the level of the control group provides4 K.H. BRODERSEN ET AL. 20 60 100 140 (a) Y Model fit Prediction X1 X2 -40 0 20 60 (b) Point-wise impact Ground truth -2000 0 2000 6000 (c) 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 Cumulative impact Ground truth Figure 1. Inferring causal impact through counterfactual predictions. (a) Simulated trajectory of a treated market (Y ) with an intervention beginning in January 2014. Two other markets (X1, X2) were not subject to the intervention and allow us to construct a synthetic control (cf. Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010). Inverting the state-space model described in the main text yields a prediction of what would have happened in Y had the intervention not taken place (posterior predictive expectation of the counterfactual with pointwise 95% posterior probability intervals). (b) The difference between observed data and counterfactual predictions is the inferred causal impact of the intervention. Here, predictions accurately reflect the true (Gamma-shaped) impact. A key characteristic of the inferred impact series is the progressive widening of the posterior intervals (shaded area). This effect emerges naturally from the model structure and agrees with the intuition that predictions should become increasingly uncertain as we look further and further into the (retrospective) future. (c) Another way of visualizing posterior inferences is by means of a cumulative impact plot. It shows, for each day, the summed effect up to that day. Here, the 95% credible interval of the cumulative impact crosses the zeroline about five months after the intervention, at which point we would no longer declare a significant overall effect.BAYESIAN CAUSAL IMPACT ANALYSIS 5 an adequate proxy for the level that would have been observed in the treatment group in the absence of treatment (see Lester, 1946; Campbell, Stanley and Gage, 1963; Ashenfelter and Card, 1985; Card and Krueger, 1993; Angrist and Krueger, 1999; Athey and Imbens, 2002; Abadie, 2005; Meyer, 1995; Shadish, Cook and Campbell, 2002; Donald and Lang, 2007; Angrist and Pischke, 2008; Robinson, McNulty and Krasno, 2009; Antonakis et al., 2010). DD designs have been limited in three ways. First, DD is traditionally based on a static regression model that assumes i.i.d. data despite the fact that the design has a temporal component. When fit to serially correlated data, static models yield overoptimistic inferences with too narrow uncertainty intervals (see also Solon, 1984; Hansen, 2007a,b; Bertrand, Duflo and Mullainathan, 2002). Second, most DD analyses only consider two time points: before and after the intervention. In practice, the manner in which an effect evolves over time, especially its onset and decay structure, is often a key question. Third, when DD analyses are based on time series, previous studies have imposed restrictions on the way in which a synthetic control is constructed from a set of predictor variables, which is something we wish to avoid. For example, one strategy (Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010) has been to choose a convex combination (w1, . . . , wJ ), wj ≥ 0, P wj = 1 of J predictor time series in such a way that a vector of pre-treatment variables (not time series) X1 characterising the treated unit before the intervention is matched most closely by the combination of pre-treatment variables X0 of the control units w.r.t. a vector of importance weights (v1, . . . , vJ ). These weights are themselves determined in such a way that the combination of pre-treatment outcome time series of the control units most closely matches the pre-treatment outcome time series of the treated unit. Such a scheme relies on the availability of interpretable characteristics (e.g., growth predictors), and it precludes non-convex combinations of controls when constructing the weight vector W. We prefer to select a combination of control series without reference to external characteristics and purely in terms of how well they explain the pre-treatment outcome time series of the treated unit (while automatically balancing goodness of fit and model complexity through the use of regularizing priors). Another idea (Belloni et al., 2013) has been to use classical variable-selection methods (such as the Lasso) to find a sparse set of predictors. This approach, however, ignores posterior uncertainty about both which predictors to use and their coefficients. The limitations of DD schemes can be addressed by using state-space6 K.H. BRODERSEN ET AL. models, coupled with highly flexible regression components, to explain the temporal evolution of an observed outcome. State-space models distinguish between a state equation that describes the transition of a set of latent variables from one time point to the next and an observation equation that specifies how a given system state translates into measurements. This distinction makes them extremely flexible and powerful (see Leeflang et al., 2009, for a discussion in the context of marketing research). The approach described in this paper inherits three main characteristics from the state-space paradigm. First, it allows us to flexibly accommodate different kinds of assumptions about the latent state and emission processes underlying the observed data, including local trends and seasonality. Second, we use a fully Bayesian approach to inferring the temporal evolution of counterfactual activity and incremental impact. One advantage of this is the flexibility with which posterior inferences can be summarised. Third, we use a regression component that precludes a rigid commitment to a particular set of controls by integrating out our posterior uncertainty about the influence of each predictor as well as our uncertainty about which predictors to include in the first place, which avoids overfitting. The remainder of this paper is organised as follows. Section 2 describes the proposed model, its design variations, the choice of diffuse empirical priors on hyperparameters, and a stochastic algorithm for posterior inference based on Markov chain Monte Carlo (MCMC). Section 3 demonstrates important features of the model using simulated data, followed by an application in Section 4 to an advertising campaign run by one of Google’s advertisers. Section 5 puts our approach into context and discusses its scope of application. 2. Bayesian structural time-series models. Structural time-series models are state-space models for time-series data. They can be defined in terms of a pair of equations yt = Z T (2.1) t αt + t αt+1 = Ttαt + Rtηt (2.2) , where t ∼ N (0, σ2 t ) and ηt ∼ N (0, Qt) are independent of all other unknowns. Equation (2.1) is the observation equation; it links the observed data yt to a latent d-dimensional state vector αt . Equation (2.2) is the state equation; it governs the evolution of the the state vector αt through time. In the present paper, yt is a scalar observation, Zt is a d-dimensional output vector, Tt is a d × d transition matrix, Rt is a d × q control matrix, t is a scalar observation error with noise variance σt , and ηt is a q-dimensionalBAYESIAN CAUSAL IMPACT ANALYSIS 7 system error with a q × q state-diffusion matrix Qt , where q ≤ d. Writing the error structure of equation (2.2) as Rtηt allows us to incorporate state components of less than full rank; a model for seasonality will be the most important example. Structural time-series models are useful in practice because they are flexible and modular. They are flexible in the sense that a very large class of models, including all ARIMA models, can be written in the state-space form given by (2.1) and (2.2). They are modular in the sense that the latent state as well as the associated model matrices Zt , Tt , Rt , and Qt can be assembled from a library of component sub-models to capture important features of the data. There are several widely used state-component models for capturing the trend, seasonality, or effects of holidays. A common approach is to assume the errors of different state-component models to be independent (i.e., Qt is block-diagonal). The vector αt can then be formed by concatenating the individual state components, while Tt and Rt become block-diagonal matrices. The most important state component for the applications considered in this paper is a regression component that allows us to obtain counterfactual predictions by constructing a synthetic control based on a combination of markets that were not treated. Observed responses from such markets are important because they allow us to explain variance components in the treated market that are not readily captured by more generic seasonal submodels. This approach assumes that covariates are unaffected by the effects of treatment. For example, an advertising campaign run in the United States might spill over to Canada or the United Kingdom. When assuming the absence of spill-over effects, the use of such indirectly affected markets as controls would lead to pessimistic inferences; that is, the effect of the campaign would be underestimated (cf. Meyer, 1995). 2.1. Components of state. Local linear trend. The first component of our model is a local linear trend, defined by the pair of equations µt+1 = µt + δt + ηµ,t δt+1 = δt + ηδ,t (2.3) where ηµ,t ∼ N (0, σ2 µ ) and ηδ,t ∼ N (0, σ2 δ ). The µt component is the value of the trend at time t. The δt component is the expected increase in µ between times t and t + 1, so it can be thought of as the slope at time t.8 K.H. BRODERSEN ET AL. The local linear trend model is a popular choice for modelling trends because it quickly adapts to local variation, which is desirable when making short-term predictions. This degree of flexibility may not be desired when making longer-term predictions, as such predictions often come with implausibly wide uncertainty intervals. There is a generalization of the local linear trend model where the slope exhibits stationarity instead of obeying a random walk. This model can be written as µt+1 = µt + δt + ηµ,t δt+1 = D + ρ(δt − D) + ηδ,t, (2.4) where the two components of η are independent. In this model, the slope of the time trend exhibits AR(1) variation around a long-term slope of D. The parameter |ρ| < 1 represents the learning rate at which the local trend is updated. Thus, the model balances short-term information with information from the distant past. Seasonality. There are several commonly used state-component models to capture seasonality. The most frequently used model in the time domain is (2.5) γt+1 = − S X−2 s=0 γt−s + ηγ,t, where S represents the number of seasons, and γt denotes their joint contribution to the observed response yt . The state in this model consists of the S − 1 most recent seasonal effects, but the error term is a scalar, so the evolution equation for this state model is less than full rank. The mean of γt+1 is such that the total seasonal effect is zero when summed over S seasons. For example, if we set S = 4 to capture four seasons per year, the mean of the winter coefficient will be −1 × (spring + summer + autumn). The part of the transition matrix Tt representing the seasonal model is an S−1 × S−1 matrix with −1’s along the top row, 1’s along the subdiagonal, and 0’s elsewhere. The preceding seasonal model can be generalized to allow for multiple seasonal components with different periods. When modelling daily data, for example, we might wish to allow for an S = 7 day-of-week effect, as well as an S = 52 weekly annual cycle. The latter can be handled by setting Tt = IS−1, with zero variance on the error term, when t is not the start of a new week, and setting Tt to the usual seasonal transition matrix, with nonzero error variance, when t is the start of a new week.BAYESIAN CAUSAL IMPACT ANALYSIS 9 Contemporaneous covariates with static coefficients. Control time series that received no treatment are critical to our method for obtaining accurate counterfactual predictions since they account for variance components that are shared by the series, including in particular the effects of other unobserved causes otherwise unaccounted for by the model. A natural way of including control series in the model is through a linear regression. Its coefficients can be static or time-varying. A static regression can be written in state-space form by setting Zt = β Txt and αt = 1. One advantage of working in a fully Bayesian treatment is that we do not need to commit to a fixed set of covariates. The spike-andslab prior described in Section 2.2 allows us to integrate out our posterior uncertainty about which covariates to include and how strongly they should influence our predictions, which avoids overfitting. All covariates are assumed to be contemporaneous; the present model does not infer on a potential lag between treated and untreated time series. A known lag, however, can be easily incorporated by shifting the corresponding regressor in time. Contemporaneous covariates with dynamic coefficients. An alternative to the above is a regression component with dynamic regression coefficients to account for time-varying relationships (e.g., Banerjee, Kauffman and Wang, 2007; West and Harrison, 1997). Given covariates j = 1 . . . J, this introduces the dynamic regression component x T t βt = X J j=1 xj,tβj,t βj,t+1 = βj,t + ηβ,j,t, (2.6) where ηβ,j,t ∼ N (0, σ2 βj ). Here, βj,t is the coefficient for the j th control series and σβj is the standard deviation of its associated random walk. We can write the dynamic regression component in state-space form by setting Zt = xt and αt = βt and by setting the corresponding part of the transition matrix to Tt = IJ×J , with Qt = diag(σ 2 βj ). Assembling the state-space model. Structural time-series models allow us to examine the time series at hand and flexibly choose appropriate components for trend, seasonality, and either static or dynamic regression for the controls. The presence or absence of seasonality, for example, will usually be obvious by inspection. A more subtle question is whether to choose static or dynamic regression coefficients. When the relationship between controls and treated unit has been stable in the past, static coefficients are an attractive option. This is because10 K.H. BRODERSEN ET AL. a spike-and-slab prior can be implemented efficiently within a forward- filtering, backward-sampling framework. This makes it possible to quickly identify a sparse set of covariates even from tens or hundreds of potential variables (Scott and Varian, 2013). Local variability in the treated time series is captured by the dynamic local level or dynamic linear trend component. Covariate stability is typically high when the available covariates are close in nature to the treated metric. The empirical analyses presented in this paper, for example, will be based on a static regression component (Section 4). This choice provides a reasonable compromise between capturing local behaviour and accounting for regression effects. An alternative would be to use dynamic regression coefficients, as we do, for instance, in our analyses of simulated data (Section 3). Dynamic coefficients are useful when the linear relationship between treated metrics and controls is believed to change over time. There are a number of ways of reducing the computational burden of dealing with a potentially large number of dynamic coefficients. One option is to resort to dynamic latent factors, where one uses xt = But +νt with dim(ut)  J and uses ut instead of xt as part of Zt in (2.1), coupled with an AR-type model for ut itself. Another option is latent thresholding regression, where one uses a dynamic version of the spike-and-slab prior as in Nakajima and West (2013). The state-component models are assembled independently, with each component providing an additive contribution to yt . Figure 2 illustrates this process assuming a local linear trend paired with a static regression component. 2.2. Prior distributions and prior elicitation. Let θ generically denote the set of all model parameters and let α = (α1, . . . , αm) denote the full state sequence. We adopt a Bayesian approach to inference by specifying a prior distribution p(θ) on the model parameters as well as a distribution p(α0|θ) on the initial state values. We may then sample from p(α, θ|y) using MCMC. Most of the models in Section 2.1 depend solely on a small set of variance parameters that govern the diffusion of the individual state components. A typical prior distribution for such a variance is (2.7) 1 σ 2 ∼ G  ν 2 , s 2  , where G (a, b) is the Gamma distribution with expectation a/b. The prior parameters can be interpreted as a prior sum of squares s, so that s/ν is a prior estimate of σ 2 , and ν is the weight, in units of prior sample size, assigned to the prior estimate.BAYESIAN CAUSAL IMPACT ANALYSIS 11 𝜇1 𝑦1 𝒩 𝑦𝑡 𝜇𝑡 + 𝑥𝑡 T𝛽𝜚, 𝜎𝑦 2 𝜇𝑛 … 𝑦𝑛 𝜇𝑛+1 𝑦 𝑛+1 𝜇𝑚 𝑦 𝑚 𝑥1 𝑥𝑛 𝑥𝑛+1 𝑥𝑚 𝒩 𝜇𝑡 𝜇𝑡−1 + 𝛿𝑡−1, 𝜎𝜇 2 𝒩 𝛿𝑡 𝛿𝑡−1, 𝜎𝛿 2 … … … … … 𝜎𝜇 𝜎𝛿 𝜎𝑦 𝜇0 local trend local level observed (𝑦) and counterfactual (𝑦 ) activity controls control selection pre-intervention period post-intervention period 𝛿1 𝛿𝑛+1 𝛿𝑚 𝜚 diffusion parameters 𝒩 𝛽𝜚 𝑏𝜚, 𝜎𝜖 2 Σ𝜚 −1 −1 𝛿0 𝛿𝑛 𝜎𝜖 𝛽𝜚 regression coefficients observation noise Figure 2. Graphical model for the static-regression variant of the proposed state-space model. Observed market activity y1:n = (y1, . . . , yn) is modelled as the result of a latent state plus Gaussian observation noise with error standard deviation σy. The state αt includes a local level µt, a local linear trend δt, and a set of contemporaneous covariates xt, scaled by regression coefficients β%. State components are assumed to evolve according to independent Gaussian random walks with fixed standard deviations σµ and σδ (conditionaldependence arrows shown for the first time point only). The model includes empirical priors on these parameters and the initial states. In an alternative formulation, the regression coefficients β are themselves subject to random-walk diffusion (see main text). Of principal interest is the posterior predictive density over the unobserved counterfactual responses y˜n+1, . . . , y˜m. Subtracting these from the actual observed data yn+1, . . . , ym yields a probability density over the temporal evolution of causal impact.12 K.H. BRODERSEN ET AL. We often have a weak default prior belief that the incremental errors in the state process are small, which we can formalize by choosing small values of ν (e.g., 1) and small values of s/ν. The notion of ‘small’ means different things in different models; for the seasonal and local linear trend models our default priors are 1/σ2 ∼ G(10−2 , 10−2 s 2 y ), where s 2 y = P t (yt − y¯) 2/(n − 1) is the sample variance of the target series. Scaling by the sample variance is a minor violation of the Bayesian paradigm, but it is an effective means of choosing a reasonable scale for the prior. It is similar to the popular technique of scaling the data prior to analysis, but we prefer to do the scaling in the prior so we can model the data on its original scale. When faced with many potential controls, we prefer letting the model choose an appropriate set. This can be achieved by placing a spike-andslab prior over coefficients (George and McCulloch, 1993, 1997; Polson and Scott, 2011; Scott and Varian, 2013). A spike-and-slab prior combines point mass at zero (the ‘spike’), for an unknown subset of zero coefficients, with a weakly informative distribution on the complementary set of non-zero coefficients (the ‘slab’). Contrary to what its name might suggest, the ‘slab’ is usually not completely flat, but rather a Gaussian with a large variance. Let % = (%1, . . . , %J ), where %j = 1 if βj 6= 0 and %j = 0 otherwise. Let β% denote the non-zero elements of the vector β and let Σ−1 % denote the rows and columns of Σ−1 corresponding to non-zero entries in %. We can then factorize the spike-and-slab prior as (2.8) p(%, β, 1/σ2  ) = p(%) p(σ 2  |%) p(β%|%, σ2  ). The spike portion of (2.8) can be an arbitrary distribution over {0, 1} J in principle; the most common choice in practice is a product of independent Bernoulli distributions, (2.9) p(%) = Y J j=1 π %j j (1 − πj ) 1−%j , where πj is the prior probability of regressor j being included in the model. Values for πj can be elicited by asking about the expected model size M, and then setting all πj = M/J. An alternative is to use a more specific set of values πj . In particular, one might choose to set certain πj to either 1 or 0 to force the corresponding variables into or out of the model. Generally, framing the prior in terms of expected model size has the advantage that the model can adapt to growing numbers of predictor variables without having to switch to a hierarchical prior (Scott and Berger, 2010).BAYESIAN CAUSAL IMPACT ANALYSIS 13 For the ‘slab’ portion of the prior we use a conjugate normal-inverse Gamma distribution, β%|σ 2  ∼ N  b%, σ2  (Σ−1 % ) −1  (2.10) 1 σ 2  ∼ G  ν 2 , s 2  (2.11) . The vector b in equation (2.10) encodes our prior expectation about the value of each element of β. In practice, we usually set b = 0. The prior parameters in equation (2.11) can be elicited by asking about the expected R2 ∈ [0, 1] as well as the number of observations worth of weight ν the prior estimate should be given. Then s = ν(1 − R2 )s 2 y . The final prior parameter in (2.10) is Σ−1 which, up to a scaling factor, is the prior precision over β in the full model, with all variables included. The total information in the covariates is XTX, and so 1 nXTX is the average information in a single observation. Zellner’s g-prior (Zellner, 1986; Chipman et al., 2001; Liang et al., 2008) sets Σ−1 = g nXTX, so that g can be interpreted as g observations worth of information. Zellner’s prior becomes improper when XTX is not positive definite; we therefore ensure propriety by averaging XTX with its diagonal, (2.12) Σ−1 = g n n wXTX + (1 − w) diag  XTX o with default values of g = 1 and w = 1/2. Overall, this prior specification provides a broadly useful default while providing considerable flexibility in those cases where more specific prior information is available. 2.3. Inference. Posterior inference in our model can be broken down into three pieces. First, we simulate draws of the model parameters θ and the state vector α given the observed data y1:n in the training period. Second, we use the posterior simulations to simulate from the posterior predictive distribution p(y˜n+1:m|y1:n) over the counterfactual time series y˜n+1:m given the observed pre-intervention activity y1:n. Third, we use the posterior predictive samples to compute the posterior distribution of the pointwise impact yt−y˜t for each t = 1, . . . , m. We use the same samples to obtain the posterior distribution of cumulative impact. Posterior simulation. We use a Gibbs sampler to simulate a sequence (θ, α) (1) ,(θ, α) (2) , . . . from a Markov chain whose stationary distribution is p(θ, α|y1:n). The sampler alternates between: a data-augmentation step that simulates from p(α|y1:n, θ); and a parameter-simulation step that simulates from p(θ|y1:n, α).14 K.H. BRODERSEN ET AL. The data-augmentation step uses the posterior simulation algorithm from Durbin and Koopman (2002), providing an improvement over the earlier forward-filtering, backward-sampling algorithms by Carter and Kohn (1994), Fr¨uhwirth-Schnatter (1994), and de Jong and Shephard (1995). In brief, because p(y1:n, α|θ) is jointly multivariate normal, the variance of p(α|y1:n, θ) does not depend on y1:n. We can therefore simulate (y ∗ 1:n , α∗ ) ∼ p(y1:n, α|θ) and subtract E(α ∗ |y ∗ 1:n , θ) to obtain zero-mean noise with the correct variance. Adding E(α|y1:n, θ) restores the correct mean, which completes the draw. The required expectations can be computed using the Kalman filter and a fast mean smoother described in detail by Durbin and Koopman (2002). The result is a direct simulation from p(α|y1:n, θ) in an algorithm that is linear in the total (pre- and post-intervention) number of time points (m), and quadratic in the dimension of the state space (d). Given the draw of the state, the parameter draw is straightforward for all state components other than the static regression coefficients β. All state components that exclusively depend on variance parameters can translate their draws back to error terms ηt , accumulate sums of squares of η, and because of conjugacy with equation (2.7) the posterior distribution will remain Gamma distributed. The draw of the static regression coefficients β proceeds as follows. For each t = 1, . . . , n in the pre-intervention period, let ˙yt denote yt with the contributions from the other state components subtracted away, and let y˙ 1:n = ( ˙y1, . . . , y˙n). The challenge is to simulate from p(%, β, σ2  |y˙ 1:n), which we can factor into p(%|y1:n)p(1/σ2  |%, y˙ 1:n)p(β|%, σ, y˙ 1:n). Because of conjugacy, we can integrate out β and 1/σ2  and be left with (2.13) %|y˙ 1:n ∼ C(y˙ 1:n) |Σ −1 % | 1 2 |V −1 % | 1 2 p(%) S N 2 −1 % , where C(y˙ 1:n) is an unknown normalizing constant. The sufficient statistics in equation (2.13) are V −1 % =  XTX  % + Σ−1 % β˜ % = (V −1 % ) −1 (XT % y˙ 1:n + Σ−1 % b%) N = ν + n S% = s + y˙ T 1:ny˙ 1:n + b T % Σ −1 % b% − β˜T % V −1 % β˜ %. To sample from (2.13), we use a Gibbs sampler that draws each %j given all other %−j . Each full-conditional is easy to evaluate because %j can only assume two possible values. It should be noted that the dimension of all matrices in (2.13) is P j %j , which is small if the model is truly sparse. There are many matrices to manipulate, but because each is small the overallBAYESIAN CAUSAL IMPACT ANALYSIS 15 algorithm is fast. Once the draw of % is complete, we sample directly from p(β, 1/σ2  |%, y˙ 1:n) using standard conjugate formulae. For an alternative that may be even more computationally efficient, see Ghosh and Clyde (2011). Posterior predictive simulation. While the posterior over model parameters and states p(θ, α|y1:n) can be of interest in its own right, causal impact analyses are primarily concerned with the posterior incremental effect, (2.14) p (y˜n+1:m | y1:n, x1:m). As shown by its indices, the density in equation (2.14) is defined precisely for that portion of the time series which is unobserved: the counterfactual market response ˜yn+1, . . . , y˜m that would have been observed in the treated market, after the intervention, in the absence of treatment. It is also worth emphasizing that the density is conditional on the observed data (as well as the priors) and only on these, i.e., on activity in the treatment market before the beginning of the intervention as well as activity in all control markets both before and during the intervention. The density is not conditioned on parameter estimates or the inclusion or exclusion of covariates with static regression coefficients, all of which have been integrated out. Thus, through Bayesian model averaging, we neither commit to any particular set of covariates, which helps avoid an arbitrary selection; nor to point estimates of their coefficients, which prevents overfitting. The posterior predictive density in (2.14) is defined as a coherent (joint) distribution over all counterfactual data points, rather than as a collection of pointwise univariate distributions. This ensures that we correctly propagate the serial structure determined on pre-intervention data to the trajectory of counterfactuals. This is crucial, in particular, when forming summary statistics, such as the cumulative effect of the intervention on the treatment market. Posterior inference was implemented in C++ with an R interface. Given a typically-sized dataset with m = 500 time points, J = 10 covariates, and 10,000 iterations (see Section 4 for an example), this implementation takes less than 30 seconds to complete on a standard computer, enabling nearinteractive analyses. 2.4. Evaluating impact. Samples from the posterior predictive distribution over counterfactual activity can be readily used to obtain samples from the posterior causal effect, i.e., the quantity we are typically interested in. For each draw τ and for each time point t = n + 1, . . . , m, we set φ (τ) t := yt − y˜ (τ) t (2.15) ,16 K.H. BRODERSEN ET AL. yielding samples from the approximate posterior predictive density of the effect attributed to the intervention. In addition to its pointwise impact, we often wish to understand the cumulative effect of an intervention over time. One of the main advantages of a sampling approach to posterior inference is the flexibility and ease with which such derived inferences can be obtained. Reusing the impact samples obtained in (2.15), we compute for each draw τ X t t 0=n+1 φ (τ) t (2.16) 0 ∀t = n + 1, . . . , m. The preceding cumulative sum of causal increments is a useful quantity when y represents a flow quantity, measured over an interval of time (e.g., a day), such as the number of searches, sign-ups, sales, additional installs, or new users. It becomes uninterpretable when y represents a stock quantity, usefully defined only for a point in time, such as the total number of clients, users, or subscribers. In this case we might instead choose, for each τ , to draw a sample of the posterior running average effect following the intervention, 1 t − n X t t 0=n+1 φ (τ) t (2.17) 0 ∀t = n + 1, . . . , m. Unlike the cumulative effect in (2.16), the running average is always interpretable, regardless of whether it refers to a flow or a stock. However, it is more context-dependent on the length of the post-intervention period under consideration. In particular, under the assumption of a true impact that grows quickly at first and then declines to zero, the cumulative impact approaches its true total value (in expectation) as we increase the counterfactual forecasting period, whereas the average impact will eventually approach zero (while, in contrast, the probability intervals diverge in both cases, leading to more and more uncertain inferences as the forecasting period increases). 3. Application to simulated data. To study the characteristics of our approach, we analysed simulated (i.e., computer-generated) data across a series of independent simulations. Generated time series started on 1 January 2013 and ended on 30 June 2014, with a perturbation beginning on 1 January 2014. The data were simulated using a dynamic regression component with two covariates whose coefficients evolved according to independent random walks, βt ∼ N (βt−1, 0.012 ), initialized at β0 = 1. The covariates themselves were simple sinusoids with wavelengths of 90 days and 360 days, respectively.BAYESIAN CAUSAL IMPACT ANALYSIS 17 12 14 16 18 20 22 24 (a) 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 observed predicted true (b) effect size (%) proportion of intervals excluding zero 0 0.1 1 10 25 50 0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 (c) campaign duration (days) proportion of intervals containing truth Figure 3. Adequacy of posterior uncertainty. (a) Example of one of the 256 datasets created to assess estimation accuracy. Simulated observations (black) are based on two contemporaneous covariates, scaled by time-varying coefficients plus a time-varying local level (not shown). During the campaign period, where the data are lifted by an effect size of 10%, the plot shows the posterior expectation of counterfactual activity (blue), along with its pointwise central 95% credible intervals (blue shaded area), and, for comparison, the true counterfactual (green). (b) Power curve. Following repeated application of the model to simulated data, the plot shows the empirical frequency of concluding that a causal effect was present, as a function of true effect size, given a post-intervention period of 6 months. The curve represents sensitivity in those parts of the graph where the true effect size is positive, and 1 − specificity where the true effect size is zero. Error bars represent 95% credible intervals for the true sensitivity, using a uniform Beta(1, 1) prior. (c) Interval coverage. Using an effect size of 10%, the plot shows the proportion of simulations in which the pointwise central 95% credible interval contained the true impact, as a function of campaign duration. Intervals should contain ground truth in 95% of simulations, however much uncertainty its predictions may be associated with. Error bars represent 95% credible intervals. The latent state underlying the observed data was generated using a local level that evolved according to a random walk, µt ∼ N (µt−1, 0.1 2 ), initialized at µ0 = 0. Independent observation noise was sampled using t ∼ N (0, 0.1 2 ). In summary, observations yt were generated using yt = βt,1zt,1 + βt,2zt,2 + µt + t . To simulate the effect of advertising, the post-intervention portion of the preceding series was multiplied by 1 +e, where e (not to be confused with ) represented the true effect size specifying the (uniform) relative lift during the campaign period. An example is shown in Figure 3a. Sensitivity and specificity. To study the properties of our model, we began by considering under what circumstances we successfully detected a causal18 K.H. BRODERSEN ET AL. effect, i.e., the statistical power or sensitivity of our approach. A related property is the probability of not detecting an absent impact, i.e., specificity. We repeatedly generated data, as described above, under different true effect sizes. We then computed the posterior predictive distribution over the counterfactuals, and recorded whether or not we would have concluded a causal effect. For each of the effect sizes 0%, 0.1%, 1%, 10%, and 100%, a total of 2 8 = 256 simulations were run. This number was chosen simply on the grounds that it provided reasonably tight intervals around the reported summary statistics without requiring excessive amounts of computation. In each simulation, we concluded that a causal effect was present if and only if the central 95% posterior probability interval of the cumulative effect excluded zero. The model used throughout this section comprised two structural blocks. The first one was a local level component. We placed an inverse-Gamma prior on its diffusion variance with a prior estimate of s/ν = 0.1σy and a prior sample size ν = 32. The second structural block was a dynamic regression component. We placed a Gamma prior with prior expectation 0.1σy on the diffusion variance of both regression coefficients. By construction, the outcome variable did not exhibit any local trends or seasonality other than the variation conveyed through the covariates. This obviated the need to include an explicit local linear trend or seasonality component in the model. In a first analysis, we considered the empirical proportion of simulations in which a causal effect had been detected. When taking into account only those simulations where the true effect size was greater than zero, these empirical proportions provide estimates of the sensitivity of the model w.r.t. the process by which the data were generated. Conversely, those simulations where the campaign had had no effect yield an estimate of the model’s specificity. In this way, we obtained the power curve shown in Figure 3b. The curve shows that, in data such as these, a market perturbation leading to a lift no larger than 1% is missed in about 90% of cases. By contrast, a perturbation that lifts market activity by 25% is correctly detected as such in most cases. In a second analysis, we assessed the coverage properties of the posterior probability intervals obtained through our model. It is desirable to use a diffuse prior on the local level component such that central 95% intervals contain ground truth in about 95% of the simulations. This coverage frequency should hold regardless of the length of the campaign period. In other words, a longer campaign should lead to posterior intervals that are appropriately widened to retain the same coverage probability as the nar-BAYESIAN CAUSAL IMPACT ANALYSIS 19 rower intervals obtained for shorter campaigns. This was approximately the case throughout the simulated campaign (Figure 3c). Estimation accuracy. To study the accuracy of the point estimates supported by our approach, we repeated the preceding simulations with a fixed effect size of 10% while varying the length of the campaign. When given a quadratic loss function, the loss-minimizing point estimate is the posterior expectation of the predictive density over counterfactuals. Thus, for each generated dataset i, we computed the expected causal effect for each time point, φˆ i,t := hφt | y1, . . . , ym, x1, . . . , xmi ∀t = n + 1, . . . , m; i = 1, . . . , 256. (3.1) To quantify the discrepancy between estimated and true impact, we calculated the absolute percentage estimation error, ai,t := φˆ i,t − φt φt (3.2) . This yielded an empirical distribution of absolute percentage estimation errors (Figure 4a; blue), showing that impact estimates become less and less accurate as the forecasting period increases. This is because, under the local linear trend model in (2.3), the true counterfactual activity becomes more and more likely to deviate from its expected trajectory. It is worth emphasizing that all preceding results are based on the assumption that the model structure remains intact throughout the modelling period. In other words, even though the model is built around the idea of multiple (non-stationary) components (i.e., a time-varying local trend and, potentially, time-varying regression coefficients), this structure itself remains unchanged. If the model structure does change, estimation accuracy may suffer. We studied the impact of a changing model structure in a second simulation in which we repeated the procedure above in such a way that 90 days after the beginning of the campaign the standard deviation of the random walk governing the evolution of the regression coefficient was tripled (now 0.03 instead of 0.01). As a result, the observed data began to diverge much more quickly than before. Accordingly, estimations became considerably less reliable (Figure 4a, red). An example of the underlying data is shown in Figure 4b. The preceding simulations highlight the importance of a model that is sufficiently flexible to account for phenomena typically encountered in sea-20 K.H. BRODERSEN ET AL. 0 50 100 150 200 250 300 (a) absolute % error 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 15 20 25 (b) 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 Figure 4. Estimation accuracy. (a) Time series of absolute percentage discrepancy between inferred effect and true effect. The plot shows the rate (mean ± 2 s.e.m.) at which predictions become less accurate as the length of the counterfactual forecasting period increases (blue). The well-behaved decrease in estimation accuracy breaks down when the data are subject to a sudden structural change (red), as simulated for 1 April 2014. (b) Illustration of a structural break. The plot shows one example of the time series underlying the red curve in (a). On 1 April 2014, the standard deviation of the generating random walk of the local level was tripled, causing the rapid decline in estimation accuracy seen in the red curve in (a).BAYESIAN CAUSAL IMPACT ANALYSIS 21 sonal empirical data. This rules out entirely static models in particular (such as multiple linear regression). 4. Application to empirical data. To illustrate the practical utility of our approach, we analysed an advertising campaign run by one of Google’s advertisers in the United States. In particular, we inferred the campaign’s causal effect on the number of times a user was directed to the advertiser’s website from the Google search results page. We provide a brief overview of the underlying data below (see Vaver and Koehler, 2011, for additional details). The campaign analysed here was based on product-related ads to be displayed alongside Google’s search results for specific keywords. Ads went live for a period of 6 consecutive weeks and were geo-targeted to a randomised set of 95 out of 190 designated market areas (DMAs). The most salient observable characteristic of DMAs is offline sales. To produce balance in this characteristic, DMAs were first rank-ordered by sales volume. Pairs of regions were then randomly assigned to treatment/control. DMAs provide units that can be easily supplied with distinct offerings, although this finegrained split was not a requirement for the model. In fact, we carried out the analysis as if only one treatment region had been available (formed by summing all treated DMAs). This allowed us to evaluate whether our approach would yield the same results as more conventional treatment-control comparisons would have done. The outcome variable analysed here was search-related visits to the advertiser’s website, consisting of organic clicks (i.e., clicks on a search result) and paid clicks (i.e., clicks on an ad next to the search results, for which the advertiser was charged). Since paid clicks were zero before the campaign, one might wonder why we could not simply count the number of paid clicks after the campaign had started. The reason is that paid clicks tend to cannibalize some organic clicks. Since we were interested in the net effect, we worked with the total number of clicks. The first building block of the model used for the analyses in this section was a local level component. For the inverse-Gamma prior on its diffusion variance we used a prior estimate of s/ν = 0.1σy and a prior sample size ν = 32. The second structural block was a static regression component. We used a spike-and-slab prior with an expected model size of M = 3, an expected explained variance of R2 = 0.8 and 50 prior df. We deliberately kept the model as simple as this. Since the covariates came from a randomised experiment, we expected them to already account for any additional local linear trends and seasonal variation in the response variable. If one suspects22 K.H. BRODERSEN ET AL. that a more complex model might be more appropriate, one could optimize model design through Bayesian model selection. Here, we focus instead on comparing different sets of covariates, which is critical in counterfactual analyses regardless of the particular model structure used. Model estimation was carried out using 10 000 MCMC samples. Analysis 1: Effect on the treated, using a randomised control. We began by applying the above model to infer the causal effect of the campaign on the time series of clicks in the treated regions. Given that a set of unaffected regions was available in this analysis, the best possible set of controls was given by the untreated DMAs themselves (see below for a comparison with a purely observational alternative). As shown in Figure 5a, the model provided an excellent fit on the precampaign trajectory of clicks (including a spike in ‘week −2’ and a dip at the end of ‘week −1’). Following the onset of the campaign, observations quickly began to diverge from counterfactual predictions: the actual number of clicks was consistently higher than what would have been expected in the absence of the campaign. The curves did not reconvene until one week after the end of the campaign. Subtracting observed from predicted data, as we did in Figure 5b, resulted in a posterior estimate of the incremental lift caused by the campaign. It peaked after about three weeks into the campaign, and faded away after about one week after the end of the campaign. Thus, as shown in Figure 5c, the campaign led to a sustained cumulative increase in total clicks (as opposed to a mere shift of future clicks into the present, or a pure cannibalization of organic clicks by paid clicks). Specifically, the overall effect amounted to 88 400 additional clicks in the targeted regions (posterior expectation; rounded to three significant digits), i.e., an increase of 22%, with a central 95% credible interval of [13%, 30%]. To validate this estimate, we returned to the original experimental data, on which a conventional treatment-control comparison had been carried out using a two-stage linear model (Vaver and Koehler, 2011). This analysis had led to an estimated lift of 84 700 clicks, with a 95% confidence interval for the relative expected lift of [19%, 22%]. Thus, with a deviation of less than 5%, the counterfactual approach had led to almost precisely the same estimate as the randomised evaluation, except for its wider intervals. The latter is expected, given that our intervals represent prediction intervals, not confi- dence intervals. Moreover, in addition to an interval for the sum over all time points, our approach yields a full time series of pointwise intervals, which allows analysts to examine the characteristics of the temporal evolution of attributable impact. The posterior predictive intervals in Figure 5b widen more slowly than inBAYESIAN CAUSAL IMPACT ANALYSIS 23 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 5. Causal effect of online advertising on clicks in treated regions. (a) Time series of search-related visits to the advertiser’s website (including both organic and paid clicks). (b) Pointwise (daily) incremental impact of the campaign on clicks. Shaded vertical bars indicate weekends. (c) Cumulative impact of the campaign on clicks.24 K.H. BRODERSEN ET AL. the illustrative example in Figure 1. This is because the large number of controls available in this data set offers a much higher pre-campaign predictive strength than in the simulated data in Figure 1. This is not unexpected, given that controls came from a randomised experiment, and we will see that this also holds for a subsequent analysis (see below) that is based on yet another data source for predictors. A consequence of this is that there is little variation left to be captured by the random-walk component of the model. A reassuring finding is that the estimated counterfactual time series in Figure 5a eventually almost exactly rejoins the observed series, only a few days after the end of the intervention. Analysis 2: Effect on the treated, using observational controls. An important characteristic of counterfactual-forecasting approaches is that they do not require a setting in which a set of controls, selected at random, was exempt from the campaign. We therefore repeated the preceding analysis in the following way: we discarded the data from all control regions and, instead, used searches for keywords related to the advertiser’s industry, grouped into a handful of verticals, as covariates. In the absence of a dedicated set of control regions, such industry-related time series can be very powerful controls as they capture not only seasonal variations but also market-specific trends and events (though not necessarily advertiser-specific trends). A major strength of the controls chosen here is that time series on web searches are publicly available through Google Trends (http://www.google.com/trends/). This makes the approach applicable to virtually any kind of intervention. At the same time, the industry as a whole is unlikely to be moved by a single actor’s activities. This precludes a positive bias in estimating the effect of the campaign that would arise if a covariate was negatively affected by the campaign. As shown in Figure 6, we found a cumulative lift of 85 900 clicks (posterior expectation), or 21%, with a [12%, 30%] interval. In other words, the analysis replicated almost perfectly the original analysis that had access to a randomised set of controls. One feature in the response variable which this second analysis failed to account for was a spike in clicks in the second week before the campaign onset; this spike appeared both in treated and untreated regions and appears to be specific to this advertiser. In addition, the series of point-wise impact (Figure 6b) is slightly more volatile than in the original analysis (Figure 5). On the other hand, the overall point estimate of 85 900, in this case, was even closer to the randomised-design baseline (84 700; deviation ca. 1%) than in our first analysis (88 400; deviation ca. 4%). In summary, the counterfactual approach effectively obviated the need for the original randomised experiment. Using purely observationalBAYESIAN CAUSAL IMPACT ANALYSIS 25 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 6. Causal efect of online advertising on clicks, using only searches for keywords related to the advertiser’s industry as controls, discarding the original control regions as would be the case in studies where a randomised experiment was not carried out. (a) Time series of clicks on to the advertiser’s website. (b) Pointwise (daily) incremental impact of the campaign on clicks. (c) Cumulative impact of the campaign on clicks. The plots show that this analysis, which was based on observational covariates only, provided almost exactly the same inferences as the first analysis (Figure 5) that had been based on a randomised design.26 K.H. BRODERSEN ET AL. variables led to the same substantive conclusions. Analysis 3: Absence of an effect on the controls. To go one step further still, we analysed clicks in those regions that had been exempt from the advertising campaign. If the effect of the campaign was truly specific to treated regions, there should be no effect in the controls. To test this, we inferred the causal effect of the campaign on unaffected regions, which should not lead to a significant finding. In analogy with our second analysis, we discarded clicks in the treated regions and used searches for keywords related to the advertiser’s industry as controls. As summarized in Figure 7, no significant effect was found in unaffected regions, as expected. Specifically, we obtained an overall non-significant lift of 2% in clicks with a central 95% credible interval of [−6%, 10%]. In summary, the empirical data considered in this section showed: (i) a clear effect of advertising on treated regions when using randomised control regions to form the regression component, replicating previous treatmentcontrol comparisons (Figure 5); (ii) notably, an equivalent finding when discarding control regions and instead using observational searches for keywords related to the advertiser’s industry as covariates (Figure 6); (iii) reassuringly, the absence of an effect of advertising on regions that were not targeted (Figure 7). 5. Discussion. The increasing interest in evaluating the incremental impact of market interventions has been reflected by a growing literature on applied causal inference. With the present paper we are hoping to contribute to this literature by proposing a Bayesian state-space model for obtaining a counterfactual prediction of market activity. We discuss the main features of this model below. In contrast to most previous schemes, the approach described here is fully Bayesian, with regularizing or empirical priors for all hyperparameters. Posterior inference gives rise to complete-data (smoothing) predictions that are only conditioned on past data in the treatment market and both past and present data in the control markets. Thus, our model embraces a dynamic evolution of states and, optionally, coefficients (departing from classical linear regression models with a fixed number of static regressors) and enables us to flexibly summarize posterior inferences. Because closed-form posteriors for our model do not exist, we suggest a stochastic approximation to inference using MCMC. One convenient consequence of this is that we can reuse the samples from the posterior to obtain credible intervals for all summary statistics of interest. Such statistics include, for example, the average absolute and relative effect caused by theBAYESIAN CAUSAL IMPACT ANALYSIS 27 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 7. Causal effect of online advertising on clicks in non-treated regions, which should not show an effect. Searches for keywords related to the advertiser’s industry are used as controls. Plots show inferences in analogy with Figure 5. (a) Time series of clicks to the advertiser’s website. (b) Pointwise (daily) incremental impact of the campaign on clicks. (c) Cumulative impact of the campaign on clicks.28 K.H. BRODERSEN ET AL. intervention as well as its cumulative effect. Posterior inference was implemented in C++ and R and, for all empirical datasets presented in Section 4, took less than 30 seconds on a standard Linux machine. If the computational burden of sampling-based inference ever became prohibitive, one option would be to replace it by a variational Bayesian approximation (see Mathys et al., 2011; Brodersen et al., 2013, for examples). Another way of using the proposed model is for power analyses. In particular, given past time series of market activity, we can define a point in the past to represent a hypothetical intervention and apply the model in the usual fashion. As a result, we obtain a measure of uncertainty about the response in the treated market after the beginning of the hypothetical intervention. This provides an estimate of what incremental effect would have been required to be outside of the 95% central interval of what would have happened in the absence of treatment. The model presented here subsumes several simpler models which, in consequence, lack important characteristics, but which may serve as alternatives should the full model appear too complex for the data at hand. One example is classical multiple linear regression. In principle, classical regression models go beyond difference-in-differences schemes in that they account for the full counterfactual trajectory. However, they are not suited for predicting stochastic processes beyond a few steps. This is because ordinary leastsquares estimators disregard serial autocorrelation; the static model structure does not allow for temporal variation in the coefficients; and predictions ignore our posterior uncertainty about the parameters. Put differently: classical multiple linear regression is a special case of the state-space model described here in which (i) the Gaussian random walk of the local level has zero variance; (ii) there is no local linear trend; (iii) regression coefficients are static rather than time-varying; (iv) ordinary least squares estimators are used which disregard posterior uncertainty about the parameters and may easily overfit the data. Another special case of the counterfactual approach discussed in this paper is given by synthetic control estimators that are restricted to the class of convex combinations of predictor variables and do not include time-series effects such as trends and seasonality (Abadie, Diamond and Hainmueller, 2010; Abadie, 2005). Relaxing this restriction means we can utilize predictors regardless of their scale, even if they are negatively correlated with the outcome series of the treated unit. Other special cases include autoregressive (AR) and moving-average (MA) models. These models define autocorrelation among observations rather thanBAYESIAN CAUSAL IMPACT ANALYSIS 29 latent states, thus precluding the ability to distinguish between state noise and observation noise (Ataman, Mela and Van Heerde, 2008; Leeflang et al., 2009). In the scenarios we consider, advertising is a planned perturbation of the market. This generally makes it easier to obtain plausible causal inferences than in genuinely observational studies in which the experimenter had no control about treatment (see discussions in Berndt, 1991; Brady, 2002; Hitchcock, 2004; Robinson, McNulty and Krasno, 2009; Winship and Morgan, 1999; Camillo and d’Attoma, 2010; Antonakis et al., 2010; Lewis and Reiley, 2011; Lewis, Rao and Reiley, 2011; Kleinberg and Hripcsak, 2011; Vaver and Koehler, 2011). The principal problem in observational studies is endogeneity: the possibility that the observed outcome might not be the result of the treatment but of other omitted, endogenous variables. In principle, propensity scores can be used to correct for the selection bias that arises when the treatment effect is correlated with the likelihood of being treated (Rubin and Waterman, 2006; Chan et al., 2010). However, the propensityscore approach requires that exposure can be measured at the individual level, and it, too, does not guarantee valid inferences, for example in the presence of a specific type of selection bias recently termed ‘activity bias’ (Lewis, Rao and Reiley, 2011). Counterfactual modelling approaches avoid these issues when it can be assumed that the treatment market was chosen at random. Overall, we expect inferences on the causal impact of designed market interventions to play an increasingly prominent role in providing quantitative accounts of return on investment (Danaher and Rust, 1996; Seggie, Cavusgil and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009). This is because marketing resources, specifically, can only be allocated to whichever campaign elements jointly provide the greatest return on ad spend (ROAS) if we understand the causal effects of spend on sales, product adoption, or user engagement. At the same time, our approach could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, biology, or the political and social sciences. With the release of the CausalImpact R package we hope to provide a simple framework serving all of these areas. Structural time-series models are being used in an increasing number of applications at Google, and we anticipate that they will prove equally useful in many analysis efforts elsewhere. Acknowledgements. The authors wish to thank Jon Vaver for sharing the empirical data analysed in this paper.30 K.H. BRODERSEN ET AL. References. Abadie, A. (2005). Semiparametric Difference-in-Differences Estimators. The Review of Economic Studies 72 1–19. Abadie, A., Diamond, A. and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of Californias tobacco control program. Journal of the American Statistical Association 105. Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study of the Basque Country. American economic review 113–132. Angrist, J. D. and Krueger, A. B. (1999). Empirical strategies in labor economics. Handbook of labor economics 3 1277–1366. Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press. Antonakis, J., Bendahan, S., Jacquart, P. and Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly 21 1086–1120. Ashenfelter, O. and Card, D. (1985). Using the longitudinal structure of earnings to estimate the effect of training programs. The Review of Economics and Statistics 648–660. Ataman, M. B., Mela, C. F. and Van Heerde, H. J. (2008). Building brands. Marketing Science 27 1036–1054. Athey, S. and Imbens, G. W. (2002). Identification and Inference in Nonlinear DifferenceIn-Differences Models Working Paper No. 280, National Bureau of Economic Research. Banerjee, S., Kauffman, R. J. and Wang, B. (2007). Modeling Internet firm survival using Bayesian dynamic models with time-varying coefficients. Electronic Commerce Research and Applications 6 332–342. Belloni, A., Chernozhukov, V., Fernandez-Val, I. and Hansen, C. (2013). Program evaluation with high-dimensional data CeMMAP working papers No. CWP77/13, Centre for Microdata Methods and Practice, Institute for Fiscal Studies. Berndt, E. R. (1991). The practice of econometrics: classic and contemporary. AddisonWesley Reading, MA. Bertrand, M., Duflo, E. and Mullainathan, S. (2002). How Much Should We Trust Differences-in-Differences Estimates? Working Paper No. 8841, National Bureau of Economic Research. Brady, H. E. (2002). Models of causal inference: Going beyond the Neyman-RubinHolland theory. In Annual Meetings of the Political Methodology Group. Brodersen, K. H., Daunizeau, J., Mathys, C., Chumbley, J. R., Buhmann, J. M. and Stephan, K. E. (2013). Variational Bayesian mixed-effects inference for classifi- cation studies. NeuroImage 76 345–361. Camillo, F. and d’Attoma, I. (2010). A new data mining approach to estimate causal effects of policy interventions. Expert Systems with Applications 37 171–181. Campbell, D. T., Stanley, J. C. and Gage, N. L. (1963). Experimental and quasiexperimental designs for research. Houghton Mifflin Boston. Card, D. and Krueger, A. B. (1993). Minimum wages and employment: A case study of the fast food industry in New Jersey and Pennsylvania Technical Report, National Bureau of Economic Research. Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika 81 541-553. Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale. In Proceedings of ACM SIGKDD 2010 7–15.BAYESIAN CAUSAL IMPACT ANALYSIS 31 Chipman, H., George, E. I., McCulloch, R. E., Clyde, M., Foster, D. P. and Stine, R. A. (2001). The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series 65–134. Claveau, F. (2012). The RussoWilliamson Theses in the social sciences: Causal inference drawing on two types of evidence. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 0. Cox, D. and Wermuth, N. (2001). Causal Inference and Statistical Fallacies. In International Encyclopedia of the Social & Behavioral Sciences (E. in Chief:Neil J. Smelser and P. B. Baltes, eds.) 1554–1561. Pergamon, Oxford. Danaher, P. J. and Rust, R. T. (1996). Determining the optimal return on investment for an advertising campaign. European Journal of Operational Research 95 511–521. de Jong, P. and Shephard, N. (1995). The simulation smoother for time series models. Biometrika 82 339–350. Donald, S. G. and Lang, K. (2007). Inference with Difference-in-Differences and Other Panel Data. Review of Economics and Statistics 89 221–233. Durbin, J. and Koopman, S. J. (2002). A Simple and Efficient Simulation Smoother for State Space Time Series Analysis. Biometrika 89 603–616. Fruhwirth-Schnatter, S. ¨ (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis 15 183–202. George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88 881–889. George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica 7 339–374. Ghosh, J. and Clyde, M. A. (2011). Rao-Blackwellization for Bayesian Variable Selection and Model Averaging in Linear and Binary Regression: A Novel Data Augmentation Approach. Journal of the American Statistical Association 106 1041–1052. Hansen, C. B. (2007a). Asymptotic properties of a robust variance matrix estimator for panel data when T is large. Journal of Econometrics 141 597–620. Hansen, C. B. (2007b). Generalized least squares inference in panel and multilevel models with serial correlation and fixed effects. Journal of Econometrics 140 670–694. Heckman, J. J. and Vytlacil, E. J. (2007). Econometric Evaluation of Social Programs, Part I: Causal Models, Structural Models and Econometric Policy Evaluation. In Handbook of Econometrics, (J. J. Heckman and E. E. Leamer, eds.) 6, Part B 4779–4874. Elsevier. Hitchcock, C. (2004). Do All and Only Causes Raise the Probabilities of Effects? In Causation and Counterfactuals MIT Press. Hoover, K. D. (2012). Economic Theory and Causal Inference. In Philosophy of Economics, (U. Mki, ed.) 13 89–113. Elsevier. Kleinberg, S. and Hripcsak, G. (2011). A review of causal inference for biomedical informatics. Journal of Biomedical Informatics 44 1102–1112. Leeflang, P. S., Bijmolt, T. H., van Doorn, J., Hanssens, D. M., van Heerde, H. J., Verhoef, P. C. and Wieringa, J. E. (2009). Creating lift versus building the base: Current trends in marketing dynamics. International Journal of Research in Marketing 26 13–20. Lester, R. A. (1946). Shortcomings of marginal analysis for wage-employment problems. The American Economic Review 36 63–82. Lewis, R. A., Rao, J. M. and Reiley, D. H. (2011). Here, there, and everywhere: correlated online behaviors can lead to overestimates of the effects of advertising. In Proceedings of the 20th international conference on World wide web. WWW ’11 157– 166. ACM, New York, NY, USA.32 K.H. BRODERSEN ET AL. Lewis, R. A. and Reiley, D. H. (2011). Does Retail Advertising Work? Technical Report. Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g-priors for Bayesian variable selection. Journal of the American Statistical Association 103 410-423. Mathys, C., Daunizeau, J., Friston, K. J. and Stephan, K. E. (2011). A Bayesian Foundation for Individual Learning Under Uncertainty. Frontiers in Human Neuroscience 5. Meyer, B. D. (1995). Natural and Quasi-Experiments in Economics. Journal of Business & Economic Statistics 13 151. Morgan, S. L. and Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge University Press. Nakajima, J. and West, M. (2013). Bayesian analysis of latent threshold dynamic models. Journal of Business & Economic Statistics 31 151–164. Polson, N. G. and Scott, S. L. (2011). Data augmentation for support vector machines. Bayesian Analysis 6 1–23. Robinson, G., McNulty, J. E. and Krasno, J. S. (2009). Observing the Counterfactual? The Search for Political Experiments in Nature. Political Analysis 17 341–357. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology; Journal of Educational Psychology 66 688. Rubin, D. B. (2007). Statistical Inference for Causal Effects, With Emphasis on Applications in Epidemiology and Medical Statistics. In Handbook of Statistics, (J. M. C. R. Rao and D. Rao, eds.) 27 28–63. Elsevier. Rubin, D. B. and Waterman, R. P. (2006). Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology. Statistical Science 21 206-222. Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics 38 2587-2619. Scott, S. L. and Varian, H. R. (2013). Predicting the Present with Bayesian Structural Time Series. International Journal of Mathematical Modeling and Optimization. (forthcoming). Seggie, S. H., Cavusgil, E. and Phelan, S. E. (2007). Measurement of return on marketing investment: A conceptual framework and the future of marketing metrics. Industrial Marketing Management 36 834–841. Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). Experimental and quasiexperimental designs for generalized causal inference. Wadsworth Cengage learning. Solon, G. (1984). Estimating autocorrelations in fixed-effects models. National Bureau of Economic Research Cambridge, Mass., USA. Stewart, D. W. (2009). Marketing accountability: Linking marketing actions to financial results. Journal of Business Research 62 636–643. Takada, H. and Bass, F. M. (1998). Multiple Time Series Analysis of Competitive Marketing Behavior. Journal of Business Research 43 97–107. Vaver, J. and Koehler, J. (2011). Measuring Ad Effectiveness Using Geo Experiments Technical Report, Google Inc. Vaver, J. and Koehler, J. (2012). Periodic Measurement of Advertising Effectiveness Using Multiple-Test-Period Geo Experiments Technical Report, Google Inc. West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models. Springer. Winship, C. and Morgan, S. L. (1999). The estimation of causal effects from observational data. Annual review of sociology 659–706.BAYESIAN CAUSAL IMPACT ANALYSIS 33 Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. K. Goel and A. Zellner, eds.) 233–243. North-Holland/Elsevier. Google, Inc. 1600 Amphitheatre Parkway Mountain View CA 94043, U.S.A. Estimating reach curves from one data point Georg M. Goerg Google Inc. Last update: November 21, 2014 Abstract Reach curves arise in advertising and media analysis as they relate the number of content impressions to the number of people who have seen it. This is especially important for measuring the effectiveness of an ad on TV or websites (Nielsen, 2009; PricewaterhouseCoopers, 2010). For a mathematical and datadriven analysis, it would be very useful to know the entire reach curve; advertisers, however, often only know its last data point, i.e., the total number of impressions and the total reach. In this work I present a new method to estimate the entire curve using only this last data point. Furthermore, analytic derivations reveal a surprisingly simple, yet insightful relationship between marginal cost per reach, average cost per impression, and frequency. Thus, advertisers can estimate the cost of an additional reach point by just knowing their total number of impressions, reach, and cost. A comparison of the proposed one-data point method to two competing regression models on TV reach curve data, shows that the proposed methodology performs only slightly poorer than regression fits to a collection of several points along the curve. 1 Introduction Let k+ reach, rk, be the percentage of the population that is exposed to a campaign at least k times. As usual, we measure impressions in gross rating points (GRPs), which is calculated as number of impressions divided by total (target) population multiplied by 100 (measured in percent). Equipped with a functional form of the reach curve, a variety of quantities of interest can be computed, e.g., marginal cost per reach or maximum possible reach. Advertisers, however, often only have two points of the reach curve rk(g): rk(0) = 0 and rk(G) = R ∈ [0, 100], (1) where G ≥ 0 is the total GRPs and R is total reach. With this information alone one is tempted to use a linear approximation r (1) k (g) = R G g. However, reach curves are not linear and in particular, the marginal reach per GRP would equal average reach per GRP (= 1/frequency); thus (1) alone is not helpful to get a better estimate of marginal GRP (and thus cost) per reach at g = G. While the behavior of rk(g) around g = G is in general unknown, the tangent at g = 0 can be approximated quite well: starting with no exposure, adding an infinitesimally small unit of GRPs (say ) one reaches  · ι % of the population, where ι = ι(k) is the reciprocal of the expected number of impressions needed for the first person to see k impressions. One can lower bound ι by 1/k. For k = 1, the bound is tight, ι = 1; getting an exact expression of ι for k > 1 is ongoing research.1 That is, for small g the reach curve can be approximated with a line through (0, 0) with slope ι: rk(g) ≈ g · ι for small g. (2) Thus, approximately, lim G→0 ∂ ∂g rk(g = G) = ι. (3) Combining (1) with (3) allows us to estimate a two-parameter model. Section 2 reviews parametric models for reach curves. Section 3 derives the parameter estimates based on 1 In practice we found that ι = (k + log2 k)−1 gives good fits for several k ≥ 1. 12.2 Conditional Logit 2 REACH CURVE MODELS the total GRP and reach. Simulations and comparisons to full least squares estimates are presented in Section 4. Finally, Section 5 summarizes the main findings and discusses future work. Details on the TV reach curve data and analytical derivations can be found in the Appendix. 2 Reach curve models Let X ≥ 0 be the number of content impressions, e.g., TV shows, websites, or commercials. For a probabilistic view of reach curves, it is useful to decompose k+ reach as P (X ≥ k,reachable) = P (X ≥ k | reachable) · P (reachable) (4) ⇔ rk = pk · ρ, (5) where ρ is the maximum possible reach, and pk is the probability of being reached k times, given that an individual is indeed reachable. This distinction allows us to model ρ and pk with separate probabilistic models. Since reach is usually denoted in percent, we also use percent for maximum possible reach ρ ∈ [0, 100], while we use proportions for pk ∈ [0, 1]. For further analytical derivations it is necessary to parametrize pk(g). Below we review two functional forms which are parsimonious (2 + 1 parameters), have excellent empirical fits, and lend themselves for simple analytical derivations. 2.1 Gamma-Mixture Jin et al. (2012) propose a Poisson distribution for the impressions g, with an exponential prior distribution with rate β on the Poisson rate λ. This yields a model of the form pk(g) = 1 − β g + β . (6) The exponential prior can be generalized to a Γ(α, β) distribution, which yields rk(g) = ρ  1 −  β β + g α . (7) By construction, (6) is nested in (7), which can be tested using a hypothesis test for H0 : α = 1. 2.1.1 Marginal reach The derivative of (7) with respect to g equals2 ∂ ∂g pk(g) = α β  β g + β α+1 , (8) with limg→0 ∂ ∂g rk(g) = ρα β . (9) Eq. (9) has three degrees of freedom; since only two data points are available, one parameters has to be fixed. Given the nested structure of the exponential model, it is natural to set α ≡ 1. 2.2 Conditional Logit As an alternative we propose a logistic regression logit(pk(g)) = β0 + β1 · log g, (10) where logit(p) = log p 1−p , and β0 and β1 are intercept and slope.3 Using the logit inverse expit(x) = e x 1+e x = 1 1+e−x , Eq. (10) can be rewritten as pk = expit(β0 + β1 log g) = e β0+β1 log g 1 + e β0+β1 log g (11) = 1 − 1 1 + e β0 · g β1 (12) = 1 − e −β0 e−β0 + g β1 (13) which shows similarity to (7). In fact, identifying β ≡ e −β0 , both models coincide if α = 1 and β1 = 1, respectively. Again, this can be tested using a two-sided hypothesis test for H0 : β1 = 1. The Logit conditional model can also be interpreted as the baseline Gamma mixture model with α ≡ 1, but with transformed GRPs, ˜g = g β1 , in (7). Here β1 can be interpreted as a parameter that measures the efficiency of GRPs: for β1 > 1 GRPs are more efficient than baseline; for β1 = 1 GRPs are spent according to the baseline model; and for β1 < 1 are not spent as efficiently as expected. For an empirical estimates see Section 4. 2See Section B.1 for details. 3We deliberately do not use α and β to parametrize intercept and slope, as it is prone to confusion with the (reversed) roles of α and β in (8). Google Inc. 23.1 ρ 1. (15) Thus for the logit model one has to assume β1 = 1 to use the linear approximation of R(g) at g = 0 for 1+ reach.4 3 Methodology Equipped with the two parameter model r(g; ρ, β) = ρ  1 − β β + g  = ρ g β + g ∈ [0, ρ], (16) we can use the tangent approximation in (3) and total GRP and reach to estimate ρ and β. Note that β ≥ 0 is a saturation parameter and controls how efficient GRPs are: for small β reach grows quickly with GRPs, for large β it grows slowly. Its derivative equals r 0 (g; ρ, β) = ρ β (β + g) 2 , (17) which at g = 0 evaluates to r 0 (0) = ρ β . This gives a system of two equations (maximum GRP and reach & marginal reach at 0) with two unknowns, ρ ∈ [0, 100] and β > 0: ρ β = ι ⇔ ρ = β · ι, (18) ρ G β + G = R ⇔ ρ = R(G + β) G . (19) First note that for 1+ reach, ρ ≡ β since ι(k = 1) = 1. Moreover, ρ in (19) satisfies ρ ≥ 0 for all β, but it satisfies ρ ≤ 100 only for β ≤ G · 100−R R . 4For k > 1, the Logit model with β1 > 1 might become useful as the marginal k+ reach for the very first impression is 0. However, one then has to estimate three parameters again, which is not possible without any further assumptions or more than one data point. Solving for β and plugging in to ρ = ρ(β) gives ρb = min  G · R G − R/ι, 100 , (20) and βb = (ρb ι = G·R/ι G−R/ι , if ρ < 100, G · 100−R R , if ρ = 100. (21) Condition ρb ≤ 100 is equivalent to G ≤ 100 ι R 100−R ; thus GRPs must be less or equal to a constant times the odds ratio of reach. Plugging them back into (16) yields expressions for reach solely as a function of R and G (details see Appendix B). According to (21) we consider the two scenarios separately. 3.1 ρ 1 − F θ pz > 1 − F. The first inequality will be close to an equality when pz and hence θ is small. For our applications 1 − F θ/pz is a reasonable approximation to the variance ratio. The second inequality reflects the fact that pooling the data cannot possibly be better than what we would get with an SSP of size n + N. From var( f ˆθI)/var(ˆθS) ≈ 1 − F θ/pz we see that using the BRP is effectively like multiplying the SSP sample size n by 1/(1−F θ/pz). Our greatest precision gains come when a high fraction of online reaches are incremental, that is, when θ/pz is largest. In our application this proportion ranges from 20% to 50% when aggregated to the campaign level. See Table 2.1 in Section 2. 3.2 Gain from the CIA alone Here we evaluate the variance reduction that would follow from the CIA. In that case, we could take advantage of the Z–Y independence, and estimate θ by ˆθC = Z¯ S(1 − Y¯ S). It is shown in the Appendix that the delta method variance of ˆθC satisfies var( f ˆθC) var(ˆθS) = 1 − py(1 − pz) 1 − θ > 1 − py, (2) when the CIA holds. This can represent a dramatic improvement, when the online reach pz and incremental reach θ are both small while the TV reach py is large. If the CIA holds, our application data suggest the variance reduction can be from 50% to 80%. The reverse setting with tiny TV reach and large online reach would not be favorable to ˆθC, but our data are not of that type. 3.3 Gain from the CIA and IDA Finally, suppose that both the CIA and IDA hold. If we apply both assumptions, we can get the estimator ˆθI,C = (fZ¯ S + FZ¯B)(1 − Y¯ S). We already gain a lot4 Example campaigns 7 from the CIA, so it is interesting to see how much more the IDA adds when the CIA holds. We show in the Appendix that under both assumptions, var( f ˆθI,C) var( f ˆθC) = f(1 − py)(1 − pz) + pypz (1 − py)(1 − pz) + pypz . If both reaches are high then we gain little, but if both reaches are small then we reduce the variance by almost a factor of f, when adding the IDA to the CIA. In our case we expect that the television reach is large but the online reach is small, fitting neither of these extremes. Consider a campaign with f = 1/3, py = 2/3 and pz = 99/100, similar to the soap campaigns. For such a campaign, var( f ˆθI,C) var( f ˆθC) = (1/9) × .99 + (2/3) × .01 (1/3) × .99 + (2/3) × .01 .= .34, so the combined assumptions then allow a nearly three-fold variance reduction compared to CIA alone. 4 Example campaigns Our data enrichment scheme is described in Section 5. Here we illustrate the results from that scheme on six marketing campaigns and discuss the differences among different algorithms. In addition to data enrichment, we also show results from tree structured models. Those split the data into groups and recursively split the groups. More about tree fitting is in Section 5. One model fits a tree to the SSP data alone and another one works with the pooled SSP and BRP data. For all three of those methods we have aggregated the predictions over the age variable, which takes six levels. In addition, we show the empirical results for age, which amount to recording the percentage of incremental reaches, that is, data with Z(1 − Y ) = 1, at each unique level of age in the SSP. There is no corresponding empirical prediction fully disaggregated by age, gender, income and education, because of the great many empty cells that would cause. We found the age related patterns of incremental reach particularly interesting. Figure 4.1 shows estimated incremental reach for all three models and the empirical counts, on all six campaigns, averaged over age groups. The beer campaign is particularly telling. The empirical data show a decreasing trend of incremental reach with age. The tree fit to SSP-only data yields a fit that is constant in age. The tree model had to explore splitting the data on all four variables without a prior focus on age. There were only 23 incremental reach events for beer in the SSP data set. With such a small number of events and four predictors, there is considerable possibility of overfitting. Cross-validation lead to a model that grouped the entire SSP into one set, that is, the tree had no splits. Both pooling and data enrichment were able to borrow strength from the BRP as well as take advantage of approximate independence of television and web exposure. They then recover the trend with age.5 Data enrichment for incremental reach 8 Age level Incremental reach (%) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 2 3 4 5 6 ● ● ● ● ● ● Beer 2 4 6 8 10 ● ● ● ● ● ● Chrome 0.0 0.5 1.0 1.5 1 2 3 4 5 6 ● ● ● ● ● ● Salt 0.0 0.5 1.0 1.5 2.0 ● ● ● ● ● ● Soap 1 1 2 3 4 5 6 0.5 1.0 1.5 2.0 ● ● ● ● ● ● Soap 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ● ● ● ● ● ● Soap 3 ● Emp SSP Pool DEIR Fig. 4.1: Estimated incremental reach by age, for six campaigns and three models: SSP, Pooling and DEIR as described in the text. Empirical counts are marked by Emp. The Salt campaign had a similarly small number of incremental reaches and once again the SSP only tree was constant. Fitting a tree to the SSP data always gave a flatter fit versus age than did DEIR which in turn was flatter than what we would get simply pooling the data. Section 6 gives simulations in which DEIR has greater accuracy than using pooling or SSP only. 5 Data enrichment for incremental reach For a given sample we would like to combine incremental reach estimates ˆθS, ˆθI , ˆθC and ˆθI,C whose assumptions are: none, IDA, CIA and IDA+CIA, respectively. The latter three add some value if their corresponding assumptions are nearly true, but our information about how well those assumptions hold comes from the same data we are using to form the estimates. The circumstances are similar to those in data enriched linear regression (Chen5 Data enrichment for incremental reach 9 et al., 2013). In that problem there is a regression model Yi = XT i β + εi which holds in the SSP and a biased regression model Yi = XT i (β + γ) + εi holds in the BRP. The estimates are found by minimizing S(λ) = X i∈S (Yi − XT i β) 2 + X i∈B (Yi − XT i (β + γ))2 + λ X i∈S (XT i γ) 2 , (3) over β and γ for a nonnegative penalty factor λ. The εi are independent with mean 0 and variance σ 2 S in the SSP and σ 2 B in the BRP. Taking λ = 0 amounts to fitting regressions separately in the two samples yielding an estimate βˆ that does not use the BRP at all. The limit λ → ∞ corresponds to pooling the two data sets, which would be optimal if there were no bias, i.e., if γ = 0. The specific penalty in (3) discourages the estimated γ from making large changes to the SSP; it is one of several penalties considered in that paper. Varying λ from 0 to ∞ gives a family of estimators that weight the SSP to varying degrees. The optimal λ is unknown. An oracle that knew γ and the error variance in the two data sets would be able to compute the optimal λ under a mean squared error loss. Chen et al. (2013) get a formula for the oracle’s λ and then plug estimates of γ and the variances into that formula. They show, under conditions, that the resulting plugin estimate gives better estimates of β than using the SSP only would. The conditions are that the Y values are normally distributed, and that the model have at least 5 regression parameters and 10 error degrees of freedom. The normality assumption allows a technical lemma due to Stein (1981) to be used and we believe that gains from using the BRP do not require normality. In principle we might multiply the sum of squared errors in the BRP by τ = σ 2 S /σ2 B if that ratio is known. If σ 2 BRP > σ2 SSP then we should put less weight on the BRP sample relative to the SSP sample. However the same effect is gained by increasing λ. Since the algorithm searches for optimal λ over a wide range it is less important to precisely specify τ . Chen et al. (2013) took τ = 1, simply summing all squared errors, and we will generalize that approach. For the present setting we must modify the method. First our responses are binary, not Gaussian. Second we have four estimators to combine, not two. Third, those estimators are dependent, being fit to overlapping data sets. 5.1 Modification for binary response To address the binary response there are two reasonable choices. One is to employ logistic regression. The other is to use tree-structured regression and then pool the estimators at the leaves of the tree. Regarding prediction accuracy, there is no unique best algorithm. There will be data sets for which simple logistic regression outperforms tree based classifiers and vice versa. For this paper we have adopted trees. Tree structured models have two practical advantages. First, the resulting cells that they select correspond to empirically determined market segments, which are then interpretable. Sec-5 Data enrichment for incremental reach 10 Data set Source Imputed V Assumptions D0 SSP ZS(1 − YS) none D1 BRP ZB(1 − YbSSP(XB, ZB)) IDA D2 SSP ZbSSP(XS)(1 − YbSSP(XS)) CIA D3 SSP ZbSSP+BRP(XS)(1 − YbSSP(XS)) CIA & IDA Tab. 5.1: Four incremental reach data sets and their imputed incremental reaches. The hats denote model-imputed values. For example YbSSP(XB, ZB) is a predictive model for Y based on values X and Z fit using data from SSP and evaluated at X = XB and Z = XB (from BRP). ond, within any of those cells, the model is intercept-only. Then both logistic regression and least squares reduce to a simple average. Each leaf of the regression tree defines a subset of the data that we call a cell. There are cells 1, . . . , C. The SSP has nc observations in cell c and the BRP has Nc observations there. For each cell and each set of assumptions we use a linear regression model relating an incremental reach quantity like Vei to an intercept. When there are no assumptions then Vei is the observed incremental reach for i ∈ S. Otherwise we may take advantage of the assumptions to impute values Vei using more of the data. The incremental reach values for each set of assumptions are given in Table 5.1. The predictive models shown there are all fit using rpart. For k = 0, 1, 2, 3 let Vek be vector of imputed responses under any of the assumptions from Table 5.1 and Xek their corresponding predictors. The regression framework minimizes kVe0 − Xe0βk 2 + X 3 k=1 kVek − Xek(β + γk)k 2 + X 3 k=1 λkkXe0γkk 2 . (4) over β and γk for penalties λk. In our setting each Xek is a column vector of ones of length mk. For cell c, m1k = Nc and m0k = m2k = m3k = nc. 5.2 Search for λk It is very convenient to search for suitable weights in the simplex ∆(K) = {(ω0, ω1, . . . , ωK) | ωk > 0, X K k=0 ωk = 1} because it is a bounded set, unlike the set [0, ∞] K of usable vectors λ = (λ1, . . . , λK). Chen et al. (2013) remark that it is more reasonable to use a common set of λk over all cells, stemming from unequal sample sizes. The search we use combines the advantages of both approaches.5 Data enrichment for incremental reach 11 Our search strategy for the simplex is to choose a grid of weight vectors ωg = (ωg0, ωg1, . . . , ωgK) ∈ ∆(K) , g = 1, . . . , G. For each vector ωg we find a vector λg = (λ1, . . . , λK) such that X C c=1 pcωk,c = ωgk, k = 0, 1, . . . , K, where pc is the proportion of our target population in cell c. That is, the population average weight of ωk,c matches ωgk. These weights give us the vector λg = (λg1, . . . , λgK). Using λg in the penalty criterion (4) specifies the weights we use within each cell. Our algorithm chooses the tree and the vector ω jointly using cross-validation. It is computationally expensive to make high dimensional searches. With K factors there is a K − 1 dimensional space of weights to search. Adding in the tree size gives a K’th dimension. As a result, combining all of our estimators requires us to search a 4 dimensional grid of values. We have chosen to set one of the ωk to 0 to reduce the search space from 4 dimensions to 3. We always retained the unbiased estimate ˆθS along with two others. In some computations reported in section A.4 of the Appendix we find only small differences among setting ω1 = 0, or ω2 = 0 or ω3 = 0. The best outcome was setting ω1 = 0. That has the effect of removing the estimate based on IDA only. As we saw in section 3, the IDA-only model had the least potential to improve our estimate. As a bonus, all three of the retained submodels have the same sample sizes and then common λ over cells coincides with common ω over cells. In the special case with ω1 = 0 we find after some calculus that the minimizer of (4) has βˆ c = V¯ 0c + P k∈{2,3} λk 1+λk V¯ kc 1 + P k∈{2,3} λk 1+λk ≡ X k∈{0,2,3} ωkc(λ)Vkc (5) where V¯ kc is the simple average of Vek over i ∈ S for cell c. Our default grid takes all values of ω whose coefficients are integer multiples of 10%. Populations D0, D2 and D3 all have the sample size n and of these only D0 is surely unbiased. An observation in D0 is worth at least as much as an observation in D2 or D3 and so we require ω0 > max{ω2, ω3}. Figure 5.1 shows this region and the set of 24 weight combinations that we use. 5.3 Search for tree size Here we give a brief review of regression trees in order to define our algorithm. For a full description see the monograph by Breiman et al. (1985). The version we use is the function rpart (Therneau and Atkinson, 1997) in the R programming language (R Core Team, 2012).5 Data enrichment for incremental reach 12 Weight region ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Weight points Fig. 5.1: The left panel shows the simplex of weights applied to data sets D0, D2 and D3 with the unbiased data set D0 in the lower left. The shaded region has the valid weights. The right panel shows that region with points for the 24 weights we use in our algorithm. Regression trees are built from splits of the set of subjects. A split uses one of the features in X and creates two subsets based on the values of that feature. For example it might split males from females or it might split those with the two smallest education levels from the others. Such a split defines two subpopulations of our target population and it equally defines two subsamples of our sample. A regression tree is a recursively defined set of splits. After the subjects are split into two groups based on one variable, each of those two groups may then be split again, using the same or different variables. Recursive splitting of splits yields a tree structure with subsets of subjects in the leaf nodes. Given a tree, we predict for subjects by a rule based on the leaf to which they belong. That rule uses the average within the subject’s leaf node. The tree is found by a greedy search that minimizes a measure of prediction error. In our case, the measure R(T), is the sum of squared prediction errors. By construction any tree with more splits than T has lower error and this brings a risk of overfitting. To counter overfitting, rpart adds a penalty proportional to the number |T| of leaves in tree T. The penalized criterion is R(T) + α|T| where the parameter α > 0 is chosen by M-fold cross-validation. This reduces the potentially complicated problem of choosing a tree to the simpler problem of selecting a scalar penalty parameter α. The rpart function has one option that we have changed from the default. That parameter is cp, the complexity parameter. The default is 10−2 . The cp parameter stops tree growing early if a proposed split improves R(T) by less6 Numerical investigation 13 than a factor of cp. We set cp = 10−4 . Our choice creates somewhat larger trees to get more choices to use in cross-validation. 5.4 The algorithm Here is a summary of the entire algorithm. First we make the following preprocessing steps. 1) Fit a large tree T by rpart relating observed incremental reaches Vi to predictor variables Xi in the SSP data. This tree returns a nested sequence of subtrees T0 ⊂ T1 ⊂ · · · ⊂ TL ⊂ T . Each T` corresponds to a critical value α` of the penalty. Choosing α` from this list selects the tree T`. The value L is data-dependent, and chosen by rpart. 2) Specify a grid of values ωg for g = 1, . . . , G. Here ωg = (ωg0, ωg1, . . . , ωgK) with ωgk > 0 and PG k=0 ωgk = 1. 3) Randomly partition SSP data (Xi , Yi , Zi) into M folds Sm for m = 1, . . . , M each of roughly equal size n/M. For fold m the SSP will contain ∪Sm0 for all m0 6= m. We call this S−m. The BRP for fold m is the entire BRP. We also considered using a bootstrap sample for the fold m BRP, but that was more expensive and less accurate in our numerical investigation as described in section A.4 of the Appendix. After this precomputation, our algorithm proceeds to the cross-validation shown in Figure 5.2 to make a joint selection of the tree penalty parameter α` and the simplex grid point ωg. Let the chosen values be α∗ and ω∗. We select the tree T∗ from step 1 above, corresponding to penalty parameter α∗. We treat each leaf node of T∗ as a cell c. We translate ω∗ into the corresponding λc in every cell c of tree T∗. Then we minimize (4) using this λc and the resulting βˆ c is our estimate Vbc of incremental reach in cell c. After choosing the tuning parameters ωg and α` by cross-validation, we use these parameters on the whole data set to make our final prediction. 6 Numerical investigation In order to measure the effect of data enriched estimates on incremental reach, we conducted a simulation where we knew the ground truth. Our goal is to predict for ensembles, not for individuals, so we constructed two large populations in which ground truth was known to us, simulated our process of subsampling them, and scored predictions against the ground truth incremental reach probabilities. To make our large samples realistic, we built them from our real data. We created S- and B-populations by replicating our SSP (respectively BRP) records 100 times each. Then in each simulation, we form an SSP by drawing 6000 observations at random from the S-population, and a BRP by drawing 13,000 observations at random from the B-population.6 Numerical investigation 14 for ` = 1, . . . , L do // initialize error sum of squares for g = 1, . . . , G do SSE`,g ← 0 for m = 1, . . . , M do // folds construct Table 5.1 for fold m, using S−m and B fit tree Tm for fold m by rpart prune tree Tm to T1,m, . . . , TL,m, tree T`,m uses α` for ` = 1, . . . , L do // tree sizes define cells S−m,c and Bc, c = 1, . . . , C from leaves of T`,m for g = 1, . . . , G do // simplex weights convert ωg into λg for c = 1, . . . , C do // cells compute Vek for k = 0, 2, 3 in cell c get Vbc = βˆ c from the weighted average (5) Vc ← 1 |Sm,c| X i∈Sm,c Vi // held out incr. reach pc ← fraction of true S population in cell c SSE`,g ←SSE`,g + pc(Vbc − Vc) 2 Fig. 5.2: Data enrichment for incremental reach (deir) algorithm. After precomputation described on page 13 we run this cross-validation algorithm to choose the complexity parameter α` and the weights ωg, as the joint minimizers ` ∗ and g ∗ of SSE`,g. The values pc come from a census or from the SSP if the census does not have the variables we need. We use M = 10. For each campaign, we apply deir with this sample data to estimate the incremental reach Vˆ (x). We used 10–fold cross-validation. The mean square estimation error (MSE) is P x p(x)(Vˆ (x) − V (x))2 . This sum is taken over all x values in the SSP. The simulation above was repeated 1000 times. The root mean square error was divided by the true incremental reach to get a relative RMSE. We consider two comparison methods. The first is to use the SSP only. That method computes ˆθS within the leaves of a tree. The tree is found by rpart. The second comparison is a tree fit by rpart to the pooled SSP and BRP data and using both CIA and IDA. We do not compare to the empirical fractions because many of them are from empty cells. Figure 6.1 compares the relative errors in the SSP only method to data6 Numerical investigation 15 DEIR RMSE (%) SSP RMSE (%) 1 2 3 1 2 3 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● Beer 6 8 10 12 3 4 5 6 7 8 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Chrome 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Salt 0 1 2 3 4 5 6 0 1 2 3 4 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 1 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 2 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 3 Fig. 6.1: Performance comparison, SSP only versus data enrichment, predictive relative mean square errors. There is one panel for each of 6 campaigns with one point for each of 1000 replicates. The reference line is the forty-five degree line. enrichment. Data enrichment is consistently better over all 6 campaigns we simulated in the great majority of replications. It is clear that the populations are similar enough that using the larger data set improves estimation of incremental reach. Under the IDA we can pool the SSP and BRP together using rpart on the combined data to estimate Pr(Z = 1 | X). Under the CIA we can multiply this estimate by Pr(Y = 0 | X) fit by rpart to the SSP, see Table 5.1 under the assumption CIA & IDA. This method, as an implementation of statistical matching, uses two separate applications of rpart each with their own built in cross-validation. Figure 6.2 compares the relative errors of statistical matching to data enrichment. Data enrichment is consistently better over all 6 campaigns we simulated in the great majority of replications. We also investigate for each estimator, how much of the predictive error is contributed by bias. It is well known that predictive mean square error can be decomposed as the sum of variance and squared bias. These quantities are typically unknown in practice, but can be evaluated in simulation studies.6 Numerical investigation 16 DEIR RMSE (%) Pool RMSE (%) 0.6 0.8 1.0 1.2 1 2 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Beer 5 6 7 8 9 10 3 4 5 6 7 8 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Chrome 0.6 0.8 1.0 1.2 1.4 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Salt 0.5 1.0 1.5 0 1 2 3 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 1 0.8 1.0 1.2 1.4 1.6 1.8 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 3 Fig. 6.2: Performance comparison, statistical matching (data pooling) versus data enrichment, predictive relative mean square errors. There is one panel for each of 6 campaigns with one point for each of 1000 replicates. The reference line is the forty-five degree line. Table 6.1 reports the fractions of squared bias in predictive mean square errors for each method in all six studies. We see there that the error for statistical matching (data pooling) is dominated by bias while the error for SSP only is dominated by variance. These results are not surprising because the SSP only method has no sampling bias (only algorithmic bias) while the pooled data set has maximal sampling bias. The proportion of bias for DEIR is in between these extremes. Here we have less population bias than a typical data fusion situation because the TV and online-only panels were recruited in the same way. The bottom of Table 6.1 shows that DEIR is able to trade off bias and variance more effectively than SSP only or data pooling, because DEIR attains the smallest predictive mean squared error. Conclusions Predictions of incremental reach can be improved by making use of additional data. That improvement comes only if certain strong assumptions are true or at6 Numerical investigation 17 bias2 /mse Beer Chrome Salt Soap 1 Soap 2 Soap 3 SSP 0.35 0.42 0.26 0.12 0.28 0.12 Pool 0.88 0.82 0.88 0.88 0.88 0.93 DEIR 0.49 0.59 0.47 0.33 0.47 0.39 mse Beer Chrome Salt Soap 1 Soap 2 Soap 3 SSP 1.02 7.76 0.89 0.84 1.26 0.66 Pool 0.82 7.39 0.80 0.86 1.12 0.78 DEIR 0.61 5.42 0.48 0.52 0.68 0.42 Tab. 6.1: The upper rows show the fraction bias2 /mse of the mean squared prediction error due to bias for 3 methods to estimate incremental reach in 6 campaigns. The lower rows show the total mse, that is bias2 + var. least approximately true. Our only guide to the accuracy of those assumptions may come from the data themselves. Our data enriched incremental reach estimate uses a shrinkage strategy to pool estimates using different assumptions. Cross-validating the level of pooling gave us an algorithm that worked better than either ignoring the additional data or treating it the same as the unbiased data. Acknowledgment This project was not part of Art Owen’s Stanford responsibilities. His participation was done as a consultant at Google. The authors would like to thank Penny Chu, Tony Fagan, Yijia Feng, Jerome Friedman, Yuxue Jin, Daniel Meyer, Jeffrey Oldham and Hal Varian for support and constructive comments. References Breiman, L., Friedman, J. H. Olshen, R. A., and Stone, C. J. (1985). Classifi- cation and Regression Trees. Chapman & Hall/CRC, Baton Rouge, FL. Chen, A., Owen, A. B., and Shi, M. (2013). Data enriched linear regression. Technical report, Google. http://arxiv.org/abs/1304.1837. Collins, J. and Doe, P. (2009). Developing an integrated television, print and consumer behavior database from national media and purchasing currency data sources. In Worldwide Readership Symposium, Valencia. Doe, P. and Kudon, D. (2010). Data integration in practice: connecting currency and proprietary data to understand media use. ARF Audience Measurement 5.0.A Appendix 18 D’Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester, UK. Gilula, Z., McCulloch, R. E., and Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, XLIII:73–83. Jin, Y., Shobowale, S., Koehler, J., and Case, H. (2012). The incremental reach and cost efficiency of online video ads over TV ads. Technical report, Google. Lehmann, E. L. and Romano, J. P. (2005). Testing statistical hypotheses. Springer, New York, Third edition. Little, R. J. A. and Rubin, D. B. (2009). Statistical Analysis with Missing Data. John Wiley & Sons Inc., Hoboken, NJ, 2nd edition. R Core Team (2012). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. R¨assler, S. (2004). Data fusion: identification problems, validity, and multiple imputation. Austrian Journal of Statistics, 33(1&2):153–171. Singh, A. C., Mantel, H., Kinack, M., and Rowe, G. (1993). Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19:59–79. Stein, C. M. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197–206. Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151. The Nielsen Company (2011). The cross-platform report. Quarter 2, U.S. Therneau, T. M. and Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines. Technical Report 61, Mayo Clinic. A Appendix A.1 Variance reduction by IDA Recall that f = n/(n + N) and F = N/(n + N) are sample size proportions of the two data sets. Under the IDA we may estimate incremental reach by ˆθI = (fZ¯ S + FZ¯B) V¯ S Z¯ S = V¯ I  f + F Z¯B Z¯ S  .A Appendix 19 By the delta method (Lehmann and Romano, 2005), var(ˆθI) is approximately var( f ˆθI) = var(V¯ S)  ∂ ˆθI ∂V¯ S 2 + var(Z¯B)  ∂ ˆθI ∂Z¯B 2 + var(Z¯ S)  ∂ ˆθI ∂Z¯ S 2 + 2cov(V¯ S,Z¯ S) ∂ ˆθI ∂V¯ S ∂ ˆθI ∂Z¯ S , with partial derivatives evaluated with expectations E(V¯ S), E(Z¯ S), and E(Z¯B) replacing the corresponding random quantities. The other two covariances are zero because the S and B samples are independent. From the binomial distribution we have var(V¯ S) = θ(1 − θ)/n, var(Z¯B) = pz(1 − pz)/N and var(Z¯ S) = pz(1 − pz)/n. Also cov(V¯ S,Z¯ S) = 1 n E(ViZi) − E(Vi)E(Zi)  = θ(1 − pz)/n. After some calculus, var( f ˆθI) = θ(1 − θ) n + pz(1 − pz) N θ 2F 2 p 2 z + pz(1 − pz) n θ 2F 2 p 2 z − 2 θ(1 − pz) n θF pz = var(ˆθS) + θ 2F(1 − pz) pz  F N + F n − 2 n  = var(ˆθS) − θ 2F(1 − pz) pz 1 n = var(ˆθS)  1 − F 1 − pz pz θ 1 − θ  . A.2 Variance reduction by CIA Applying the delta method to ˆθC = Z¯ S(1 − Y¯ S), we find that var( f ˆθC) = var(Z¯ S)  ∂ ˆθC ∂Z¯ S 2 + var(Y¯ S) ∂ ˆθC ∂Y¯ S 2 + 2cov(Y¯ S,Z¯ S) ∂ ˆθC ∂Y¯ S ∂ ˆθC ∂Z¯ S = var(Z¯ S)(1 − py) 2 + var(Y¯ S)p 2 z + 2cov(Y¯ S,Z¯ S)(1 − py)pz. Here var(Z¯ S) = pz(1 − pz)/n, var(Z¯ S) = pz(1 − pz)/n, and under conditional independence cov(Y¯ S,Z¯ S) = 0. Thus var( f ˆθC) = 1 n pz(1 − pz)(1 − py) 2 + py(1 − py)p 2 z  = 1 n pz(1 − pz)(1 − py) 2 + py(1 − py)p 2 z  = pz(1 − py) n (1 − pz)(1 − py) + pypz  . When the CIA holds, θ = pz(1 − py). Note that var(ˆθS) = θ(1 − θ)/n. After some algebraic simplification we find that var( f ˆθC) var(ˆθS) = 1 − py(1 − pz) 1 − θ .A Appendix 20 A.3 Variance reduction by CIA and IDA When both assumptions hold we can estimate θ by ˆθI,C = (fZ¯ S + FZ¯B)(1 − Y¯ S). Under these assumptions, Z¯ S, Z¯B and Y¯ S are all independent, and var( f ˆθI,C) equals var(Z¯ S) ∂ ˆθI,C ∂Z¯ S 2 + var(Z¯B) ∂ ˆθI,C ∂Z¯B 2 + var(Y¯ S) ∂ ˆθI,C ∂Y¯ S 2 = pz(1 − pz) n f 2 (1 − py) 2 + pz(1 − pz) N F 2 (1 − py) 2 + py(1 − py) n p 2 z = pz(1 − py) n f(1 − py)(1 − pz) + pypz  after some simplification. As a result var( f ˆθI,C) var( f ˆθC) = f(1 − py)(1 − pz) + pypz (1 − py)(1 − pz) + pypz . A.4 Alternative algorithms We faced some design choices in our algorithm. First, we had to decide which estimators to include in our algorithm. We always include the unbiased choice ˆθS as well as two others. Second, we had to decide whether to use the entire BRP or to bootstrap sample it. We ran all six choices on simulations of all six data sets where we knew the correct answer. Table A.1 shows the mean squared errors for the six possible estimators on each of the six data sets. In every case we divided the mean squared error by that for the estimator combining ˆθS, ˆθC , and ˆθI,C without the bootstrap. We only see small differences, but the evidence favors choosing λI = 0 as well as not bootstrapping. Our default method is consistently the best in this table, although only by a small amount. We saw that data enrichment is consistently better than either pooling the data or ignoring the large sample, and by much larger amounts than we see in Table A.1. As a result, any of the data enrichment methods in this table would make a big improvement over either pooling the samples or ignoring the BRP.A Appendix 21 Estimators ˆθS, ˆθI , ˆθC ˆθS, ˆθI , ˆθI,C ˆθS, ˆθC , ˆθI,C BRP All Boot All Boot All Boot Beer 1.02 1.02 1.00 1.01 1 1.01 Chrome 1.04 1.04 1.01 1.01 1 1.00 Salt 1.04 1.04 1.01 1.01 1 1.01 Soap 1 1.04 1.05 1.01 1.02 1 1.00 Soap 2 1.05 1.05 1.01 1.03 1 1.01 Soap 3 1.02 1.02 1.01 1.00 1 1.00 Tab. A.1: Relative performance of our estimators on six problems. The relative errors are mean squared prediction errors normalized to the case that uses ˆθS, ˆθC , ˆθI,C without bootstrapping. The relative error for that case is 1 by definition. Coupled and k-Sided Placements: Generalizing Generalized Assignment Madhukar Korupolu1 , Adam Meyerson1 , Rajmohan Rajaraman2 , and Brian Tagiku1 1 Google, 1600 Amphitheater Parkway, Mountain View, CA. Email: {mkar,awmeyerson,btagiku}@google.com 2 Northeastern University, Boston, MA 02115. Email: rraj@ccs.neu.edu Abstract. In modern data centers and cloud computing systems, jobs often require resources distributed across nodes providing a wide variety of services. Motivated by this, we study the Coupled Placement problem, in which we place jobs into computation and storage nodes with capacity constraints, so as to optimize some costs or profits associated with the placement. The coupled placement problem is a natural generalization of the widely-studied generalized assignment problem (GAP), which concerns the placement of jobs into single nodes providing one kind of service. We also study a further generalization, the k-Sided Placement problem, in which we place jobs into k-tuples of nodes, each node in a tuple offering one of k services. For both the coupled and k-sided placement problems, we consider minimization and maximization versions. In the minimization versions (MinCP and MinkSP), the goal is to achieve minimum placement cost, while incurring a minimum blowup in the capacity of the individual nodes. Our first main result is an algorithm for MinkSP that achieves optimal cost while increasing capacities by at most a factor of k + 1, also yielding the first constant-factor approximation for MinCP. In the maximization versions (MaxCP and MaxkSP), the goal is to maximize the total weight of the jobs that are placed under hard capacity constraints. MaxkSP can be expressed as a k-column sparse integer program, and can be approximated to within a factor of O(k) factor using randomized rounding of a linear program relaxation. We consider alternative combinatorial algorithms that are much more efficient in practice. Our second main result is a local search based approximation algorithm that yields a 15- approximation and O(k 3 )-approximation for MaxCP and MaxkSP respectively. Finally, we consider an online version of MaxkSP and present algorithms that achieve logarithmic competitive ratio under certain necessary technical assumptions. 1 Introduction The data center has become one of the most important assets of a modern business. Whether it is a private data center for exclusive use or a shared public cloud data center, the size and scale of the data center continues to rise. As a companygrows, so too must its data center to accommodate growing computational, storage and networking demand. However, the new components purchased for this expansion need not be the same as the components already in place. Over time, the data center becomes quite heterogeneous [1]. This complicates the problem of placing jobs within the data center so as to maximize performance. Jobs often require resources of more than one type: for example, compute and storage. Modern data centers typically separate computation from storage and interconnect the two using a network of switches. As such, when placing a job within a data center, we must decide which computation node and which storage node will serve the job. If we pick nodes that are far apart, then communication latency may become too prohibitive. On the other hand, nodes are capacitated, so picking nodes close together may not always be possible. Most prior work in data center resource management is focussed on placing one type of resource at a time: e.g., placing storage requirements assuming job compute location is fixed [2, 3] or placing compute requirements assuming job storage location is fixed [4, 5]. One sided placement methods cannot suitably take advantage of the proximities and heterogeneities that exist in modern data centers. For example, a database analytics application requiring high throughput between its compute and storage elements can benefit by being placed on a storage node that has a nearby available compute node. In this paper, we study Coupled Placement (CP), which is the problem of placing jobs into computation and storage nodes with capacity constraints, so as to optimize costs or profits associated with the placement. Coupled placement was first addressed in [6] in a setting where we are required to place all jobs and we wish to minimize the communication latency over all jobs. They show that this problem, which we call MinCP, is NP-hard and investigate the performance of heuristic solutions. Another natural formulation is where the goal is to maximize the total number of jobs or revenue generated by the placement, subject to capacity constraints. We refer to this problem as MaxCP. We also study a generalization of Coupled Placement, the k-Sided Placement Problem (kSP), which considers k ≥ 2 kinds of resources. 1.1 Problem definition In the coupled placement problem, we are given a bipartite graph G = (U, V, E) where U is a set of compute nodes and V is a set of storage nodes. We have capacity functions C : U → R and S : V → R for the compute and storage nodes, respectively. We are also given a set T of jobs, each of which needs to be allocated to one compute node and one storage node. Each job may prefer some compute-storage node pairs more than others, and may also consume different resources at different nodes. To capture these heterogeneities, we have for each job j a function fj : E → R, a processing requirement pj : E → R and a storage requirement sj : E → R. We note that without loss of generality, we can assume that the capacities are unit, since we can scale the processing and storage requirements of individual nodes accordingly.We consider two versions of the coupled placement problems. For the maximization version MaxCP, we view fj as a payment function. Our goal is to select a subset A ⊆ T of jobs and an assignment σ : A → E such that all capacities are observed and our total profit P j∈A fj (σ(j)) is maximized. For the minimization version MinCP, we view fj as a cost function. Our goal is to find an assignment σ : T → E such that all capacities are observed and our total cost P j∈A fj (σ(j)) is minimized. A generalization of the coupled placement problem is k-sided placement (kSP), in which we have k different sets of nodes, S1, . . . , Sk, each set of nodes providing a distinct service. For each i, we have a capacity function Ci : Si → R that gives the capacity of a node in Si to provide the ith service. We are given a set T of jobs, each of which needs each kind of service; the exact resource needs may depend on the particular k-tuple of nodes from Q i Si to which it is assigned. That is, for each job j, we have a demand function dj : Q i Si → Rk . We also have another function fj : Q i Si → R. As for coupled placement, we can assume that the capacities are unit, since we can scale the demands of individual nodes accordingly. Similar to coupled placement, we consider two versions of kSP, MinkSP and MaxkSP. 1.2 Our Results All of the variants of CP and kSP are NP-hard, so our focus is on approximation algorithms. Our first set of results consist of the first non-trivial approximation algorithms for MinCP and MinkSP. Under hard capacity constraints, it is easy to see that it is NP-hard to achieve any bounded approximation ratio to cost minimization. So we consider approximation algorithms that incur a blowup in capacity. We say that an algorithm is α-approximate for the minimization version if its cost is at most that of an optimal solution, while incurring a blowup factor of at most α in the capacity of any node. – We present a (k + 1)-approximation algorithm for MinkSP using iterative rounding, yielding a 3-approximation for MinCP. We next consider the maximization version. MaxkSP can be expressed as a k-column sparse integer packing program (k-CSP). From this, it is immediate that MaxkSP can be approximated to within an O(k) approximation factor by applying randomized rounding to a linear programming relaxation [7]. An Ω(k/ log k)-inapproximability result for k-set packing due to [16] implies the same hardness result for MaxkSP. Our second main result is a simpler approximation algorithm for MaxCP and MaxkSP based on local search. – We present a local search based 15-approximation algorithm for MaxCP. We extend it to MaxkSP and obtain an O(k 3 )-approximation. The local search result applies directly to a version where we can assign tasks fractionally but only to a single pair of machines (this is like assigning a task with lower priority and may have additional applications). We then describe asimple rounding scheme to obtain an integral version. The rounding technique involves establishing a one-to-one correspondence between fractional assignments and machines. This is much like the cycle-removing rounding for GAP; there is a crucial difference, however, since coupled and k-sided placements assign jobs to tuples of machines. Finally, we study the online version of MaxCP, in which tasks arrive online and must be irrevocably assigned or rejected immediately upon arrival. – We extend the techniques of [8] to the case where the capacity requirement for a job is arbitrarily machine-dependent. This enables us to achieve competitive ratio logarithmic in the ratio of best to worst value-per-capacity density, under necessary technical assumptions about the maximum job size. 1.3 Related Work The coupled and k-sided placement problems are natural generalizations of the Generalized Assignment Problem (GAP), which can be viewed as a 1-sided placement problem. In GAP, which was first introduced by Shmoys and Tardos [9], the goal is assign items of various sizes to bins of various capacities. A subset of items is feasible for a bin if their total size is no more than the bin’s capacity. If we are required to assign all items and minimize our cost (MinGAP), Shmoys and Tardos [9] give an algorithm for computing an assignment that achieves optimal cost while doubling the capacities of each bin. A previous result by Lenstra et al. [10] for scheduling on unrelated machines show it is NP-hard to achieve optimal cost without incurring a capacity blowup of at least 3/2. On the other hand, if we wish to maximize our profit and are allowed to leave items unassigned (MaxGAP), Chekuri and Khanna [11] observe that the (1, 2)-approximation for MinGAP implies a 2-approximation for MaxGAP. This can be improved to a ( e e−1 )-approximation using LP-based techniques [12]. It is known that MaxGAP is APX-hard [11], though no specific constant of hardness is shown. On the experimental side, most prior work in data center resource management focusses on placing one type of resource at a time: for example, placing storage requirements assuming job compute location is fixed (file allocation problem [2], [13, 14, 3]) or placing compute requirements assuming job storage location is fixed [4, 5]. These in a sense are variants of GAP. The only prior work on Coupled Placement is [6], where they show that MinCP is NP-hard and experimentally evaluate heuristics: in particular, a fast approach based on stable marriage and knapsacks is shown to do well in practice, close to the LP optimal. The MaxkSP problem is related to the recently studied hypermatching assignment problem (HAP) [15], and special cases, including k-set packing, and a uniform version of the problem. A (k + 1 + ε)-approximation is given for HAP in [15], where other variants of HAP are also studied. While the MaxkSP problem can be viewed as a variant of HAP, there are critical differences. For instance, in MaxkSP, each task is assigned at most one tuple, while in the hypermatching problem each client (or task) is assigned a subset of the hyperedges. Hence, the MaxkSP and HAP problems are not directly comparable. The k-set packing canbe captured as a special case of MaxkSP, and hence the Ω(k/ log k)-hardness due to [16] applies to MaxkSP as well. 2 The minimization version Next, we consider the minimization version of the Coupled Placement problem, MinCP. We write the following integer linear program for MinCP, where xtuv is the indicator variable for the assignment of t to pair (u, v), u ∈ U, v ∈ V . Minimize: X t,u,v xtuvft(u, v) Subject to: X u,v xtuv ≥ 1, ∀t ∈ T, X t,v pt(u, v)xtuv ≤ cu, ∀u ∈ U, X t,u st(u, v)xtuv ≤ dv, ∀v ∈ V, xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V. We refer the first set of constraints as satisfaction constraints, the second and third set as capacity constraints (processing and storage). We consider the linear relaxation of this program which replaces the integrality constraints above with 0 ≤ xtuv ≤ 1, ∀t ∈ T, u ∈ U, v ∈ V . 2.1 A 3-approximation algorithm for MinCP We now present algorithm IterRound, based on iterative rounding [21], which achieves a 3-approximation for MinCP. We start with a basic algorithm that achieves a 5-approximation by identifying tight constraints with a small number of variables. Each iteration of this algorithm repeats the following round until all variables have been rounded. 1 Extreme point: Compute an extreme point solution x to the current LP. 2 Eliminate variable or constraint: Execute one of these two steps. By Lemma 3, one of these steps can always be executed if the LP is nonempty. a Remove from the LP all variables xtuv that take the value 0 or 1 in x. If xtuv is 1, then assign job t to the pair (u, v), remove the job t and its associated variables from the LP, and reduce cu by pt(u, v) and dv by st(u, v). b Remove from the LP any tight capacity constraint with at most 4 variables. Fix an iteration of the algorithm, and an extreme point x. Let nt, nc, and ns denote the number of tight task satisfaction constraints, computation constraints, and storage constraints, respectively, in x. Note that every task satisfaction constraint can be assumed to be tight, without loss of generality. Let N denote the number of variables in the LP. Since x is an extreme point, if all variables in x take values in (0, 1), then we have N = nt + nc + ns.Lemma 1. If all variables in x take values in (0, 1), then nt ≤ N/2. Proof. Since a variable only occurs once over all satisfaction constraints, if nt > N/2, there exists a satisfaction constraint that has exactly one variable. But then, this variable needs to take value 1, a contradiction. Lemma 2. If nt ≤ N/2, then there exists a tight capacity constraint that has at most 4 variables. Proof. If nt ≤ N/2, then ns + nc = N − nt ≥ N/2. Since each variable occurs in at most one computation constraint and at most one storage constraint, the total number of variable occurrences over all tight storage and computation constraints is at most 2N, which is at most 4(ns +nc). This implies that at least one of these tight capacity constraints has at most 4 variables. Using Lemmas 1 and 2, we can argue that the above algorithm yields a 5- approximation. Step 2a does not cause any increase in cost or capacity. Step 2b removes a constraint, hence cannot increase cost; since the removed constraint has at most 4 variables, the total demand allocated on the relevant node is at most the demand of four tasks plus the capacity already used in earlier iterations. Since each task demand is at most the capacity of the node, we obtain a 5- approximation with respect to capacity. Studying the proof of Lemma 2 more closely, one can separate the case nt < N/2 from the nt = N/2; in the former case, one can, in fact, show that there exists a tight capacity constraint with at most 3 variables. Together with a careful consideration of the nt = N/2 case, one can improve the approximation factor to 4. We now present an alternative selection of tight capacity constraint that leads to a 3-approximation. One interesting aspect of this step is that the constraint being selected may not have a small number of variables. We replace step 2b by the following. 2b Remove from the LP any tight capacity constraint in which the number of variables is at most two more than the sum of the values of the variables. Lemma 3. If all variables in x take values in (0, 1), then there exists a tight capacity constraint in which the number of variables is at most two more than the sum of the values of the variables. Proof. Since each variable occurs in at most two tight capacity constraints, the total number of occurrences of all variables across the tight capacity constraints is 2N − s for some nonnegative integer s. Since each satisfaction constraint is tight, each variable appears in 2 capacity constraints, and each variable takes on value less than 1, the sum of all the variables over the tight capacity constraints is at least 2nt − s. Therefore, the sum, over all tight capacity constraints, of the difference between the number of variables and their sum is at most 2(N − nt). Since there are N − nt tight capacity constraints, for at least one of these constraints, the difference between the number of variables and their sum is at most 2.Lemma 4. Let u be a node with a tight capacity constraint, in which the number of variables is at most 2 more than the sum of the variables. Then, the sum of the capacity requirements of the tasks partially assigned to u is a most the current available capacity of u plus twice the capacity of u. Proof. Let ` be the number of variables in the constraint for u, and let the associated tasks be numbered 1 through `. Let the demand of task j for the capacity of node u be dj . Then, the capacity constraint for u is P j djxj = bc(u), where bc(u) is the available capacity of u in the current LP. We know that ` − P i xi ≤ 2. Since di ≤ C(u), the capacity of u: X j dj = bc(u) +X ` j=1 (1 − xj )dj ≤ bc(u) + (` − Xm j=` xj )C(u) ≤ bc(u) + 2C(u). Theorem 1. IterRound is a polynomial-time 3-approximation algorithm for MinCP. Proof. By Lemma 3, each iteration of the algorithm removes either a variable or a constraint from the LP. Hence the algorithm is polynomial time. The elimination of a variable that takes value 0 or 1 does not change the cost. The elimination of a constraint can only decrease cost, so the final solution has cost no more than the value achieved by the original LP. Finally, when a capacity constraint is eliminated, by Lemma 4, we incur a blowup of at most 3 in capacity. 2.2 A (k + 1)-approximation algorithm for MinkSP It is straightforward to generalize the the algorithm of the preceding section to obtain a k + 1-approximation to MinkSP. We first set up the integer LP for MinkSP. For a given element e ∈ Q i Si , we use ei to denote the ith coordinate of e. Let xte be the indicator variable that t is assigned to e ∈ Q i Si . Minimize: X t,e xteft(e) Subject to: X e xte ≥ 1, ∀t ∈ T, X t,e:ei=u (dt(e))ixte ≤ Ci(u), ∀1 ≤ i ≤ k, u ∈ U, xte ∈ {0, 1}, ∀t ∈ T, e ∈ E The algorithm, which we call IterRound(k), is identical to IterRound of Section 2.1 except that step 2b is replaced by the following. 2b Remove from the LP any tight capacity constraint in which the number of variables is at most k more than the sum of the values of the variables. The claims and proofs are almost identical to the k = 2 case and are moved to Appendix A. A natural question to ask is whether a linear approximation factor of MinkSP is unavoidable for polynomial time algorithms. Unfortunately, we donot have any non-trivial results in this direction. We have been able to show that the MinkSP linear program has an integrality that grows as Ω(log k/ log log k) (see Appendix A). 3 The maximization problems We present approximation algorithms for the maximization versions of coupled placement and k-sided placement problems. We first observe, in Section 3.1, that these problems reduce to column sparse integer packing. We next present, in Section 3.2, an alternative combinatorial approach based on local search. 3.1 An LP-based approximation algorithm One can write a positive integer linear program for MaxCP. Let xtuv denote the indicator variable for the the assignment of job t to the pair (u, v), u ∈ U, v ∈ V . The goal is then to Maximize: X t,u,v xtuvft(u, v) Subject to: X u,v xtuv ≤ 1, ∀t ∈ T, X t,v pt(u, v)xtuv ≤ cu, ∀u ∈ U, X t,u st(u, v)xtuv ≤ dv, ∀v ∈ V, xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V. Note that we can deal with capacities on u, v by scaling the pt(u, v) and st(u, v) values appropriately. The above LP can be easily extended to MaxkSP (see Appendix B). These linear programs are 3- and k-column sparse packing programs, respectively, and can be approximated to within a factor of 15.74 and ek + o(k), respectively using a clever randomized rounding approach. We next give a combinatorial approach based on local search which is likely to be much more efficient in practice. 3.2 Approximation algorithms based on local search Before giving the details, we start with a few helpful definitions. For any u ∈ U, Fu = Σt,vxtuvft(u, v). Similarly, for any v ∈ V , Fv = Σt,uxtuvft(u, v). We set µ = 1 n maxt,u,v ft(u, v). It follows that the optimum solution is at least nµ and at most n 2µ. The local search algorithm will maintain the following two invariants: (1) For each t, there is at most one pair (u, v) for which xtuv > 0; (2) All the linear program inequalities hold. It’s easy to set an initial state where the invariant holds (all xtuv = 0). The local search algorithm proceeds in the following steps: While ∃t, u, v : ft(u, v) > Fu pt(u,v) cu + Fv st(u,v) dv + Σu0 ,v0xtu0v 0ft(u 0 , v0 ) + µ:1. Set xtuv = 1 and set xtu0v 0 = 0 for all (u 0 , v0 ) 6= (u, v). 2. While Σt,vpt(u, v)xtuv > cu, reduce xtuv for the job with minimum cuft(u, v)/pt(u, v) such that xtuv > 0. 3. While Σu,vst(u, v)xtuv > dv, reduce xtuv for the job with minimum dvft(u, v)/st(u, v) such that xtuv > 0 Theorem 2. The local search algorithm maintains the two stated invariants. Proof. The first invariant is straightforward, because the only time we increase an xtuv value we simultaneously set all other values for the same t to zero. The only time the linear program inequalities can be violated is immediately after setting xtuv = 1. However, the two steps immediately after this operation will reduce the values of other jobs so as to satisfy the inequalities (and this is done without increasing any xtuv so no new constraint can be violated). Theorem 3. The local search algorithm produces a 3+ approximate fractional solution satisfying the invariants. Proof. When the algorithm terminates, we have for all t, u, v: ft(u, v) ≤ Fu pt(u,v) cu + Fv st(u,v) dv + Σu0 ,v0xtu0v 0ft(u 0 , v0 )µ. We sum this over t, u, v representing the optimum integer assignments: OP T ≤ ΣuFu + ΣvFv + Σt,u,vxtuvft(u, v) + OP T. Each summation simplifies to the algorithm’s objective value, giving the result. Theorem 4. The local search algorithm runs in polynomial time. Proof. Setting xtuv = 1 and setting all other xtu0v 0 = 0 adds ft(u, v)−Σu0v 0xtu0v 0ft(u 0 , v0 ) to the algorithm’s objective. The next two steps of the algorithm (making sure the LP inequalities hold) reduce the objective by at most Fu pt(u,v) cu + Fv st(u,v) dv . It follows that each iteration of the main loop increases the solution value by at least µ. By definition of µ, this can happen at most n 2/ times. Each selection of (t, u, v) can be done in polynomial time (at worst, by simply trying all tuples). Rounding Phase: When the local search algorithm terminates, we have a fractional solution with the additional guarantee from the first invariant. Note that we can extend this to the k-sided version if we increase the approximation factor to k+1+. Below, we give two different rounding schemes. The first works for general values of k and loses an O(k 2 ) factor, for an overall approximation factor of O(k 3 ). The second is specific to the k = 2 case and obtains a better approximation. 1. We randomly make each assignment with probability p times the fractional value (so pxtuv for Coupled Placement), for some p to be defined later. 2. For each assigned job t, if the other jobs t 0 6= t assigned to any one of its assigned machines violate the corresponding linear program constraint, we immediately drop job t. For Coupled Placement this means if P t 06=t,v pt 0 (u, v)xt 0uv > 1 for any t, u we set xtuv = 0.3. Note that we may still violate linear program constraints, but for any particular machine the constraint would be satisfied if we dropped any one of its assigned jobs. We divide the assigned jobs into k + 1 groups. These groups should guarantee that for any machine with at least two assigned jobs, not all its jobs are members of the same group. We then select the group with largest total objective value as our final solution. Theorem 5. For the k-sided version, the rounding scheme runs in poly-time and achieves an O(k 2 )-approximation over the fractional approximation factor (so an overall factor of O(k 3 ) using local search) for appropriate choice of p. Proof. The first two steps finish with a solution of value at least p(1 − p) k times the optimum in expectation. This is because for any job t, the probability of placing this job in step one is exactly p times its fractional value. Consider any machine m where the job is assigned; the expected total size of the other jobs t 0 6= t assigned to this machine is at most pcm and thus the probability that these other jobs exceed cm is at most p. The probability that none of the k machines where t is assigned exceed capacity from other jobs will be at most (1 − p) k . We may still violate constraints. Dividing into k + 1 groups and picking the best gives a result which is at least 1 k+1 p(1−p) k times optimum without violating constraints. Selecting p = 1 k gives the desired approximation factor. It remains to show that the division into groups can be performed in polytime. We start with all machines unmarked. For each group, we select a maximal set of jobs no two of which are assigned the same unmarked machine. We then mark all machines to which one of our current group of jobs is assigned. Note that immediately before we select group i, each remaining job is assigned to at most k−i+1 unmarked machines. For i = 1 this is obvious. Inductively, suppose that job j is assigned to more than k −i unmarked machines immediately before selecting group i + 1. Before selecting group i, job j was assigned to at most k −i+ 1 unmarked machines, and since we never “unmark” a machine it follows that job j was assigned to exactly k − i + 1 unmarked machines both before and after the selection of group i. But then none of the jobs selected in group i are assigned to any of the unmarked machines assigned to job j (else they would have become marked after selection of group i). So we can augment group i with job j without violating the constraint that no two jobs of group i are on the same unmarked machine. This contradicts the maximality of group i. We thus conclude that immediately before we select group k + 1, each remaining job is assigned only to marked machines. Thus group k + 1 selects all remaining jobs (maximality) and the jobs are divided into k+1 groups. Consider any machine m with at least two assigned jobs. Let group i be the first group to contain a job from m. Thus prior to selection of group i, we had not selected any job which was assigned to m and m was unmarked. So group i cannot include more than one job from machine m without violating the condition that no two jobs share an unmarked machine. It follows that there are at least two distinct groups which contain jobs from machine m (group i and also some later group).For MaxCP, we can improve the approximation factor. We refer the reader to Appendix B for details. Theorem 6. For MaxCP, there exists a polynomial-time algorithm based on local search that achieves a 15 +  approximation for MaxCP. 4 Online MaxCP and MaxkSP We now study the online version of MaxCP, in which jobs arrive in an online fashion. When a job arrives we must irrevocably assign it or reject it. Our goal is to maximize our total value at the end of the instance. We apply the techniques of [8] to obtain a logarithmic competitive online algorithm under certain assumptions. We first note that online MaxCP differs from the model considered in [8] in that a job’s computation/storage requirements need not be the same. As demonstrated in [8] certain assumptions have to be made to achieve competitive ratios of any interest. We extend these assumptions for the MaxCP model as follows: Assumption 1 There exists F such that for all t, u, v either ft(u, v) = 0 or 1 ≤ ft(u, v) ≤ F min( pt(u,v) cu , st(u,v) dv ). Assumption 2 For  = min( 1 2 , 1 ln 2F +1 ), for all t, u, v: pt(u, v) ≤ cu and st(u, v) ≤ dv. It is not hard to show that they (or some similar flavor of these assumptions) are in fact necessary to obtain any interesting competitive ratios (proof in Appendix C). Theorem 7. No deterministic online algorithm can be competitive over classes of instances where either one of the following is true: (i) job size is allowed to be arbitrarily large relative to capacities, or (ii) job values and resource requirements are completely uncorrelated. A small modification to the algorithm of [8] gives an O(log F)-competitive algorithm. Moreover, the lower bound of Ω(log F) shown in [8] applies to online MaxCP as well. (See Appendix D for proof.) Theorem 8. There exists a deterministic O(log F)-competitive algorithm for online MaxCP under Assumptions 1 and 2. For MaxkSP, this can be extended to a O(log kF)-competitive algorithm. Moreover, any online deterministic algorithm for online MaxCP has competitive ratio Ω(log F), and for online MaxkSP has competitive ratio Ω(log kF). Theorem 9. There exist a randomized O(log F)-competitive algorithm (in expectation) for online MaxCP under assumption 1 even if we weaken assumption 2 to require only that  = 1 2 . No deterministic online algorithm for the problem can accomplish such a result.Acknowledgments We would like to thank Aravind Srinivasan for helpful discussions, and for pointing us to the Ω(k/ log k)-hardness result for k-set packing, in particular. We thank anonymous referees for helpful comments on an earlier version of the paper, and are especially grateful to a referee who generously offered the key insights leading to improved results for MinCP and MinkSP. References 1. Patterson, D.A.: Technical perspective: the data center is the computer. Communications of the ACM 51 (January 2008) 105–105 2. Dowdy, L.W., Foster, D.V.: Comparative models of the file assignment problem. ACM Surveys 14 (1982) 3. Anderson, E., Kallahalla, M., Spence, S., Swaminathan, R., Wang, Q.: Quickly finding near-optimal storage designs. ACM Transactions on Computer Systems 23 (2005) 337–374 4. Appleby, K., Fakhouri, S., Fong, L., Goldszmidt, G., Kalantar, M., Krishnakumar, S., Pazel, D., Pershing, J., Rochwerger, B.: Oceano-SLA based management of a computing utility. In: Proceedings of the International Symposium on Integrated Network Management. (2001) 855–868 5. Chase, J.S., Anderson, D.C., Thakar, P.N., Vahdat, A.M., Doyle, R.P.: Managing energy and server resources in hosting centers. In: Proceedings of the Symposium on Operating Systems Principles. (2001) 103–116 6. Korupolu, M., Singh, A., Bamba, B.: Coupled placement in modern data centers. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2009) 1–12 7. Bansal, N., Korula, N., Nagarajan, V., Srinivasan, A.: On k-column sparse packing programs. In: Proceedings of the Conference on Integer Programming and Combinatorial Optimization. (2010) 369–382 8. Awerbuch, B., Azar, Y., Plotkin, S.: Throughput-competitive on-line routing. In: Proceedings of the Symposium on Foundations of Computer Science. (1993) 32–40 9. Shmoys, D.B., Eva Tardos: An approximation algorithm for the generalized as- ´ signment problem. Mathematical Programming 62(3) (1993) 461–474 10. Lenstra, J.K., Shmoys, D.B., Eva Tardos: Approximation algorithms for scheduling ´ unrelated parallel machines. Mathematical Programming 46(3) (1990) 259–271 11. Chekuri, C., Khanna, S.: A PTAS for the multiple knapsack problem. In: Proceedings of the Symposium on Discrete Algorithms. (2000) 213–222 12. Fleischer, L., Goemans, M.X., Mirrokni, V.S., Sviridenko, M.: Tight approximation algorithms for maximum general assignment problems. In: SODA. (2006) 611–620 13. Alvarez, G.A., Borowsky, E., Go, S., Romer, T.H., Becker-Szendy, R., Golding, R., Merchant, A., Spasojevic, M., Veitch, A., Wilkes, J.: Minerva: An automated resource provisioning tool for large-scale storage systems. Transactions on Computer Systems 19 (November 2001) 483–518 14. Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., Veitch, A.: Hippodrome: Running circles around storage administration. In: Proceedings of the Conference on File and Storage Technologies. (2002) 175–188 15. Cygan, M., Grandoni, F., Mastrolilli, M.: How to sell hyperedges: The hypermatching assignment problem. In: SODA. (2013) 342–35116. Hazan, E., Safra, S., Schwartz, O.: On the complexity of approximating k-set packing. Computational Complexity 15(1) (2006) 20–39 17. Vazirani, V.V.: Approximation Algorithms. Springer-Verlag (2001) 18. Frieze, A.M., Clarke, M.: Approximation algorithms for the m-dimensional 0-1 knapsack problem: Worst-case and probabilistic analyses. European Journal of Operational Research 15(1) (1984) 100–109 19. Chekuri, C., Khanna, S.: On multi-dimensional packing problems. In: Proceedings of the Symposium on Discrete Algorithms. (1999) 185–194 20. Srinivasan, A.: Improved approximations of packing and covering problems. In: Proceedings of the Symposium on Theory of Computing. (1995) 268–276 21. Lau, L., Ravi, R., Singh, M.: Iterative Methods in Combinatorial Optimization. Cambridge Texts in Applied Mathematics. Cambridge University Press (2011) A Proofs for MinkSP Fix an iteration of the algorithm, and an extreme point x. Let nt denote the number of tight satisfaction constraints, and ni denote the number of tight capacity constraints on the ith side. Since x is an extreme point, if all variables in x take values in (0, 1), then we have N = nt + P i ni . Lemma 5. If all variables in x take values in (0, 1), then there exists a tight capacity constraint in which the number of variables is at most k more than the sum of the variables. Proof. Since each variable occurs in at most k tight capacity constraints, the total number of occurrences of all variables across the tight capacity constraints is kN − s for some nonnegative integer s. Since each satisfaction constraint is tight, each variable appears in k capacity constraints, and each variable takes on value at most 1, the sum of all the variables over the tight capacity constraints is at least knt − s. Therefore, the sum, over all tight capacity constraints, of the difference between the number of variables and their sum is at most k(N − nt). Since the number of tight capacity constraints is N −nt, for at least one of these constraints, the difference between the number of variables and their sum is at most k. Lemma 6. Let u be a side-i node with a tight capacity constraint, in which the number of variables is at most k more than the sum of the variables. Then, the sum of the capacity requirements of the tasks partially assigned to u is at most the available capacity of u plus kCi(u). Proof. Let ` be the number of variables in the constraint for u, and let the associated tasks be numbered 1 through `. Let the demand of task j for the capacity of node u be dj . Then, the capacity constraint for u is P j djxj = bc(u).We know that m − P i xi ≤ k. We also have di ≤ Ci(u). Letting bc(u) denote the current capacity of u, we now derive X i di = bc(u) +Xm j=1 (1 − xi)di ≤ bc(u) + (m − Xm j=1 xi)Ci(u) ≤ bc(u) + kCi(u). Theorem 10. IterRound(k) is a polynomial-time k + 1-approximation algorithm for MinkSP. Proof. By Lemma 5, each iteration of the algorithm removes either a variable or a constraint from the LP. Hence the algorithm is polynomial time. The elimination of a variable that takes value 0 or 1 neither changes cost nor incurs capacity blowup. The elimination of a constraint can only decrease cost, so the final solution has cost no more than the value achieved by the original LP. Finally, by Lemma 6, we incur a blowup of at most 1 + k in capacity. We now show that the MinkSP LP has an integrality gap of Ω(log k/ log log k). We recursively construct an integrality gap instance with ` t sides, for parameters ` and t, with two nodes per side one with infinite capacity and the other with unit capacity, such that any integral solution has at least t tasks on the unit-capacity node on some side, while there is a fractional solution with load of at most t/` on the unit-capacity node of each side. Setting t = ` and k = ` ` , we obtain an instance in which the capacity used by the fractional solution is 1, while any integral solution has load ` = Θ(log k/ log log k). Each task can be placed on one tuple from a subset of tuples; for a given tuple, the demand of the task on each side of the tuple is one. We start with the construction for t = 1. We introduce a task that has ` choices, the ith choice consisting of the unit-capacity node from side i and infinite capacity nodes on all other sides. Clearly, any integral solution uses up unit capacity of one unitcapacity node, while there is a fractional solution (1/` for each choice) that uses only 1/` fraction of each unit capacity node. Given a construction for ` t sides, we show how to extend to ` t+1 sides. We take ` identical copies of the instance with ` t sides and combine the tuples for each task in such a way that for any i, any integral placement places exactly the same task on side i of each copy. Now we add task t + 1 which can be placed in one of ` tuples: unit capacity node on all sides of copy i and infinite capacity node on all other sides, for each i. Clearly, any integral solution will have to add one more task to a unit-capacity node of a side that already has load t, yielding a load t + 1, while a fractional solution can assign load of at most 1/` to the unit-capacity nodes of each side.B Proofs for MaxkSP and MaxCP We first present the linear program for MaxkSP (recall the definition in Section 1.1). Let xte denote the indicator variable for the assignment of job t to the k-tuple e. Maximize: X t,e xteft(e) Subject to: X e xte ≤ 1, ∀t ∈ T, X t,e (dt(e))ixte ≤ Ci(ei), ∀i ∈ {1, . . . , k}, xte ∈ {0, 1}, ∀t ∈ T, e ∈ Q i Si . We now present the improved approximation algorithm for MaxCP. The idea is to obtain a one-to-one correspondance between fractional assignments and machines. Essentially we view the machines as nodes of a graph where the edges are the fractional assignments (this is similar to the rounding for generalized assignment). If we have a cycle, the idea is to shift the fractions around the cycle (i.e. increase one xtuv then decrease some xt 0vw and increase some xt 00wx and so forth). Applying this directly on a single cycle may violate some constraints; while we try to increase and decrease the fractions in such a way that constraints hold, since each job has different “size” on its two endpoints we may wind up violating the constraint P t,v xtuvpt(u, v) at a single node u. This prevents us from doing a simple cycle elimination as in generalized assignment. However, if we have two adjoining (or connected) cycles the process can be made to work. The remaining case is a single cycle, where we can assign each edge to one of its endpoints. Generalized assignment rounding would now proceed to integrally assign each job to its corresponding machine; we cannot do this because each job requires two machines, and each machine thus has multiple fractional assignments (all but one of which “correspond” to some other machine). Lemma 7. Given any fractional solution which satisfies the local search invariants, we can produce an alternative fractional solution (also satisfying the local search invariants and with equal or greater value). This new fractional solution labels each job t with 0 < xtuv < 1, with either u or v, guaranteeing that each u is labeled with at most one job. Proof. Consider a graph where the nodes are machines, and we have an edge (u, v) for any fractional assignment 0 < xtuv < 1. If any node has degree zero or one, we remove that node and its assigned edge (if any), labeling the removed edge with the node that removed it. We continue this process until all remaining nodes have degree at least two. If there is a node of degree three, then there must exist two (distinct but not necessarily edge-disjoint) cycles with a path between them (possibly a path of length zero); since the graph is bipartite all cycles are even in length. We can alternately increase and decrease the fractional assignments of edges along a cycle such that the total load P t,v pt(u, v)xtuv changesonly on a single node u where the path between cycles intersects this cycle. We can do the same along the other cycle. We can then do the same thing along the path, and equalize the changes (multiplicatively) such that there is no overall change in load, but at least one edge has its fractional value changing. If this process decreases the value, we can reverse it to increase the value. This allows us to modify the fractional solution in a way that increases the number of integral assignments without decreasing the value. After applying this repeatedly (and repeating the node/edge removal process above where necessary), we are left with a graph that consists only of node-disjoint cycles. Each of the remaining edges will be labeled with one of its two endpoints (one to each). The overall effect is that we have a one-to-one labeling correspondance between fractional assignments and machines (each fractional edge to one of its two assigned machines). Note however that since each job is assigned to two machines and labeled with only one of the two, this does not imply that each machine has only one fractional assignment. Once this is done, we consider three possible solutions. One consists of all the integral assignments. The second considers only those assignments which are fractional and labeled with nodes u. For each node v, we select a subset of its fractional assignments to make integrally, so as to maximize the value without violating capacity of v. We cannot violate capacity of u because we select at most one job for each such machine. The result has at least 1 2 the value of assignments labeled with nodes u. For the third solution, we do the same but with the roles of u, v reversed. We select the best of these three solutions; our choice obtains at least 1 5 of the overall value. Proof of Theorem 6: The algorithm sketch contains most of the proof. We need to establish that we can get at least 1 2 the fractional value on a single machine integrally. This can be done by selecting jobs in decreasing order of density (ft(u, v)/pt(u, v)) until we overflow the capacity. Including the job that overflows capacity, this must be better than the fractional solution. Thus we can select either everything but the job that overflows capacity, or that job by itself. We also need to establish the 1 5 value claim. If we were to select the integral assignments with probability 1 5 and each of the other two solutions with probability 2 5 , we would get an expected 1 5 of the fractional solution. Deterministially selecting the best of the three solutions can only be better than this. ut C Proof of Theorem 7 We first show that if resource requirements are large compared to capacities, payment functions ft are exactly equal to the total amount of resources and each job requires the same amount over all resources/dimensions (but different jobs can require different amounts), then no deterministic online algorithm can be competitive. Consider a graph G with a single compute node and a single data storage node. Each node has one-dimensional compute/storage capacity of L. A jobarrives requesting 1 unit of computing and storage and will pay 2. Clearly, any competitive deterministic algorithm must accept this job, in case this is the only job. However, a second job arrives requesting L units of computing and storage and will pay 2L. In this case, the algorithm is L-competitive, and L can be arbitrarily large. Next, we show that if resource requirements are small relative to capacities, payment functions ft are arbitrary and resource requirements are identical, then no deterministic online algorithm can be competitive. This instance satisfies Assumption 2 but not Assumption 1. Consider again a graph G with a single compute node and single data storage node each with one-dimensional, unit capacities. We will use up to k + 1 jobs, each requiring 1/k units of computing and storage. The i-th job, 0 ≤ i ≤ k, will pay Mi for some large value M. Now, consider any deterministic algorithm. If it fails to accept any job j < k, then if job j is the last job, it will be Ω(M)- competitive. If the algorithm accepts jobs 0 up through k − 1 then it will not be able to accept job k and will be Ω(M)-competitive. In all cases it has competitive ratio at least Ω(M) and M and k can be arbitrarily large. Similarly, if resource requirements are small relative to capacities, payment functions ft are exactly equal to the total amount of resources requested and resource requirements are arbitrary, then no deterministic online algorithm can be competitive. Consider once more a graph G with a single compute node and single data store node with one-dimensional compute/storage capacities. However, this time the compute capacity will be 1 and the storage capacity will be some very large L. We will use up to k+1 jobs, each requiring 1/k units of computing. The i-th job, 0 ≤ i ≤ k, will require the appropriate amount of storage so that its value is Mi for very large M. Assuming L = O(kMk ), all these storage requirements are at most 1/k of L. Note that storage can accommodate all jobs, but computing can accommodate at most k jobs. Any deterministic algorithm will have competitive ratio Ω(M) and k, M and L can be suitably large. Thus, it follows that some flavor of Assumptions 1 and 2 are necessary to achieve any interesting competitive result. D Proof of Theorem 8 We adapt the framework of [8] to solve the online MaxCP problem. This framework uses an exponential cost function to place a price on remaining capacity of a node. If the value obtained from a task can cover the cost of the capacity it consumes, we admit the task. In the algorithm below, e is the base of the natural logarithm. We first show that our algorithm will not exceed capacities. Essentially, this occurs because the cost will always be sufficiently high. Lemma 8. Capacity constraints are not violated at any time during this algorithm.Algorithm 1 Online algorithm for MaxCP. 1: λu(1) ← 0, λv(1) ← 0 for all u ∈ U, v ∈ V 2: for each new task j do 3: costu(j) ← 1 2 (e λu(j) ln(2F +1) 1− − 1) 4: costv(j) ← 1 2 (e λv(j) ln(2F +1) 1− − 1) 5: For all uv let Ztuv = pj (u,v) cu costu(j) + sj (u,v) dv costv(j) 6: Let uv maximize fj (u, v) subject to Zjuv < fj (u, v) 7: if such uv exist with fj (u, v) > 0 then 8: Assign j to uv 9: λu(j + 1) ← λu(j) + pj (u,v) cu 10: λv(j + 1) ← λv(j) + sj (u,v) dv 11: For all other u 0 6= u let λu0 (j + 1) ← λu0 (j) 12: For all other v 0 6= v let λv0 (j + 1) ← λv0 (j) 13: else 14: Reject task j 15: For all u let λu(j + 1) ← λu(j) 16: For all v let λv(j + 1) ← λv(j) 17: end if 18: end for Proof. Note that λu(n + 1) will be 1 cu Σt,vpt(u, v)xtuv, since any time we assign a job j to uv we immediately increase λu(j + 1) by the appopriate amount. Thus if we can prove λu(n + 1) ≤ 1 we will not violate capacity of u. Initially we had λu(1) = 0 < 1, so suppose that the first time we exceed capacity is after the placement of job j. Thus we have λu(j) ≤ 1 < λu(j + 1). By applying assumption 2 we have λu(j) > 1 − . From this it follows that costu(j) > 1 2 (e ln(2F +1) − 1) = F, and since these costs are always non-negative we must have had Zjuv > pj (u,v) cu F ≥ fj (u, v) by applying assumption 1. But then we must have rejected job j and would have λu(j + 1) = λu(j) Identical reasoning applies to v ∈ V . Next, we bound the algorithms revenue from below using the sum of the node costs. Lemma 9. Let A(j) be the total objective value Σt,u,vxtuvft(u, v) obtained by P the algorithm immediately before job j arrives. Then (3e ln(2F + 1))A(j) ≥ u∈U costu(j) + P v∈V costv(j). Proof. The proof will be by induction on j; the base case where j = 1 is immediate since no jobs have yet arrived or been scheduled and costu(1) = costv(1) = 0 for all u and v. Consider what happens when job j arrives. If this job is rejected, neither side of the inequality changes and the induction holds. Otherwise, suppose job j is assigned to uv. We have: A(j + 1) = A(j) + fj (u, v)We can bound the new value of the righthand side by observing that since costu has derivative increasing in the value of λu, the new value will be at most the new derivative times the increase in λu. It follows that: costu(j + 1) ≤ costu(j) + (λu(j + 1) − λu(j))1 2 ( ln(2F + 1) 1 −  )(e λu(j+1) ln(2F +1) 1− ) costu(j + 1) ≤ costu(j) + pj (u, v) cu ( ln(2F + 1) 1 −  )(1 2 e λu(j) ln(2F +1) 1− )(e  ln(2F +1) 1− ) costu(j + 1) ≤ costu(j) + pj (u, v) cu ln(2F + 1) 1 −  (costu(j) + 1 2 )(e  ln(2F +1)) Applying assumption 2 gives: costu(j + 1) ≤ costu(j) + (2e ln(2F + 1))(pj (u, v) cu costu(j) + 1 4 ) Identical reasoning can be applied to costv, allowing us to show that the increase in the righthand side is at most: (2e ln(2F + 1))(pj (u, v) cu costu(j) + sj (u, v) du costv(j) + 1 2 ) Since j was assigned to uv, we must have fj (u, v) > pj (u,v) cu costu(j)+sj (u,v) dv costv(j); from assumption 1 we also have fj (u, v) ≥ 1 so we can conclude that the increase in the righthand side is at most: (3e ln(2F + 1))fj (u, v) ≤ (3e ln(2F + 1))(A(j + 1) − A(j)) Now, we can bound the profit the optimum solution gets from tasks which we either fail to assign, or assign with a lower value of ft(u, v). The reason we did not assign these tasks was because the node costs were suitably high. Thus, we can bound the profit of tasks using the node costs. Lemma 10. Suppose the optimum solution assigned j to u, v, but the online algorithm either rejected j or assigned it to some u 0 , v0 with fj (u 0 , v0 ) < fj (u, v). Then pj (u,v) cu costu(n + 1) + sj (u,v) dv costv(n + 1) ≥ fj (u, v) Proof. When the algorithm considered j, it would find the u, v with maximum fj (u, v) satisfying Zjuv < fj (u, v). Since the algortihm either could not find such u, v or else selected u 0 , v0 with fj (u 0 , v0 ) < fj (u, v) it must be that Zjuv ≥ fj (u, v). The lemma then follows by inserting the definition of Zjuv and then observing that costu and costv only increase as the algorithm continues.Lemma 11. Let Q be the total value of tasks which the optimum offline algorithm assigns, but which Algorithm 1 either rejects or assigns to a uv with lower value of ft(u, v). Then Q ≤ Σu∈U costu(n + 1) + Σv∈V costv(n + 1). Proof. Consider any task q as described above. Suppose offline optimum assigns q to uq, vq. By applying lemma 10 we have: Q = Σqfq(uq, vq) ≤ Σq pq(uq, vq) cu costuq (n + 1) + sq(uq, vq) dv costq(n + 1) The lemma then follows from the fact that the offline algorithm must obey the capacity constraints. Finally, we can combine Lemmas 9 and 11 to bound our total profit. In particular, this shows that we are within a factor 3e ln(2F + 1) of the optimum offline solution, for an O(log F)-competitive algorithm. Theorem 11. Algorithm 1 never violates capacity constraints and is O(log F)- competitive. We can extend the result to k-sided placement, and can get a slight improvement in the required assumptions if we are willing to randomize. The results are given below: Theorem 12. For the k-sided placement problem, we can adapt algorithm 1 to be O(log kF)-competitive provided that assumption 2 is tightened to  = min( 1 2 , 1 ln(kF +1) ). Proof. We must modify the definition of cost to: costu(j) = 1 k (e λu(j) ln(kF +1) 1− − 1) The rest of the proof will then go through. The intuition for the increase in competitive ratio is that we need to assign the first task to arrive (otherwise after this task our competitive ratio would be unbounded). This task potentially uses up space on k machines while obtaining a value of only 1. So as the value of k increases, the ratio of “best” to “worst” task effectively increases as well. Theorem 13. If we select a random power of two z ∈ [1, F] and then reject all placements with ft(u, v) < z or ft(u, v) > 2z, then we can obtain a competitive ratio of O(log F log k) while weakening assumption 2 to  = min( 1 2 , 1 ln(2k+1) ). Note that in the specific case of two-sided placement this is O(log F)-competitive requiring only that no single job consumes more than a constant fraction of any machine. Proof. Once we make our random selection of z, we effectively have F = 2 and can apply the algorithm and analysis above. The selection of z causes us to lose (in expectation) all but 1 log F of the possible profit, so we have to multiply this into our competitive ratio. Collaboration in the Cloud at Google Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy Google Inc. January 8, 2014 Abstract Through a detailed analysis of logs of activity for all Google employees1 , this paper shows how the Google Docs suite (documents, spreadsheets and slides) enables and increases collaboration within Google. In particular, visualization and analysis of the evolution of Google’s collaboration network show that new employees2 , have started collaborating more quickly and with more people as usage of Docs has grown. Over the last two years, the percentage of new employees who collaborate on Docs per month has risen from 70% to 90% and the percentage who collaborate with more than two people has doubled from 35% to 70%. Moreover, the culture of collaboration has become more open, with public sharing within Google overtaking private sharing. 1 Introduction Google Docs is a cloud productivity suite and it is designed to make collaboration easy and natural, regardless of whether users are in the same or different locations, working at the same or different times, or working on desktops or mobile devices. Edits and comments on the document are displayed as they are made, even if many people are simultaneously writing and commenting on or viewing the document. Comments enable real-time discussion and feedback on the document, without changing the document itself. Authors are notified when a new comment is made or replied to, and authors can continue a conversation by replying to the comment, or end the discussion by resolving it, or re-start the discussion by re-opening a closed discussion stream. Because documents are stored in the cloud, users can access any document they own or that has been shared with them anywhere, any time and on any device. The question is whether this enriched model of collaboration matters? There have been a few previous qualitative analyses of the effects of Google Docs on collaboration. For example, the review of Google Docs in [1] suggested that its features should improve collaboration and productivity among college students. A technical report [2] from the University of Southern Queensland, Australia argued that Google Docs can overcome barriers to usability such as difficulty of installation and document version control and help resolve conflicts among co-authors of research papers. There has also been at least one rigorous study of the effect of Google Docs on collaboration. Blau and Caspi [3] ran a small experiment that was designed to compare collaboration on writing documents to merely sharing documents. In their experiment, 118 undergraduate students of the Open University of Israel were randomized to one of five groups in which they shared their written assignments and received feedback from other students to varying degrees, ranging from keeping texts 1Full-time Google employees, excluding interns, part-times, vendors, etc 2Full-time employees who have joined Google for less than 90 days 12 COLLABORATION VISUALIZATION private to allowing in-text suggestions or allowing in-text edits. None of the students had used Google Docs previously. The authors found that only students in the collaboration group perceived the quality of their final document to be higher after receiving feedback, and students in all groups thought that collaboration improves documents. This paper takes a different approach, and looks for the effects of collaboration on a large, diverse organization with thousands of users over a much longer period of time. The first part of the paper describes some of the contexts in which Google Docs is used for collaboration, and the second part analyzes how collaboration has evolved over the last two years. 2 Collaboration Visualization 2.1 The Data This section introduces a way to visualize the events during a collaboration and some simple statistics that summarize how widespread collaboration using Google Docs is at Google. The graphics and metrics are based on the view, edit and comment actions of all full-time employees on tens of thousands of documents created in April 2013. 2.2 A Simple Example To start, a document with three collaborators Adam (A), Bryant (B) and Catherine (C) is shown in Figure 1. The horizontal axis represents time during the collaboration. The vertical axis is broken into three regions representing viewing, editing and commenting. Each contributor is assigned a color. A box with the contributor’s color is drawn in any time interval in which the contributor was active, at a vertical position that indicates what the user was doing in that time interval. This allows us to see when contributors were active and how often they contributed to the document. Stacking the boxes allows us to show when contributors were acting at the same time. Only time intervals in which at least one contributor was active are shown, and gaps in time that are shorter than a threshold are ignored. Gray vertical bars of fixed width are used to represent periods of no activity that are longer than the threshold. In this paper, the threshold is set to be 12 hours in all examples. In Figure 1, an interval represents an hour. Adam and Bryant edited the document together during the hour of 10 AM May 4 and Bryant edited alone in the following hour. The collaboration paused for 8 days and resumed during the hour of 2 pm on May 12. Adam, Bryant and Catherine all viewed the document during that hour. Catherine commented on the document in the next hour. Altogether, the collaboration had two active sessions, with a pause of 8 days between them. Figure 1: This figure shows an example of the collaboration visualization technique. Each colored block except the gray one represents an hour and the gray one represents a period of no activity. The Y axis is the number of users for each action type. This document has three contributors, each assigned a different color. Although we have used color to represent collaborators here, we could instead use color to represent the locations of the collaborators, their organizations, or other variables. Examples with different colorings are given in Sections 2.5 and 2.6. 2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics 2.3 Collaboration Metrics To estimate the percentage of users who concurrently edit a document and the percentage of documents which had concurrent editing, we discretize the timestamps of editing actions into 15 minute intervals and consider editing actions by different contributors in the same 15 minute interval to be concurrent. Two users who edit the same document but always more than 15 minutes apart would not be considered as concurrent, although they would still be considered collaborators. Edge cases in which two collaborators edit the same document within 15 minutes of each other but in two adjacent 15 minute intervals would not be counted as concurrent events. The choice of 15 minutes is arbitrary; however, metrics based on a 15 minute discretization and a 5 minute discretization are little different. The choice of 15 minute intervals makes computation faster. A more accurate approach would be to look for sequences of editing actions by different users with gaps below 15 minutes, but that requires considerably more computing. 2.4 Collaborative Editing Collaborative editing is common at Google. 53% of the documents that were created and shared in April 2013 were edited by more than one employee, and half of those had at least one concurrent editing session in the following six months. Looking at employees instead of documents, 80% of the employees who edited any document contributed content to a document owned by others and 65% participated in at least one 15 minute concurrent editing session in April 2013. Concurrent editing is sticky, in the sense that 76% of the employees who participate in a 15 minute concurrent editing session in April will do so again the following month. There are many use cases for collaborative editing, including weekly reports, design documents, and coding interviews. The following three plots show an example of each of these use cases. Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the number of users for each action type. The document was mainly edited by 3 employees, commented on by 18 and viewed by 50+. Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION Figure 2 shows the life of a design document created by engineers. The X axis is time in hours and the Y axis is the number of employees working on the document for each action type. The document was mainly edited by three employees, commented on by 18 employees and viewed by more than 50 employees from three major locations. This document was completed within two weeks and viewed many times in the subsequent month. Design documents are common at Google, and they typically have many contributors. Figure 3 shows the life of a weekly report document. Each bar represents a day and the Y axis is the number of employees who edited and viewed the document in a day. This document has the following submission rules: • Wednesday, AM: Reminder for submissions • Wednesday, PM: All teams submit updates • Thursday, AM: Document is locked The activities on the document exhibit a pronounced weekly pattern that mirrors the submission rules. Weekly reports and meeting notes that are updated regularly are often used by employees to keep everyone up-to-date as projects progress. Figure 3: Collaboration on a weekly report. The X axis is time in days and the Y axis is the number of users for each action type. The activities exhibit a pronounced weekly pattern and reflect the submission rules of the document. Finally, Figure 4 shows the life of a document used in an interview. The X axis represents time in minutes. The document was prepared by a recruiter and then viewed by an engineer. At the beginning of the interview, the engineer edited the document and the candidate then wrote code in the document. The engineer was able to watch the candidate typing. At the end of the interview, the candidate’s access to the document was revoked so no further change could be made, and the document was reviewed by the engineer. Collaborative editing allows the coding interview to take place remotely, and it is an integral part of interviews for software engineers at Google. Figure 4: The activity on a phone interview document. The X axis is time in minutes and the Y axis is the number of users for each action type. The engineer was able to watch the candidate typing on the document during a remote interview. 2.5 Commenting Commenting is common at Google. 30% of the documents created in April 2013 that are shared received comments within six months of creation. 57% of the employees who used Google Docs in April commented at least once in April, and 80% of the users who commented in April commented again in the following month. 4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis is the number of user actions for each user location. There are four user actions, each assigned a different color. Timestamps are in Pacific time. Figure 5 shows the life of a design document. Here color represents the type of user action (create a comment, reply to a comment, resolve a comment and edit the document), and the Y axis is split into two locations. The document was written by one engineering team and reviewed by another. The review team used commenting to raise many questions, which the engineering team resolved over the next few days. Collaborators were located in London, UK and Mountain View, California, with a nine hour time zone difference, so the two teams were almost ”taking turns” working on the document (timestamps are in Pacific time). There are many similar communication patterns between engineers via commenting to ask questions, have discussions and suggest modifications. 2.6 Collaboration Across Sites Employees use the Docs suite to collaborate with colleagues across the world, as Figure 6 shows. In that figure, employees working from nine locations in eight countries across the globe contributed to a document that was written within a week. The document was either viewed or edited with gaps of less than 12 hours (the threshold for suppressing gaps in the plot) in the first seven days as people worked in their local timezones. After final changes were made to the document, it was reviewed by people in Dublin, Mountain View, and New York. Figure 7 shows one month of global collaborations for full-time employees using Google Docs. The blue dots show the locations of the employees and a line connects two locations if a document is created in one location and viewed in the other. The warmer the color of the line, moving from green to red, the more documents shared between the two locations. Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in hours and the Y axis is the number of locations for each action type. Users from nine different locations contributed to the document. Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is collaboration on Google Docs between the two locations. 6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work 2.7 Cross Device Work The advantage of cloud-based software and storage is that a document can be accessed from any device. Figure 8 shows one employee’s visits to a document from multiple devices and locations. When the employee was in Paris, a desktop or laptop was used during working hours and a mobile device during non-working hours. Apparently, the employee traveled to Aix-En-Provence on August 18. On August 18 and the first part of August 19, the employee continued working on the same document from a mobile device while on the move. Figure 8: Visits to a document by one user working on multiple devices and from multiple locations. Not surprisingly, the pattern of working on desktops or laptops during working hours and on mobile devices out of business hours holds generally at Google, as Figure 9 shows. The day of week is shown on the X axis and hour of day in local time on the Y axis. Each pixel is colored according to the average number of employees working in Google Docs in a day of week and time of day slot, with brighter colors representing higher numbers. Pixel values are normalized within each plot separately. Desktop and laptop usage of Google Docs peaks during conventional working hours (9:00 AM to 11:00 AM and 1:00 PM to 5:00 PM), while mobile device usage peaks during conventional commuting and other out-of-office hours (7:00 AM to 9:00 AM and 6:00 PM to 8:00 PM). Figure 9: The average number of active users working in Google Docs in each day of week and time of day slot. The X axis is day of the week and the Y axis is time of the day in local time. Desktop/Laptop usage peaks during working hours while mobile usage peaks at out-of-office working hours. 3 The Evolution of Collaboration 3.1 The Data This section explores changes in the usage of Google Docs over time. Section 2 defined collaborators as users who edited or commented on the same document and used logs of employee editing, viewing and commenting actions to describe collaboration within Google. This section defines collaborators differently using metadata on documents. Metadata is much less rich than the event history logs used in Section 2, but metadata is retained for a much longer period of time. Document metadata includes the document creation time and the last time that the document Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION was accessed, but no other information about its revision history. However, the metadata does include the identification numbers for employees who have subscribed to the document, where a subscriber is anyone who has permission to view, edit or comment on a document and who has viewed the document at least once. Here we use metadata on documents, slides and spreadsheets. We call two employees collaborators (or subscription collaborators to be clear) if one is a subscriber to a document owned by the other and has viewed the document at least once and the document has fewer than 20 subscribers. The owner of the document is said to have shared the document with the subscriber. The number of subscribers is capped at 20 to avoid overcounting collaborators. The more subscribers the document has, the less likely it is that all the subscribers contributed to the document. There is no timestamp for when the employee subscribed to the document in the metadata, so the exact time of the collaboration is not known. Instead, the document creation time, which is known, is taken to be the time of the collaboration. An analysis (not shown here) of the event history data discussed in Section 2 showed that most collaborators join a collaboration soon after a document is created, so taking collaboration time to be document creation time is not unreasonable. To make this assumption even more tenable, we exclude documents for which the time of the last view, comment or edit is more than six months after the document was created. This section uses metadata on documents created between January 1, 2011 and March 31, 2013. We say that two employees had a subscription collaboration in July if they collaborated on a document that was created in July. 3.2 Collaboration for New Employees Here we define the new employees for a given month to be all the employees who joined Google no more than 90 days before the beginning of the month and started using Google Docs in the given month. For example, employees called new in the month of January 2011 must have joined Google no more than 90 days before January 1, 2011 and used Google Docs in January 2011. Each month can include different employees. New employees are said to share a document if they own a document that someone else subscribed to, whether or not the person subscribed to the document is a new employee. Similarly, a new employee is counted as a subscriber, regardless of the tenure of the document creator. Figure 10 shows that collaboration among new employees has increased since 2011. Over the last two years, subscribing has risen from 55% to 85%, sharing has risen from 30% to 50%, and the fraction of users who either share or subscribe has risen from 70% to 90%. In other words, new employees are collaborating earlier in their career, so there is a faster ramp-up and easier access to collective knowledge. Figure 10: This figure shows the percentage of new employees who share, subscribe to others’ documents and either share or subscribe in each one-month period over the last two years. Not only do new employees start collaborating more often (as measured by subscription and sharing), they also collaborate with more people. Figure 11 shows the percentage of new employees with at least a given number of collaborators by month. For example, the percentage of 8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing new employees with at least three subscription collaborators was 35% in January 2011 (the bottom red curve) and 70% in March 2013 (the top blue curve), a doubling over two years. It is interesting that the curves hardly cross each other and the curves for the farthest back months lie below those for recent months, suggesting that there has been steady growth in the number of subscription collaborators per new employee over this period. Figure 11: This figure shows the proportion of new employees who have at least a given number of collaborators in each one-month period. Each period is assigned a different color. The cooler the color of the curve, moving from red to blue, the more recent the month. The legend only shows the labels for a subset of curves. The percentage of new employees who have at least three collaborators has doubled from 35% to 70%. To present the data in Figure 11 in another way, Table 1 shows percentiles of the distribution of the number of subscription collaborators per new employee using Google Docs in January 2011 and in January 2013. For example, the lowest 25% of new employees using Google Docs had no such collaborators in January 2011 and two such collaborators in January 2013. 25% 50% 75% 90% 95% January 2011 0 1 4 7 11 January 2013 2 5 10 17 22 Table 1: This table shows the percentile of number of collaborators a new employee have in January 2011 and January 2013. The entire distribution shifts to the right. 3.3 Collaboration in Sales and Marketing Section 3.2 compared new employees who joined Google in different months. This section follows current employees in Sales and Marketing who joined Google before January 1, 2011. That is, the previous section considered changes in new employee behavior over time and this section considers changes in behavior for a fixed set of employees over time. We only analyze subscription collaborations among this fixed set of employees and collaborations with employees not in this set are excluded. Figure 12: This figure shows the percentage of current employees in Sales and Marketing who have at least a given number of collaborators in each onemonth period. Figure 12 shows the percentage of current employees in Sales and Marketing who have at least Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION a given number of collaborators at several times in the past. There we see that more employees are sharing and subscribing over time because the fraction of the group with at least one subscription collaborator has increased from 80% to 95%. And the fraction of the group with at least three subscription collaborators has increased from 50% to 80%. It shows that many of the employees who used to have no or very few subscription collaborators have migrated to having multiple subscription collaborators. In other words, the distribution of number of subscription collaborators for employees who have been in Sales and Marketing since January 1, 2011 has shifted right over time, which implies that collaboration in that group of employees has increased over time. Finally, the number of documents shared by the employees who have been in Sales and Marketing at Google since January 1, 2011 has nearly doubled over the last two years. Figure 13 shows the number of shared documents normalized by the number of shared documents in January, 2011. Figure 13: This figure shows the number of shared documents created by employees in Sales and Marketing each month normalized by the number of shared documents in January 2011. The number has almost doubled over the last two years. 3.4 Collaboration Between Organizations Collaboration between organizations has increased over time. To show that, we consider hundreds of employees in nine teams within the Sales and Marketing group and the Engineering and Product Management group who joined Google before January 1, 2011, were still active in March 31, 2013 and used Google Docs in that period. Figure 14 represents the Engineering and Product Management employees as red dots and the Sales and Marketing employees as blue dots. The same dots are included in all three plots in Figure 14 because the employees included in this analysis do not change. A line connects two dots if the two employees had at least one subscription collaboration in the month shown. The denser the lines in the graph, the more collaboration, and the more lines connecting red and blue dots, the more collaboration between organizations. Clearly, subscription collaboration has increased both within and across organizations in the past two years. Moreover, the network shows more pronounced communities (groups of connected dots) over time. Although there are nine individual teams, there seems to be only three major communities in the network. Figure 14 indicates that teams can work closely with each other even though they belong to separate departments. We also sampled 187 teams within the Sales and Marketing group and the Engineering and Product Management group. Figure 15 represents teams in Engineering and Product Management as red dots and teams in Sales and Marketing as blue dots. Two dots are connected if the two teams had a least one subscription collaboration between their members in the month. Figure 15 shows that the collaboration between those teams has increased and the interaction between the two organizations has becomed stronger over the past two years. 10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations Figure 14: An example of collaboration across organizations. Red dots represent employees in Engineering and Product Management and blue dots represent employees in Sales and Marketing Figure 15: An example of collaboration between teams. Red dots represent teams in Engineering and Product Management and blue dots represent teams in Sales and Marketing Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS 3.5 Cultural Changes in Collaboration Google Docs allows users to specify the access level (visibility) of their documents. The default access level in Google Docs is private, which means that only the user who created the document or the current owner of the document can view it. Employees can change the access level on a document they own and allow more people to access it. For example, the document owner can specify particular employees who are allowed to access the document, or the owner can mark the document as public within Google, in which case any employee can access the document. Clearly, not all documents created in Google can be visible to everyone at Google, but the more documents are widely shared, the more open the environment is to collaboration. Figure 16: This figure shows the percentage of shared documents that are ”public within Google” created in each month. Public sharing is overtaking private sharing at Google. Figure 16 shows the percentage of shared documents in Google created each month between January 1, 2012 and March 31, 2013 that are public within Google. The red line, which is a curve fit to the data to smooth out variability, shows that the percentage has increased about 12% from 48% to 54% in the last year alone. In that sense, the culture of sharing is changing in Google from private sharing to public sharing. 4 Conclusions We have examined how Google employees collaborate with Docs and how that collaboration has evolved using logs of user activity and document metadata. To show the current usage of Docs in Google, we have developed a visualization technique for the revision history of a document and analyzed key features in Docs such as collaborative editing, commenting, access from anywhere and on any device. To show the evolution of collaboration in the cloud, we have analyzed new employees and a fixed group of employees in Sales and Marketing, and computed collaboration network statistics each month. We find that employees are engaged in using the Docs suite, and collaboration has grown rapidly over the last two years. It would also be interesting to conduct a similar analysis for other enterprises and see how long it would take them to reach the benchmark Google has set for collaboration on Docs. Not only has the collaboration on Docs changed at Google, the number of emails, comments on G+, calender meetings between people who work together has also had significant changes over the past few years. How those changes reinforce each other over time would also be an interesting topic to study. Acknowledgements We would like to thank Ariel Kern for her insights about collaboration on Google Docs, Penny Chu and Tony Fagan for their encouragement and support and many thanks to Jim Koehler for his constructive feedback. 12 Google Inc.REFERENCES REFERENCES References [1] Dan R. Herrick (2009). Google this!: using Google apps for collaboration and productivity. Proceeding of the ACM SIGUCCS fall conference (pp. 55-64). [2] Stijn Dekeyser, Richard Watson (2009). Extending Google Docs to Collaborate on Research Papers. Technical Report, The University of Southern Queensland, Australia. [3] Ina Blau, Avner Caspi (2009). What Type of Collaboration Helps? Psychological Ownership, Perceived Learning and Outcome Quality of Collaboration Using Google Docs. Learning in the technological era: Proceedings of the Chais conference on instructional technologies research (pp. 48-55). Google Inc. 13 Circulant Binary Embedding Felix X. Yu1 YUXINNAN@EE.COLUMBIA.EDU Sanjiv Kumar2 SANJIVK@GOOGLE.COM Yunchao Gong3 YUNCHAO@CS.UNC.EDU Shih-Fu Chang1 SFCHANG@EE.COLUMBIA.EDU 1Columbia University, New York, NY 10027 2Google Research, New York, NY 10011 3University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 Abstract Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from O(d 2 ) to O(d log d), and the space complexity from O(d 2 ) to O(d) where d is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits. 1. Introduction Embedding input data in binary spaces is becoming popular for efficient retrieval and learning on massive data sets (Li et al., 2011; Gong et al., 2013a; Raginsky & Lazebnik, 2009; Gong et al., 2012; Liu et al., 2011). Moreover, Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). in a large number of application domains such as computer vision, biology and finance, data is typically highdimensional. When representing such high dimensional data by binary codes, it has been shown that long codes are required in order to achieve good performance. In fact, the required number of bits is O(d), where d is the input dimensionality (Li et al., 2011; Gong et al., 2013a; Sanchez ´ & Perronnin, 2011). The goal of binary embedding is to well approximate the input distance as Hamming distance so that efficient learning and retrieval can happen directly in the binary space. It is important to note that another related area called hashing is a special case with slightly different goal: creating hash tables such that points that are similar fall in the same (or nearby) bucket with high probability. In fact, even in hashing, if high accuracy is desired, one typically needs to use hundreds of hash tables involving tens of thousands of bits. Most of the existing linear binary coding approaches generate the binary code by applying a projection matrix, followed by a binarization step. Formally, given a data point, x ∈ R d , the k-bit binary code, h(x) ∈ {+1, −1} k is generated simply as h(x) = sign(Rx), (1) where R ∈ R k×d , and sign(·) is a binary map which returns element-wise sign1 . Several techniques have been proposed to generate the projection matrix randomly without taking into account the input data (Charikar, 2002; Raginsky & Lazebnik, 2009). These methods are very popular due to their simplicity but often fail to give the best performance due to their inability to adapt the codes with respect to the input data. Thus, a number of data-dependent techniques have been proposed with different optimization criteria such as reconstruction error (Kulis & Darrell, 2009), data dissimilarity (Norouzi & Fleet, 2012; Weiss et al., 1A few methods transform the linear projection via a nonlinear map before taking the sign (Weiss et al., 2008; Raginsky & Lazebnik, 2009).Circulant Binary Embedding 2008), ranking loss (Norouzi et al., 2012), quantization error after PCA (Gong et al., 2013b), and pairwise misclassification (Wang et al., 2010). These methods are shown to be effective for learning compact codes for relatively lowdimensional data. However, the O(d 2 ) computational and space costs prohibit them from being applied to learning long codes for high-dimensional data. For instance, to generate O(d)-bit binary codes for data with d ∼1M, a huge projection matrix will be required needing TBs of memory, which is not practical2 . In order to overcome these computational challenges, Gong et al. (2013a) proposed a bilinear projection based coding method for high-dimensional data. It reshapes the input vector x into a matrix Z, and applies a bilinear projection to get the binary code: h(x) = sign(RT 1 ZR2). (2) When the shapes of Z, R1, R2 are chosen appropriately, the method has time and space complexity of O(d 1.5 ) and O(d), respectively. Bilinear codes make it feasible to work with datasets with very high dimensionality and have shown good results in a variety of tasks. In this work, we propose a novel Circulant Binary Embedding (CBE) technique which is even faster than the bilinear coding. It is achieved by imposing a circulant structure on the projection matrix R in (1). This special structure allows us to use Fast Fourier Transformation (FFT) based techniques, which have been extensively used in signal processing. The proposed method further reduces the time complexity to O(d log d), enabling efficient binary embedding for very high-dimensional data3 . Table 1 compares the time and space complexity for different methods. This work makes the following contributions: • We propose the circulant binary embedding method, which has space complexity O(d) and time complexity O(d log d) (Section 2, 3). • We propose to learn the data-dependent circulant projection matrix by a novel and efficient time-frequency alternating optimization, which alternatively optimizes the objective in the original and frequency domains (Section 4). • Extensive experiments show that, compared to the state-of-the-art, the proposed method improves the result dramatically for a fixed time cost, and provides much faster computation with no performance degradation for a fixed number of bits (Section 5). 2 In principle, one can generate the random entries of the matrix on-the-fly (with fixed seeds) without needing to store the matrix. But this will increase the computational time even further. 3One could in principal use other structured matrices like Hadamard matrix along with a sparse random Gaussian matrix to achieve fast projection as was done in fast Johnson-Lindenstrauss transform(Ailon & Chazelle, 2006; Dasgupta et al., 2011), but it is still slower than circulant projection and needs more space. Method Time Space Time (Learning) Full projection O(d 2 ) O(d 2 ) O(nd3 ) Bilinear proj. O(d 1.5 ) O(d) O(nd1.5 ) Circulant proj. O(d log d) O(d) O(nd log d) Table 1. Comparison of the proposed method (Circulant proj.) with other methods for generating long codes (code dimension k comparable to input dimension d). n is the number of instances used for learning data-dependent projection matrices. 2. Circulant Binary Embedding (CBE) A circulant matrix R ∈ R d×d is a matrix defined by a vector r = (r0, r2, · · · , rd−1) T (Gray, 2006)4 . R = circ(r) :=         r0 rd−1 . . . r2 r1 r1 r0 rd−1 r2 . . . r1 r0 . . . . . . rd−2 . . . . . . rd−1 rd−1 rd−2 . . . r1 r0         . (3) Let D be a diagonal matrix with each diagonal entry being a Bernoulli variable (±1 with probability 1/2). For x ∈ R d , its d-bit Circulant Binary Embedding (CBE) with r ∈ R d is defined as: h(x) = sign(RDx), (4) where R = circ(r). The k-bit (k < d) CBE is defined as the first k elements of h(x). The need for such a D is discussed in Section 3. Note that applying D to x is equivalent to applying random sign flipping to each dimension of x. Since sign flipping can be carried out as a preprocessing step for each input x, here onwards for simplicity we will drop explicit mention of D. Hence the binary code is given as h(x) = sign(Rx). The main advantage of circulant binary embedding is its ability to use Fast Fourier Transformation (FFT) to speed up the computation. Proposition 1. For d-dimensional data, CBE has space complexity O(d), and time complexity O(d log d). Since a circulant matrix is defined by a single column/row, clearly the storage needed is O(d). Given a data point x, the d-bit CBE can be efficiently computed as follows. Denote ~ as operator of circulant convolution. Based on the definition of circulant matrix, Rx = r ~ x. (5) The above can be computed based on Discrete Fourier Transformation (DFT), for which fast algorithm (FFT) is available. The DFT of a vector t ∈ C d is a d-dimensional vector with each element defined as 4The circulant matrix is sometimes equivalently defined by “circulating” the rows instead of the columns.Circulant Binary Embedding F(t)l = X d−1 m=0 tn · e −i2πlm/d, l = 0, · · · , d − 1. (6) The above can be expressed equivalently in a matrix form as F(t) = Fdt, (7) where Fd is the d-dimensional DFT matrix. Let F H d be the conjugate transpose of Fd. It is easy to show that F −1 d = (1/d)F H d . Similarly, for any t ∈ C d , the Inverse Discrete Fourier Transformation (IDFT) is defined as F −1 (t) = (1/d)F H d t. (8) Since the convolution of two signals in their original domain is equivalent to the hadamard product in their frequency domain (Oppenheim et al., 1999), F(Rx) = F(r) ◦ F(x). (9) Therefore, h(x) = sign F −1 (F(r) ◦ F(x)) . (10) For k-bit CBE, k < d, we only need to pick the first k bits of h(x). As DFT and IDFT can be efficiently computed in O(d log d) with FFT (Oppenheim et al., 1999), generating CBE has time complexity O(d log d). 3. Randomized Circulant Binary Embedding A simple way to obtain CBE is by generating the elements of r in (3) independently from the standard normal distribution N (0, 1). We call this method randomized CBE (CBE-rand). A desirable property of any embedding method is its ability to approximate input distances in the embedded space. Suppose Hk(x1, x2) is the normalized Hamming distance between k-bit codes of a pair of points x1, x2 ∈ R d : Hk(x1, x2)= 1 k k X−1 i=0 sign(Ri·x1)−sign(Ri·x2) /2, (11) and Ri· is the i-th row of R, R = circ(r). If r is sampled from N (0, 1), from (Charikar, 2002), Pr sign(r T x1) 6= sign(r T x2)  = θ/π, (12) where θ is the angle between x1 and x2. Since all the vectors that are circulant variants of r also follow the same distribution, it is easy to see that E(Hk(x1, x2)) = θ/π. (13) For the sake of discussion, if k projections, i.e., first k rows of R, were generated independently, it is easy to show that the variance of Hk(x1, x2) will be Var(Hk(x1, x2)) = θ(π − θ)/kπ2 . (14) 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (a) θ = π/12 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (b) θ = π/6 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (c) θ = π/3 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (d) θ = π/2 Figure 1. The analytical variance of normalized hamming distance of independent bits as in (14), and the sample variance of normalized hamming distance of circulant bits, as a function of angle between points (θ) and number of bits (k). The two curves overlap. Thus, with more bits (larger k), the normalized hamming distance will be close to the expected value, with lower variance. In other words, the normalized hamming distance approximately preserves the angle5 . Unfortunately in CBE, the projections are the rows of R = circ(r), which are not independent. This makes it hard to derive the variance analytically. To better understand CBE-rand, we run simulations to compare the analytical variance of normalized hamming distance of independent projections (14), and the sample variance of normalized hamming distance of circulant projections in Figure 1. For each θ and k, we randomly generate x1, x2 ∈ R d such that their angle is θ 6 . We then generate k-dimensional code with CBE-rand, and compute the hamming distance. The variance is estimated by applying CBE-rand 1,000 times. We repeat the whole process 1,000 times, and compute the averaged variance. Surprisingly, the curves of “Independent” and “Circulant” variances are almost indistinguishable. This means that bits generated by CBE-rand are generally as good as the independent bits for angle preservation. An intuitive explanation is that for the circulant matrix, though all the rows are dependent, circulant shifting one or multiple elements will in fact result in very different projections in most cases. We will later show in experiments on real-world data that CBE-rand and Locality Sensitive Hashing (LSH)7 has almost identical performance (yet CBE-rand is significantly faster) (Section 5). 5 In this paper, we consider the case that the data points are `2 normalized. Therefore the cosine distance, i.e., 1 - cos(θ), is equivalent to the l2 distance. 6This can be achieved by extending the 2D points (1, 0), (cos θ, sin θ) to d-dimension, and performing a random orthonormal rotation, which can be formed by the Gram-Schmidt process on random vectors. 7Here, by LSH we imply the binary embedding using R such that all the rows of R are sampled iid. With slight abuse of notation, we still call it “hashing” following (Charikar, 2002).Circulant Binary Embedding Note that the distortion in input distances after circulant binary embedding comes from two sources: circulant projection, and binarization. For the circulant projection step, recent works have shown that the Johnson-Lindenstrausstype lemma holds with a slightly worse bound on the number of projections needed to preserve the input distances with high probability (Hinrichs & Vyb´ıral, 2011; Zhang & Cheng, 2013; Vyb´ıral, 2011; Krahmer & Ward, 2011). These works also show that before applying the circulant projection, an additional step of randomly flipping the signs of input dimensions is necessary8 . To show why such a step is required, let us consider the special case when x is an allone vector, 1. The circulant projection with R = circ(r) will result in a vector with all elements to be r T 1. When r is independently drawn from N (0, 1), this will be close to 0, and the norm cannot be preserved. Unfortunately the Johnson-Lindenstrauss-type results do not generalize to the distortion caused by the binarization step. One problem with the randomized CBE method is that it does not utilize the underlying data distribution while generating the matrix R. In the next section, we propose to learn R in a data-dependent fashion, to minimize the distortions due to circulant projection and binarization. 4. Learning Circulant Binary Embedding We propose data-dependent CBE (CBE-opt), by optimizing the projection matrix with a novel time-frequency alternating optimization. We consider the following objective function in learning the d-bit CBE. The extension of learning k < d bits will be shown in Section 4.2. argmin B,r ||B − XRT ||2 F + λ||RRT − I||2 F (15) s.t. R = circ(r), where X ∈ R n×d , is the data matrix containing n training points: X = [x0, · · · , xn−1] T , and B ∈ {−1, 1} n×d is the corresponding binary code matrix.9 In the above optimization, the first term minimizes distortion due to binarization. The second term tries to make the projections (rows of R, and hence the corresponding bits) as uncorrelated as possible. In other words, this helps to reduce the redundancy in the learned code. If R were to be an orthogonal matrix, the second term will vanish and the optimization would find the best rotation such that the distortion due to binarization is minimized. However, when R is a circulant matrix, R, in general, will not be orthogonal. Similar objective has been used in previous works including (Gong et al., 2013b;a) and (Wang et al., 2010). 8 For each dimension, whether the sign needs to be flipped is predetermined by a (p = 0.5) Bernoulli variable. 9 If the data is `2 normalized, we can set B ∈ {−1/ √ d, 1/ √ d} n×d to make B and XRT more comparable. This does not empirically influence the performance. 4.1. The Time-Frequency Alternating Optimization The above is a combinatorial optimization problem, for which an optimal solution is hard to find. In this section we propose a novel approach to efficiently find a local solution. The idea is to alternatively optimize the objective by fixing r, and B, respectively. For a fixed r, optimizing B can be easily performed in the input domain (“time” as opposed to “frequency”). For a fixed B, the circulant structure of R makes it difficult to optimize the objective in the input domain. Hence we propose a novel method, by optimizing r in the frequency domain based on DFT. This leads to a very efficient procedure. For a fixed r. The objective is independent on each element of B. Denote Bij as the element of the i-th row and j-th column of B. It is easy to show that B can be updated as: Bij = ( 1 if Rj·xi ≥ 0 −1 if Rj·xi < 0 , (16) i = 0, · · · , n − 1. j = 0, · · · , d − 1. For a fixed B. Define ˜r as the DFT of the circulant vector ˜r := F(r). Instead of solving r directly, we propose to solve ˜r, from which r can be recovered by IDFT. Key to our derivation is the fact that DFT projects the signal to a set of orthogonal basis. Therefore the `2 norm can be preserved. Formally, according to Parseval’s theorem , for any t ∈ C d (Oppenheim et al., 1999), ||t||2 2 = (1/d)||F(t)||2 2 . Denote diag(·) as the diagonal matrix formed by a vector. Denote <(·) and =(·) as the real and imaginary parts, respectively. We use Bi· to denote the i-th row of B. With complex arithmetic, the first term in (15) can be expressed in the frequency domain as: ||B − XRT ||2 F = 1 d nX−1 i=0 ||F(B T i· − Rxi)||2 2 (17) = 1 d nX−1 i=0 ||F(B T i·)−˜r◦F(xi)||2 2 = 1 d nX−1 i=0 ||F(B T i·)−diag(F(xi))˜r||2 2 = 1 d nX−1 i=0  F(B T i·)−diag(F(xi))˜r  T  F(B T i·)−diag(F(xi))˜r  = 1 d h <(˜r) TM<(˜r)+=(˜r) TM=(˜r)+<(˜r) T h+=(˜r) T g i +||B||2 F , where, M=diag nX−1 i=0 <(F(xi))◦<(F(xi))+=(F(xi))◦=(F(xi)) h = −2 nX−1 i=0 <(F(xi))◦<(F(B T i·))+=(F(xi)) ◦ =(F(B T i·)) g = 2 nX−1 i=0 =(F(xi)) ◦ <(F(B T i·)) − <(F(xi)) ◦ =(F(B T i·)).Circulant Binary Embedding For the second term in (15), we note that the circulant matrix can be diagonalized by DFT matrix Fd and its conjugate transpose F H d . Formally, for R = circ(r), r ∈ R d , R = (1/d)F H d diag(F(r))Fd. (18) Let Tr(·) be the trace of a matrix. Therefore, ||RRT − I||2 F = || 1 d F H d (diag(˜r) Hdiag(˜r) − I)Fd||2 F = Tr 1 d F H d (diag(˜r) Hdiag(˜r)−I) H(diag(˜r) Hdiag(˜r)−I)Fd  = Tr (diag(˜r) Hdiag(˜r) − I) H(diag(˜r) Hdiag(˜r) − I) =||˜r H ◦ ˜r − 1||2 2 = ||<(˜r) 2 + =(˜r) 2 − 1||2 2 . (19) Furthermore, as r is real-valued, additional constraints on ˜r are needed. For any u ∈ C, denote u as the complex conjugate of u. We have the following result (Oppenheim et al., 1999): For any real-valued vector t ∈ C d , F(t)0 is real-valued, and F(t)d−i = F(t)i , i = 1, · · · , bd/2c. From (17) − (19), the problem of optimizing ˜r becomes argmin ˜r <(˜r) TM<(˜r) + =(˜r) TM=(˜r) + <(˜r) T h + =(˜r) T g + λd||<(˜r) 2 + =(˜r) 2 − 1||2 2 (20) s.t. =(˜r0) = 0 <(˜ri) = <(˜rd−i), i = 1, · · · , bd/2c =(˜ri) = −=(˜rd−i), i = 1, · · · , bd/2c. The above is non-convex. Fortunately, the objective function can be decomposed, such that we can solve two variables at a time. Denote the diagonal vector of the diagonal matrix M as m. The above optimization can then be decomposed to the following sets of optimizations. argmin r˜0 m0r˜ 2 0 + h0r˜0+ λd r˜ 2 0 − 1 2 , s.t. r˜0 = r˜0. (21) argmin r˜i (mi + md−i)(<(˜ri) 2 + =(˜ri) 2 ) (22) + 2λd <(˜ri) 2 + =(˜ri) 2 − 1 2 + (hi + hd−i)<(˜ri) + (gi − gd−i)=(˜ri), i = 1, · · · , bd/2c. In (21), we need to minimize a 4 th order polynomial with one variable, with the closed form solution readily available. In (22), we need to minimize a 4 th order polynomial with two variables. Though the closed form solution is hard (requiring solution of a cubic bivariate system), we can find local minima by gradient descent, which can be considered as having constant running time for such small-scale problems. The overall objective is guaranteed to be nonincreasing in each step. In practice, we can get a good solution with just 5-10 iterations. In summary, the proposed time-frequency alternating optimization procedure has running time O(nd log d). 4.2. Learning k < d Bits In the case of learning k < d bits, we need to solve the following optimization problem: argmin B,r ||BPk−XPT k RT ||2 F +λ||RPkP T k RT −I||2 F s.t. R = circ(r), (23) in which Pk =  Ik O O Od−k  , Ik is a k × k identity matrix, and Od−k is a (d − k) × (d − k) all-zero matrix. In fact, the right multiplication of Pk can be understood as a “temporal cut-off”, which is equivalent to a frequency domain convolution. This makes the optimization difficult, as the objective in frequency domain can no longer be decomposed. To address this issues, we propose a simple solution in which Bij = 0, i = 0, · · · , n − 1, j = k, · · · , d − 1 in (15). Thus, the optimization procedure remains the same, and the cost is also O(nd log d). We will show in experiments that this heuristic provides good performance in practice. 5. Experiments To compare the performance of the proposed circulant binary embedding technique, we conducted experiments on three real-world high-dimensional datasets used by the current state-of-the-art method for generating long binary codes (Gong et al., 2013a). The Flickr-25600 dataset contains 100K images sampled from a noisy Internet image collection. Each image is represented by a 25, 600 dimensional vector. The ImageNet-51200 contains 100k images sampled from 100 random classes from ImageNet (Deng et al., 2009), each represented by a 51, 200 dimensional vector. The third dataset (ImageNet-25600) is another random subset of ImageNet containing 100K images in 25, 600 dimensional space. All the vectors are normalized to be of unit norm. We compared the performance of the randomized (CBErand) and learned (CBE-opt) versions of our circulant embeddings with the current state-of-the-art for highdimensional data, i.e., bilinear embeddings. We use both the randomized (bilinear-rand) and learned (bilinear-opt) versions. Bilinear embeddings have been shown to perform similar or better than another promising technique called Product Quantization (Jegou et al., 2011). Finally, we also compare against the binary codes produced by the baseline LSH method (Charikar, 2002), which is still applicable to 25,600 and 51,200 dimensional feature but with much longer running time and much more space. We also show an experiment with relatively low-dimensional data in 2048 dimensional space using Flickr data to compare against techniques that perform well for low-dimensional data but do not scale to high-dimensional scenario. Exam- C/C++ Thread Safety Analysis DeLesley Hutchins Google Inc. Email: delesley@google.com Aaron Ballman CERT/SEI Email: aballman@cert.org Dean Sutherland Email: dfsuther@cs.cmu.edu Abstract—Writing multithreaded programs is hard. Static analysis tools can help developers by allowing threading policies to be formally specified and mechanically checked. They essentially provide a static type system for threads, and can detect potential race conditions and deadlocks. This paper describes Clang Thread Safety Analysis, a tool which uses annotations to declare and enforce thread safety policies in C and C++ programs. Clang is a production-quality C++ compiler which is available on most platforms, and the analysis can be enabled for any build with a simple warning flag: −Wthread−safety. The analysis is deployed on a large scale at Google, where it has provided sufficient value in practice to drive widespread voluntary adoption. Contrary to popular belief, the need for annotations has not been a liability, and even confers some benefits with respect to software evolution and maintenance. I. INTRODUCTION Writing multithreaded programs is hard, because developers must consider the potential interactions between concurrently executing threads. Experience has shown that developers need help using concurrency correctly [1]. Many frameworks and libraries impose thread-related policies on their clients, but they often lack explicit documentation of those policies. Where such policies are clearly documented, that documentation frequently takes the form of explanatory prose rather than a checkable specification. Static analysis tools can help developers by allowing threading policies to be formally specified and mechanically checked. Examples of threading policies are: “the mutex mu should always be locked before reading or writing variable accountBalance” and “the draw() method should only be invoked from the GUI thread.” Formal specification of policies provides two main benefits. First, the compiler can issue warnings on policy violations. Finding potential bugs at compile time is much less expensive in terms of engineer time than debugging failed unit tests, or worse, having subtle threading bugs hit production. Second, specifications serve as a form of machine-checked documentation. Such documentation is especially important for software libraries and APIs, because engineers need to know the threading policy to correctly use them. Although documentation can be put in comments, our experience shows that comments quickly “rot” because they are not updated when variables are renamed or code is refactored. This paper describes thread safety analysis for Clang. The analysis was originally implemented in GCC [2], but the GCC version is no longer being maintained. Clang is a productionquality C++ compiler, which is available on most platforms, including MacOS, Linux, and Windows. The analysis is currently implemented as a compiler warning. It has been deployed on a large scale at Google; all C++ code at Google is now compiled with thread safety analysis enabled by default. II. OVERVIEW OF THE ANALYSIS Thread safety analysis works very much like a type system for multithreaded programs. It is based on theoretical work on race-free type systems [3]. In addition to declaring the type of data (int , float , etc.), the programmer may optionally declare how access to that data is controlled in a multithreaded environment. Clang thread safety analysis uses annotations to declare threading policies. The annotations can be written using either GNU-style attributes (e.g., attribute ((...))) or C++11- style attributes (e.g., [[...]] ). For portability, the attributes are typically hidden behind macros that are disabled when not compiling with Clang. Examples in this paper assume the use of macros; actual attribute names, along with a complete list of all attributes, can be found in the Clang documentation [4]. Figure 1 demonstrates the basic concepts behind the analysis, using the classic bank account example. The GUARDED BY attribute declares that a thread must lock mu before it can read or write to balance, thus ensuring that the increment and decrement operations are atomic. Similarly, REQUIRES declares that the calling thread must lock mu before calling withdrawImpl. Because the caller is assumed to have locked mu, it is safe to modify balance within the body of the method. In the example, the depositImpl() method lacks a REQUIRES clause, so the analysis issues a warning. Thread safety analysis is not interprocedural, so caller requirements must be explicitly declared. There is also a warning in transferFrom(), because it fails to lock b.mu even though it correctly locks this−>mu. The analysis understands that these are two separate mutexes in two different objects. Finally, there is a warning in the withdraw() method, because it fails to unlock mu. Every lock must have a corresponding unlock; the analysis detects both double locks and double unlocks. A function may acquire a lock without releasing it (or vice versa), but it must be annotated to specify this behavior. A. Running the Analysis To run the analysis, first download and install Clang [5]. Then, compile with the −Wthread−safety flag: clang −c −Wthread−s af et y example . cpp#include ” mutex . h ” class BankAcct { Mutex mu; i n t balance GUARDED BY(mu ) ; void d e p o s itIm p l ( i n t amount ) { / / WARNING! Must l o c k mu. balance += amount ; } void withd rawImpl ( i n t amount ) REQUIRES (mu) { / / OK. C a l l e r must have lo c ked mu. balance −= amount ; } public : void withd raw ( i n t amount ) { mu. l o c k ( ) ; / / OK. We ’ ve lo c ked mu. withd rawImpl ( amount ) ; / / WARNING! F a i l e d t o unlo c k mu. } void t r a n sf e rF r om ( BankAcct& b , i n t amount ) { mu. l o c k ( ) ; / / WARNING! Must l o c k b .mu. b . withd rawImpl ( amount ) ; / / OK. d e p o s itIm p l ( ) has no requi rement s . d e p o s itIm p l ( amount ) ; mu. unlo c k ( ) ; } } ; Fig. 1. Thread Safety Annotations Note that this example assumes the presence of a suitably annotated mutex.h [4] that declares which methods perform locking and unlocking. B. Thread Roles Thread safety analysis was originally designed to enforce locking policies such as the one previously described, but locks are not the only way to ensure safety. Another common pattern in many systems is to assign different roles to different threads, such as “worker thread” or “GUI thread” [6]. The same concepts used for mutexes and locking can also be used for thread roles, as shown in Figure 2. Here, a widget library has two threads, one to handle user input, like mouse clicks, and one to handle rendering. It also enforces a constraint: the draw() method should only be invoked only by the GUI thread. The analysis will warn if draw() is invoked directly from onClick(). The rest of this paper will focus discussion on mutexes in the interest of brevity, but there are analogous examples for thread roles. III. BASIC CONCEPTS Clang thread safety analysis is based on a calculus of capabilities [7] [8]. To read or write to a particular location in memory, a thread must have the capability, or permission, to do so. A capability can be thought of as an unforgeable key, #include ” ThreadRole . h ” ThreadRole Input Th read ; ThreadRole GUI Thread ; class Widget { public : v i r t u a l void o nC l i c k ( ) REQUIRES ( Input Th read ) ; v i r t u a l void draw ( ) REQUIRES ( GUI Thread ) ; } ; class Button : public Widget { public : void o nC l i c k ( ) o v e r r i d e { depressed = t rue ; draw ( ) ; / / WARNING! } } ; Fig. 2. Thread Roles or token, which the thread must present to perform the read or write. Capabilities can be either unique or shared. A unique capability cannot be copied, so only one thread can hold the capability at any one time. A shared capability may have multiple copies that are shared among multiple threads. Uniqueness is enforced by a linear type system [9]. The analysis enforces a single-writer/multiple-reader discipline. Writing to a guarded location requires a unique capability, and reading from a guarded location requires either a unique or a shared capability. In other words, many threads can read from a location at the same time because they can share the capability, but only one thread can write to it. Moreover, a thread cannot write to a memory location at the same time that another thread is reading from it, because a capability cannot be both shared and unique at the same time. This discipline ensures that programs are free of data races, where a data race is defined as a situation that occurs when multiple threads attempt to access the same location in memory at the same time, and at least one of the accesses is a write [10]. Because write operations require a unique capability, no other thread can access the memory location at that time. A. Background: Uniqueness and Linear Logic Linear logic is a formal theory for reasoning about resources; it can be used to express logical statements like: “You cannot have your cake and eat it too” [9]. A unique, or linear, variable must be used exactly once; it cannot be duplicated (used multiple times) or forgotten (not used). A unique object is produced at one point in the program, and then later consumed. Functions that use the object without consuming it must be written using a hand-off protocol. The caller hands the object to the function, thus relinquishing control of it; the function hands the object back to the caller when it returns. For example, if std :: stringstream were a linear type, stream programs would be written as follows:st d : : st r i n g st r e am ss ; / / produce ss auto& ss2 = ss << ” H e l l o ” ; / / consume ss auto& ss3 = ss2 << ” World . ” ; / / consume ss2 re tu rn ss3 . s t r ( ) ; / / consume ss3 Notice that each stream variable is used exactly once. A linear type system is unaware that ss and ss2 refer to the same stream; the calls to << conceptually consume one stream and produce another with a different name. Attempting to use ss a second time would be flagged as a use-after-consume error. Failure to call ss3. str () before returning would also be an error because ss3 would then be unused. B. Naming of Capabilities Passing unique capabilities explicitly, following the pattern described previously, would be needlessly tedious, because every read or write operation would introduce a new name. Instead, Clang thread safety analysis tracks capabilities as unnamed objects that are passed implicitly. The resulting type system is formally equivalent to linear logic but is easier to use in practical programming. Each capability is associated with a named C++ object, which identifies the capability and provides operations to produce and consume it. The C++ object itself is not unique. For example, if mu is a mutex, mu.lock() produces a unique, but unnamed, capability of type Cap (a dependent type). Similarly, mu.unlock() consumes an implicit parameter of type Cap. Operations that read or write to data that is guarded by mu follow a hand-off protocol: they consume an implicit parameter of type Cap and produce an implicit result of type Cap. C. Erasure Semantics Because capabilities are implicit and are used only for typechecking purposes, they have no run time effect. As a result, capabilities can be fully erased from an annotated program, yielding an unannoted program with identical behavior. In Clang, this erasure property is expressed in two ways. First, recommended practice is to hide the annotations behind macros, where they can be literally erased by redefining the macros to be empty. However, literal erasure is unnecessary. The analysis is entirely static and is implemented as a compile time warning; it cannot affect Clang code generation in any way. IV. THREAD SAFETY ANNOTATIONS This section provides a brief overview of the main annotations that are supported by the analysis. The full list can be found in the Clang documentation [4]. GUARDED BY(...) and PT GUARDED BY(...) GUARDED BY is an attribute on a data member; it declares that the data is protected by the given capability. Read operations on the data require at least a shared capability; write operations require a unique capability. PT GUARDED BY is similar but is intended for use on pointers and smart pointers. There is no constraint on the data member itself; rather, the data it points to is protected by the given capability. Mutex mu; i n t ∗p2 PT GUARDED BY(mu ) ; void t e s t ( ) { ∗p2 = 42; / / Warning ! p2 = new i n t ; / / OK ( no GUARDED BY ) . } REQUIRES(...) and REQUIRES SHARED(...) REQUIRES is an attribute on functions; it declares that the calling thread must have unique possession of the given capability. More than one capability may be specified, and a function may have multiple REQUIRES attributes. REQUIRES SHARED is similar, but the specified capabilities may be either shared or unique. Formally, the REQUIRES clause states that a function takes the given capability as an implicit argument and hands it back to the caller when it returns, as an implicit result. Thus, the caller must hold the capability on entry to the function and will still hold it on exit. Mutex mu; i n t a GUARDED BY(mu ) ; void foo ( ) REQUIRES (mu) { a = 0; / / OK. } void t e s t ( ) { foo ( ) ; / / Warning ! Requi res mu. } ACQUIRE(...) and RELEASE(...) The ACQUIRE attribute annotates a function that produces a unique capability (or capabilities), for example, by acquiring it from some other thread. The caller must not-hold the given capability on entry, and will hold the capability on exit. RELEASE annotates a function that consumes a unique capability, (e.g., by handing it off to some other thread). The caller must hold the given capability on entry, and will nothold it on exit. ACQUIRE SHARED and RELEASE SHARED are similar, but produce and consume shared capabilities. Formally, the ACQUIRE clause states that the function produces and returns a unique capability as an implicit result; RELEASE states that the function takes the capability as an implicit argument and consumes it. Attempts to acquire a capability that is already held or to release a capability that is not held are diagnosed with a compile time warning. CAPABILITY(...) The CAPABILITY attribute is placed on a struct, class or a typedef; it specifies that objects of that type can be used to identify a capability. For example, the threading libraries at Google define the Mutex class as follows:class CAPABILITY ( ” mutex ” ) Mutex { public : void l o c k ( ) ACQUIRE ( t hi s ) ; void reade rLock ( ) ACQUIRE SHARED( t hi s ) ; void unlo c k ( ) RELEASE( t hi s ) ; void reade rUnlock ( ) RELEASE SHARED( t hi s ) ; } ; Mutexes are ordinary C++ objects. However, each mutex object has a capability associated with it; the lock () and unlock() methods acquire and release that capability. Note that Clang thread safety analysis makes no attempt to verify the correctness of the underlying Mutex implementation. Rather, the annotations allow the interface of Mutex to be expressed in terms of capabilities. We assume that the underlying code implements that interface correctly, e.g., by ensuring that only one thread can hold the mutex at any one time. TRY ACQUIRE(b, ...) and TRY ACQUIRE SHARED(b, ...) These are attributes on a function or method that attempts to acquire the given capability and returns a boolean value indicating success or failure. The argument b must be true or false, to specify which return value indicates success. NO THREAD SAFETY ANALYSIS NO THREAD SAFETY ANALYSIS is an attribute on functions that turns off thread safety checking for the annotated function. It provides a means for programmers to opt out of the analysis for functions that either (a) are deliberately thread-unsafe, or (b) are thread-safe, but too complicated for the analysis to understand. Negative Requirements All of the previously described requirements discussed are positive requirements, where a function requires that certain capabilities be held on entry. However, the analysis can also track negative requirements, where a function requires that a capability be not-held on entry. Positive requirements are used to prevent race conditions. Negative requirements are used to prevent deadlock. Many mutex implementations are not reentrant, because making them reentrant entails a significant performance cost. Attempting to acquire a non-reentrant mutex that is already held will deadlock the program. To avoid deadlock, acquiring a capability requires a proof that the capability is not currently held. The analysis represents this proof as a negative capability, which is expressed using the ! negation operator: Mutex mu; i n t a GUARDED BY(mu ) ; void c l e a r ( ) REQUIRES ( !mu) { mu. l o c k ( ) ; a = 0; mu. unlo c k ( ) ; } void r e s et ( ) { mu. l o c k ( ) ; / / Warning ! C a l l e r cannot hold ’mu ’ . c l e a r ( ) ; mu. unlo c k ( ) ; } Negative capabilities are tracked in much the same way as positive capabilities, but there is a bit of extra subtlety. Positive requirements are typically confined within the class or the module in which they are declared. For example, if a thread-safe class declares a private mutex, and does all locking and unlocking of that mutex internally, then there is no reason clients of the class need to know that the mutex exists. Negative requirements lack this property. If a class declares a private mutex mu, and locks mu internally, then clients should theoretically have to provide proof that they have not locked mu before calling any methods of the class. Moreover, there is no way for a client function to prove that it does not hold mu, except by adding REQUIRES(!mu) to the function definition. As a result, negative requirements tend to propagate throughout the code base, which breaks encapsulation. To avoid such propagation, the analysis restricts the visibility of negative capabilities. The analyzer assumes that it holds a negative capability for any object that is not defined within the current lexical scope. The scope of a class member is assumed to be its enclosing class, while the scope of a global variable is the translation unit in which it is defined. Unfortunately, this visibility-based assumption is unsound. For example, a class with a private mutex may lock the mutex, and then call some external routine, which calls a method in the original class that attempts to lock the mutex a second time. The analysis will generate a false negative in this case. Based on our experience in deploying thread safety analysis at Google, we believe this to be a minor problem. It is relatively easy to avoid this situation by following good software design principles and maintaining proper separation of concerns. Moreover, when compiled in debug mode, the Google mutex implementation does a run time check to see if the mutex is already held, so this particular error can be caught by unit tests at run time. V. IMPLEMENTATION The Clang C++ compiler provides a sophisticated infrastructure for implementing warnings and static analysis. Clang initially parses a C++ input file to an abstract syntax tree (AST), which is an accurate representation of the original source code, down to the location of parentheses. In contrast, many compilers, including GCC, lower to an intermediate language during parsing. The accuracy of the AST makes it easier to emit quality diagnostics, but complicates the analysis in other respects. The Clang semantic analyzer (Sema) decorates the AST with semantic information. Name lookup, function overloading, operator overloading, template instantiation, and type checking are all performed by Sema when constructing the AST. Clang inserts special AST nodes for implicit C++ operations, such as automatic casts, LValue-to-RValue conversions,implicit destructor calls, and so on, so the AST provides an accurate model of C++ program semantics. Finally, the Clang analysis infrastructure constructs a control flow graph (CFG) for each function in the AST. This is not a lowering step; each statement in the CFG points back to the AST node that created it. The CFG is shared infrastructure; the thread safety analysis is only one of its many clients. A. Analysis Algorithm The thread safety analysis algorithm is flow-sensitive, but not path-sensitive. It starts by performing a topological sort of the CFG, and identifying back edges. It then walks the CFG in topological order, and computes the set of capabilities that are known to be held, or known not to be held, at every program point. When the analyzer encounters a call to a function that is annotated with ACQUIRE, it adds a capability to the set; when it encounters a call to a function that is annotated with RELEASE, it removes it from the set. Similarly, it looks for REQUIRES attributes on function calls, and GUARDED BY on loads or stores to variables. It checks that the appropriate capability is in the current set, and issues a warning if it is not. When the analyzer encounters a join point in the CFG, it checks to confirm that every predecessor basic block has the same set of capabilities on exit. Back edges are handled similarly: a loop must have the same set of capabilities on entry to and exit from the loop. Because the analysis is not path-sensitive, it cannot handle control-flow situations in which a mutex might or might not be held, depending on which branch was taken. For example: void foo ( ) { i f ( b ) mutex . l o c k ( ) ; / / Warning : b may o r may not be held here . doSomething ( ) ; i f ( b ) mutex . unlo c k ( ) ; } void l o c k A l l ( ) { / / Warning : c a p a b i l i t y s et s do not match / / at s t a r t and end of loop . fo r ( unsigned i =0; i < n ; ++ i ) mutexArray [ i ] . l o c k ( ) ; } Although this seems like a serious limitation, we have found that conditionally held locks are relatively unimportant in practical programming. Reading or writing to a guarded location in memory requires that the mutex be held unconditionally, so attempting to track locks that might be held has little benefit in practice, and usually indicates overly complex or poor-quality code. Requiring that capability sets be the same at join points also speeds up the algorithm considerably. The analyzer need not iterate to a fixpoint; thus it traverses every statement in the program exactly once. Consequently, the computational complexity of the analysis is O(n) with respect to code size. The compile time overhead of the warning is minimal. B. Intermediate Representation Each capability is associated with a C++ object. C++ objects are run time entities, that are identified by C++ expressions. The same object may be identified by different expressions in different scopes. For example: class Foo { Mutex mu; bool compare ( const Foo& ot h e r ) REQUIRES ( this−>mu, ot h e r .mu ) ; } void ba r ( ) { Foo a ; Foo ∗b ; . . . a .mu. l o c k ( ) ; b−>mu. l o c k ( ) ; / / REQUIRES (&a)−>mu, (∗ b ) .mu a . compare (∗ b ) ; . . . } Clang thread safety analysis is dependently typed: note that the REQUIRES clause depends on both this and other, which are parameters to compare. The analyzer performs variable substitution to obtain the appropriate expressions within bar(); it substitutes &a for this and ∗b for other. Recall, however, that the Clang AST does not lower C++ expressions to an intermediate language; rather, it stores them in a format that accurately represents the original source code. Consequently, (&a)−>mu and a.mu are different expressions. A dependent type system must be able to compare expressions for semantic (not syntactic) equality. The analyzer implements a simple compiler intermediate representation (IR), and lowers Clang expressions to the IR for comparison. It also converts the Clang CFG into single static assignment (SSA) form so that the analyzer will not be confused by local variables that take on different values in different places. C. Limitations Clang thread safety analysis has a number of limitations. The three major ones are: No attributes on types. Thread safety attributes are attached to declarations rather than types. For example, it is not possible to write vector, or ( int GUARDED BY(mu))[10]. If attributes could be attached to types, PT GUARDED BY would be unnecessary. Attaching attributes to types would result in a better and more accurate analysis. However, it was deemed infeasible for C++ because it would require invasive changes to the C++ type system that could potentially affect core C++ semantics in subtle ways, such as template instantiation and function overloading. No dependent type parameters. Race-free type systems as described in the literature often allow classes to be parameterized by the objects that are responsible for controlling access. [11] [3] For example, assume a Graph class has a list of nodes, and a single mutex that protects all of them. In this case, theNode class should technically be parameterized by the graph object that guards it (similar to inner classes in Java), but that relationship cannot be easily expressed with attributes. No alias analysis. C++ programs typically make heavy use of pointer aliasing; we currently lack an alias analysis. This can occasionally cause false positives, such as when a program locks a mutex using one alias, but the GUARDED BY attribute refers to the same mutex using a different alias. VI. EXPERIMENTAL RESULTS AND CONCLUSION Clang thread safety analysis is currently deployed on a wide scale at Google. The analysis is turned on by default, across the company, for every C++ build. Over 20,000 C++ files are currently annotated, with more than 140,000 annotations, and those numbers are increasing every day. The annotated code spans a wide range of projects, including many of Google’s core services. Use of the annotations at Google is entirely voluntary, so the high level of adoption suggests that engineering teams at Google have found the annotations to be useful. Because race conditions are insidious, Google uses both static analysis and dynamic analysis tools such as Thread Sanitizer [12]. We have found that these tools complement each other. Dynamic analysis operates without annotations and thus can be applied more widely. However, dynamic analysis can only detect race conditions in the subset of program executions that occur in test code. As a result, effective dynamic analysis requires good test coverage, and cannot report potential bugs until test time. Static analysis is less flexible, but covers all possible program executions; it also reports errors earlier, at compile time. Although the need for handwritten annotations may appear to be a disadvantage, we have found that the annotations confer significant benefits with respect to software evolution and maintenance. Thread safety annotations are widely used in Google’s core libraries and APIs. Annotating libraries has proven to be particularly important, because the annotations serve as a form of machine-checked documentation. The developers of a library and the clients of that library are usually different engineering teams. As a result, the client teams often do not fully understand the locking protocol employed by the library. Other documentation is usually out of date or nonexistent, so it is easy to make mistakes. By using annotations, the locking protocol becomes part of the published API, and the compiler will warn about incorrect usage. Annotations have also proven useful for enforcing internal design constraints as software evolves over time. For example, the initial design of a thread-safe class must establish certain constraints: locks are used in a particular way to protect private data. Over time, however, that class will be read and modified by many different engineers. Not only may the initial constraints be forgotten, they may change when code is refactored. When examining change logs, we found several cases in which an engineer added a new method to a class, forgot to acquire the appropriate locks, and consequently had to debug the resulting race condition by hand. When the constraints are explicitly specified with annotations, the compiler can prevent such bugs by mechanically checking new code for consistency with existing annotations. The use of annotations does entail costs beyond the effort required to write the annotations. In particular, we have found that about 50% of the warnings produced by the analysis are caused not by incorrect code but rather by incorrect or missing annotations, such as failure to put a REQUIRES attribute on getter and setter methods. Thread safety annotations are roughly analogous to the C++ const qualifier in this regard. Whether such warnings are false positives depends on your point of view. Google’s philosophy is that incorrect annotations are “bugs in the documentation.” Because APIs are read many times by many engineers, it is important that the public interfaces be accurately documented. Excluding cases in which the annotations were clearly wrong, the false positive rate is otherwise quite low: less than 5%. Most false positives are caused by either (a) pointer aliasing, (b) conditionally acquired mutexes, or (c) initialization code that does not need to acquire a mutex. Conclusion Type systems for thread safety have previously been implemented for other languages, most notably Java [3] [11]. Clang thread safety analysis brings the benefit of such systems to C++. The analysis has been implemented in a production C++ compiler, tested in a production environment, and adopted internally by one of the world’s largest software companies. REFERENCES [1] K. Asanovic et al., “A view of the parallel computing landscape,” Communications of the ACM, vol. 52, no. 10, 2009. [2] L.-C. Wu, “C/C++ thread safety annotations,” 2008. [Online]. Available: https://docs.google.com/a/google.com/document/d/1 d9MvYX3LpjTk 3nlubM5LE4dFmU91SDabVdWp9-VDxc [3] M. Abadi, C. Flanagan, and S. N. Freund, “Types for safe locking: Static race detection for java,” ACM Transactions on Programming Languages and Systems, vol. 28, no. 2, 2006. [4] “Clang thread safety analysis documentation.” [Online]. Available: http://clang.llvm.org/docs/ThreadSafetyAnalysis.html [5] “Clang: A c-language front-end for llvm.” [Online]. Available: http://clang.llvm.org [6] D. F. Sutherland and W. L. Scherlis, “Composable thread coloring,” PPoPP ’10: Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming, 2010. [7] K. Crary, D. Walker, and G. Morrisett, “Typed memory management in a calculus of capabilities,” Proceedings of POPL, 1999. [8] J. Boyland, J. Noble, and W. Retert, “Capabilities for sharing,” Proceedings of ECOOP, 2001. [9] J.-Y. Girard, “Linear logic,” Theoretical computer science, vol. 50, no. 1, pp. 1–101, 1987. [10] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, “Eraser: A dynamic data race detector for multithreaded programs.” ACM Transactions on Computer Systems (TOCS), vol. 15, no. 4, 1997. [11] C. Boyapati and M. Rinard, “A parameterized type system for race-free Java programs,” Proceedings of OOPSLA, 2001. [12] K. Serebryany and T. Iskhodzhanov, “Threadsanitizer: data race detection in practice,” Workshop on Binary Instrumentation and Applications, 2009. An Optimized Template Matching Approach to Intra Coding in Video/Image Compression Hui Su, Jingning Han, and Yaowu Xu Chrome Media, Google Inc., 1950 Charleston Road, Mountain View, CA 94043 ABSTRACT The template matching prediction is an established approach to intra-frame coding that makes use of previously coded pixels in the same frame for reference. It compares the previously reconstructed upper and left boundaries in searching from the reference area the best matched block for prediction, and hence eliminates the need of sending additional information to reproduce the same prediction at decoder. In viewing the image signal as an auto-regressive model, this work is premised on the fact that pixels closer to the known block boundary are better predicted than those far apart. It significantly extends the scope of the template matching approach, which is typically followed by a conventional discrete cosine transform (DCT) for the prediction residuals, by employing an asymmetric discrete sine transform (ADST), whose basis functions vanish at the prediction boundary and reach maximum magnitude at far end, to fully exploit statistics of the residual signals. It was experimentally shown that the proposed scheme provides substantial coding performance gains on top of the conventional template matching method over the baseline. Keywords: Template matching, Intra prediction, Transform coding, Asymmetric discrete sine transform 1. INTRODUCTION Intra-frame coding is a key component in video/image compression system. It predicts from previously reconstructed neighboring pixels to largely remove spatial redundancies. A codec typically allows various prediction directions1–3 , and the encoder selects the one that best describes the texture patterns (and hence rendering minimal rate-distortion cost) for block coding. Such boundary extrapolation based prediction is efficient when the image signals are well modeled by a first-order Markovian process. In practice, however, image signals might contain certain complicated patterns repeatedly appearing, which the boundary prediction approach can not effectively capture. This motivates the initial block matching prediction, that searches in the previously reconstructed frame area for reference, as an additional mode.4 A displacement vector per block is hence needed to inform decoder to reproduce the prediction, akin the motion vector for inter-frame motion compensation. To overcome such overhead cost that diminishes the performance gains, a template matching prediction (TMP) approach was developed5 that employs the available neighboring pixels of a block as a template, measures the template similarity between the block of interest and the candidate references, and chooses the most “similar” one as the prediction. Clearly the decoder is able to repeat the same process without recourse to further information, which further allows the TMP to operate in smaller block size for more precise prediction at no additional cost. A conventional 2D-DCT is then applied to the prediction residuals, followed by quantization and entropy coding, to encode the block. Certain coding performance gains were obtained by integrating the TMP in a regular intra coder. In viewing the image signals as an auto-regressive process, pixels close to the block boundaries are more correlated to the template pixels, and hence are better predicted by the matched reference, than those sitting at far end. Therefore, the residuals ought to exhibit smaller variance at the known boundaries and gradually increased energy to the opposite end, which makes the efficacy of the use of DCT questionable due to the fact that its basis functions get to the maximal magnitude at both ends and are agnostic to the statistical characteristics of the residuals. This work addresses this issue by incorporating the ADST,6, 7 whose basis functions possess the desired asymmetric properties, as an alternative to the TMP residuals for optimal coding performance. A complementary similarity measurement based on weighted template matching, in recognition of the statistical E-mails: {huisu, jingning, yaowu}@google.com. Visual Information Processing and Communication V, edited by Amir Said, Onur G. Guleryuz, Robert L. Stevenson, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 9029, 902904 © 2014 SPIE-IS&T · CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2040890 SPIE-IS&T/ Vol. 9029 902904-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsReconstructed Pixels To-be-coded Pixels Target Block Prediction Block Target Template Candidate Template Figure 1. Template matching intra prediction. variations across the block, was also proposed to improve the search quality. The scheme was implemented in the VP9 framework, in conjunction with other boundary prediction based intra coding modes. Experiments demonstrated remarkable performance advantages over conventional TMP as well as the baseline codec. The rest of the paper is organized as follows: Sec. 2 presents a brief review on the template matching approach. In Sec. 3, we describe the proposed techniques in details. Experimenting results are presented in Sec. 4, and Sec. 5 concludes the paper. 2. REVISITING MATCHING PREDICTION We provide a brief review on the TMP approach5 in this section. As shown in Fig. 1, the TMP employs the pixels in the adjacent upper rows and left columns of a block as its template. Every template in the reconstructed area of the frame is considered as a reference template, and the template of the block to be encoded is the target template. The similarity between the target template and the reference templates is then evaluated in terms of sum of absolute/squared difference. The encoder selects amongst the reference templates the one that best resembles the target template as the candidate template, and the block corresponding to this candidate template is used as the prediction for the target block. Since it only involves comparing reconstructed pixels, the same operations can be repeated at the decoder side without any additional side information sent, resulting in higher compression efficiency than the direct block matching approach.4 As a consequence, the decoding process gets more computationally loaded. The TMP approach was shown to be particularly efficient in the scenarios where certain complicated texture patterns, that cannot be captured by the conventional directional intra prediction modes, appear repeatedly in the image/frame. Recent research efforts have been devoted to further improve the TMP scheme, including combining multiple candidates with top similarity scores,8 using hybrid TMP and block matching (with displacement vector sent explicitly),9 etc. This work is focused on optimizing the original TMP approach by observing and exploiting the statistical property of the TMP residual signals. It is noteworthy that the proposed principles are generally applicable to other advanced variants as well. 3. PROPOSED TECHNIQUES We view the image signals as an auto-regressive model, which implies that two nearby pixels are more correlated than those far apart. Since the template of a matched reference block closely resembles that of the block of TEMPLATE SPIE-IS&T/ Vol. 9029 902904-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTarget Block Pixels Template Pixels with Larger Weight Template Pixels with Smaller Weight Figure 2. The proposed weighted template matching scheme. The pixels that are closer to the target block are assigned with larger weight. interest, the pixels sitting close to the known boundaries of the two blocks are element-wisely more correlated, than those at the opposite end. Hence the pixels near the top/left boundaries are better predicted by the matched reference block, which translates into a key observation that the variance of prediction residuals tends to vanish at the known boundaries and gradually increase towards the far end. This inspires that unlike the discrete cosine transform (DCT) whose basis functions get maximum magnitude at both ends, the (near) optimal spatial transform for the TMP residuals should possess such asymmetric properties. We hence propose to employ the asymmetric discrete sine transform (ADST)6, 7 for transform coding of the TMP residuals. A complementary matching approach that expands the template to multiple boundary rows and columns, and uses a weighted sum of difference measurement is first developed for more precise referencing. A statistical study of the TMP residuals, followed by the detailed discussion of ADST will be provided next. 3.1 Weighted Template Matching In order to obtain reliable template matching, it is reasonable to define multiple layers of boundary pixels as the template of a block. In our study, we have observed that the prediction accuracy can be improved when the number of rows and columns in the template increases. However, it is not wise to adopt too many layers, as the gain in matching accuracy becomes saturated and the computational complexity explodes. In our implementation, the template consists of the pixels in the 2 rows and 2 columns above and to the left of the given block, which gives a good tradeoff between accuracy and computation complexity. The similarity between the target templates and reference templates can be measured by the sum of absolute difference (SAD). Along the line of recognizing the variations in statistics, the template pixels closer to the block are highly correlated to the block content, and hence should be more weighted in the SAD calculation than those distant ones. This idea is illustrated in Fig. 2. A weight ratio of 3:2 for the inner row/column versus the outer row/column is used in this work. 3.2 Spatial Transformation In video/image compression, the prediction residuals are typically processed via transformation to further remove the remaining spatial redundancy, before the quantization and entropy coding modules. The Karhunen-Loeve transform (KLT) is considered as the optimal spatial transform in terms of energy compaction. However, KLT SPIE-IS&T/ Vol. 9029 902904-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsis rarely used in practical coding system due to its high computation complexity. The DCT has long been a popular substitute due to its good tradeoff between energy compaction and complexity. The basis functions of the DCT are as follows: [TC ]j,i =  α cos π(j − 1)(2i − 1) 2N  , (1) where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively; and α =    q 1 N , if j = 1 q 2 N , otherwise It is easy to see that the basis functions of the DCT achieve their maximum energy at both ends (i.e., i = 1 or i = N). Assuming the template of a matched reference block closely approximates that of the block of interest, it is highly likely that pixels close to these known boundaries are also well predicted, while those distant pixels are less correlated, which results in a relatively higher residual variance. This postulation is verified by the following experimental study. We collected the absolute values of the TMP prediction residues element-wisely over 8000 blocks (of dimension 4 × 4) from the foreman sequence, and the average of the residue signal at each pixel location was calculated, as shown below:   4.05 4.37 4.51 5.13 4.72 5.48 5.50 6.60 5.04 5.95 6.12 7.32 5.50 6.28 6.97 8.20   As can be seen from the matrix, the variance of the prediction residue signal indeed increases along both the horizontal and vertical directions. As abovementioned, the basis functions of the conventional DCT achieve their maximum energy at both ends and are therefore agnostic to the statistical patterns of the prediction residuals. As an alternative, the ADST6, 7 has basis functions of form: [TS]j,i =  2 √ 2N + 1 sin (2j − 1)iπ 2N + 1  , (2) where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively. It is shown6, 7 that the ADST is a better approximation of the optimal KLT than the DCT when the partial boundary information is available. Clearly, the basis functions of ADST vanishes at the known prediction boundary (i = 1) and maximizes at the far end (i = N), and therefore matches well with the statistical patterns of the TMP residuals. We hence propose to employ the ADST as the spatial transform for the TMP residuals. It is experimentally shown in the next section that the use of ADST provides substantial performance improvement over the TMP followed by the conventional DCT. 4. EXPERIMENT RESULTS The proposed scheme was tested in the VP9 framework.1 We verified its efficacy in a relatively simplified setting, where the block size was fixed as 8 × 8. There are 10 intra prediction modes in VP9, including vertical prediction, horizontal prediction, 8 angular prediction modes, and a “true motion” mode that utilizes the left, above and corner pixels simultaneously. The TMP scheme was implemented as an additional mode to the 10 existing ones. The selection among the 11 modes is based on rate-distortion optimization. In the TMP mode, the 8 × 8 block is further partitioned into four 4 × 4 blocks, each of which is predicted via template matching, followed by the 2D-ADST transform, quantization, and reconstruction, in a raster scan order. The template consists of pixels from 2 rows and 2 columns above and to the left of the given block. For the weighted template matching, we use a weight ratio 3:2 for the inner row/column versus the outer row/column, as shown in Fig. 2. SPIE-IS&T/ Vol. 9029 902904-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms2 3 4 5 6 x 104 36 37 38 39 40 41 42 43 44 Bits per frame PSNR VP9 Baseline Template Matching Proposed Scheme 5 6 7 8 9 10 11 12 x 104 36 37 38 39 40 41 42 Bits per frame PSNR VP9 Baseline Template Matching Proposed Scheme Figure 3. Rate-Distortion curves of the Ice (upper) and Foreman (lower) test sequences. Several test video clips were used to compare the coding efficiency, including the Ice, Foreman, and Carphone sequence. For every test sequence, the first 75 frames were coded as key-frame (i.e., all blocks were coded in intra modes), at various bit-rates. The coding performance gains of the conventional TMP and the proposed method over the reference codec, measured by the Bjontegaard metric, are shown in Table 1. Clearly, the proposed approach that optimizes the transformation for prediction residual significantly improves the performance of TMP, and both outperform the reference VP9 baseline. The rate-distortion curves of the ice and foreman sequence are also provided in Fig. 3. It can be seen from the figure that the proposed techniques boost the coding efficiency of the conventional TMP consistently. 5. CONCLUSIONS AND FUTURE WORK This work proposed a novel approach that incorporated the ADST for TMP prediction residuals as an additional mode for intra-frame coding. A complementary template matching method along the lines of recognizing the SPIE-IS&T/ Vol. 9029 902904-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTable 1. Coding performance gains over VP9 baseline in terms of bit-rate reduction percentage. Sequences Conventional TMP Proposed Method Ice 2.89 3.78 Foreman 2.88 3.33 Carphone 1.05 1.35 statistical variations across block was also provided for more precise reference search. The scheme implemented in the VP9 framework demonstrated substantial performance improvements over the conventional TMP as well as the reference codec. The TMP approach can also be applied to inter-frame prediction.10 The template of a block is defined as the pixels in the adjacent upper rows and left columns, in the same way as in the case of intra prediction. The optimal template which is best matched to that of the block of interest is found in a previously encoded reference frame, and the block to be encoded is filled in by copying the block corresponding the optimal template. By the same principles in this work, the residue signal of the template matching inter prediction should also present asymmetric statistical property across the block. We thus expect the ADST to be more efficient than the conventional DCT for the transform coding of the template matching inter prediction, and are currently working along this direction. REFERENCES [1] VP9 Video Codec , http://www.webmproject.org/vp9/. [2] Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A., “Overview of the H.264/avc video coding standard,” IEEE Trans. Circuits and Systems for Video Technology 13, 560–576 (July 2003). [3] Sullivan, G., J. Ohm, W. H., and Wiegand, T., “Overview of the high effciency video coding (HEVC) standard,” IEEE Trans. Circuits and Systems for Video Technology 22, 1649–1668 (Dec. 2012). [4] Yu, S. and Chrysafis, C., “New intra prediction using intra-macroblock motion compensation,” Tech. Rep. JVT-C151 (2002). [5] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by template matching,” IEEE Proc. ICIP , 1693–1696 (2006). [6] Han, J., Saxena, A., and Rose, K., “Towards jointly optimal spatial prediction and adaptive transform in video/image coding,” IEEE Proc. ICASSP , 726–729 (2010). [7] Han, J., Saxena, A., Melkote, V., and Rose, K., “Jointly optimized spatial prediction and block transform for video and image coding,” IEEE Trans. on Image Processing 21, 1874–1884 (2012). [8] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by averaged template matching predictors,” IEEE Proc. CCNC (2007). [9] Cherigui, S., Thoreau, D., Guillotel, P., and Perez, P., “Hybrid template and block matching algorithm for image intra prediction,” IEEE Proc. ICASSP , 781–784 (2012). [10] Sugimoto, K., Kobayashi, M., Suzuki, Y., Kato, S., and Boon, C. S., “Inter frame coding with template matching spatio-temporal prediction,” IEEE Proc. ICIP (2004). SPIE-IS&T/ Vol. 9029 902904-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms How Many People Visit YouTube? Imputing Missing Events in Panels With Excess Zeros Georg M. Goerg, Yuxue Jin, Nicolas Remy, Jim Koehler1 1 Google, Inc.; United States E-mail for correspondence: gmg@google.com Abstract: Media-metering panels track TV and online usage of people to analyze viewing behavior. However, panel data is often incomplete due to nonregistered devices, non-compliant panelists, or work usage. We thus propose a probabilistic model to impute missing events in data with excess zeros using a negative-binomial hurdle model for the unobserved events and beta-binomial sub-sampling to account for missingness. We then use the presented models to estimate the number of people in Germany who visit YouTube. Keywords: imputation; missing data; zero inflation; panel data. 1 Introduction Media panels (GfK Consumer Panels, 2013) are used by advertisers to estimate reach and frequency of a campaign: reach is the fraction of the population that has seen an ad, frequency tells us how often they have seen it (on average). It is important to get good estimates from panel data, as they largely determine the cost of an ad spot on TV or a website. Na¨ıvely, one would use a sample fraction of the number of non-zero events (website visits, TV spots watched, etc.) per unit time to estimate reach; similarly, for frequency. This, however, suffers from underestimation as panels often only record a fraction of all events due to e.g., non-compliance or work usage. Correcting this bias and imputing missing events has been studied previously (Fader and Hardie, 2000; Yang et al., 2010). In this work we i) extend the beta-binomial negative-binomial (BBNB) model (Hofler and Scrogin, 2008) with a hurdle component to improve modeling excess zeros in panel data (§2); ii) present the maximum likelihood estimator (MLE) and also add prior information on missingness (§3); and iii) use the methodology to estimate – from online media panels and internal YouTube log files – how many people in Germany visit YouTube (§4). The proposed methodology can be applied to a great variety of situations where events have been counted – but some are known to be missing.2 How Many People Visit YouTube? 2 Hierarchical Event Imputation Let Ni ∈ {0, 1, 2, . . .} count the true (but unobserved) number of visits by panelist i. The population consists of people who do not visit YouTube at all (with probability q0 ∈ [0, 1]), and those who visit at least once. If she visits (overcoming the “hurdle” with probability 1 − q0), we assume that Ni is distributed according to a shifted Poisson distribution (starting at n = 1) with rate λi . For model heterogeneity among the population we use a Gamma  r, q1 1−q1  prior for λi , with r > 0 and q1 ∈ (0, 1). Overall, this yields a shifted negative binomial hurdle (NBH) distribution P (N = n; q0, q1, r) = ( q0, if n = 0, (1 − q0) · Γ(n+r−1) Γ(r)Γ(n) · (1 − q1) r q n−1 1 , if n ≥ 1. (1) We choose a hurdle, rather than a mixture, model for the excess zeros (Hu et al., 2011), since 1 − q0 can be directly interpreted as the true – but unobserved – 1+ reach: if an advertiser shows an ad on YouTube they can expect that a fraction of 1 − q0 of the population sees it at least once. Let pi be the probability a visit of user i is recorded in the panel. Assuming independence across visits the total number of recorded panel events, Ki ∈ {0, 1, 2, . . .}, thus follows a binomial distribution, Ki ∼ Bin(Ni , pi). To account for heterogeneity across the population we assume pi ∼ Beta(µ, φ), with mean µ and precision φ (Ferrari and Cribari-Neto, 2004). Here µ represents the expected non-missing rate and φ the (inverse) variation across the population. Integrating out pi gives a Beta-Binomial (BB) distribution, Ki | Ni ∼ BB(Ni ; µ, φ). (2) Combining (1) and (2) yields a hierarchical beta-binomial negative-binomial hurdle (BBNBH) imputation model with parameter vector θ = (µ, φ, q0, r, q1): Ni ∼ NBH(N; q0, r, q1) and Ki | Ni ∼ BB(K | Ni ; µ, φ). (3) 2.1 Joint Distribution The pdf of (2) can be written as g(k | n; µ, φ) =  n k  Γ(k + φµ)Γ(n − k + (1 − µ)φ) Γ(n + φ) Γ(φ) Γ(µφ)Γ(φ(1 − µ)) . For k = 0 this reduces to P (K = 0 | N, µ, φ) = Γ(n + (1 − µ)φ) Γ(n + φ) × Γ(φ) Γ(φ(1 − µ)) . (4)Goerg et al. 3 Due to the zero hurdle it is useful to treat N = 0 and N > 0 separately: P (N, K) = P (K | N) · P (N) = BB(k | n; µ, φ) · NBH(n; q0, q1, r) (5) For n = 0, (5) is non-zero only for k = 0, P (N = 0, K = 0) = q0, since P (K > N) = 0. For n > 0, P (N = n, K = k) =(1 − q0) 1 B(φµ, φ(1 − µ)) (1 − q1) r Γ(r) × Γ(k + φµ) Γ(k + 1) × Γ(n − k + φ(1 − µ)) Γ(n − k + 1) Γ(n + r − 1) Γ(n + φ) q n−1 1 × Γ(n + 1) Γ(n) . (6) 2.2 Conditional Predictive Distribution For Imputation The panel records ki events for panelist i, but we want to know how many events truly occurred. That is, we are interested in (dropping subscript i) P (N = n | K = k) = P (K = k | N = n) P (N = n) P (K = k) , (7) To obtain analytical expressions we consider k = 0 and k > 0 separately: k = 0: Either none truly happened (n = 0) or a panelist visited at least once (n > 0), but none were recorded. n = 0: P (N = 0 | K = 0) = q0 P (K = 0). (8) n > 0: P (N = n | K = 0) = 1 P (K = 0) × Γ(n + φ(1 − µ)) Γ(n + φ) Γ(φ) Γ(φ(1 − µ)) × (1 − q0) Γ(n + r − 1) Γ(n) (1 − q1) r Γ(r) q n−1 1 , where the second term comes from (4). k > 0: The zero “hurdle” for N has been surpassed for sure. n < k : By construction of Binomial subsampling P (N = n | K = k) = 0 for all n < k. (9) n ≥ k: Here P (N = n | K = k) = n · q n−1 1 Γ(n − k + (1 − µ)φ) Γ(n − k + 1)Γ(n + φ) Γ(n + r − 1)× X∞ m=0 (m + k) Γ(m + φ(1 − µ)) Γ(m + 1) Γ(m + k + r − 1) Γ(m + k + φ) q m+k−1 1 !−1 .4 How Many People Visit YouTube? Estimate Std. Err. t value P r(> |t|) µ 0.272 q0 0.641 0.016 38.858 0.000 q1 0.982 0.002 494.105 0.000 r 0.252 0.021 11.811 0.000 φ 2.320 0.594 3.907 0.000 TABLE 1: MLE for θ for panel data on YouTube visits in Germany. 3 Parameter Estimation Let k = {k1, . . . , kP } be the number of observed events for all P panelist. Each panelist also has socio-economic indicators such as gender, age, and income. These attributes determine their demographic weight ˜wi , which equals the number of people in the entire population that panelist i represents. Finally, let wi = ˜wi ·  P/PP i=1 w˜i  be re-scaled weight of panelist i such that PP i=1 wi equals sample size P. We estimate θ using maximum likelihood (MLE), θb = arg maxθ∈Θ `(θ; x), where the log-likelihood `(θ; x) = X {k|xk>0} xk · log P (K = k; θ), (10) and x = {xk | k = 0, 1, . . . , max (k)}, where xk = P {i|ki=k} wi is the total weight of all panelists with k visits. For deriving closed form expressions of P (K = k) = P∞ n=0 P (N = n, K = k) it is simpler to consider k = 0 and k > 0 separately: P (K = 0) = q0 + (1 − q0) × Γ(φ) Γ(φ(1 − µ)) (1 − q1) r Γ(r) × X∞ n=0 Γ(n + 1 + φ(1 − µ)) Γ(n + 1) Γ(n + r) Γ(n + 1 + φ) q n 1 , (11) and for k > 0, P (K = k) =(1 − q0)(1 − q1) r Γ(φ) Γ(µφ)Γ(φ(1 − µ)) 1 Γ(r) × Γ(k + µφ) Γ(k + 1) × X∞ m=0 (m + k) Γ(m + φ(1 − µ)) Γ(m + 1) Γ(m + k + r − 1) Γ(m + k + φ) q m+k−1 1 . (12)Goerg et al. 5 0 4 8 12 17 22 cdf 0.65 0.80 P(N <= n; r = 0.25, q1 = 0.98, q0 = 0.64) true counts (N) q0 = 64 % 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 pdf Beta(p; µ = 0.27, φ = 2.3) non−missingness rate α = 0.63, β = 1.7 0 5 10 15 20 25 0.80 cdf 0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● P(K <= k; θ) observed counts (K) ● empirical model Log−likelihood: −6466.69 0 2 4 6 8 10 K = 0 K = 2 pmf 0.0 0.4 0.8 P(N = n | K = k; θ) true counts (N) E(N|K=0) = 1.02 E(N|K=2) = 13.12 79.5% FIGURE 1: Model estimates for: (top left) true counts Ni ; (top right) nonmissing rate pi ; (bottom left) empirical count frequency and model fit; (bottom right) conditional predictive distributions and expectations. 3.1 Fix expected non-missing rate µ Usually, researchers must estimate all 5 parameters from panel data. For our application, though, we can estimate (and fix) the non-missing rate µ a-priori as we have access to internal YouTube log files. Let ¯kW˜ = PP i=1 w˜iki be the observed panel visits projected to the entire population. Analogously, let N¯W˜ = PP i=1 w˜iNi be the panel projections of the number of true YouTube visits. While any single Ni is unobservable, we can estimate N¯W˜ by simply counting all YouTube homepage views in Germany from our YouTube log files, yielding Nb¯W˜ . We herewith obtain a plug-in estimate of the non-missing rate, µbLogs = ¯kW˜ /Nb¯W˜ . The remaining 4 parameters, θ(−µ) = (φ, q0, r, q1), can be obtained by MLE, θb (−µ) = arg maxθ(−µ) `((µbLogs, θ(−µ)); x). The overall estimate is θb = (µbLogs, θb (−µ)). 4 Estimating YouTube Audience in Germany Here we use data from a German online panel (GfK Consumer Panels, 2013), which monitors web usage of P = 6, 545 individuals in October, 2013 (31 days). In particular, we are interested in the probability that an adult in Germany visited the YouTube homepage www.youtube.de. Empirically,Pb (K = 0) = 0.81, yielding 19% observed 1+ reach. However, we know by comparison to YouTube log files that the panel only recorded 27.2% of all impressions. We fix the expected non-missing rate at µb = 0.272 and obtain the remaining parameters via MLE (Table 1): Figure 1 shows the model fit for the true, observed, and predictive distribution. In particular, the true 1+ reach is 36% (qb0 = 0.64), not 19% as the na¨ıve estimate suggests. 5 Discussion We introduce a probabilistic framework to impute missing events in count data, including a hurdle component for more flexibility to model lots of zeros. Researchers can use our models to obtain accurate probabilistic predictions of the number of true, unobserved events. We apply our methodology to accurately estimate how many people in Germany visit YouTube. Acknowledgments: We want to thank Christoph Best, Penny Chu, Tony Fagan, Yijia Feng, Oli Gaymond, Simon Morris, Raimundo Mirisola, Andras Orban, Simon Rowe, Sheethal Shobowale, Yunting Sun, Wiesner Vos, Xiaojing Wang, and Fan Zhang for constructive discussions and feedback. References Fader, P. and Hardie, B. (2000). A note on modelling underreported Poisson counts. Journal of Applied Statistics, 27(8):953–964. Ferrari, S. and Cribari-Neto, F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7):799–815. GfK Consumer Panels (2013). Media Efficiency Panel. Hofler, R. A. and Scrogin, D. (2008). A count data frontier model. Technical report, University of Central Florida. Hu, M., Pavlicova, M., and Nunes, E. (2011). Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse, 37(5):367–75. Rose, C., Martin, S., Wannemuehler, K., and Plikaytis, B. (2006). On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat, 16(4):463–81. Schmittlein, D. C., Bemmaor, A. C., and Morrison, D. G. (1985). Why Does the NBD Model Work? Robustness in Representing Product Purchases, Brand Purchases and Imperfectly Recorded Purchases. Marketing Science, 4(3):255–266. Yang, S., Zhao, Y., and Dhar, R. (2010). Modeling the underreporting bias in panel survey data. Marketing Science, 29(3):525–539. 38 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice DOI:10.1145/2643134 Article development led by queue.acm.org Preventing script injection vulnerabilities through software design. BY CHRISTOPH KERN SCRIPT INJECTION VULNERABILITIES are a bane of Web application development: deceptively simple in cause and remedy, they are nevertheless surprisingly difficult to prevent in large-scale Web development. Cross-site scripting (XSS)2,7,8 arises when insufficient data validation, sanitization, or escaping within a Web application allow an attacker to cause browser-side execution of malicious JavaScript in the application’s context. This injected code can then do whatever the attacker wants, using the privileges of the victim. Exploitation of XSS bugs results in complete (though not necessarily persistent) compromise of the victim’s session with the vulnerable application. This article provides an overview of how XSS vulnerabilities arise and why it is so difficult to avoid them in real-world Web application software development. Software design patterns developed at Google to address the problem are then described. A key goal of these design patterns Securing the Tangled WebSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 39 IMAGE BY PHOTOBANK GALLERY is to confine the potential for XSS bugs to a small fraction of an application’s code base, significantly improving one’s ability to reason about the absence of this class of security bugs. In several software projects within Google, this approach has resulted in a substantial reduction in the incidence of XSS vulnerabilities. Most commonly, XSS vulnerabilities result from insufficiently validating, sanitizing, or escaping strings that are derived from an untrusted source and passed along to a sink that interprets them in a way that may result in script execution. Common sources of untrustworthy data include HTTP request parameters, as well as user-controlled data located in persistent data stores. Strings are often concatenated with or interpolated into larger strings before assignment to a sink. The most frequently encountered sinks relevant to XSS vulnerabilities are those that interpret the assigned value as HTML markup, which includes server-side HTTP responses of MIME-type text/html, and the Element.prototype.innerHTML Document Object Model (DOM)8 property in browser-side JavaScript code. Figure 1a shows a slice of vulnerable code from a hypothetical photosharing application. Like many modern Web applications, much of its user-interface logic is implemented in browser-side JavaScript code, but the observations made in this article transfer readily to applications whose UI is implemented via traditional serverside HTML rendering. In code snippet (1) in the figure, the application generates HTML markup for a notification to be shown to a user when another user invites the former to view a photo album. The generated markup is assigned to the innerHTML property of a DOM 40 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice main page. If the login resulted from a session time-out, however, the app navigates back to the URL the user had visited before the time-out. Using a common technique for short-term state storage in Web applications, this URL is encoded in a parameter of the current URL. The page navigation is implemented via assignment to the window.location.href DOM property, which browsers interpret as instruction to navigate the current window to the provided URL. Unfortunately, navigating a browser to a URL of the form javascript:attackScript causes execution of the URL’s body as Java Script. In this scenario, the target URL is extracted from a parameter of the current URL, which is generally under attacker control (a malicious page visited by a victim can instruct the browser to navigate to an attacker-chosen URL). Thus, this code is also vulnerable to XSS. To fix the bug, it is necessary to validate that the URL will not result in script execution when dereferenced, by ensuring that its scheme is benign— for example, https. Why Is XSS So Difficult to Avoid? Avoiding the introduction of XSS into nontrivial applications is a difficult problem in practice: XSS remains among the top vulnerabilities in Web applications, according to the Open Web Application Security Project (OWASP);4 within Google it is the most common class of Web application vulnerabilities among those reported under Google’s Vulnerability Reward Program (https://goo.gl/82zcPK). Traditionally, advice (including my own) on how to prevent XSS has largely focused on: ˲ Training developers how to treat (by sanitization, validation, and/or escaping) untrustworthy values interpolated into HTML markup.2,5 ˲ Security-reviewing and/or testing code for adherence to such guidance. In our experience at Google, this approach certainly helps reduce the incidence of XSS, but for even moderately complex Web applications, it does not prevent introduction of XSS to a reasonably high degree of confidence. We see a combination of factors leading to this situation. element (a node in the hierarchical object representation of UI elements in a browser window), resulting in its evaluation and rendering. The notification contains the album’s title, chosen by the second user. A malicious user can create an album titled: Since no escaping or validation is applied, this attacker-chosen HTML is interpolated as-is into the markup generated in code snippet (1). This markup is assigned to the innerHTML sink, and hence evaluated in the context of the victim’s session, executing the attacker-chosen JavaScript code. To fix this bug, the album’s title must be HTML-escaped before use in markup, ensuring that it is interpreted as plain text, not markup. HTMLescaping replaces HTML metacharacters such as <, >, ", ', and & with corresponding character entity references or numeric character references: <, >, ", ', and &. The result will then be parsed as a substring in a text node or attribute value and will not introduce element or attribute boundaries. As noted, most data flows with a potential for XSS are into sinks that interpret data as HTML markup. But other types of sinks can result in XSS bugs as well: Figure 1b shows another slice of the previously mentioned photo-sharing application, responsible for navigating the user interface after a login operation. After a fresh login, the app navigates to a preconfigured URL for the application’s The following code snippet intends to populate a DOM element with markup for a hyperlink (an HTML anchor element): var escapedCat = goog.string.htmlEscape(category); var jsEscapedCat = goog.string.escapeString(escapedCat); catElem.innerHTML = '' + escapedCat + ''; The anchor element’s click-event handler, which is invoked by the browser when a user clicks on this UI element, is set up to call a JavaScript function with the value of category as an argument. Before interpolation into the HTML markup, the value of category is HTML-escaped using an escaping function from the JavaScript Closure Library. Furthermore, it is JavaScript-string-literal-escaped (replacing ' with \' and so forth) before interpolation into the string literal within the onclick handler’s JavaScript expression. As intended, for a value of Flowers & Plants for variable category, the resulting HTML markup is: Flowers & Plants So where’s the bug? Consider a value for category of: ');attackScript();// Passing this value through htmlEscape results in: ');attackScript();// because htmlEscape escapes the single quote into an HTML character reference. After this, JavaScript-string-literal escaping is a no-op, since the single quote at the beginning of the page is already HTML-escaped. As such, the resulting markup becomes: ');attackScript();// When evaluating this markup, a browser will first HTML-unescape the value of the onclick attribute before evaluation as a JavaScript expression. Hence, the JavaScript expression that is evaluated results in execution of the attacker’s script: createCategoryList('');attackScript();//') Thus, the underlying bug is quite subtle: the programmer invoked the appropriate escaping functions, but in the wrong order. A Subtle XSS BugSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 41 practice Subtle security considerations. As seen, the requirements for secure handling of an untrustworthy value depend on the context in which the value is used. The most commonly encountered context is string interpolation within the content of HTML markup elements; here, simple HTML-escaping suffices to prevent XSS bugs. Several special contexts, however, apply to various DOM elements and within certain kinds of markup, where embedded strings are interpreted as URLs, Cascading Style Sheets (CSS) expressions, or JavaScript code. To avoid XSS bugs, each of these contexts requires specific validation or escaping, or a combination of the two.2,5 The accompanying sidebar, “A Subtle XSS Bug,” shows this can be quite tricky to get right. Complex, difficult-to-reason-about data flows. Recall that XSS arises from flows of untrustworthy, unvalidated/escaped data into injection-prone sinks. To assert the absence of XSS bugs in an application, a security reviewer must first find all such data sinks, and then inspect the surrounding code for context-appropriate validation and escaping of data transferred to the sink. When encountering an assignment that lacks validation and escaping, the reviewer must backward-trace this data flow until one of the following situations can be determined: ˲ The value is entirely under application control and hence cannot result in attacker-controlled injection. ˲ The value is validated, escaped, or otherwise safely constructed somewhere along the way. ˲ The value is in fact not correctly validated and escaped, and an XSS vulnerability is likely present. Let’s inspect the data flow into the innerHTML sink in code snippet (1) in Figure 1a. For illustration purposes, code snippets and data flows that require investigation are shown in red. Since no escaping is applied to sharedAlbum.title, we trace its origin to the albums entity (4) in persistent storage, via Web front-end code (2). This is, however, not the data’s ultimate origin—the album name was previously entered by a different user (that is, originated in a different time context). Since no escaping was applied to this value anywhere along its flow from an ultimately untrusted source, an XSS vulnerability arises. Similar considerations apply to the data flows in Figure 1b: no validation occurs immediately prior to the assignment to window.location.href in (5), so back-tracing is necessary. In code snippet (6), the code exploration branches: in the true branch, the value originates in a configuration entity in the data store (3) via the Web front end (8); this value can be assumed application-controlled and trustworthy and is safe to use without further validation. It is noteworthy that the persistent storage contains both trustworthy and untrustworthy data in different entities of the same schema—no blanket assumptions can be made about the provenance of stored data. In the else-branch, the URL originates from a parameter of the current URL, obtained from window.location.href, which is an attacker-controlled source (7). Since there is no validation, this code path results in an XSS vulnerability. Many opportunities for mistakes. Figures 1a and 1b show only two small slices of a hypothetical Web application. In reality, a large, nontrivial Web application will have hundreds if not thousands of branching and merging data flows into injection-prone sinks. Each such flow can potentially result in an XSS bug if a developer makes a mistake related to validation or escaping. Exploring all these data flows and asserting absence of XSS is a monumental task for a security reviewer, especially considering an ever-changing code base of a project under active development. Automated tools that employ heuristics to statically analyze data flows in a code base can help. In our experience at Google, however, they do not substantially increase confidence in review-based assessments, since they are necessarily incomplete in their reasoning and subject to both false positives and false negatives. Furthermore, they have similar difficulties as human reviewers with reasoning about whole-system data flows across multiple system components, using a variety of programming languages, RPC (remote procedure call) mechanisms, and so forth, and involving flows traversing multiple time contexts across data stores. The primary goal of this approach is to limit code that could potentially give rise to XSS vulnerabilities to a very small fraction of an application’s code base.42 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice user-profile field). Unfortunately, there is an XSS bug: the markup in profile.aboutHtml ultimately originates in a rich-text editor implemented in browser-side code, but there is no server-side enforcement preventing an attacker from injecting malicious markup using a tampered-with client. This bug could arise in practice from a misunderstanding between front-end and back-end developers regarding responsibilities for data validation and sanitization. Reliably Preventing the Introduction of XSS Bugs In our experience in Google’s security team, code inspection and testing do not ensure, to a reasonably high degree of confidence, the absence of XSS bugs in large Web applications. Of course, both inspection and testing provide tremendous value and will typically find some bugs in an application (perhaps even most of the bugs), but it is difficult to be sure whether or not they discovered all the bugs (or even almost all of them). The primary goal of this approach is to limit code that could potentially give rise to XSS vulnerabilities to a very small fraction of an application’s code base. A key goal of this approach is to drastically reduce the fraction of code that could potentially give rise to XSS bugs. In particular, with this approach, an application is structured such that most of its code cannot be responsible for XSS bugs. The potential for vulnerabilities is therefore confined to infrastructure code such as Web application frameworks and HTML templating engines, as well as small, self-contained applicationspecific utility modules. A second, equally important goal is to provide a developer experience that does not add an unacceptable degree of friction as compared with existing developer workflows. Key components of this approach are: ˲ Inherently safe APIs. Injection-prone Web-platform and HTML-rendering APIs are encapsulated in wrapper APIs designed to be inherently safe against XSS in the sense that no use of such APIs can result in XSS vulnerabilities. ˲ Security type contracts. Special types are defined with contracts stipuSimilar limitations apply to dynamic testing approaches: it is difficult to ascertain whether test suites provide adequate coverage for whole-system data flows. Templates to the rescue? In practice, HTML markup, and interpolation points therein, are often specified using HTML templates. Template systems expose domain-specific languages for rendering HTML markup. An HTML markup template induces a function from template variables into strings of HTML markup. Figure 1c illustrates the use of an HTML markup template (9): this example renders a user profile in the photo-sharing application, including the user’s name, a hyperlink to a personal blog site, as well as free-form text allowing the user to express any special interests. Some template engines support automatic escaping, where escaping operations are automatically inserted around each interpolation point into the template. Most template engines’ auto-escape facilities are noncontextual and indiscriminately apply HTML escaping operations, but do not account for special HTML contexts such as URLs, CSS, and JavaScript. Contextually auto-escaping template engines6 infer the necessary validation and escaping operations required for the context of each template substitution, and therefore account for such special contexts. Use of contextually auto-escaping template systems dramatically reduces the potential for XSS vulnerabilities: in (9), the substitution of untrustworthy values profile.name and profile. blogUrl into the resulting markup cannot result in XSS—the template system automatically infers the required HTML-escaping and URL-validation. XSS bugs can still arise, however, in code that does not make use of templates, as in Figure 1a (1), or that involves non-HTML sinks, as in Figure 1b (5). Furthermore, developers occasionally need to exempt certain substitutions from automatic escaping: in Figure 1c (9), escaping of profile.aboutHtml is explicitly suppressed because that field is assumed to contain a user-supplied message with simple, safe HTML markup (to support use of fonts, colors, and hyperlinks in the “about myself” lating that their values are safe to use in specific contexts without further escaping and validation. ˲ Coding guidelines. Coding guidelines restrict direct use of injectionprone APIs, and ensure security review of certain security-sensitive APIs. Adherence to these guidelines can be enforced through simple static checks. Inherently safe APIs. Our goal is to provide inherently safe wrapper APIs for injection-prone browser-side Web platform API sinks, as well as for server- and client-side HTML markup rendering. For some APIs, this is straightforward. For example, the vulnerable assignment in Figure 1b (5) can be replaced with the use of an inherently safe wrapper API, provided by the JavaScript Closure Library, as shown in Figure 2b (5’). The wrapper API validates at runtime that the supplied URL represents either a scheme-less URL or one with a known benign scheme. Using the safe wrapper API ensures this code will not result in an XSS vulnerability, regardless of the provenance of the assigned URL. Crucially, none of the code in (5’) nor its fan-in in (6-8) needs to be inspected for XSS bugs. This benefit comes at the very small cost of a runtime validation that is technically unnecessary if (and only if) the first branch is taken—the URL obtained from the configuration store is validated even though it is actually a trustworthy value. In some special scenarios, the runtime validation imposed by an inherently safe API may be too strict. Such cases are accommodated via variants of inherently safe APIs that accept types with a security contract appropriate for the desired use context. Based on their contract, such values are exempt from runtime validation. This approach is discussed in more detail in the next section. Strictly contextually auto-escaping template engines. Designing an inherently safe API for HTML rendering is more challenging. The goal is to devise APIs that guarantee that at each substitution point of data into a particular context within trusted HTML markup, data is appropriately validated, sanitized, and/or escaped, unless it can be demonstrated that a specific data item is safe to use in that context based on SEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 43 practice Figure 1. XSS vulnerabilities in a hypothetical Web application. Browser Web-App Frontend Application Backends (4) (3) (1) Application data store (2) Browser Web-App Frontend Application Backends (4) (3) (5) (6) (7) Application data store (8) Browser Web-App Frontend Application Backends (12) (13) (9) (10) Profile Store (11) (a) Vulnerable code of a hypothetical photo-sharing application. (b) Another slice of the photo-sharing application. (c) Using an HTML markup template. 44 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice sanitizer to remove any markup that may result in script execution renders it safe to use in HTML context and thus produces a value that satisfies the SafeHtml type contract. To actually create values of these types, unchecked conversion factory methods are provided that consume an arbitrary string and return an instance of a given wrapper type (for example, SafeHtml or SafeUrl) without applying any runtime sanitization or escaping. Every use of such unchecked conversions must be carefully security reviewed to ensure that in all possible program states, strings passed to the conversion satisfy the resulting type’s contract, based on context-specific processing or construction. As such, unchecked conversions should be used as rarely as possible, and only in scenarios where their use is readily reasoned about for security-review purposes. For example, in Figure 2c, the unchecked conversion is encapsulated in a library (12’’) along with the HTML sanitizer implementation on whose correctness its use depends, permitting security review and testing in isolation. Coding guidelines. For this approach to be effective, it must ensure developers never write application code that directly calls potentially injection-prone sinks, and that they instead use the corresponding safe wrapper API. Furthermore, it must ensure uses of unchecked conversions are designed with reviewability in mind, and are in fact security reviewed. Both constraints represent coding guidelines with which all of an application’s code base must comply. In our experience, automated enforcement of coding guidelines is necessary even in moderate-size projects—otherwise, violations are bound to creep in over time. At Google we use the open source error-prone static checker1 (https:// goo.gl/SQXCvw), which is integrated into Google’s Java tool chain, and a feature of Google’s open source Closure Compiler (https://goo.gl/UyMVzp) to whitelist uses of specific methods and properties in JavaScript. Errors arising from use of a “banned” API include references to documentation for the corresponding safe API, advising developers on how to address its provenance or prior validation, sanitization, or escaping. These inherently safe APIs are created by strengthening the concept of contextually auto-escaping template engines6 into SCAETEs (strictly contextually auto-escaping template engines). Essentially, a SCAETE places two additional constraints on template code: ˲ Directives that disable or modify the automatically inferred contextual escaping and validation are not permitted. ˲ A template may use only sub-templates that recursively adhere to the same constraint. Security type contracts. In the form just described, SCAETEs do not account for scenarios where template parameters are intended to be used without validation or escaping, such as aboutHtml in Figure 1c—the SCAETE unconditionally validates and escapes all template parameters, and disallows directives to disable the auto-escaping mechanism. Such use cases are accommodated through types whose contracts stipulate their values are safe to use in corresponding HTML contexts, such as “inner HTML,” hyperlink URLs, executable resource URLs, and so forth. Type contracts are informal: a value satisfies a given type contract if it is known that it has been validated, sanitized, escaped, or constructed in a way that guarantees its use in the type’s target context will not result in attackercontrolled script execution. Whether or not this is indeed the case is established by expert reasoning about code that creates values of such types, based on expert knowledge of the relevant behaviors of the Web platform.8 As will be seen, such security-sensitive code is encapsulated in a small number of special-purpose libraries; application code uses those libraries but is itself not relied upon to correctly create instances of such types and hence does not need to be security-reviewed. The following are examples of types and type contracts in use: ˲ SafeHtml. A value of type SafeHtml, converted to string, will not result in attacker-controlled script execution when used as HTML markup. ˲ SafeUrl. Values of this type will not result in attacker-controlled script execution when dereferenced as hyperlink URLs. ˲ TrustedResourceUrl. Values of this type are safe to use as the URL of an executable or “control” resource, such as the src attribute of a

 

Voir également :

[TXT]

 01AINrues.htm           07-Oct-2011 14:09  5.5M  

[TXT]

 75PARISRUEMONTGALLET..> 19-Oct-2011 11:58   32K  

[TXT]

 75PARISavenuedescham..> 19-Oct-2011 11:50  521K  

[TXT]

 75PARISruedeRennesen..> 19-Oct-2011 12:09  203K  

[TXT]

 75PARISruegeneralDel..> 19-Oct-2011 12:04   17K  

[TXT]

 75PARISrues.htm         11-Jul-2011 20:49  3.6M  

[TXT]

 78YVELINESrues.htm      11-Jul-2011 20:55  2.0M  

[TXT]

 92HAUTSDESEINErues.htm  12-Jul-2011 09:21  4.1M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:03  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:21  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:24  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.7M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-1.htm              04-Jun-2013 12:57  2.9M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 10:12  1.9M  

[TXT]

 Rues-33-Gironde-1.htm   27-Jun-2013 16:39  1.9M  

[TXT]

 Rues-33-Gironde-2.htm   27-Jun-2013 16:39  1.9M  

[TXT]

 Rues-33-Gironde-3.htm   27-Jun-2013 16:38  1.9M  

[TXT]

 Rues-33-Gironde-4.htm   27-Jun-2013 16:38  1.9M  

[TXT]

 Rues-33-Gironde-5.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-6.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-7.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-8.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-9.htm   27-Jun-2013 16:45  1.9M  

[TXT]

 Rues-33-Gironde-10.htm  27-Jun-2013 16:45  1.9M  

[TXT]

 Rues-33-Gironde-11.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-12.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-13.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-14.htm  27-Jun-2013 16:52  1.9M  

[TXT]

 Rues-33-Gironde-15.htm  28-Jun-2013 08:05  1.9M  

[TXT]

 Rues-33-Gironde-16.htm  28-Jun-2013 08:03  1.9M  

[TXT]

 Rues-33-Gironde-17.htm  28-Jun-2013 08:03  1.9M  

[TXT]

 Rues-33-Gironde-18.htm  28-Jun-2013 07:37  1.9M  

[TXT]

 Rues-33-Gironde-19.htm  28-Jun-2013 07:37  1.9M