Automatically Discovering Talented Musicians with
Acoustic Analysis of YouTube Videos - Google
Accéder au pdf :ACCEDER AU PDF
Voir également :Unsupervised-Testing..> 06-Mar-2015 18:52 2.9M
Backoff-Inspired-Fea..> 06-Mar-2015 18:51 2.9M
Accurate-and-Compact..> 06-Mar-2015 18:51 2.9M
ReFr-An-Open-Source-..> 06-Mar-2015 18:50 2.9M
On-Demand-Language-M..> 06-Mar-2015 18:50 2.9M
Building-High-level-..> 06-Mar-2015 18:49 3.0M
On-the-Predictabilit..> 06-Mar-2015 18:49 3.0M
Tera-scale-deep-lear..> 06-Mar-2015 18:48 3.0M
Building-high-level-..> 06-Mar-2015 18:48 3.0M
Bootstrapping-Depend..> 06-Mar-2015 18:47 3.1M
Towards-A-Unified-Mo..> 06-Mar-2015 18:47 3.1M
Scalable-Dynamic-Non..> 06-Mar-2015 18:46 3.1M
March-2013-WORKING-G..> 06-Mar-2015 18:45 3.2M
Collaboration-in-the..> 06-Mar-2015 18:45 3.3M
Minimizing-off-targe..> 06-Mar-2015 18:44 3.3M
Online-Microsurveys-..> 06-Mar-2015 18:43 3.3M
Better-Bounds-for-On..> 06-Mar-2015 18:43 3.3M
Linear-Space-Computa..> 06-Mar-2015 18:42 3.3M
Performance-Trade-of..> 06-Mar-2015 18:42 3.3M
Extracting-Patterns-..> 06-Mar-2015 18:41 3.4M
Auditory-Sparse-Codi..> 06-Mar-2015 18:40 3.4M
Estimation-Optimizat..> 06-Mar-2015 18:40 2.5M
Automatically-Discov..> 06-Mar-2015 18:39 1.3M
Le-referencement-sur..> 06-Mar-2015 09:19 2.8M
Google-Analytics-Liv..> 06-Mar-2015 09:19 2.8M
Google-Analytics-pri..> 06-Mar-2015 09:19 2.8M
Reussir-son-referenc..> 06-Mar-2015 09:18 2.8M
Google-AdWords-Analy..> 06-Mar-2015 09:18 2.8MAutomatically Discovering Talented Musicians with Acoustic Analysis of YouTube Videos Eric Nichols Department of Computer Science Indiana University Bloomington, Indiana, USA Email: epnichols@gmail.com Charles DuHadway, Hrishikesh Aradhye, and Richard F. Lyon Google, Inc. Mountain View, California, USA Email: {duhadway,hrishi,dicklyon}@google.com Abstract—Online video presents a great opportunity for upand-coming singers and artists to be visible to a worldwide audience. However, the sheer quantity of video makes it difficult to discover promising musicians. We present a novel algorithm to automatically identify talented musicians using machine learning and acoustic analysis on a large set of “home singing” videos. We describe how candidate musician videos are identified and ranked by singing quality. To this end, we present new audio features specifically designed to directly capture singing quality. We evaluate these vis-a-vis a large set of generic audio features and demonstrate that the proposed features have good predictive performance. We also show that this algorithm performs well when videos are normalized for production quality. Keywords-talent discovery; singing; intonation; music; melody; video; YouTube I. INTRODUCTION AND PRIOR WORK Video sharing sites such as YouTube provide people everywhere a platform to showcase their talents. Occasionally, this leads to incredible successes. Perhaps the best known example is Justin Bieber, who is believed to have been discovered on YouTube and whose videos have since received over 2 billion views. However, many talented performers are never discovered. Part of the problem is the sheer volume of videos: sixty hours of video are uploaded to YouTube every minute (nearly ten years of content every day) [23]. This builds a “rich get richer” bias where only those with a large established viewer base continue to get most of the new visitors. Moreover, even “singing at home” videos have a large variation not only in choice of song but also in sophistication of audio capture equipment and the extent of postproduction. An algorithm that can analyze all of YouTube’s daily uploads to automatically identify talented amateur singers and musicians will go a long way towards removing these biases. We present in this paper a system that uses acoustic analysis and machine learning to (a) detect “singing at home” videos, and (b) quantify the quality of musical performances therein. To the best of our knowledge, no prior work exists for this specific problem, especially given an unconstrained dataset such as videos on YouTube. While performace quality will Figure 1. “Singing at home” videos. always have a large subjective component, one relatively objective measure of quality is intonation—that is, how intune is a music performance? In the case of unaccompanied audio, the method in [14] uses features derived both from intonation and vibrato analysis to automatically evaluate singing quality from audio. These sorts of features have also been investigated by music educators attempting to quantify intonation quality given certain constraints. The InTune system [1], for example, processes an instrumentalist’s recording to generate a graph of deviations from desired pitches, based on alignment with a known score followed by analysis of the strongest FFT bin near each expected pitch. Other systems for intonation visualization are reviewed in [1]; these differ in whether or not the score is required and in the types of instruments recognized. Practical value of such systems on large scale data such as YouTube is limited because (a) the original recording and/or score may not be known, and (b) most published approaches for intonation estimation assume a fixed reference pitch such as A=440 Hz. Previous work in estimating the reference pitch has generally been based on FFT or filterbank analysis [8], [9], [10]. To ensure scalability to a corpus of millions of videos, we propose a computationally efficient means forestimating both the reference pitch and overall intonation. We then use it to construct an intonation-based feature for musical performance quality. Another related subproblem relevant to performance quality is the analysis of melody in audio. There are many approaches to automatically extracting the melody line from a polyphonic audio signal (see the review in [15]), ranging from simple autocorrelation methods [3], [5] to FFT analysis and more complex systems [16], [18], [22]. Melody extraction has been a featured task in the MIREX competition in recent years; the best result so far for singing is the 78% accuracy obtained by [16] on a standard test set with synthetic (as opposed to natural) accompaniment. This system combined FFT analysis with heuristics which favor extracted melodies with typically-musical contours. We present a new melody-based feature for musical performance quality. In addition to these new features, the proposed approach uses a large set of previously published acoustic features including MFCC, SAI[12], intervalgram[21], volume, and spectrogram. When identifying candidate videos we also use video features including HOG [17], CONGAS[19] and HueSaturation color histograms [11]. II. APPROACH A. Identifying Candidate Videos We first identify “singing at home” videos. These videos are correlated with features such as ambient indoor lighting, head-and-shoulders view of a person singing in front of a fixed camera, few instruments, and a single dominant voice. A full description of this stage is beyond this paper’s scope. We use the approach in [2] to train a classifier to identify these videos. In brief, we collected a large set of videos that were organically included in YouTube playlists related to amateur performances. We then used this as weakly labeled ground-truth against a large set of randomly picked negative samples to train a “singing at home” classifier. We use a combination of audio and visual features including HOG, CONGAS[19], MFCC, SAI[12], intervalgram[21], volume, and spectrograms. Our subsequent analyses for feature extraction and singing quality estimation are based on the high precision range of this classifier. Figure 1 shows a sample of videos identified by this approach. B. Feature Extraction We developed two sets of features, each comprised of 10 floating point numbers. These include an intonation feature set, intonation, and a melody line feature set, melody. 1) Intonation-based Features: Intonation Histogram: Considering that for an arbitrary YouTube video we know neither the tuning reference nor the desired pitches, we implemented a two step algorithm to estimate the in-tuneness of an audio recording. The first step computes a tuning reference (see Figure 2). To this end, we first detect STFT amplitude peaks Figure 2. Pitch histogram for an in-tune recording. in the audio (monophonic 22.05 kHz, frame size 4096 samples=186 ms, 5.38 Hz bin size). From these peaks we construct an amplitude-weighted histogram, and set the tuning reference to the maximum bin. The second step makes a histogram of distances from nearest chromatic pitches using the previously-computed histogram. Note that this computation is very simple and efficient, compared with filterbank approaches as in [14], and as it allows for multiple peaks, it works with polyphonic audio recordings. In this process we first use the tuning reference to induce a grid of “correct” pitch frequencies based on an equal-tempered chromatic scale. Subsequently, we make an amplitudeweighted histogram of differences from correct frequencies. Histogram heights are normalized to sum to 1. We used 7 bins to cover each 100 cent range (1 semitone), which worked out nicely because the middle bin collected pitches within ±7.1 cents of the correct pitch. The range ±7 cents was found to sound “in-tune” in experiments [7]. When possible we match audio to known reference tracks using the method in [21] and use this matching to identify and remove frames that are primarily non-pitch, such as talking or rapping, when computing the tuning reference. Feature Representation: Now we can generate a summary vector consisting of the 7 heights of the histogram itself followed by three low-order weighted moments-aboutzero. These statistics (standard deviation, skew, and kurtosis) describe the data’s deviation from the reference tuning grid. See Table I. This set of 10 values, which we refer to collectively as intonation, gives a summary of the intonation of a recording, by describing how consistent the peaks of each frame are with the tuning reference derived from the set of all these peaks. Figure 3(b) shows the histogram for an out-of-tune recording. For the high-quality recording in Figure 2(a), the(a) In-tune (good) audio (b) Out-of-tune (bad) audio Figure 3. Distance to tuning reference. bar1 bar2 bar3 bar4 bar5 bar6 bar7 stddev skew kurtosis In-tune .00 .05 .14 .70 .08 .03 .00 .006 .15 2.34 Out-of-tune .14 .20 .13 .30 .10 .07 .06 .015 -.42 -.87 Table I INTONATION FEATURE VECTORS FOR FIGURES 3(A) AND 3(B). central bar of the histogram is relatively high, indicating that most peaks were for in-tune frequencies. The histogram is also relatively symmetrical and has lower values for more out-of-tune frequencies. The high kurtosis and low skew and standard deviation of the data reflect this. The low-quality recording, on the other hand, does have a central peak, but it is much shorter relative to the other bars, and in general its distribution’s moments do not correspond well with a normal distribution. Note that while we expect that asymmetrical, peaked distribution in this histogram is an indicator of “good singing”, we do not build in this expectation to our prediction system explicitly; rather, these histogram features will be provided as input to a machine learning algorithm. Good performances across different genres of music might result in differing shapes of the histogram; the system should learn which shapes to expect based on the training data. For example, consider the case of music where extensive pitch-correction has been applied by a system such as AutoTune. We processed several such tracks using this system, resulting in histograms with a very tall central bar and very short other bars; almost all notes fell within 7 cents of the computed reference grid. If listeners rated these recordings highly, this shape might lead to predictions of high quality by our system; if listeners disliked this sound, it might have the inverse effect. Similarly, consider vocal vibrato. If the extent of vibrato (the amplitude of modulation of the frequency) is much more than 50 cents in each direction from the mean frequency of a note, then this approach will result in a more flat histogram which might obscure the intonation quality we are trying to capture. Operatic singing often has vibrato with an extent of a whole semitone, giving a very flat distribution; early music performance, on the other hand, is characterized by very little vibrato. Popular music comprises the bulk of the music studied here. Although we did not analyze the average vibrato extent in this collection, an informal look at histograms produced with this approach suggests that performances that sound in-tune in our data tend to have histograms with a central peak. For musical styles with large vibrato extent, such as opera, we would need to refine our technique to explicitly model the vibrato in order to recover the mean fundamental frequency of each note, as in [14]. For styles with a moderate amount of vibrato, frequency energy is placed symmetrically about the central histogram bar, and in-tune singing yields the expected peaked distribution (for example, if a perfeclty sinusoidal vibrato ranges from 50 cents above to 50 cents below the mean frequency, then approximately 65% of each note’s duration will be spent within the middle three bars of the histogram; reducing the vibrato extent to 20 cents above and below causes all frequencies of an in-tune note fall within the middle three bars.) 2) Melody-based Features: Melody Line: As we are interested in the quality of the vocal line in particular, a primary goal in analyzing thesinging quality is to isolate the vocal signal. One method for doing so is to extract the melody line, and to assume that most of the time, the primary melody will be the singing part we are interested in. This is a reasonable assumption for many of the videos we encounter where people have recorded themselves singing, especially when someone is singing over a background karaoke track. Our problem would be made easier if we had access to a symbolic score (e.g., the sheet music) for the piece being sung, as in [1]. However, we have no information available other than the recording itself. Thus we use two ideas to extract a good candidate for a melody line: the Stabilized Auditory Image (SAI) [20] and the Viterbi algorithm. Algorithm: We compute the SAI for each frame of audio, where we have set the frame rate to 50 frames per second. At 22,050 Hz, this results in a frame size of 441 samples. The SAI is a matrix with lag times on one axis and frequency on the other; we convert the lag dimension into a pitch-class representation for each frame using the method employed in [21] but without wrapping pitch to chroma. This is a vector giving strengths in each frequency bin. Our frequency bins span 8 octaves, and we tried various numbers of bins per octave such as 12, 24, or 36. In our experiments, 12 bins per octave gave the best results. This 96-element vector of bin strengths for each frame looks much like a spectrogram, although unlike a spectrogram, we cannot recover the original audio signal with an inverse transform. However, the bins with high strengths should correspond to perceptually salient frequencies, and we assume that for most frames the singer’s voice will be one of the most salient frequencies. We next extract a melody using a best-path approach. We represent the successive SAI summary vectors as layers in a trellis graph, where nodes correspond to frequency bins for each frame and each adjacent pair of layers is fully connected. We then use the Viterbi algorithm to find the best path using the following transition score function: St[i, j] = SAIt[j] − α pm + pl + |i − j| T (1) where pm = 1 if i 6= j 0 otherwise pl = 1 if transition is ≥ 1 octave 0 otherwise and T is the frame length in seconds. We used α = 0.15 in our experiments. Figure 4(a) shows the SAI summary frames and the best path computed for a professional singer. Figure 4(b) shows the best path for the recording of an badly-rated amateur singer. We observed that the paths look qualitatively different in the two cases, although the difference is hard to describe precisely. In the professional singer case, the path looks more smooth and is characterized by longer horizontal bars (corresponding to single sustained notes) and less vertical jumps of large distance. Note that this is just an example suggestive of some potentially useful features to be extracted below; the training set and learning algorithm will make use of these features only if they turn out to be useful in prediction. Feature Representation: Remembering that our aim was to study not the quality of the underlying melody of the song, but instead the quality of the performance, we realized we could use the shape of the extracted melody as an indicator of the strength and quality of singing. This idea may seem counterintuitive, but we are studying characteristics of the extracted melody—rather than correlation between the performance and a desired melody—simply because we do not have access to the sheet music and “correct” notes of the melody. Obviously, this depends a great deal on the quality of the melody-extraction algorithm, but because we are training a classifier based on extraction results, we expect that even with an imperfect extraction algorithm, useful trends should emerge that can help distinguish between low- and high-quality performances. Differences between songs also obviousy affects global melody contour, but we maintain that for any given song a better singer should produce a melody line that is more easily extracted and which locally conforms better to expected shapes. To study the shape and quality of the extracted melody, first we define a “note” to be a contiguous horizontal segment of the note path, so that each note has a single frequency bin. Then we compute 10 different statistics at the note level to form the melody feature vector: 1) Mean and standard deviation of note length (µlen, σlen) 2) Difference between the standard deviation and mean 3) Mean and standard deviation of note frequency bin number (µbin, σbin) 4) Mean and standard deviation of note strength (sum of bin strengths divided by note length) (µstr, σstr) 5) Mean and standard deviation of vertical leap distance between adjacent notes (in bins) (µleap, σleap) 6) Total Viterbi best path score divided by total number of frames The intuition behind this choice of statistics follows. In comparing Figures 4(a) and 4(b), we see that the path is more fragmented for the lower-quality performance: there are more, shorter notes than there should be. Thus, note length is an obvious statistic to compute. If we assume that note length is governed by a Poisson process, we would expect an exponential distribution on note lengths, and the mean and standard deviation would be about the same. However, we conjecture that a Poisson process is not the best model for lengths of notes in musical compositions. If the best-path chosen by the Viterbi algorithm is more in-line(a) A better quality recording (b) A lower quality recording Figure 4. Best-path melody extraction. The best path is shown as a blue line superimposed on the plot. Higher-amplitude frequency bins are shown in red. Upper and lower frequency bins were cropped for clarity. µlen σlen σlen − µlen µbin σbin µstr σstr µleap σleap path score good 71.77 83.71 11.93 37.12 4.00 0.094 0.028 3.44 2.52 32.14 medium 43.64 41.61 -2.02 38.71 2.87 0.105 0.012 3.46 2.18 30.49 bad 45.46 46.08 0.62 38.16 3.84 0.101 0.032 3.84 2.60 32.64 Table II MELODY FEATURE VECTORS FOR FIGURES 4(A) AND 4(B). with the correct melody, we would expect a non-exponential distribution. Thus, the difference between standard deviation and mean of note length is computed as a useful signal about the distribution type. Note strength is also computed because we suspect that notes with larger amplitude values are more likely to correspond to instances of strong, clear singing. Note frequency bins are analyzed because vocal performances usually lie in a certain frequency range; deviations from the range would be a signal that something went wrong in the melody detection process and hence that the performance might not be so good. Leap distance between adjacent notes is a useful statistic because musical melody paths will follow certain patterns, whereas problems in the path could show up if the distribution of leaps is not distributed as expected. Finally, the average path score per frame from the Viterbi algorithm is recorded, although it may prove to be a a useless statistic because it is notoriously hard to interpret path scores from different data files—more analysis is necessary to determine which of these features are most useful. Table II gives examples of these statistics for the paths in Figures 4(a) and 4(b) as well as for one other medium-quality melody. C. Performance Quality Estimation Given a pool of candidate videos our next step is to estimate the performance quality of each video. For sets on the order of a hundred videos human ratings could be used directly for ranking. However, to consider thousands or more videos we require an automated solution. We train kernelized passive-aggressive (PA) [6] rankers to estimate the quality of each candidate video set. We tried several kernels including linear, intersection, and polynomial and found that the intersection kernel worked the best overall Unless noted otherwise we used this kernel in all our experiments. The training data for these rankers is given as pairs of video feature sets where one video has been observed to be higher quality than the other. Given a new video the ranker generates a single quality score estimate. III. EXPERIMENTAL RESULTS A. ”Singing At Home” Video Dataset We have described two features for describing properties of a melody, where each feature is a vector of 10 floating point numbers. To test their utility, the features are used to predict human ratings on a set of pairs of music videos. This corpus is composed of over 5,000 pairs of videos, where for each pair, human judges have selected which video of the pair is better. Carterette et al. [4] showed that preference judgements of this type can be more effective than absolute judgements. Each pair is evaluated by at least 3 different judges. In this experiment, we only consider the subset of video pairs where the winner was selected unanimously. Our training dataset is made of this subset, which comprises 1,573 unique videos. B. Singing Quality Ranker Training For each video, we computed intonation and melody feature vectors described above, as well as a large feature vector which is composed of other audio analysis features including MFCC, SAI[12], intervalgram[21], volume, andFeature Accuracy (%) # dimensions Accuracy gain / # dimensions Rank intonation 51.9 10 0.1900 2 melody 61.2 10 1.1200 1 large 67.5 14,352 0.0012 9 all 67.8 14,372 0.0012 8 large-MFCC 61.4 2,000 0.0057 5 large-SAI-boxes 66.7 7,168 0.0023 6 large-SAI-intervalgram 58.6 4,096 0.0021 7 large-spectrum 62.7 1,024 0.0124 4 large-volume 59.9 64 0.1547 3 Table III PREDICTION ACCURACY BY FEATURE SET. spectrograms. These features are used to train a ranker which outputs a floating-point score for each input example. In order to test the ranker, we simply generate the ranking score for each example in each pair, and choose the higher-scoring example as the winner. To test the ranker, we compare this winner to that chosen by unanimous human consent. Thus, although we use a floating-point ranker as an intermediate step, the final ranker output is a simple binary choice and baseline performance is 50%. C. Prediction Results Training consisted of 10-fold cross-validation. The percentages given below are the mean accuracies over the 10 cross-validation folds, where accuracy is computed as the number of correct predictions of the winner in a pair divided by the total number of pairs. Overall, large yields the best accuracy, 67.5%, melody follows with 61.2%, and intonation achieves just 51.9% accuracy. The results for our two new feature vectors, as well as for large, are given in Table III. Because large has so many dimensions, it is unsurprising that it performs better than our 10-dimensional features. To better understand the utility of each feature, we broke large down into subsets also listed in Table III, calculated the % gain above baseline for each feature subset, computed the average % gain per feature dimension, and ranked the features accordingly. The intonation and melody features offer the most accuracy per dimension. Our metric of % gain per dimension is important because we are concerned with computational resources in analyzing large collections of videos. For the subsets of the large vector which required thousands of dimensions, it was interesting to see how useful each subset was compared with the amount of computation being done (assuming that the number of dimensions is a rough correlate to computation time). For example, it seems clear that melody is more useful than large-SAI-intervalgram as it has better accuracy with less dimensions, but also melody is probably more useful when computational time is limited than is large-MFCC, as they have similar accuracy but a much different accuracy gain per dimension. D. Effect of Production Quality We did one further experiment to determine if the above rankers were simply learning to distinguish videos with better production quality. To test this possibility we trained another ranker on pairs of videos with similar production quality. This dataset contained 999 pairs with ground truth established through the majority voting of 5 human operators. As before we trained and tested rankers using 10- fold cross validation. The average accuracy of the resulting rankers, using the large feature set, was 61.8%. This suggests that the rankers are indeed capturing more than simple production quality. IV. DISCUSSION The results in Table III show that the melody feature set performed quite well, with the best accuracy gain per dimension and also a good raw accuracy. The intonation feature set achieved second place according to the accuracy gain metric, but raw accuracy was not much better than baseline. However, kernel choice may have had a large impact: the large feature set performs better with the intersection kernel, while intonation alone does better (54.1%) with a polynomial kernel. Integrating the different types of features using multi-kernel methods might help. Note that while we developed these features for vocal analysis, they could be applied to other music sources—the feature sets analyze the strongest or perceptually salient frequency components of a signal, which might be any instrument in a recording. In our case where we have “singing at home” videos, these analyzed components are often the sung melody that we are interested in, but even if not, the intonation and melodyshape of other components of the recording are still likely indicators of overall video quality. The output of our system is a set of high quality video performances, but this system is not (yet) capable of identifying the very small set of performers with extraordinary talent and potential. This is not surprising given that pitch and consistently strong singing are only two of many factors that determine a musician’s popularity. Our system has two properties that make it well-suited for use as a filtering step for a competition driven by human ratings. First, it can evaluatevery large sets of candidate videos which would overwhelm a crowd-based ranking system with limited users. Second, it can eliminate obviously low quality videos which would otherwise reduce the entertainment in such competition. V. FUTURE WORK Our ongoing work includes several improvements to these features. For instance, we have used the simple bin index in the FFT to estimate frequencies. Although it would increase computation time, we could use the instantaneous phase (with derivative approximated by a one-frame difference) to more precisely estimate the frequency of a component present in a particular bin [13]. With this modification, step 1 of our algorithm would no longer use a histogram; instead, the tuning reference minimizing the total error in step 2 would be computed instead. Our present implementation avoided this fine-tuning by using a quite-large frame size (at the expense of time-resolution) so that our maximum error (half the bin size) is 2.7 Hz, or approximately 10 cents for a pitch near 440 Hz. The proposed intonation feature extraction algorithm can be easily modified to run on small segments (e.g., 10 seconds) of audio at once instead of over the whole song. This has the advantage of allowing the algorithm to throw out extremely out-of-tune frames which are probably due to speech or other non-pitched events. Finally, we also are working on substantially improving the process of vocal line extraction from a polyphonic signal. Once this is achieved, there are many details which could augment our current feature sets to provide a deeper analysis of singing quality; such features may include vibrato analysis of the melody line, strength of vocal signal, dynamics (expression), and duration/strength of long notes. REFERENCES [1] K. Ae and C. Raphael: “InTune: A System to Support an Instrumentalist’s Visualization of Intonation”, Computer Music Journal, Vol. 34, No. 3, Fall 2010. [2] H. Aradhye, G. Toderici, and J. Yagnik: “Video2Text: Learning to Annotate Video Content”, ICDM Workshop on Internet Multimedia Mining, 2009. [3] P. Boersma: “Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound”, Proceedings of the Institute of Phonetic Sciences, University of Amsterdam, pp. 97–110, 1993. [4] B. Carterette, P. Bennett, D. Chickering and S. Dumais: “Here or There Preference Judgments for Relevance”, Advances in Information Retrieval, Volume 4956/2008, pp. 16–27. [5] A. de Cheveigne: “YIN, a fundamental frequency estimator for speech and music”, J. Acoust. Soc. Am., Vol. 111, No. 4, April 2002. [6] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer: “Online passive-aggressive algorithms”, Journal of Machine Learning Research (JMLR), Vol. 7, 2006. [7] D. Deutsch: The Psychology of Music, p. 205, 1982. [8] S. Dixon: “A Dynamic Modelling Approach to Music Recognition”, ICMC, 1996. [9] E. Gomez: “Comparative Analysis of Music Recordings from Western and NonWestern Traditions by Automatic Tonal Feature Extraction”, Empirical Musicology Review, Vol. 3, No. 3, pp. 140–156, March 2008. [10] A. Lerch: “On the requirement of automatic tuning frequency estimation”, ISMIR, 2006. [11] T. Leung and J. Malik: “Representing and recognizing the visual appearance of materials using three-dimensional textons”, IJCV, 2001. [12] R. Lyon, M. Rehn, S. Bengio, T. Walters, and G. Chechik: “Sound Retrieval and Ranking Using Sparse Auditory Representations”, Neural Computation, Vol. 22 (2010), pp. 2390– 2416. [13] D. McMahon and R. Barrett: “Generalization of the method for the estimation of the frequencies of tones in noise from the phases of discrete fourier transforms”, Signal Processing, Vol. 12, No. 4, pp. 371–383, 1987. [14] T. Nakano, M. Goto, and Y. Hiraga: “An Automatic Singing Skill Evaluation Method for Unknown Melodies Using Pitch Interval Accuracy and Vibrato Features,” ICSLP pp. 1706– 1709, 2006. [15] G. Poliner: “Melody Transcription From Music Audio: Approaches and Evaluation”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15, No. 4, May 2007. [16] J. Salamon and E. Gmez: “Melody Extraction from Polyphonic Music: MIREX 2011”, Music Information Retrieval Evaluation eXchange (MIREX), extended abstract, 2011. [17] J. Shotton, M. Johnson, and R. Cipolla: “Semantic texton forests for image categorization and segmentation”. CVPR, 2008. [18] L. Tan and A. Alwan: “Noise-robust F0 estimation using SNR-weighted summary correlograms from multi-band comb filters”, ICASSP, pp. 4464–4467, 2011. [19] E. Tola, V. Lepetit, and P. Fua: “A fast local descriptor for dense matching”, CVPR, 2008 [20] T. Walters: Auditory-Based Processing of Communication Sounds, Ph.D. thesis, University of Cambridge, 2011. [21] T. Walters, D. Ross and R. Lyon: “The Intervalgram: An Audio Feature for Large-scale Melody Recognition”, accepted for CMMR, 2012. [22] L. Yi and D. Wang: “Detecting pitch of singing voice in polyphonic audio”, ICASSP, 2005. [23] YouTube. “Statistics” http://www.youtube.com/t/press statistics. April 11, 2012. Estimation, Optimization, and Parallelism when Data is Sparse or Highly Varying John C. Duchi Michael I. Jordan H. Brendan McMahan November 10, 2013 Abstract We study stochastic optimization problems when the data is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties—in terms of increased sample complexity that sparse data necessitates—and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on several medium to large-scale learning tasks. 1 Introduction and problem setting In this paper, we investigate stochastic optimization problems in which the data is sparse. Formally, let {F(·; ξ), ξ ∈ Ξ} be a collection of real-valued convex functions, each of whose domains contains the convex set X ⊂ R d . For a probability distribution P on Ξ, we consider the following optimization problem: minimize x∈X f(x) := E[F(x; ξ)] = Z Ξ F(x; ξ)dP(ξ). (1) By data sparsity, we mean that the sampled data ξ is sparse: samples ξ are assumed to lie in R d , and if we define the support supp(x) of a vector x to the set of indices of its non-zero components, we assume that supp ∇F(x; ξ) ⊂ supp ξ. (2) The sparsity condition (2) means that F(x; ξ) does not “depend” on the values of xj for indices j such that ξj = 0.1 This type of data sparsity is prevalent in statistical optimization problems and machine learning applications, though in spite of its prevalence, study of such problems has been somewhat limited. As a motivating example, consider a text classification problem: data ξ ∈ R d represents words appearing in a document, and we wish to minimize a logistic loss F(x; ξ) = log(1 + exp(hξ, xi)) on the data (we encode the label implicitly with the sign of ξ). Such generalized linear models satisfy the sparsity condition (2), and while instances are of very high dimension, in any given instance, 1Formally, if we define πξ as the coordinate projection that zeros all indices j of its argument where ξj = 0, then F(πξ(x); ξ) = F(x; ξ) for all x, ξ. This is implied by first order conditions for convexity [6, Chapter VI.2] 1very few entries of ξ are non-zero [8]. From a modelling perspective, it thus makes sense to allow a dense predictor x: any non-zero entry of ξ is potentially relevant and important. In a sense, this is dual to the standard approaches to high-dimensional problems; one usually assumes that the data ξ may be dense, but there are only a few relevant features, and thus a parsimonious model x is desirous [2]. So while such sparse data problems are prevalent—natural language processing, information retrieval, and other large data settings all have significant data sparsity—they do not appear to have attracted as much study as their high-dimensional “duals” of dense data and sparse predictors. In this paper, we investigate algorithms and their inherent limitations for solving problem (1) under natural conditions on the data generating distribution. Recent work in the optimization and machine learning communities has shown that data sparsity can be leveraged to develop parallel optimization algorithms [12, 13, 14], but the authors do not study the statistical effects of data sparsity. In recent work, Duchi et al. [4] and McMahan and Streeter [9] develop “adaptive” stochastic gradient algorithms designed to address problems in sparse data regimes (2). These algorithms exhibit excellent practical performance and have theoretical guarantees on their convergence, but it is not clear if they are optimal—in the sense that no algorithm can attain better statistical performance—or whether they can leverage parallel computing as in the papers [12, 14]. In this paper, we take a two-pronged approach. First, we investigate the fundamental limits of optimization and learning algorithms in sparse data regimes. In doing so, we derive lower bounds on the optimization error of any algorithm for problems of the form (1) with sparsity condition (2). These results have two main implications. They show that in some scenarios, learning with sparse data is quite difficult, as essentially each coordinate j ∈ [d] can be relevant and must be optimized for. In spite of this seemingly negative result, we are also able to show that the AdaGrad algorithms of [4, 9] are optimal, and we show examples in which their dependence on the dimension d can be made exponentially better than standard gradient methods. As the second facet of our two-pronged approach, we study how sparsity may be leveraged in parallel computing frameworks to give substantially faster algorithms that still achieve optimal sample complexity in terms of the number of samples ξ used. We develop two new algorithms, asynchronous dual averaging (AsyncDA) and asynchronous AdaGrad (AsyncAdaGrad), which allow asynchronous parallel solution of the problem (1) for general convex f and X . Combining insights of Niu et al.’s Hogwild! [12] with a new analysis, we prove our algorithms can achieve linear speedup in the number of processors while maintaining optimal statistical guarantees. We also give experiments on text-classification and web-advertising tasks to illustrate the benefits of the new algorithms. Notation For a convex function x 7→ f(x), we let ∂f(x) denote its subgradient set at x (if f has two arguments, we say ∂xf(x, y) is the subgradient w.r.t. x). For a positive semi-definite matrix A, we let k·kA be the (semi)norm defined by kvk 2 A := hv, Avi, where h·, ·i is the standard inner product. We let 1 {·} be the indicator function, which is 1 when its argument is true and 0 otherwise. 2 Minimax rates for sparse optimization We begin our study of sparse optimization problems by establishing their fundamental statistical and optimization-theoretic properties. To do this, we derive bounds on the minimax convergence rate of any algorithm for such problems. Formally, let xb denote any estimator for a minimizer of the 2objective (1). We define the optimality gap ǫN for the estimator xb based on N samples ξ 1 , . . . , ξN from the distribution P as ǫN (x, F, b X , P) := f(xb) − inf x∈X f(x) = EP [F(xb; ξ)] − inf x∈X EP [F(x; ξ)] . This quantity is a random variable, since xb is a random variable (it is a function of ξ 1 , . . . , ξN ). To define the minimax error, we thus take expectations of the quantity ǫN , though we require a bit more than simply E[ǫN ]. We let P denote a collection of probability distributions, and we consider a collection of loss functions F specified by a collection F of convex losses F : X × ξ → R. We can then define the minimax error for the family of losses F and distributions P as ǫ ∗ N (X ,P, F) := inf xb sup P ∈P sup F ∈F EP [ǫN (x, F, b X , P)], (3) where the infimum is taken over all possible estimators (optimization schemes) xb. 2.1 Minimax lower bounds Let us now give a more precise characterization of the (natural) set of sparse optimization problems we consider to provide the lower bound. For the next proposition, we let P consist distributions supported on Ξ = {−1, 0, 1} d , and we let pj := P(ξj 6= 0) be the marginal probability of appearance of feature j (j ∈ {1, . . . , d}). For our class of functions, we set F to consist of functions F satisfying the sparsity condition (2) and with the additional constraint that for g ∈ ∂xF(x; ξ), we have that the jth coordinate |gj | ≤ Mj for a constant Mj < ∞. We obtain Proposition 1. Let the conditions of the preceding paragraph hold. Let R be a constant such that X ⊃ [−R, R] d . Then ǫ ∗ N (X ,P, F) ≥ 1 8 R X d j=1 Mj min pj , √pj √ N log 3 . We provide the proof of Proposition 1 in Appendix A.1, providing a few remarks here. We begin by giving a corollary to Proposition 1 that follows when the data ξ obeys a type of power law: let p0 ∈ [0, 1], and assume that P(ξj 6= 0) = p0j −α. We have Corollary 2. Let α ≥ 0. Let the conditions of Proposition 1 hold with Mj ≡ M for all j, and assume the power law condition P(ξj 6= 0) = p0j −α on coordinate appearance probabilities. Then (1) If d > (p0N) 1/α, ǫ ∗ N (X ,P, F) ≥ MR 8 2 2 − α r p0 N (p0N) 2−α 2α − 1 + p0 1 − α d 1−α − (p0N) 1−α α . (2) If d ≤ (p0N) 1/α, ǫ ∗ N (X ,P, F) ≥ MR 8 r p0 N 1 1 − α/2 d 1− α 2 − 1 1 − α/2 . 3For simplicity assume that the features are not too extraordinarly sparse, say, that α ∈ [0, 2], and that number of samples is large enough that d ≤ (p0N) 1/α. Then we find ourselves in regime (2) of Corollary 2, so that the lower bound on optimization error is of order MRr p0 N d 1− α 2 when α < 2, MRr p0 N log d when α → 2, and MRr p0 N when α > 2. (4) These results beg the question of tightness: are they improvable? As we see presently, they are not. 2.2 Algorithms for attaining the minimax rate The lower bounds specified by Proposition 1 and the subsequent specializations are sharp, meaning that they are unimprovable by more than constant factors. To show this, we review a few stochastic gradient algorithms. We first recall stochastic gradient descent, after which we review the dual averaging methods and an extension of both. We begin with stochastic gradient descent (SGD): for this algorithm, we repeatedly sample ξ ∼ P, compute g ∈ ∂xF(x; ξ), then perform the update x ← ΠX (x − ηg), where η is a stepsize parameter and ΠX denotes Euclidean projection onto X . Then standard analyses of stochastic gradient descent (e.g. [10]) show that after N samples ξ, in our setting the SGD estimator xb(N) satisfies E[f(xb(N))] − inf x∈X f(x) ≤ O(1) R2M qPd j=1 pj √ N , (5) where R2 denotes the ℓ2-radius of X . Dual averaging, due to Nesterov [11] and referred to as “follow the regularized leader” in the machine learning literature (see, e.g., the survey article by Hazan [5]) is somewhat more complex. In dual averaging, one again samples g ∈ ∂xF(x; ξ), but instead of updating the parameter vector x one updates a dual vector z by z ← z + g, then computes x ← argmin x∈X hz, xi + 1 η ψ(x) , where ψ(x) is a strongly convex function defined over X (often one takes ψ(x) = 1 2 kxk 2 2 ). The dual averaging algorithm, as we shall see, is somewhat more natural in asynchronous and parallel computing environments, and it enjoys the same type of convergence guarantees (5) as SGD. The AdaGrad algorithm [4, 9] is a slightly more complicated extension of the preceding stochastic gradient methods. It maintains a diagonal matrix S, where upon receiving a new sample ξ, AdaGrad performs the following: it computes g ∈ ∂xF(x; ξ), then updates Sj ← Sj + g 2 j for j ∈ [d]. Depending on whether the dual averaging or stochastic gradient descent (SGD) variant is being used, AdaGrad performs one of two updates. In the dual averaging case, it maintains the dual vector z, which is updated by z ← z + g; in the SGD case, the parameter x is maintained. The updates for the two cases are then x ← argmin x′∈X g, x′ + 1 2η D x ′ − x, S 1 2 (x ′ − x) E 4for stochastic gradient descent and x ← argmin x′∈X z, x′ + 1 2η D x ′ , S 1 2 x ′ E for dual averaging, where η is a stepsize. Then appropriate choice of η shows that after N samples ξ, the averaged parameter xb(N) AdaGrad returns satisfies E[f(xb(N))] − inf x∈X f(x) ≤ O(1)R∞M √ N X d j=1 √pj , (6) where R∞ denotes the ℓ∞-radius of X (e.g. [4, Section 1.3 and Theorem 5], where one takes η ≈ R∞). By inspection, the AdaGrad rate (6) matches the lower bound in Proposition 1 and is thus optimal. It is interesting to note, though, that in the power law setting of Corollary 2 (recall the error order (4)), a calculation shows that the multiplier for the SGD guarantee (5) becomes R∞ √ d max{d (1−α)/2 , 1}, while AdaGrad attains rate at worst R∞ max{d 1−α/2 , log d} (by evaluation of P j √pj ). Thus for α > 1, the AdaGrad rate is no worse, and for α ≥ 2, is more than √ d/ log d better than SGD—an exponential improvement in the dimension. 3 Parallel and asynchronous optimization with sparsity As we note in the introduction, recent works [12, 14] have suggested that sparsity can yield benefits in our ability to parallelize stochastic gradient-type algorithms. Given the optimality of AdaGradtype algorithms, it is natural to focus on their parallelization in the hope that we can leverage their ability to “adapt” to sparsity in the data. To provide the setting for our further algorithms, we first revisit Niu et al.’s Hogwild!. The Hogwild! algorithm of Niu et al. [12] is an asynchronous (parallelized) stochastic gradient algorithm that proceeds as follows. To apply Hogwild!, we must assume the domain X in problem (1) is a product space, meaning that it decomposes as X = X1 × · · · × Xd, where Xj ⊂ R. Fix a stepsize η > 0. Then a pool of processors, each running independently, performs the following updates asynchronously to a centralized vector x: 1. Sample ξ ∼ P 2. Read x and compute g ∈ ∂xF(x; ξ) 3. For each j s.t. gj 6= 0, update xj ← ΠXj (xj − ηgj ) Here ΠXj denotes projection onto the jth coordinate of the domain X . The key of Hogwild! is that in step 2, the parameter x at which g is calculated may be somewhat inconsistent—it may have received partial gradient updates from many processors—though for appropriate problems, this inconsistency is negligible. Indeed, Niu et al. [12] show a linear speedup in optimization time as the number of independent processors grow; they show this empirically in many scenarios, providing a proof under the somewhat restrictive assumption that there is at most one non-zero entry in any gradient g. 53.1 Asynchronous dual averaging One of the weaknesses of Hogwild! is that, as written it appears to only be applicable to problems for which the domain X is a product space, and the known analysis assumes that kgk0 = 1 for all gradients g. In effort to alleviate these difficulties, we now develop and present our asynchronous dual averaging algorithm, AsyncDA. In AsyncDA, instead of asynchronously updating a centralized parameter vector x, we maintain a centralized dual vector z. A pool of processors performs asynchronous additive updates to z, where each processor repeatedly performs the following updates: 1. Read z and compute x := argminx∈X n hz, xi + 1 η ψ(x) o // Implicitly increment “time” counter t and let x(t) = x 2. Sample ξ ∼ P and let g ∈ ∂xF(x; ξ) // Let g(t) = g. 3. For j ∈ [d] such that gj 6= 0, update zj ← zj + gj Because the actual computation of the vector x in asynchronous dual averaging (AsyncDA) is performed locally on each processor in step 1 of the algorithm, the algorithm can be executed with any proximal function ψ and domain X . The only communication point between any of the processors is the addition operation in step 3. As noted by Niu et al. [12], this operation can often be performed atomically on modern processors. In our analysis of AsyncDA, and in our subsequent analysis of the adaptive methods, we require a measurement of time elapsed. With that in mind, we let t denote a time index that exists (roughly) behind-the-scenes. We let x(t) denote the vector x ∈ X computed in the tth step 1 of the AsyncDA algorithm, that is, whichever is the tth x actually computed by any of the processors. We note that this quantity exists and is recoverable from the algorithm, and that it is possible to track the running sum Pt τ=1 x(τ ). Additionally, we require two assumptions encapsulating the conditions underlying our analysis. Assumption A. There is an upper bound m on the delay of any processor. In addition, for each j ∈ [d] there is a constant pj ∈ [0, 1] such that P(ξj 6= 0) ≤ pj . We also require an assumption about the continuity (Lipschitzian) properties of the loss functions being minimized; the assumption amounts to a second moment constraint on the sub-gradients of the instantaneous F along with a rough measure of the sparsity of the gradients. Assumption B. There exist constants M and (Mj ) d j=1 such that the following bounds hold for all x ∈ X : E[k∂xF(x; ξ)k 2 2 ] ≤ M2 and for each j ∈ [d] we have E[|∂xjF(x; ξ)|] ≤ pjMj . With these definitions, we have the following theorem, which captures the convergence behavior of AsyncDA under the assumption that X is a Cartesian product, meaning that X = X1×· · ·×Xd, where Xj ⊂ R, and that ψ(x) = 1 2 kxk 2 2 . Note the algorithm itself can still be efficiently parallelized for more general convex X , even if the theorem does not apply. Theorem 3. Let Assumptions A and B and the conditions in the preceding paragraph hold. Then E X T t=1 F(x(t); ξ t ) − F(x ∗ ; ξ t ) ≤ 1 2η kx ∗ k 2 2 + η 2 TM2 + ηTmX d j=1 p 2 jM2 j . 6We provide the proof of Theorem 3 in Appendix B. As stated, the theorem is somewhat unwieldy, so we provide a corollary and a few remarks to explain and simplify the result. Under a more stringent condition that |∂xjF(x; ξ)| ≤ Mj , Assumption A implies E[k∂xF(x; ξ)k 2 2 ] ≤ Pd j=1 pjM2 j . Thus, without loss of generality for the remainder of this section we take M2 = Pd j=1 pjM2 j , which serves as an upper bound on the Lipschitz continuity constant of the objective function f. We then obtain the following corollary. Corollary 4. Define xb(T) = 1 T PT t=1 x(t), and set η = kx ∗k2 /M √ T. Then E[f(xb(T)) − f(x ∗ )] ≤ M kx ∗k √ 2 T + m kx ∗k2 2M √ T X d j=1 p 2 jM2 j Corollary 4 is almost immediate. To see the result, note that since ξ t is independent of x(t), we have E[F(x(t); ξ t ) | x(t)] = f(x(t)); applying Jensen’s inequality to f(xb(T)) and performing an algebraic manipulation give the corollary. If the data is suitably “sparse,” meaning that pj ≤ 1/m (which may also occur if the data is of relatively high variance in Assumption B) the bound in Corollary 4 simplifies to E[f(xb(T)) − f(x ∗ )] ≤ 3 2 M kx ∗k √ 2 T = 3 2 qPd j=1 pjM2 j kx ∗k2 √ T (7) which is the convergence rate of stochastic gradient descent (and dual averaging) even in nonasynchronous situations (5). In non-sparse cases, setting η ∝ kx ∗k2 / √ mM2T in Theorem 3 recovers the bound E[f(xb(T)) − f(x ∗ )] ≤ O(1)√ m · M kx ∗k √ 2 T . The convergence guarantee (7) shows that after T timesteps, we have error scaling 1/ √ T; however, if we have k processors, then updates can occur roughly k times as quickly, as all updates are asynchronous. Thus in time scaling as n/k, we can evaluate n gradient samples: a linear speedup. 3.2 Asynchronous AdaGrad We now turn to extending AdaGrad to asynchronous settings, developing AsyncAdaGrad (asynchronous AdaGrad). As in the AsyncDA algorithm, AsyncAdaGrad maintains a shared dual vector z among the processors, which is the sum of gradients observed; AsyncAdaGrad also maintains the matrix S, which is the diagonal sum of squares of gradient entries (recall Section 2.2). The matrix S is initialized as diag(δ 2 ), where δj ≥ 0 is an initial value. Each processor asynchronously performs the following iterations: 1. Read S and z and set G = S 1 2 . Compute x := argminx∈X n hz, xi + 1 2η hx, Gxi o // Implicitly increment “time” counter t and let x(t) = x, S(t) = S 2. Sample ξ ∼ P and let g ∈ ∂F(x; ξ) 3. For j ∈ [d] such that gj 6= 0, update Sj ← Sj + g 2 j and zj ← zj + gj 7As in the description of AsyncDA, we note that x(t) is the vector x ∈ X computed in the tth “step” of the algorithm (step 1), and similarly associate ξ t with x(t). To analyze AsyncAdaGrad, we make a somewhat stronger assumption on the sparsity properties of the losses F than Assumption B. Assumption C. There exist constants (Mj ) d j=1 such that for any x ∈ X and ξ ∈ Ξ, we have E[(∂xjF(x; ξ))2 | ξj 6= 0] ≤ M2 j . Indeed, taking M2 = P j pjM2 j shows that Assumption C implies Assumption B with specific constants. We then have the following convergence result, whose proof we provide defer to Appendix C. Theorem 5. In addition to the conditions of Theorem 3, let Assumption C hold. Assume that δ 2 ≥ M2 j m for all j and that X ⊂ [−R∞, R∞] d . Then X T t=1 E F(x(t); ξ t ) − F(x ∗ ; ξ t ) ≤ X d j=1 min 1 η R 2 ∞E " δ 2 + X T t=1 gj (t) 2 1 2 # + ηE "X T t=1 gj (t) 2 1 2 # (1 + pjm), MjR∞pjT . We can also relax the condition on the initial constant diagonal term δ slightly, which gives a qualitatively similar result (see Appendix C.3). Corollary 6. Under the conditions of Theorem 5, assume that δ 2 ≥ M2 j min{m, 6 max{log T, mpj}} for all j. Then X T t=1 E F(x(t); ξ t ) − F(x ∗ ; ξ t ) ≤ X d j=1 min 1 η R 2 ∞E " δ 2 + X T t=1 gj (t) 2 1 2 # + 3 2 ηE X T t=1 gj (t) 2 1 2 (1 + pjm), MjR∞pjT . It is natural to ask in which situations the bound Theorem 5 and Corollary 6 provides is optimal. We note that, as in the case with Theorem 3, we may take an expectation with respect to ξ t and obtain a convergence rate for f(xb(T)) − f(x ∗ ), where xb(T) = 1 T PT t=1 x(t). By Jensen’s inequality, we have for any δ that E " δ 2 + X T t=1 gj (t) 2 1 2 # ≤ δ 2 + X T t=1 E[gj (t) 2 ] 1 2 ≤ q δ 2 + T pjM2 j . For interpretation, let us now make a few assumptions on the probabilities pj . If we assume that pj ≤ c/m for a universal (numerical) constant c, then Theorem 5 guarantees that E[f(xb(T)) − f(x ∗ )] ≤ O(1) 1 η R 2 ∞ + η X d j=1 Mj min (p log(T)/T + pj √ T , pj ) , (8) 8 Auditory Sparse Coding 1 Auditory Sparse Coding Steven R. Ness University of Victoria Thomas Walters Google Inc. Richard F. Lyon Google Inc. CONTENTS 1.1 Summary .................................................................. 3 1.2 Introduction ............................................................... 4 1.2.1 The stabilized auditory image .................................... 6 1.3 Algorithm ................................................................. 7 1.3.1 Pole–Zero Filter Cascade .......................................... 7 1.3.2 Image Stabilization ................................................ 8 1.3.3 Box Cutting ....................................................... 9 1.3.4 Vector Quantization ............................................... 9 1.3.5 Machine Learning ................................................. 10 1.4 Experiments ............................................................... 11 1.4.1 Sound Ranking .................................................... 11 1.4.2 MIREX 2010 ...................................................... 14 1.5 Conclusions ................................................................ 17 1.1 Summary The concept of sparsity has attracted considerable interest in the field of machine learning in the past few years. Sparse feature vectors contain mostly values of zero and one or a few non-zero values. Although these feature vectors can be classified by traditional machine learning algorithms, such as SVM, there are various recently-developed algorithms that explicitly take advantage of the sparse nature of the data, leading to massive speedups in time, as well as improved performance. Some fields that have benefited from the use of sparse algorithms are finance, bioinformatics, text mining [1], and image classification [4]. Because of their speed, these algorithms perform well on very large collections of data [2]; large collections are becoming increasingly 34 Book title goes here relevant given the huge amounts of data collected and warehoused by Internet businesses. In this chapter, we discuss the application of sparse feature vectors in the field of audio analysis, and specifically their use in conjunction with preprocessing systems that model the human auditory system. We present early results that demonstrate the applicability of the combination of auditory-based processing and sparse coding to content-based audio analysis tasks. We present results from two different experiments: a search task in which ranked lists of sound effects are retrieved from text queries, and a music information retrieval (MIR) task dealing with the classification of music into genres. 1.2 Introduction Traditional approaches to audio analysis problems typically employ a shortwindow fast Fourier transform (FFT) as the first stage of the processing pipeline. In such systems a short, perhaps 25ms, segment of audio is taken from the input signal and windowed in some way, then the FFT of that segment is taken. The window is then shifted a little, by perhaps 10ms, and the process is repeated. This technique yields a two-dimensional spectrogram of the original audio, with the frequency axis of the FFT as one dimension, and time (quantized by the step-size of the window) as the other dimension. While the spectrogram is easy to compute, and a standard engineering tool, it bears little resemblance to the early stages of the processing pipeline in the human auditory system. The mammalian cochlea can be viewed as a bank of tuned filters the output of which is a set of band-pass filtered versions of the input signal that are continuous in time. Because of this property, fine-timing information is preserved in the output of cochlea, whereas in the spectrogram described above, there is no fine-timing information available below the 10ms hop-size of the windowing function. This fine-timing information from the cochlea can be made use of in later stages of processing to yield a three-dimensional representation of audio, the stabilized auditory image (SAI)[11], which is a movie-like representation of sound which has a dimension of ‘time-interval’ in addition to the standard dimensions of time and frequency in the spectrogram. The periodicity of the waveform gives rise to a vertical banding structure in this time interval dimension, which provides information about the sound which is complementary to that available in the frequency dimension. A single example frame of a stabilized auditory image is shown in Figure 1.1. While we believe that such a representation should be useful for audio analysis tasks, it does come at a cost. The data rate of the SAI is many times that of the original input audio, and as such some form of dimensionalityAuditory Sparse Coding 5 reduction is required in order to create features at a suitable data rate for use in a recognition system. One approach to this problem is to move from a the dense representation of the SAI to a sparse representation, in which the overall dimensionality of the features is high, but only a limit number of the dimensions are nonzero at any time. In recent years, machine learning algorithms that utilize the properties of sparsity have begun to attract more attention and have been shown to outperform approaches that use dense feature vectors. One such algorithm is the passive-aggressive model for image retrieval (PAMIR), a machine learning algorithm that learns a ranking function from the input data, that is, it takes an input set of documents and orders them based on their relevance to a query. PAMIR was originally developed as a machine vision method and has demonstrated excellent results in this field. There is also growing evidence that in the human nervous system sensory inputs are coded in a sparse manner; that is, only small numbers of neurons are active at a given time [10]. Therefore, when modeling the human auditory system, it may be advantageous to investigate this property of sparseness in relation to the mappings that are being developed. The nervous systems of animals have evolved over millions of years to be highly efficient in terms of energy consumption and computation. Looking into the way sound signals are handled by the auditory system could give us insights into how to make our algorithms more efficient and better model the human auditory system. One advantage of using sparse vectors is that such coding allows very fast computation of similarity, with a trainable similarity measure [4]. The efficiency results from storing, accessing, and doing arithmetic operations on only the non-zero elements of the vectors. In one study that examined the performance of sparse representations in the field of natural language processing, a 20- to 80-fold speedup over LIBSVM was found [7]. They comment that kernel-based methods, like SVM, scale quadratically with the number of training examples and discuss how sparsity can allow algorithms to scale linearly based on the number of training examples. In this chapter, we use the stabilized auditory image (SAI) as the basis of a sparse feature representation which is then tested in a sound ranking task and a music information retrieval task. In the sound raking task, we generate a two-dimensional SAI for each time slice, and then sparse-code those images as input to PAMIR. We use the ability of PAMIR to learn representations of sparse data in order to learn a model which maps text terms to audio features. This PAMIR model can then be used rank a list of unlabeled sound effects according to their relevance to some text query. We present results that show that in certain tasks our methods can outperform highly tuned FFT based approaches. We also use similar sparse-coded SAI features as input to a music genre classification system. This system uses an SVM classifier on the sparse features, and learns text terms associated with music. The system was entered into the annual music information retrieval evaluation exchange evaluation (MIREX 2010).6 Book title goes here Results from the sound-effects ranking task show that sparse auditorymodel-based features outperform standard MFCC features, reaching precision about 73% for the top-ranked sound, compared to about 60% for standard MFCC and 67% for the best MFCC variant. These experiments involved ranking sounds in response to text queries through a scalable online machine learning approach to ranking. 1.2.1 The stabilized auditory image In our system we have taken inspiration from the human auditory system in order to come up with a rich set of audio features that are intended to more closely model the audio features that we use to listen and process music. Such fine timing relations are discarded by traditional spectral techniques. A motivation for using auditory models is that the auditory system is very effective at identifying many sounds. This capability may be partially attributed to acoustic features that are extracted at the early stages of auditory processing. We feel that there is a need to develop a representation of sounds that captures the full range of auditory features that humans use to discriminate and identify different sounds, so that machines have a chance to do so as well. FIGURE 1.1 An example of a single SAI of a sound file of a spoken vowel sound. The vertical axis is frequency with lower frequencies at the bottom of the figure and higher frequencies on the top. The horizontal axis is the autocorrelation lag. From the positions of the vertical features, one can determine the pitch of the sound. This SAI representation generates a 2D image from each section of waveform from an audio file. We then reduce each image in several steps: first cutting the image into overlapping boxes converted to fixed resolution per box; second, finding row and column sums of these boxes and concatenating those into a vector; and finally vector quantizing the resulting medium-Auditory Sparse Coding 7 dimensionality vector, using a separate codebook for each box position. The VQ codeword index is a representation of a 1-of-N sparse code for each box, and the concatenation of all of those sparse vectors, for all the box positions, makes the sparse code for the SAI image. The resulting sparse code is accumulated across the audio file, and this histogram (count of number of occurrences of each codeword) is then used as input to an SVM [5] classifier[3]. This approach is similar to that of the “bag of words” concept, originally from natural language processing, but used heavily in computer vision applications as “bag of visual words”; here we have a “bag of auditory words”, each “word” being an abstract feature corresponding to a VQ codeword. The bag representation is a list of occurrence counts, usually sparse. 1.3 Algorithm In our experiments, we generate a stream of SAIs using a series of modules that process an incoming audio stream through the various stages of the auditory model. The first module filters the audio using the pole–zero filter cascade (PZFC) [9], then subsequent modules find strobe points in this audio, and generate a stream of SAIs at a rate of 50 per second. The SAIs are then cut into boxes and are transformed into a high dimensional dense feature vector [12] which is vector quantized to give a high dimensional sparse feature vector. This sparse vector is then used as input to a machine learning system which performs either ranking or classification. This whole process is shown in diagrammatic form in Figure 1.2 1.3.1 Pole–Zero Filter Cascade We first process the audio with the pole–zero filter cascade (PZFC) [9], a model inspired by the dynamics of the human cochlea. The PZFC is a cascade of a large number of simple filters with an output tap after each stage. The effect of this filter cascade is to transform an incoming audio signal into a set of band-pass filtered versions of the signal. In our case we used a cascade with 95 stages, leading to 95 output channels. Each output channel is halfwave rectified to simulate the output of the inner hair cells along the length of the cochlea. The PZFC also includes an automatic gain control (AGC) system that mimics the effect of the dynamic compression mechanisms seen in the cochlea. A smoothing network, fed from the output of each channel, dynamically modifies the characteristics of the individual filter stages. The AGC can respond to changes in the output on the timescale of milliseconds, leading to very fast-acting compression. One way of viewing this filter cascade is that its outputs are an approximation of the instantaneous neuronal firing rate as a function of cochlear place, modeling both the frequency filtering and8 Book title goes here FIGURE 1.2 A flowchart describing the flow of data in our system. First, either the pole– zero filter cascade (PZFC) or gammatone filterbank filters the input audio signal. Filtered signals then pass through a half-wave rectification module (HCL), and trigger points in the signal are then calculated by the local-max module. The output of this stage is the SAI, the image in which each signal is shifted to align the trigger time to the zero lag point in the image. The SAI is then cut into boxes with the box-cutting module, and the resulting boxes are then turned into a codebook with the vector-quantization module. the automatic gain control characteristics of the human cochlea [8]. The PZFC parameters used for the sound-effects ranking task are described in [9]. We did not do any further tuning of this system to the problems of genre, mood or song classification; this would be a fruitful area of further research. 1.3.2 Image Stabilization The output of the PZFC filterbank is then subjected to a process of strobe finding where large peaks in the PZFC signal are found. The temporal locations of these peaks are then used to initiate a process of temporal integration whereby the stabilized auditory image is generated. These strobe points “stabilize” the signal in a manner analogous to the trigger mechanism in an oscilloscope. When these strobe points are found, a modified form of autocorrelation, known as strobed temporal integration, which is like a sparse version of autocorrelation where only the strobe points are correlated against the sig-Auditory Sparse Coding 9 FIGURE 1.3 The cochlear model, a filter cascade with half-wave rectifiers at the output taps, and an automatic gain control (AGC) filter network that controls the tuning and gain parameters in response to the sound. nal. Strobed temporal integration has the advantage of being considerably less computationally expensive than full autocorrelation. 1.3.3 Box Cutting We then divide each image into a number of overlapping boxes using the same process described in [9]. We start with rectangles of size 16 lags by 32 frequency channels, and cover the SAI with these rectangles, with overlap. Each of these rectangles is added to the set of rectangles to be used for vector quantization. We then successively double the height of the rectangle up to the largest size that fits in an SAI frame, but always reducing the contents of each box back to 16 by 32 values. Each of these doublings is added to the set of rectangles. We then double the width of each rectangle up to the width of the SAI frame and add these rectangles to the SAI frame. The output of this step is a set of 44 overlapping rectangles. The process of box-cutting is shown in Figure 1.4. In order to reduce the dimensionality of these rectangles, we then take their row and column marginals and join them together into a single vector. 1.3.4 Vector Quantization The resulting dense vectors from all the boxes of a frame are then converted to a sparse representation by vector quantization. We first preprocessed a collection of 1000 music files from 10 genres using a PZFC filterbank followed by strobed temporal integration to yield a set of10 Book title goes here FIGURE 1.4 The boxes, or multi-scale regions, used to analyze the stabilized auditory images are generated in a variety of heights, widths, and positions. SAI frames for each file . We then take this set of SAI and apply the boxcutting technique described above. The followed by the calculation of row and column marginals. These vectors are then used to train dictionaries of 200 entries, representing abstract “auditory words”, for each box position, using a k-means algorithm. This process requires the processing of large amounts of data, just to train the VQ codebooks on a training corpus. The resulting dictionaries for all boxes are then used in the MIREX experiment to convert the dense features from the box cutting step on the test corpus songs into a set of sparse features where each box was represented by a vector of 200 elements with only one element being non-zero. The sparse vectors for each box were then concatenated, and these long spare vectors are histogrammed over the entire audio file to produce a sparse feature vector for each song or sound effect. This operation of constructing a sparse bag of auditory words was done for both the training and testing corpora.Auditory Sparse Coding 11 1.3.5 Machine Learning For this system, we used the support vector machine learning system from libSVM which is included in the Marsyas[13] framework. Standard Marsyas SVM parameters were used in order to classify the sparse bag of auditory words representation of each song. It should be noted that SVM is not the ideal algorithm for doing classification on such a sparse representation, and if time permitted, we would have instead used the PAMIR machine learning algorithm as described in [9]. This algorithm has been shown to outperform SVM on ranking tasks, both in terms of execution speed and quality of results. 1.4 Experiments 1.4.1 Sound Ranking We performed an experiment in which we examined a quantitative ranking task over a diverse set of audio files using tags associated with the audio files. For this experiment, we collected a dataset of 8638 sound effects, which came from multiple places. 3855 of the sound files were from commercially available sound effect libraries, of these 1455 were from the BBC sound effects library. The other 4783 audio files were collected from a variety of sources on the internet, including findsounds.com, partnersinrhyme.com, acoustica.com, ilovewaves.com, simplythebest.net, wav-sounds.com, wav-source.com and wavlist.com. We then manually annotated this dataset of sound effects with a small number of tags for each file. Some of the files were already assigned tags and for these, we combined our tags with this previously existing tag information. In addition, we added higher level tags to each file, for example, files with the tags “cat”, “dog” and “monkey” were also given the tags “mammal” and “animal”. We found that the addition of these higher level tags assist retrieval by inducing structure over the label space. All the terms in our database were stemmed, and we used the Porter stemmer for English, which left a total of 3268 unique tags for an average of 3.2 tags per sound file. In order to estimate the performance of the learned ranker, we used a standard three-fold cross-validation experimental setup. In this scheme, two thirds of the data is used for training and one third is used for testing; this process is then repeated for all three splits of the data and results of the three are averaged. We removed any queries that had fewer than 5 documents in either the training set or the test set, and if the corresponding documents had no other tags, these documents were removed as well. To determine the values of the hyperparameters for PAMIR we performed a second level of cross-validation where we iterated over values for the aggressiveness parameter C and the number of training iterations. We found that in12 Book title goes here general system performance was good for moderate values of C and that lower values of C required a longer training time. For the agressiveness parameter, we selected a value of C=0.1, a value which was also found to be optimal in other research [6]. For the number of iterations, we chose 10M, and found that in our experience, the system was not very sensitive to the exact value of these parameters. We evaluated our learned model by looking at the precision within the top k audio files from the test set as ranked by each query. Precision at top k is a commonly used measure in retrieval tasks such as these and measures the fraction of positive results within the top k results from a query. The stabilized auditory image generation process has a number of parameters which can be adjusted including the parameters of the PZFC filter and the size of rectangles that the SAI is cut into for subsequent vector quantization. We created a default set of parameters and then varied these parameters in our experiments. The default SAI box-cutting was performed with 16 lags and 32 channels, which gave a total of 49 rectangles. These rectangles were then reduced to their marginal values which gives a 48 dimension vector, and a codebook of size 256 was used for each box, giving a total of 49 x 256 = 12544 feature dimensions. Starting from these, we then made systematic variations to a number of different parameters and measured their effect on precision of retrieval. For the box-cutting step, we adjusted various parameters including the smallest sized rectangle, and the maximum number of rectangles used for segmentation. We also varied the codebook sizes that we used in the sparse coding step. In order to evaluate our method, we compared it with results obtained using a very common feature extraction method for audio analysis, MFCCs (mel-frequency cepstral coefficients). In order to compare this type of feature extraction with our own, we turned these MFCC coefficients into a sparse code. These MFCC coefficients were calculated with a Hamming window with initial parameters based on a setting optimized for speech. We then changed various parameters of the MFCC algorithm, including the number of cepstral coefficients (13 for speech), the length of each frame (25ms for speech), and the number of codebooks that were used to sparsify the dense MFCC features for each frame. We obtained the best performance with 40 cepstral coefficients, a window size of 40ms and codebooks of size 5000. We investigated the effect of various parameters of the SAI feature extraction process on test-set precision, these results are displayed graphically in Figure 1.5 where the precision of the top ranked sound file is plotted against the number of features used. As one can see from this graph, performance saturates when the number of features approaches 105 which results from the use of 4000 code words per codebook, with a total of 49 codebooks. This particular set of parameters led to a performance of 73%, significantly better than the best MFCC result which achieved a performance of 67%, which represents a smaller error of 18% (from 33 % to 27 % error). It is also notableAuditory Sparse Coding 13 FIGURE 1.5 Ranking at top-1 retrieved result for all the experimental runs described in this section. A few selected experiment names are plotted next to each point, and different experiments are shown by different icons. The convex hull that connects the best-performing experiments is plotted as a solid line. that SAI can achieve better precision-at-top-k consistently for all values of k, albeit with a smaller improvement in relative precision. In table 1.2 results of three queries along with the top five sound files that were returned by the best SAI-based and MFCC-based systems. From this table, one can see that the two systems perform in different ways, this can be expected when one considers the basic audio features that these two systems extract. For example, for the query “gulp”, the SAI system returns “pouring” and “water-dripping”, all three of these share the similarity of involving the movement of water or liquids. When we calculated performance, it was based on textual tags, which are often noisy and incomplete. Due to the nature of human language and perception, people often use different words to describe sounds that are very similar, for example, a Chopin Mazurka could be described with the words “piano”, “soft”, “classical”, “Romantic”, and “mazurka”. To compound this diffi- culty, a song that had a female vocalist singing could be labelled as “woman”, “women”, “female”, “female vocal”, or “vocal”. This type of multi-label problem is common in the field of content based retrieval. It can be alleviated by a number of techniques, including the stemming of words, but due to the varying nature of human language and perception, will continue to remain an issue. In Figure 1.6 the performance of the SAI and MFCC based systems are14 Book title goes here FIGURE 1.6 A comparison of the average precision of the SAI and MFCC based systems. Each point represents a single query, with the horizontal position being the MFCC average precision and the vertical position being the SAI average precision. More of the points appear above the y=x line, which indicates that the SAI based system achieved a higher mean average precision. compared to each other with respect to their average precision. A few select full tag names are placed on this diagram, for the rest, only a plus is shown. This is required because otherwise the text would overlap to such a great degree that it would be impossible to read. In this diagram we plot the average precision of the SAI based system against that of the MFCC based system, with the SAI precision shown along the vertical axis and the MFCC precision shown along the horizontal axis. If the performance of the two systems was identical, all points would lie on the line y=x. Because more points lie above the line than below the line, the performance of the SAI based system is better than that of the MFCC based system.Auditory Sparse Coding 15 top-k SAI MFCC percent error reduction 1 27 33 18 % 2 39 44 12 % 5 60 62 4 % 10 72 74 3 % 20 81 84 4 % TABLE 1.1 A comparison of the best SAI and MFCC configurations. This table shows the percent error at top-k, where error is defined as 1 - precision. Query SAI file (labels) MFCC file (labels) tarzan Tarzan-2 (tarzan, yell) TARZAN (tarzan, yell) tarzan2 (tarzan, yell) 175orgs (steam, whistle) 203 (tarzan) mosquito-2 (mosquito) wolf (mammal, wolves, wolf, ...) evil-witch-laugh (witch, evil, laugh) morse (morse, code) Man-Screams (horror, scream, man) applause 27-Applause-from-audience 26-Applause-from-audience audience 30-Applause-from-audience phase1 (trek, phaser, star) golf50 (golf) fanfare2 (fanfare, trumpet) firecracker 45-Crowd-Applause (crowd, applause) 53-ApplauseLargeAudienceSFX golf50 gulp tite-flamn (hit, drum, roll) GULPS (gulp, drink) water-dripping (water, drip) drink (gulp, drink) Monster-growling (horror, monster, growl) california-myotis-search (blip) Pouring (pour,soda) jaguar-1 (bigcat, jaguar, mammal, ...) TABLE 1.2 A comparison of the best SAI and MFCC configurations. This table shows the percent error at top-k, where error is defined as (1 - precision).16 Book title goes here Algorithm Classification Accuracy SAI/VQ 0.4987 Marsyas MFCC 0.4430 Best 0.6526 Average 0.455 TABLE 1.3 Classical composer train/test classification task Algorithm Classification Accuracy SAI/VQ 0.4861 Marsyas MFCC 0.5750 Best 0.6417 Average 0.49 TABLE 1.4 Music mood train/test classification task 1.4.2 MIREX 2010 All of these algorithms were then ported to the Marsyas music information retrieval framework from AIM-C, and extensive tests were written as described above. These algorithms were submitted to the MIREX 2010 competition as C++ code, which was then run by the organizers on blind data. As of this date, only results for two of the four train/test tasks have been released. One of these is for the task of classifying classical composers and the other is for classifying the mood of a piece of music. There were 40 groups participating in this evaluation, the most ever for MIREX, which gives some indication about how this classification task is increasingly important in the real world. Below I present the results for the best entry, the average of all entries, our entry, and the other entry for the Marsyas system. It is instructive to compare our result to that of the standard Marsyas system because in large part we would like to compare the SAI audio feature to the standard MFCC features, and since both of these systems use the SVM classifier, we partially negate the influence of the machine learning part of the problem. For the classical composer task the results are shown in table 1.3 and for the mood classification task, results are shown in table 1.4 From these results we can see that in the classical composer task we outperformed the traditional Marsyas system which has been tuned over the course of a number of years to perform well. This gives us the indication that the use of these SAI features has promise. However, we underperform the best algorithm, which means that there is work to be done in terms of testing different machine learning algorithms that would be better suited to this type of data. However, in a more detailed analysis of the results, which is shown in 1.7, it is evident that each of the algorithms has a wide range of perfor-Auditory Sparse Coding 17 mance on different classes. This graph shows that the most well predicted in our SAI/VQ classifier overlap significantly with those from the highest scoring classification engines. FIGURE 1.7 Per class results for classical composer In the mood task, we underperform both Marsyas and the leading algorithm. This is interesting and might speak to the fact that we did not tune the parameters of this algorithm for the task of music classification, but instead used the parameters that worked best for the classification of sound effects. Music mood might be a feature that has spectral aspects that evolve over longer time periods than other features. For this reason, it would be important to search for other parameters in the SAI algorithm that would perform well for other tasks in music information retrieval. For these results, due to time constraints, we only used the SVM classifier on the SAI histograms. This has been shown in [9] to be an inferior classifier for this type of sparse, high-dimensional data than the PAMIR algorithm. In the future, we would like to add the PAMIR algorithm to Marsyas and to try these experiments using this new classifier. It was observed that the MIR community is increasingly becoming focused on advanced machine learning techniques, and it is clear that it will be critical to both try different machine learning algorithms on these audio features as well as to perform wider sweeps of parameters for these classifiers. Both of these will be important in increasing the performance of these novel audio features.18 Book title goes here 1.5 Conclusions The use of physiologically-plausible acoustic models combined with a sparsi- fication approach has shown promising results in both the sound effects ranking and MIREX 2010 experiments. These features are novel and hold great promise in the field of MIR for the classification of music as well as other tasks. Some of the results obtained were better than that of a highly tuned MIR system on blind data. In this task we were able to expose the MIR community to these new audio features. These new audio features have been shown to outperform MFCC features in a sound-effects ranking task, and by evaluating these features with machine learning algorithms more suited for these high dimensional, sparse features, we have great hope that we will obtain even better results in future MIREX evaluations.Bibliography [1] Suhrid Balakrishnan and David Madigan. Algorithms for sparse linear classifiers in the massive data setting. J. Mach. Learn. Res., 9:313–337, 2008. [2] L´eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston. Large-Scale Kernel Machines (Neural Information Processing). The MIT Press, 2007. [3] O. Chappelle, B. Scholkopf, and A. Zien. Semi-Supervised Learning. MIT Press, Cambridge, MA, 2006. [4] Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. Large scale online learning of image similarity through ranking. J. Mach. Learn. Res., 11:1109–1135, 2010. [5] Yasser EL-Manzalawy and Vasant Honavar. WLSVM: Integrating LibSVM into Weka Environment, 2005. [6] David Grangier and Samy Bengio. A discriminative kernel-based approach to rank images from text queries. IEEE Trans. Pattern Anal. Mach. Intell., 30(8):1371–1384, 2008. [7] Patrick Haffner. Fast transpose methods for kernel learning on sparse data. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 385–392, New York, NY, USA, 2006. ACM. [8] R. F. Lyon. Automatic gain control in cochlear mechanics. In P Dallos et al., editor, The Mechanics and Biophysics of Hearing, pages 395–420. Springer-Verlag, 1990. [9] Richard F. Lyon, Martin Rehn, Samy Bengio, Thomas C. Walters, and Gal Chechik. Sound retrieval and ranking using auditory sparsecode representations. Neural Computation, 22, 2010. [10] Bruno A. Olshausen and David J. Field. Sparse coding of sensory inputs. Current opinion in neurobiology, 14(4):481–487, 2004. [11] R. D. Patterson. Auditory images: how complex sounds are represented in the auditory system. The Journal of the Acoustical Society of America, 21:183–190, 2000. 1920 Book title goes here [12] Martin Rehn, Richard F. Lyon, Samy Bengio, Thomas C. Walters, and Gal Chechik. Sound ranking using auditory sparse-code representations. ICML 2009 Workshop on Sparse Methods for Music Audio, 2009. [13] G. Tzanetakis. Marsyas-0.2: A case study in implementing music information retrieval systems, chapter 2, pages 31–49. Intelligent Music Information Systems: Tools and Methodologies. Information Science Reference, 2008. Shen, Shepherd, Cui, Liu (eds). Extracting Patterns from Location History Andrew Kirmse Google Inc Mountain View, California akirmse@google.com Tushar Udeshi Google Inc Boulder, Colorado tudeshi@google.com Jim Shuma Google Inc Mountain View, California jshuma@google.com Pablo Bellver Google Inc Mountain View, California pablob@google.com ABSTRACT In this paper, we describe how a user's location history (recorded by tracking the user's mobile device location with his permission) is used to extract the user's location patterns. We describe how we compute the user's commonly visited places (including home and work), and commute patterns. The analysis is displayed on the Google Latitude history dashboard [7] which is only accessible to the user. Categories and Subject Descriptors D.0 [General]: Location based services. General Terms Algorithms. Keywords Location history analysis, commute analysis. 1. INTRODUCTION Location-based services have been gaining in popularity. Most services[4,5] utilize a “check-in” model where a user takes some action on the phone to announce that he has reached a particular place. He can then advertise this to his friends and also to the business owner who might give him some loyalty points. Google Latitude [6] utilizes a more passive model. The mobile device periodically sends his location to a server which shares it with his registered friends. The user also has the option of opting into latitude location history. This allows Google to store the user's location history. This history is analyzed and displayed for the user on a dashboard [7]. A user's location history can be used to provide several useful services. We can cluster the points to determine where he frequents and how much time he spends at each place. We can determine the common routes the user drives on, for instance, his daily commute to work. This analysis can be used to provide useful services to the user. For instance, one can use real-time traffic services to alert the user when there is traffic on the route he is expected to take and suggest an alternate route. We expect many more useful services to arise from location history. It is important to note that a user's location history is stored only if he explicitly opts into this feature. However, once signed in, he can get several useful services without any additional work on his part (like checking in). 2. PREVIOUS WORK Much previous work assumes clean location data sampled at very high frequency. Ashbrook and Starner [2] cluster a user's significant locations from GPS traces by identifying locations where the GPS signal reappears after an absence of 10 minutes or longer. This approach is unable to identify important outdoor places and is also susceptible to spurious GPS signal loss (e.g. in urban canyons or when the recording device is off). In addition they use a Markov model to predict where the user is likely to go next from where he is. Liao, et al [11] attempt to segment a user's day into everyday activities such as “working”, “sleeping” etc. using a hierarchical activity model. Both these papers obtain one GPS reading per second. This is impractical with today's mobile devices due to battery usage. Kang et al [10] use time-based clustering of locations obtained using a “Place Lab Client” to infer the user's important locations. The “Place lab client” infers locations by listening to RF-emissions from known wi-fi access points. This requires less power than GPS. However, their clustering algorithm assumes a continuous trace of one sample per second. Real-world data is not so reliable and often has missing and noisy data as illustrated in Section 3.2. Ananthanarayanan et al [1] describe a technique to infer a user's driving route. They also match users having similar routes to suggest carpool partners. Liao et al [12] use a hierarchical Markov Model to infer a user's transportation patterns including different modes of transportation (e.g. bus, on foot, car etc.). Both these papers use clean regularly-sampled GPS traces as input. 3. LOCATION ANALYSIS 3.1 Input Data For every user, we have a list of timestamped points. Each point has a geolocation (latitude and longitude), an accuracy radius and an input source: 17% of our data points are from GPS and these have an accuracy in the 10 meter range. Points derived from wifi signatures have an accuracy in the 100 meter range and represent 57% of our data. The remaining 26% of our points are derived Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGSPATIAL GIS '11, November 1-4, 2011. Chicago, IL, USA Copyright © 2011 ACM ISBN 978-1-4503-1031-4/11/11...$10.00from cell tower triangulation and these have an accuracy in the 1000 meter range. We have a test database of location history of around a thousand users. We used this database to generate the data in this paper. 3.2 Location Filtering The raw geolocations reported from mobile devices may contain errors beyond the measurable uncertainties inherent in the collection method. Hardware or software bugs in the mobile device may induce spurious readings, or variations in signal strength or terrain may cause a phone to connect to a cell tower that is not the one physically closest to the device. Given a stream of input locations, we apply the following filters to account for these errors: 1. Reject any points that fall outside the boundaries of international time zones over land. While this discards some legitimate points over water (presumably collected via GPS), in practice it removes many more false readings. 2. Reject any points with timestamps before the known public launch of the collection software. 3. Identify cases of “jitter”, where the reported location jumps to a distant point and soon returns. As shown in Figure 1, this is surprisingly common. We look for a sequence of consecutive locations {P1, P2, …, Pn} where the following conditions hold: ◦ P1 and Pn are within a small distance threshold D of each other. ◦ P1 and Pn have timestamps within a few hours of each other. ◦ P1 and Pn have high reported accuracy. ◦ P2, …, Pn-1 have low reported accuracy. ◦ P2, …, Pn-1 are farther than D from P1. In such a case, we conclude that the points P2, …, Pn-1 are due to jitter, and discard them. 4. If a pair of consecutive points implies a non-physical velocity, reject the later one. Any points that are filtered are discarded, and are not used in the remaining algorithms described in this paper. 3.3 Computing Frequently Visited Places In this section, we describe the algorithms we use to compute places frequented by a user from his location history. We first filter out the points for which the user is stationary i.e. moving at a very low velocity. These stationary points need to be clustered to extract interesting locations. 3.3.1 Clustering Stationary Points We use two different algorithms for clustering stationary points. 3.3.1.1 Leader-based Clustering For every point, we determine if it belongs to any of the already generated clusters by computing its distance to the cluster leader's location. If this is below a threshold radius, the point is added to the cluster. Psuedocode is in Figure 2. This algorithm is simple and efficient. It runs in O(NC) where N is the number of points and C is the number of clusters. However, the output clusters are dependent on the order of the input points. For example, consider 3 points P1, P2 and P3 which lie on a straight line with a distance of radius between them as shown in Figure 3. If the input points are ordered {P1, P2, P3}, we would get 2 clusters: {P1} and {P2, P3}. But if they are ordered {P2, P1 ,P3} we would get only 1 cluster containing all 3 points. 3.3.1.2 Mean Shift Clustering Mean shift [3] is an iterative procedure that moves every point to the average of the data points in its neighborhood. The iterations are stopped when all points move less than a threshold. This algorithm is guaranteed to converge. We use a weighted average to compute the move position where the weights are inversely proportional to the accuracy. This causes the points to gravitate towards high accuracy points. Once the iterations converge, the moved locations of the points are chosen as cluster centers. All input points within a threshold radius to a cluster center are added to its cluster. We revert back to leader-based clustering if the iterations do not converge or some of the input points remain unclustered. Psuedocode is shown in Figure 4. This algorithm does not suffer from the input-order dependency of leader-based clustering. For the input point set of Figure 3, it will always return a cluster comprising all 3 points. The algorithm generates a smaller number of better located clusters compared to leader-based clustering. For example consider 4 points on the vertices of a square with a diagonal of 2*radius as shown in Figure 5. Leader-based clustering would generate 4 clusters, 1 per Figure 3. Three equidistant points on a line. P2 P3 P 1 radius radius Let points = input points. Let clusters = [] foreach p in points: foreach c in clusters: if distance(c.leader(), p) < radius: Add p to c break else: Create a new cluster c with p as leader clusters.add(c) Figure 2. Leader-based Clustering Algorithm. Figure 1. A set of reported locations exhibiting “jitter”. One of the authors was actually stationary during the time interval represented by these points.point. Mean-shift clustering would return only 1 cluster whose centroid is at the center of the square. The iterative nature of this algorithm makes it expensive. We therefore limit the maximum number of iterations and revert to leader-based clustering if the algorithm does not converge quickly enough. When we ran Mean-shift clustering on our test database, the algorithm converged in 2.4 iterations on an average. 3% of the input points could not be clustered (i.e. we had to revert to leaderbased for them). However, it did not cause a significant reduction in the number of computed clusters (< 1%). We concluded that the marginal improvement in quality did not justify the increased computational cost. 3.3.1.3 Adaptive Radius Clustering The two clustering algorithms described above return clusters of the input points. One possibility would be to deem clusters larger than a threshold as interesting locaitons. However, this is not ideal since the input points have varying accuracy. For instance, if there are three stationary GPS points within close proximity of each other, we have high confidence that the user visited that place as opposed to three stationary cell tower points. We run the clustering algorithms multiple times, increasing the radius as well as the minimum cluster size after every iteration. When a cluster is generated, we check to see if it overlaps an already computed cluster (generated from a smaller radius). If that is the case, we merge it into the larger cluster. Note that adaptive radius clustering can be used in conjunction with any clustering algorithm. From our test database, we found that adaptively increasing the clustering radius from 20 meters to 500 meters and the minimum cluster size from 2 to 4, increased the number of computed visited places by 81% as compared to clustering with a fixed radius of 500 meters and a minimum cluster size of 4. We also surveyed users and found that the majority of the new visited places generated were correct and useful enough to display on the Latitude history dashboard [7]. 3.3.2 Computing Home and Work Locations We use a simple heuristic to determine the user's home and work locations. A user is likely to be at home at night. We filter out the user's points which occur at night and cluster them. The largest cluster is deemed the user's home location. Similarly, work location is derived by clustering points which occur on weekdays in the middle of the day and clustering them. Note that this heuristic will not work for users with non-standard schedules (e.g. work at night or work in multiple locations). Such users have the option of correcting their home and work location on the Latitude history dashboard [7]. These updated locations will be used for other analyses (e.g. commute analysis described in Section 3.4). 3.3.3 Computing Visited Places We do some additional filtering of the input points before clustering for visited places: 1. We remove points which are within a threshold distance of home and work locations. 2. We remove points which are on the user's commute between home and work. These points are determined using the algorithm described in Section 3.4.1. 3. We remove points near airports since these are reported as flights as described in Section 3.5. Without these filters, we get spurious visited places. Even when a user is stationary at home or work, the location reported can jump around, as described in Section 3.2. Without the first filter, we would get multiple visited places near home and work. If a user regularly stops at a long traffic signal on his commute to work, it has a good chance of being clustered to a visited place. This is why we need the second filter. 3.4 Commute Analysis We can deduce a user's driving commute patterns from his location history. The main challenge here is that points are reported infrequently and we have to derive the path the user has taken in between these points. Also, the accuracy of the points can be very low and so one needs to snap the points to the road he is likely to be on. The commutes are analyzed in three steps: (1) Extract sets of commute points from the input. (2) Fit the commute points to a road path. (3) Cluster paths together spatially and (optionally) temporally to generate the most common commutes taken by the user. 3.4.1 Extracting Commute Points Given a source and destination location (e.g. home and work), we extract the points from the user's location history which likely occurred on the user's driving commute from source to destination. We first filter out all points with low accuracy from the input set. We then find pairs of source-destination points. All points between a pair are candidate points for a single commute. The input points are noisy and therefore we do some sanity checks on the commute candidate points: (1) The commute distance should be reasonable. (2) The commute duration should be reasonable. (3) The commute should be at reasonable driving velocity. Figure 5. Four points on a square. Let points = input points Let clusters = [] foreach p in points: Compute Weight(p) Let cluster_centers = points while all cluster_centers move < shift_threshold: foreach p in cluster_centers: Find all points np which are within a threshold distance of p p = ∑ Weight (npi )×npi ∑ Weight( npi ) while not cluster_centers.empty(): Choose p with highest accuracy from cluster_centers Find all points, say rp, in points which are within radius of p Create a new cluster c with rp clusters.add(c) points = points – rp cluster_centers = cluster_centers – Moved(rp) Cluster remaining points with Leader-based clustering. Figure 4. Mean-Shift Clustering Algorithm 2*radius3.4.2 Fitting a Road Path to Commute Points We use the routing engine used to compute driving directions in Google Maps [8] to fit a path to the commute point. For the rest of this paper, we will refer to this routing engine as “Pathfinder”. This is an iterative algorithm. We first query Pathfinder for the route between source and destination. If all the commute points are within the accuracy threshold distance (used in Section 3.4.1), we terminate and return this path. If not, we add the point which is furthest away from the path as a waypoint and query Pathfinder again. If Pathfinder fails, we assume that the waypoint is not valid (for example it might be in water) and drop it. We continue iterating until all commute points are within the accuracy distance threshold. Psuedocode is shown in Figure 6. Two iterations of this algorithm are shown in Figure 7. The small blue markers are the points from the user's history. The large green markers are the Pathfinder waypoints used to generate the path. After the second iteration all the user points are within the accuracy threshold. 3.4.3 Clustering Commutes Given a bag of commute paths and the time intervals the commutes occurred, we cluster them to determine the most frequent commutes. Two commutes are deemed temporally close if they start and end within some threshold of each other on the same day of the week. Two commutes are deemed spatially close if their Hausdorff distance [9] is within a threshold. We use a variant of leader-based clustering (described in Section 3.3.1.1) to generate commute clusters. The largest commute cluster on a particular day of week is the most common route taken by the user. This can be used for traffic alerts as described in Section 4.1. 4. ACKNOWLEDGMENTS Our thanks to Matthieu Devin, Max Braun, Jesse Rosenstock, Will Robinson, Dale Hawkins, Jim Guggemos, and Baris Gultekin for contributing to this project. 5. REFERENCES [1] Ananthanarayanan, G., Haridasan , M., Mohomed , I., Terry, D., Thekkath , C. A. 2009. StarTrack: A Framework for Enabling Track-Based Applications. Proceedings of the 7th international conference on Mobile systems, applications, and services. [2] Ashbrook, D., Starner, T. 2003. Using GPS to Learn Significant Locations and Predict Movement across Multiple Users. Personal and Ubiquitous Computing, Vol. 7, Issue 5. [3] Cheng, Y. C. 1995. Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 17, No. 8. [4] Facebook places http://www.facebook.com/places. [5] Foursquare http://www.foursquare.com. [6] Google Latitude http://www.google.com/latitude. [7] Google Latitude History Dashboard. http://www.google.com/latitude/history/dashboard [8] Google Maps http://maps.google.com. [9] Huttenlocher D. P., Klanderman, G. A., and Rucklidge W. J. 1993. Comparing Images using the Hausdorff Distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol 15, No 9. [10] Kang, J.H., Welbourne, W., Stewart, B. and Borriello, G. 2005. Extracting Places from Traces of Locations. SIGMOBILE Mob. Comput. Commun. Rev. 9, 3. [11] Liao, L., Fox, D., and Kautz, H. 2007. Extracting Places and Activities from GPS traces using Hierarchical Conditional Random Fields. International Journal of Robotics Research. [12] Liao, L., Patterson D. J., Fox, D., and Kautz, H. 2007, Learning and Inferring Transportation Routines. Artificial Intelligence. Vol. 171, Issues 5-6. Figure 7. Two iterations of path fitting algorithm. The small markers are the user's points. The large marker are Pathfinder waypoints used to generate the path. Iteration 1 Iteration 2 Let points = input points Let waypoints = [source, destination] Let current_path = Pathfinder route for waypoints while points.size() > 2: Let p be the point in points farthest away from current_path if distance(p, current_path) < threshold: Record current_path as a commute. break Add p to waypoints in the correct position current_path = Pathfinder route for waypoints if Pathfinder fails: Erase p from waypoints Erase p from points Figure 6. Algorithm to fit a path to commute Abstract When we started implementing a refactoring tool for real-world C programs, we recognized that preprocessing and parsing in straightforward and accurate ways would result in unacceptably slow analysis times and an overly-complicated parsing system. Instead, we traded some accuracy so we could parse, analyze, and change large, real programs while still making the refactoring experience feel interactive and fast. Our tradeoffs fell into three categories: using different levels of accuracy in different parts of the analysis, recognizing that collected wisdom about C programs didn't hold for Objective-C programs, and finding ways to exploit delays in typical interaction with the tool. Categories and Subject Descriptors D.2.6 [Software Engineering]: Programming Environments General Terms Design, Language Keywords: refactoring, case study, scalability, Objective-C 1. Introduction Taking software engineering tools from research to development requires addressing the practical details of software development: huge amounts of source code, the nuances of real languages, and multiple build configurations. Making tools useful for real programmers requires either addressing all these sorts of issues, or accepting various trade-offs in order to ship a reasonable software tool. In our case, we wanted to add refactoring to Apple’s Xcode IDE (integrated development environment.) 1 The refactoring feature would manipulate programs written in Objective-C. Objective-C is an object-oriented extension to C, and Apple’s primary development language [1]. In past research [2], I’d found it acceptable to take multiple minutes to perform a transformation on a small Scheme program. The critical requirements for our commercial tool were quite different: • Support the most common and useful transformations. Renaming declarations, replacing a block of code with a call to a new function, and moving declarations up and down a class hierarchy were mandatory features. • Refactor 200,000 line programs. The feature had to work on real, medium-sized applications. The actual amount of code to parse was much larger than the program’s size. Most Mac OS X compilation units pull in headers for common system libraries, requiring at least another 60-120,000 lines of code that would need to be parsed for every compilation unit. Such large sets of headers are not unique to Mac OS X. C programs using large libraries like the Qt user interface library would encounter similar scalability issues. • Interactive behavior. Xcode’s refactoring feature would be part of the source code editor. Users will expect transformations to complete in seconds rather than minutes, and the whole experience would need to feel interactive [3]. Parsing and analyzing programs of this size in straightforward ways would result in an unacceptable user experience. In one of my first experiences with a similar product, renaming a declaration in a 4,200 line C program (with the previously-mentioned 60,000 lines of headers) took two minutes. • Don't force the user to change the program in order to refactor. The competing product previously mentioned could provide much more acceptable performance if the user specified a pre-compiled header—a single header included by all compilation units. However, converting a large existing project to use a pre-compiled header is not a trivial task, and the additional and hidden setup step discourages new users. • Be aware of use of C's preprocessor. The programs being manipulated would make common use of preprocessor macros and conditionally compiled code. If we did not fully address how the preprocessor affected refactoring, we would at least need to be aware of the potential issues. • Reuse existing parsing infrastructure. We realized there wasn’t sufficient time or resources to write a new parser from scratch. Analysis would need to be done by an existing Objective C parser used for indexing global declarations. Refactoring had to work best for our third-party developers—primarily developers writing GUI applications. It should also work well for developers within Apple, but not for those writing low-level operating system or device driver code. Performance and interactivity were key—we wanted to provide an excellent refactoring experience. In order to meet these performance and interactivity goals, we attacked three areas: using different levels of accuracy in different parts of the tool, recognizing differences between our target programmers and typical C programmers, and finding ways to exploit delays in the user’s interaction with the tool. 2. Different Levels of Accuracy In C, each source file is preprocessed and compiled independently as a “compilation unit”. Each can include different headers, or can include the same headers with different inclusion order or Performance Trade-offs Implementing Refactoring Support for Objective-C Robert Bowdidge* rbowdidge@mac.com Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 3rd Workshop on Refactoring Tools '09, Oct. 26, 2009, Orlando, FL. Copyright © 2009 ACM 978-1-60558-909-1...$10.00 * This work was performed while the author was at Apple, and discusses the initial implementation of refactoring for Xcode 3.0. The author is currently at Google.initial macro settings. As a result, each compilation unit may interpret the same headers different ways, and may parse different declarations in those same headers. For correct parsing, the compiler needs to compile every source file independently, read in header files anew each time, and fully parse all headers. For small programs, this may not matter, but with Mac OS X, each source file includes between 60-120,000 lines of code from header files. Precompiled headers and other optimizations could speed compile times, but not all developers use precompiled headers, nor could we demand that developers use such schemes in order to use refactoring. Naively parsing all source code was not acceptable; we saw parse times of around five seconds to parse a typical set of headers, so five seconds minimum per file per build configuration would be completely unacceptable. We realized two facts about programs that made us question whether we needed compilation-unit-level accuracy. We realized that although programmers have the opportunity for header files to be interpreted differently in each compilation unit, most programmers intend for the headers to be processed the same in all compilation units. (When header files are not processed uniformly, it can cause subtle, nasty bugs that can take days to track down.) We also realized that system header files are not really part of the project, and not targets for refactoring. We needed to correctly parse system header files merely for their information on types and external function declarations. For most refactoring operations, we didn’t care if the my_integer_t type was 4 bytes long or 8; we just needed to know that the name referred to a type. We also knew that correct refactoring transformations shouldn’t change the write-protected system header files. We thus made two assumptions about headers we parsed. First, we decided to parse each header file at most once, and would assume that the files were interpreted the same in each compilation unit. This meant that we could shorten parsing times for at least five seconds per file to five seconds (for all system header files), plus the additional time to only parse the source files and headers in the project. Second, we gathered less position information for system header files. We knew that changes in system header files were both incorrect (because we couldn’t change the existing code in libraries) and uninteresting (because we couldn’t change all other clients of the header file.) We gathered less exact position information for such files, and would flag errors if a transformation would change code in a system header file. We also realized that the user interface needed information about the source code to identify whether refactoring was possible for a given selection, which transformations were possible, and what the default parameters for the transformation would be. Because we wanted the user interface to make these suggestions immediately without waiting for parsing to complete, we used saved information from the Xcode’s declaration index when helping the user propose a refactoring transformation. We did have some issues where indexer information had inaccuracies (when its less accurate parser misparsed certain constructs), but in general we found the information good enough for our first release. 3. The Typical Programmer Dealing with conditional code and multiple build configurations is another major issue for refactoring and source code analysis of C programs. We realized that many of the assumptions about C code did not hold for Objective-C programs, and changed our expectations of what we would implement. C’s preprocessor supports conditional code—code only compiled if certain macros are set. Although some conventions exist for using conditional directives, the criteria triggering a particular block of code usually can be understood only by evaluating the values of the controlling macros at the point the preprocessor would have interpreted the directive. If source code with conditional code was refactored without considering all potential conditions, syntax errors or changed behavior could be introduced. Others have proposed various solutions for handling conditional code. Garrido and Johnson expanded the conditional code to cover entire declarations, and annotated the ASTs to mark the configurations including each declaration [4]. Vittek suggested parsing only the feasible sets of configuration macros, parsing each condition separately, and merging the resulting parse trees [5]. McCloskey and Brewer proposed a new preprocessor amenable to analysis and change, with tools to migrate existing programs to the new preprocessor [6]. We instead chose to parse for a single build configuration—a single set of macros, compiler flags, and include paths. Parsing a single build configuration appeared reasonable because ObjectiveC programs use the preprocessor less than typical C programs, because occurrences of conditional code were unlikely to be refactored, and because remaining uses of conditional code were insensitive to the refactoring changes. Ernst’s survey of preprocessor use found that UNIX utilities varied in their use of preprocessor directives. He found the percentage of preprocessor directives to total non-comment, nonblank (NCNB) lines ranged between 4% and 22% [7]. By contrast, only 3-8% of lines in typical Objective-C programs were preprocessor directives. (Measurements were made on sources for the Osirix medical visualization application, Adium multiprotocol chat client, and Xcode itself.) Within those Objective-C programs, preprocessor directives and conditional code also occurred much more frequently in the code unlikely to be refactored. Many were either in third-party utility code, or in cross-platform C++ code. The utility code was often public-domain source code intended for multiple operating systems. Such code is unlikely to be refactored for fear of complicating merges of newer versions. For applications designed for multiple operating systems, often a core C++ library would be the basis of all versions, and separate user interface code would be written for each operating system. Because our first release would not refactor or parse C++ code, such core code would be irrelevant to refactoring. For the Objective-C portions of the projects, only 2-4% of all lines were preprocessor directives. The preprocessor directives that do appear in Objective-C code are often irrelevant to refactoring. Of Ernst’s eleven categories of conditional code, many are either unlikely to affect the target audience, or are irrelevant to refactoring in general. Include guards are less frequently used in Objective-C because a separate directive (#import) ensures a file is included only once. Conditional directives that always disabled code (“#if (0)”) can be handled in the same way comments are processed. Operating system -specific conditional code is unlikely in Objective-C code because the language is used only on Mac OS X. However, there are three problematic conditional code directives that appear in Objective-C programs: code for debugging, architecture-specific code, and conditional code for enabling and disabling features in the project. Conditional code for debugging is unlikely to be troublesome. The rename transformation will make an incorrect change if a declaration is referenced in conditional code that is not parsed. If the condition is parsed, then the conditional code is not a concern. The most dangerous case occurs when code that needs to be manipulated exists in two conditionally compiled sections of code never parsed at the same time. Luckily, most conditional code controlled by debugging macros only adds code to the debug case, and does not add code to the non-debug case. As long as we parse the program with debug macros set (which should be the default during development), then we should parse all necessary code. Architecture-specific code is more common at Apple because we support two architectures (x86 and PowerPC), both in 32 and 64 bit versions. Most of the architecture-specific conditional code is found in low level system code and device drivers. The external developers we are targeting with refactoring would be working on application software, and would be unlikely to have architecturespecific code. Project-specific features controlled by conditional compilation directives represent a larger risk. Some of these may actually be in use (such as code shared between an iPhone and Mac application), and others may represent dead code. Code may exist on both sides of a condition. For the first release, we only changed code in the current build configuration, and relied on the user to be aware of and avoid changes in project-specific conditional code. 4. Exploiting Interaction Delays A final area for optimization was deciding when parsing and refactoring work would begin during actual use. Even with our previous decisions, parsing speed still wasn’t acceptable. Our rough numbers were that we could parse all the system header files in about 5 seconds, and then could parse an additional ten files a second on a typical machine. Caching the results of the header file parsing was an obvious solution, but we weren’t sure we had the time to implement such caching. A straightforward implementation would start parsing after the user specified the transformation to be performed, and only show results when the transformation was complete. We realized we could speed perceived performance by starting parsing early, and showing partial results before the transformation completed. 4.1 Optimistically Starting Parsing It usually takes a few seconds for a programmer to specify a refactoring transformation. Even for the simple rename, the user needs to indicate that he wants to rename a declaration, then needs to type in the new name. For “extract function”, the additional choices for parameter name and order requires additional time. To improve perceived performance, we began parsing the currently active file and header files as soon as the programmer had selected the “refactor” menu item. For refactoring transformations that only affected a single file, this often meant that as soon as the user specified the parameters for refactoring, the parsing had already been completed, and the transformation would be ready immediately. 4.2 Showing Partial Results When performing transformations changing multiple files, we similarly exploited how programmers would interact with the refactoring tool. We knew that most programmers beginning to use refactoring might want to examine the changes being made to double-check that the transformation was correct. If we assumed that most transformations would be successful (because the programmer was unlikely to try a transformation they thought would break their code), then we could begin showing partial results immediately rather than waiting for the entire transformation to be complete and validated to be safe. Most descriptions of refactoring break each transformation into two parts: the pre-conditions (which indicate the requirements that must be met before a transformation may be performed) and the changes to the source code (which are only performed after the change is believed safe. [8]) Because parse times are liable to be longer than a few seconds, the “check, then perform” approach would not have been interactive. The user would have to wait until all source code was parsed and all refactoring complete before examining any results. Similarly, parse trees for all functions would need to be generated before any refactoring work could begin. If the project being manipulated was particularly large, then the parse trees could consume huge amounts of memory. To make refactoring more palatable on large projects, we designed our transformations to work in several phases so that changes could be presented shown after only some of the code had been parsed and portions of the transformation performed. (See Figure 1). We also could dispose of some parse trees as soon as that code has been analyzed. The seven phases for our transformations are: • check user input: precondition checks that could be done with the initial inputs to the transformation only. • check first file: precondition checks to do after the file containing the selection is parsed. Generally, the analysis performed in this phase only performs initial sanity checks requiring parse trees. For the rename transformation, the phase checks that the declaration can be renamed, if the name is a valid C identifier, and if the declaration is not in a system header file. • perform first file: apply any changes that can be determined after the first file is parsed. Few transformations do work in this phase. • check per-file: precondition checks to do after parsing each compilation unit. • perform per-file: changes to apply after parsing each compilation unit. Most transformations do the bulk of their work in the per-file category. The check and perform parts both look at newly found uses of relevant declarations, and make appropriate changes. Each transformation specifies if the memory for parsed representations of function bodies can be freed before beginning the next file. • check final: precondition checks to do after parsing all files. The after-parsing checks tend to involve existence tests or nonexistence tests—whether any situations exist that indicate the transformation is unsafe such as “did we ever see any declaraparse b.c check per-file perform per-file check per-file perform per-file check per-file perform first file perform per-file check first file parse a.c parse c.c perform final check final Process a.c Process b.c Process c.c Figure 1: Order of processing of interleaved refactoring transformation on three source files a.c, b.c, and c.c. Results of the transformation are incrementally updated after each perform- phase is complete.tions with this name already?” Some of these checks could be done incrementally as each file is parsed. • perform final: changes to apply after parsing all files. The perform final phase is typically used for edits that cannot be constructed until all sources have been parsed. For example, when converting references to a structure’s field to call getter or setter functions, the transformation needs to determine where to place the new accessor functions. The accessors need to be placed in a source file (rather than a header), preferably near existing references to the field or the definition of the structure. Typically, the transformation can place the functions as soon as a likely location is found. If no appropriate location for the new code is found in any source file, the perform final phase chooses an arbitrary location. By breaking up each transformation in this way, the user experience of refactoring becomes more interactive. The refactoring user interface can show the list of files which must be parsed for a transformation. As each file is parsed and changes are identified, the user interface indicates completion and notes the number of changes in that file. Selecting the filename shows a side-by-side view of the source before and after the change. As the transformation progresses, more files and edits are displayed. The user can examine proposed changes as soon as each file is processed. While examining the changes, the user can also choose not to include some changes, or can make additional edits to the changed source code. In this way, the user can both measure progress and can be working productively as the transformation progresses. The interleaved transformation approach has the risk of declaring a transformation unsafe after the user has already examined some changes. This turned out not to be a problem in actual use. Programmers weren’t bothered by the delayed negative answer. We also found very few transformations where we could outright refuse to do a transformation. We might warn the result is incorrect, but we found programmers often wanted the chance to apply those incorrect changes and then fix remaining problems with straight edits. 5. Conclusions Overall, our progress on refactoring matched effort described on similar projects. Our first prototype was completed in three months by one person, and our first release required two years and three people. We found the transformations tended to be easy to write. Most of our parsing effort focused on scalability - getting parsing performance and memory use low, and making sure it worked well inside the IDE. We also found that implementing a polished user interface took the majority of the overall effort, with two of the engineers working full time on refactoring workflow and on making the file comparison view as polished as possible. With the trade-offs described here, we met our performance goals. Our goal at the beginning of the project was to permit refactoring on 200,000 line projects, and be able to rename a frequently-referenced declaration within 30 seconds. On a 2.2 GHz Dual Xeon PowerMac with 1 GB of memory, we renamed declarations in a 270,000 line Objective-C project. We found we could rename a class referenced in 382 places through 123 files in 28 seconds. We could rename a class used in 65 files in 15 seconds. Operations involving only a single file took around 8 seconds; this time was irrespective of the source file because parsing the headers dominated. Most transformations only require parsing a small subset of source files in a project. However, one of the transformations searches all code for iterators that can be converted to use a new language feature. Parsing the entire 270,000 line project for this transformation takes around 90 seconds. This is not acceptable for the interactive transformations, but is adequate for an infrequently run transformation that changes all source files. The refactoring feature as described shipped as part of Xcode 3.0 and Mac OS X 10.5. Building software development tools in industry requires making tradeoffs in both requirements and design. Some are driven by the expected needs of users such as the size of programs to be refactored, or response times expected. Some are driven by scalability issues such as whether to save pre-processed header files in the IDE between refactorings, or whether to re-parse headers from scratch each time. Other tradeoffs occur for business, timing, or staffing reasons, affecting whether a feature might even be implemented, or whether a new parser is written from scratch. As described in this paper, our requirements strongly affected what we could and did implement. The particular tradeoffs we made may not appear to be the "right" or "perfect" decision in all cases, but they are representative of the sorts of decisions that must be made during the process of commercial development. Our three themes of trade-offs—identifying where different levels of accuracy were acceptable, recognizing differences between "our typical user" and "a typical user", and exploiting delays in user interaction to improve responsiveness—suggest ways that other tools can meet their own goals. Acknowledgements Thanks to Michael Van De Vanter and Todd Fernandez for their feedback on a previous version of this paper. Dave Payne originally suggested applying the transformations file-by-file. Andrew Pontious and Yuji Akimoto implemented the refactoring user interface, and kept us focused on an interactive experience. Our approach for incrementally showing refactoring results is also described in U.S. Patent Application 20080052684, “ Stepwise source code refactoring”. References [1] Apple, "Apple Developer Documentation: Objective-C Programming Language," Cupertino, CA 2007. [2] R. W. Bowdidge and W. G. Griswold, "Supporting the Restructuring of Data Abstractions through Manipulation of a Program Visualization," ACM Transactions on Software Engineering and Methodology, vol. 7(2), 1998. [3] D. Bäumer, E. Gamma, and A. Kiezun, "Integrating refactoring support into a Java development tool," in OOPSLA 2001 Companion, 2001. [4] A. Garrido and R. Johnson, "Analyzing Multiple Configurations of a C Program," in 21st IEEE International Conference on Software Maintenance (ICSM), 2005. [5] M. Vittek, "Refactoring Browser with Preprocessor," in 7th European Conference on Software Maintenance and Reengineering, Benevento, Italy, 2003. [6] B. McCloskey and E. Brewer, "ASTEC: a new approach to refactoring C," in 13th ACM SIGSOFT international symposium on Foundations of Software Engineering ESEC/FSE-13, 2005. [7] M. D. Ernst, G. J. Badros, and D. Notkin, "An Empirical Analysis of C Preprocessor Use," IEEE Transactions on Software Engineering, vol. 28, pp. 1146-1170, December 2002. [8] W. F. Opdyke, "Refactoring: A Program Restructuring Aid in Designing Object-Oriented Application Frameworks," University of Illinois, Urbana-Champaign, 1991. Linear-Space Computation of the Edit-Distance between a String and a Finite Automaton Cyril Allauzen1 and Mehryar Mohri2,1 1 Google Research 76 Ninth Avenue, New York, NY 10011, US. 2 Courant Institute of Mathematical Sciences 251 Mercer Street, New York, NY 10012, US. Abstract. The problem of computing the edit-distance between a string and a finite automaton arises in a variety of applications in computational biology, text processing, and speech recognition. This paper presents linear-space algorithms for computing the edit-distance between a string and an arbitrary weighted automaton over the tropical semiring, or an unambiguous weighted automaton over an arbitrary semiring. It also gives an efficient linear-space algorithm for finding an optimal alignment of a string and such a weighted automaton. 1 Introduction The problem of computing the edit-distance between a string and a finite automaton arises in a variety of applications in computational biology, text processing, and speech recognition [8, 10, 18, 21, 14]. This may be to compute the edit-distance between a protein sequence and a family of protein sequences compactly represented by a finite automaton [8, 10, 21], or to compute the error rate of a word lattice output by a speech recognition with respect to a reference transcription [14]. A word lattice is a weighted automaton, thus this further motivates the need for computing the edit-distance between a string and a weighted automaton. In all these cases, an optimal alignment is also typically sought. In computational biology, this may be to infer the function and various properties of the original protein sequence from the one it is best aligned with. In speech recognition, this determines the best transcription hypothesis contained in the lattice. This paper presents linear-space algorithms for computing the edit-distance between a string and an arbitrary weighted automaton over the tropical semiring, or an unambiguous weighted automaton over an arbitrary semiring. It also gives an efficient linear-space algorithm for finding an optimal alignment of a string and such a weighted automaton. Our linear-space algorithms are obtained by using the same generic shortest-distance algorithm but by carefully defining different queue disciplines. More precisely, our meta-queue disciplines are derived in the same way from an underling queue discipline defined over states with the same level.The connection between the edit-distance and the shortest distance in a directed graph was made very early on (see [10, 4–6] for a survey of string algorithms). This paper revisits some of these algorithms and shows that they are all special instances of the same generic shortest-distance algorithm using different queue disciplines. We also show that the linear-space algorithms all correspond to using the same meta-queue discipline using different underlying queues. Our approach thus provides a better understanding of these classical algorithms and makes it possible to easily generalize them, in particular to weighted automata. The first algorithm to compute the edit-distance between a string x and a finite automaton A as well as their alignment was due to Wagner [25] (see also [26]). Its time complexity was in O(|x||A| 2 Q) and its space complexity in O(|A| 2 Q|Σ| + |x||A|Q), where Σ denotes the alphabet and |A|Q the number of states of A. Sankoff and Kruskal [23] pointed out that the time and space complexity O(|x||A|) can be achieved when the automaton A is acyclic. Myers and Miller [17] significantly improved on previous results. They showed that when A is acyclic or when it is a Thompson automaton, that is an automaton obtained from a regular expression using Thompson’s construction [24], the edit-distance between x and A can be computed in O(|x||A|) time and O(|x|+|A|) space. They also showed, using a technique due to Hirschberg [11], that the optimal alignment between x and A can be obtained in O(|x| + |A|) space, and in O(|x||A|) time if A is acyclic, and in O(|x||A| log |x|) time when A is a Thompson automaton. The remainder of the paper is organized as follows. Section 2 introduces the definition of semirings, and weighted automata and transducers. In Section 3, we give a formal definition of the edit-distance between a string and a finite automaton, or a weighted automaton. Section 4 presents our linear-space algorithms, including the proof of their space and time complexity and a discussion of an improvement of the time complexity for automata with some favorable graph structure property. 2 Preliminaries This section gives the standard definition and specifies the notation used for weighted transducers and automata which we use in our computation of the edit-distance. Finite-state transducers are finite automata [20] in which each transition is augmented with an output label in addition to the familiar input label [2, 9]. Output labels are concatenated along a path to form an output sequence and similarly input labels define an input sequence. Weighted transducers are finitestate transducers in which each transition carries some weight in addition to the input and output labels [22, 12]. Similarly, weighted automata are finite automata in which each transition carries some weight in addition to the input label. A path from an initial state to a final state is called an accepting path. A weighted transducer or weighted automaton is said to be unambiguous if it admits no two accepting paths with the same input sequence.The weights are elements of a semiring (K, ⊕, ⊗, 0, 1), that is a ring that may lack negation [12]. Some familiar semirings are the tropical semiring (R+ ∪ {∞}, min, +, ∞, 0) and the probability semiring (R+ ∪ {∞}, +, ×, 0, 1), where R+ denotes the set of non-negative real numbers. In the following, we will only consider weighted automata and transducers over the tropical semiring. However, all the results of section 4.2 hold for an unambiguous weighted automaton A over an arbitrary semiring. The following gives a formal definition of weighted transducers. Definition 1. A weighted finite-state transducer T over the tropical semiring (R+ ∪ {∞}, min, +, ∞, 0) is an 8-tuple T = (Σ, ∆, Q, I, F, E, λ, ρ) where Σ is the finite input alphabet of the transducer, ∆ its finite output alphabet, Q is a finite set of states, I ⊆ Q the set of initial states, F ⊆ Q the set of final states, E ⊆ Q × (Σ ∪ {ǫ}) × (∆ ∪ {ǫ}) × (R+ ∪ {∞}) × Q a finite set of transitions, λ : I → R+ ∪ {∞} the initial weight function, and ρ : F → R+ ∪ {∞} the final weight function mapping F to R+ ∪ {∞}. We define the size of T as |T | = |T |Q + |T |E where |T |Q = |Q| is the number of states and |T |E = |E| the number of transitions of T . The weight of a path π in T is obtained by summing the weights of its constituent transitions and is denoted by w[π]. The weight of a pair of input and output strings (x, y) is obtained by taking the minimum of the weights of the paths labeled with (x, y) from an initial state to a final state. For a path π, we denote by p[π] its origin state and by n[π] its destination state. We also denote by P(I, x, y, F) the set of paths from the initial states I to the final states F labeled with input string x and output string y. The weight T (x, y) associated by T to a pair of strings (x, y) is defined by: T (x, y) = min π∈P (I,x,y,F ) λ(p[π]) + w[π] + ρ[n[π]]. (1) Figure 1(a) shows an example of weighted transducer over the tropical semiring. Weighted automata can be defined as weighted transducers A with identical input and output labels, for any transition. Thus, only pairs of the form (x, x) can have a non-zero weight by A, which is why the weight associated by A to (x, x) is abusively denoted by A(x) and identified with the weight associated by A to x. Similarly, in the graph representation of weighted automata, the output (or input) label is omitted. Figure 1(b) shows an example. 3 Edit-distance We first give the definition of the edit-distance between a string and a finite automaton. Let Σ be a finite alphabet, and let Ω be defined by Ω = (Σ ∪ {ǫ}) × (Σ ∪ {ǫ}) − {(ǫ, ǫ)}. An element of Ω can be seen as a symbol edit operation: (a, ǫ) is a deletion, (ǫ, a) an insertion, and (a, b) with a 6= b a substitution. We will denote by h the natural morphism between Ω∗ and Σ∗ × Σ∗ defined by a:b/.1 0 1 a:b/.2 2/1 a:b/.4 3/.8 b:a/.6 b:a/.3 b:a/.5 a/.1 0 1 a/.2 2/1 a/.4 3/.8 b/.6 b/.3 b/.5 (a) (b) Fig. 1. (a) Example of a weighted transducer T. (b) Example of a weighted automaton A. T(aab, bba) = A(aab) = min(.1 +.2 +.6 +.8, .2 +.4 +.5 +.8). A bold circle indicates an initial state and a double-circle a final state. The final weight ρ[q] of a final state q is indicated after the slash symbol representing q. h((a1, b1)· · ·(an, bn)) = (a1 · · · an, b1 · · · bn). An alignment ω between two strings x and y is an element of Ω∗ such that h(ω) = (x, y). Let c : Ω → R+ be a function associating a non-negative cost to each edit operation. The cost of an alignment ω = ω1 · · · ωn is defined as c(ω) = Pn i=1 c(ωi). Definition 2. The edit-distance d(x, y) of two strings x and y is the minimal cost of a sequence of symbols insertions, deletions or substitutions transforming one string into the other: d(x, y) = min h(ω)=(x,y) c(ω). (2) When c is the function defined by c(a, a) = 0 and c(a, ǫ) = c(ǫ, a) = c(a, b) = 1 for all a, b in Σ such that a 6= b, the edit-distance is also known as the Levenshtein distance. The edit-distance d(x, A) between a string x and a finite automaton A can then be defined as d(x, A) = min y∈L(A) d(x, y), (3) where L(A) denotes the regular language accepted by A. The edit-distance d(x, A) between a string x and a weighted automaton A over the tropical semiring is defined as: d(x, A) = min y∈Σ∗ A(y) + d(x, y) . (4) 4 Algorithms In this section, we present linear-space algorithms both for computing the editdistance d(x, A) between an arbitrary string x and an automaton A, and an optimal alignment between x and A, that is an alignment ω such that c(ω) = d(x, A). We first briefly describe two general algorithms that we will use as subroutines.4.1 General algorithms Composition. The composition of two weighted transducers T1 and T2 over the tropical semiring with matching input and output alphabets Σ, is a weighted transducer denoted by T1 ◦ T2 defined by: (T1 ◦ T2)(x, y) = min z∈Σ∗ T1(x, z) + T2(z, y). (5) T1 ◦ T2 can be computed from T1 and T2 using the composition algorithm for weighted transducers [19, 15]. States in the composition T1 ◦ T2 are identified with pairs of a state of T1 and a state of T2. In the absence of transitions with ǫ inputs or outputs, the transitions of T1 ◦ T2 are obtained as a result of the following matching operation applied to the transitions of T1 and T2: (q1, a, b, w1, q′ 1 ) and (q2, b, c, w2, q′ 2 ) → ((q1, q′ 1 ), a, c, w1 + w2,(q2, q′ 2 )). (6) A state (q1, q2) of T1 ◦T2 is initial (resp. final) iff q1 and q2 are initial (resp. final) and, when it is final, its initial (resp.final) weight is the sum of the initial (resp. final) weights of q1 and q2. In the worst case, all transitions of T1 leaving a state q1 match all those of T2 leaving state q2, thus the space and time complexity of composition is quadratic, that is O(|T1||T2|). Shortest distance. Let A be a weighted automaton over the tropical semiring. The shortest distance from p to q is defined as d[p, q] = min π∈P (p,q) w[π]. (7) It can be computed using the generic single-source shortest-distance algorithm of [13], a generalization of the classical shortest-distance algorithms. This generic shortest-distance algorithm works with an arbitrary queue discipline, that is the order according to which elements are extracted from a queue. We shall make use of this key property in our algorithms. The pseudocode of a simplified version of the generic algorithm for the tropical semiring is given in Figure 2. The complexity of the algorithm depends on the queue discipline selected for S. Its general expression is O(|Q| + C(A) max q∈Q N(q)|E| + (C(I) + C(X))X q∈Q N(q)), (8) where N(q) denotes the number of times state q is extracted from queue S, C(X) the cost of extracting a state from S, C(I) the cost of inserting a state in S, and C(A) the cost of an assignment. With a shortest-first queue discipline implemented using a heap, the algorithm coincides with Dijkstra’s algorithm [7] and its complexity is O((|E| + |Q|) log |Q|). For an acyclic automaton and with the topological order queue discipline, the algorithm coincides with the standard linear-time (O(|Q| + |E|)) shortest-distance algorithm [3].Shortest-Distance(A, s) 1 for each p ∈ Q do 2 d[p] ← ∞ 3 d[s] ← 0 4 S ← {s} 5 while S 6= ∅ do 6 q ← Head(S) 7 Dequeue(S) 8 for each e ∈ E[q] do 9 if (d[s] + w[e] < d[n[e]]) then 10 d[n[e]] ← d[s] + w[e] 11 if (n[e] 6∈ S) then 12 Enqueue(S, n[e]) Fig. 2. Pseudocode of the generic shortest-distance algorithm. 4.2 Edit-distance algorithms The edit cost function c can be naturally represented by a one-state weighted transducer over the tropical semiring Tc = (Σ, Σ, {0}, {0}, {0}, Ec, 1, 1), or T in the absence of ambiguity, with each transition corresponding to an edit operation: Ec = {(0, a, b, c(a, b), 0)|(a, b) ∈ Ω}. Lemma 1. Let A be a weighted automaton over the tropical semiring and let X be the finite automaton representing a string x. Then, the edit-distance between x and A is the shortest-distance from the initial state to a final state in the weighted transducer U = X ◦ T ◦ A. Proof. Each transition e in T corresponds to an edit operation (i[e], o[e]) ∈ Ω, and each path π corresponds to an alignment ω between i[π] and o[π]. The cost of that alignment is, by definition of T , c(ω) = w[π]. Thus, T defines the function: T (u, v) = min ω∈Ω∗ {c(ω): h(ω) = (u, v)} = d(u, v), (9) for any strings u, v in Σ∗ . Since A is an automaton and x is the only string accepted by X, it follows from the definition of composition that U(x, y) = T (x, y) + A(y) = d(x, y) + A(y). The shortest-distance from the initial state to a final state in U is then: min π∈PU (I,F ) w[π] = min y∈Σ∗ min π∈PU (I,x,y,F ) w[π] = min y∈Σ∗ U(x, y) (10) = min y∈Σ∗ d(x, y) + A(y) = d(x, A), (11) that is the edit-distance between x and A. ⊓⊔0 1 a/0 2 b/0 3/0 a/0 0/0 1 a/0 b/0 (a) (b) 0/0 ε:a/1 ε:b/1 a:ε/1 a:a/0 a:b/1 b:ε/1 b:a/1 b:b/0 0,0 0,1 ε:a/1 1,0 a:ε/1 1,1 a:a/0 ε:b/1 a:b/1 a:ε/1 ε:a/1 2,0 b:ε/1 2,1 b:a/1 ε:b/1 b:b/0 b:ε/1 ε:a/1 3,0/0 a:ε/1 3,1 a:a/0 ε:b/1 a:b/1 a:ε/1 ε:a/1 ε:b/1 (c) (d) Fig. 3. (a) Finite automaton X representing the string x = aba. (b) Finite automaton A. (c) Edit transducer T over the alphabet {a, b} where the cost of any insertion, deletion and substitution is 1. (d) Weighted transducer U = X ◦ T ◦ A. Figure 3 shows an example illustrating Lemma 1. Using the lateral strategy of the 3-way composition algorithm of [1] or an ad hoc algorithm exploiting the structure of T , U = X ◦ T ◦ A can be computed in O(|x||A|) time. The shortestdistance algorithm presented in Section 4.1 can then be used to compute the shortest distance from an initial state of U to a final state and thus the edit distance of x and A. Let us point out that different queue disciplines in the computation of that shortest distance lead to different algorithms and complexities. In the next section, we shall give a queue discipline enabling us to achieve a linear-space complexity. 4.3 Edit-distance computation in linear space Using the shortest-distance algorithm described in Section 4.1 leads to an algorithm with space complexity linear in the size of U, i.e. in O(|x||A|). However, taking advantage of the topology of U, it is possible to design a queue discipline that leads to a linear space complexity O(|x| + |A|). We assume that the finite automaton X representing the string x is topologically sorted. A state q in the composition U = X ◦T ◦A can be identified with a triplet (i, 0, j) where i is a state of X, 0 the unique state of T , and j a state of A. Since T has a unique state, we further simplify the notation by identifying each state q with a pair (i, j). For a state q = (i, j) of U, we will refer to i by the level of q. A key property of the levels is that there is a transition in U from q to q ′iff level(q ′ ) = level(q) or level(q ′ ) = level(q) + 1. Indeed, a transition from (i, j) to (i ′ , j′ ) in U corresponds to taking a transition in X (in that case i ′ = i + 1 since X is topologically sorted) or staying at the same state in X and taking an input-ǫ transition in T (in that case i ′ = i). From any queue discipline ≺ on the states of U, we can derive a new queue discipline ≺l over U defined for all q, q′ in U as follows: q ≺l q ′ iff level(q) < level(q ′ ) or level(q) = level(q ′ ) and q ≺ q ′ . (12) Proposition 1. Let ≺ be a queue discipline that requires at most O(|V |) space to maintain a queue over any set of states V . Then, the edit-distance between x and A can be computed in linear space, O(|x| + |A|), using the queue discipline ≺l. Proof. The benefit of the queue discipline ≺l is that when computing the shortest distance to q = (i, j) in U, only the shortest distances to the states in U of level i and i − 1 need to be stored in memory. The shortest distances to the states of level strictly less than i − 1 can be safely discarded. Thus, the space required to store the shortest distances is in O(|A|Q). Similarly, there is no need to store in memory the full transducer U. Instead, we can keep in memory the last two levels active in the shortest-distance algorithm. This is possible because the computation of the outgoing transitions of a state with level i only requires knowledge about the states with level i and i + 1. Therefore, the space used to store the active part of U is in O(|A|E +|A|Q) = O(|A|). Thus, it follows that the space required to compute the edit-distance of x and A is linear, that is in O(|x| + |A|). ⊓⊔ The time complexity of the algorithm depends on the underlying queue discipline ≺. A natural choice is for ≺ is the shortest-first queue discipline, that is the queue discipline used in Dijkstra’s algorithm. This yields the following corollary. Corollary 1. The edit-distance between a string x and an automaton A can be computed in time O(|x||A| log |A|Q) and space O(|x| + |A|) using the queue discipline ≺l. Proof. A shortest-first queue is maintained for each level and contains at most |A|Q states. The cost for the global queue of an insertion, C(I), or an assignment, C(A), is in O(log |A|Q) since it corresponds to inserting in or updating one of the underlying level queues. Since N(q) = 1, the general expression of the complexity (8) leads to an overall time complexity of O(|x||A| log |A|Q) for the shortestdistance algorithm. ⊓⊔ When the automaton A is acyclic, the time complexity can be further improved by using for ≺ the topological order queue discipline. Corollary 2. If the automaton A is acyclic, the edit-distance between x and A can be computed in time O(|x||A|) and space O(|x| + |A|) using the queue discipline ≺l with the topological order queue discipline for ≺.Proof. Computing the topological order for U would require O(|U|) space. Instead, we use the topological order on A, which can be computed in O(|A|), to define the underlying queue discipline. The order inferred by (12) is then a topological order on U. ⊓⊔ Myers and Miller [17] showed that when A is a Thompson automaton, the time complexity can be reduced to O(|x||A|) even when A is not acyclic. This is possible because of the following observation: in a weighted automaton over the tropical semiring, there exists always a shortest path that is simple, that is with no cycle, since cycle weights cannot decrease path weight. In general, it is not clear how to take advantage of this observation. However, a Thompson automaton has additionally the following structural property: a loop-connectedness of one. The loop-connectedness of A is k if in any depth- first search of A, a simple path goes through at most k back edges. [17] showed that this property, combined with the observation made previously, can be used to improve the time complexity of the algorithm. The results of [17] can be generalized as follows. Corollary 3. If the loop-connectedness of A is k, then the edit-distance between x and A can be computed in O(|x||A|k) time and O(|x| + |A|) space. Proof. We first use a depth-first search of A, identify back edges, and mark them as such. We then compute the topological order for A, ignoring these back edges. Our underlying queue discipline ≺ is defined such that a state q = (i, j) is ordered first based on the number of times it has been enqueued and secondly based on the order of j in the topological order ignoring back edges. This underlying queue can be implemented in O(|A|Q) space with constant time costs for the insertion, extraction and updating operations. The order ≺l derived from ≺ is then not topological for a transition e iff e was obtained by matching a back edge in A and level(p[e]) = level(n[e]). When such a transition e is visited, n[e] is reinserted in the queue. When state q is dequeued for the lth time, the value of d[q] is the weight of the shortest path from the initial state to q that goes through at most l −1 back edges. Thus, the inequality N(q) ≤ k + 1 holds for all q and, since the costs for managing the queue, C(I), C(A), and C(X), are constant, the time complexity of the algorithm is in O(|x||A|k). ⊓⊔ 4.4 Optimal alignment computation in linear space The algorithm presented in the previous section can also be used to compute an optimal alignment by storing a back pointer at each state in U. However, this can increase the space complexity up to O(|x||A|Q). The use of back pointers to compute the best alignment can be avoided by using a technique due to Hirschberg [11], also used by [16, 17]. As pointed out in previous sections, an optimal alignment between x and A corresponds to a shortest path in U = X ◦ T ◦ A. We will say that a state q in U is a midpoint of an optimal alignment between x and A if q belongs to a shortest path in U and level(q) = ⌊|x|/2⌋.Lemma 2. Given a pair (x, A), a midpoint of the optimal alignment between x and A can be computed in O(|x|+|A|) space with a time complexity in O(|x||A|) if A is acyclic and in O(|x||A| log |A|Q) otherwise. Proof. Let us consider U = X ◦ T ◦ A. For a state q in U let d[q] denote the shortest distance from the initial state to q, and by d R[q] the shortest distance from q to a final state. For a given state q = (i, j) in U, d[(i, j)] +d R[(i, j)] is the cost of the shortest path going through (i, j). Thus, for any i, the edit-distance between x and A is d(x, A) = minj (d[(i, j)] + d R[(i, j)]). For a fixed i0, we can compute both d[(i0, j)] and d R[(i0, j)] for all j in O(|x||A| log |A|Q) time (or O(|x||A| time if A is acyclic) and in linear space O(|x| + |A|) using the algorithm from the previous section forward and backward and stopping at level i0 in each case. Running the algorithm backward (exchanging initial and final states and permuting the origin and destination of every transition) can be seen as computing the edit-distance between x R and AR, the mirror images of x and A. Let us now set i0 = ⌊|x|/2⌋ and j0 = argminj (d[(i0, j)] + d R[(i0, j)]). It then follows that (i0, j0) is a midpoint of the optimal alignment. Hence, for a pair (x, A), the running-time complexity of determining the midpoint of the alignment is in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q) otherwise. ⊓⊔ The algorithm proceeds recursively by first determining the midpoint of the optimal alignment. At step 0 of the recursion, we first find the midpoint (i0, j0) between x and A. Let x 1 and x 2 be such that x = x 1x 2 and |x 1 | = i0, and let A1 and A2 be the automaton obtained from A by respectively changing the final state to j0 in A1 and the initial state to j0 in A2 . We can now recursively find the alignment between x 1 and A1 and between x 2 and A2 . Theorem 1. An optimal alignment between a string x and an automaton A can be computed in linear space O(|x| + |A|) and in time O(|x||A|) if A is acyclic, O(|x||A| log |x| log |A|Q) otherwise. Proof. We can assume without loss of generality that the length of x is a power of 2. At step k of the recursion, we need to compute the midpoints for 2k string-automaton pairs (x i k , Ai k )1≤i≤2k . Thus, the complexity of step k is in O( P2 k i=1 |x i k ||Ai k | log |Ai k |Q) = O( |x| 2 k P2 k i=1 |Ai k | log |Ai k |Q) since |x i k | = |x|/2 k for all i. When A is acyclic, the log factor can be avoided and the equality P2 k i=1 |Ai k | = O(|A|) holds, thus the time complexity of step k is in O(|x||A|/2 k ). In the general case, each |Ai k | can be in the order of |A|, thus the complexity of step k is in O(|x||A| log |A|Q). Since there are at most log |x| steps in the recursion, this leads to an overall time complexity in O(|x||A|) if A is acyclic and O(|x||A| log |A|Q log |x|) in general. ⊓⊔ When the loop-connectedness of A is k, the time complexity can be improved to O(k|x||A| log |x|) in the general case.5 Conclusion We presented general algorithms for computing in linear space both the editdistance between a string and a finite automaton and their optimal alignment. Our algorithms are conceptually simple and make use of existing generic algorithms. Our results further provide a better understanding of previous algorithms for more restricted automata by relating them to shortest-distance algorithms and general queue disciplines. References 1. C. Allauzen and M. Mohri. 3-way composition of weighted finite-state transducers. In O. Ibarra and B. Ravikumar, editors, Proceedings of CIAA 2008, volume 5148 of Lecture Notes in Computer Science, pages 262–273. Springer-Verlag Berlin Heidelberg, 2008. 2. J. Berstel. Transductions and Context-Free Languages. Teubner Studienbucher: Stuttgart, 1979. 3. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. The MIT Press: Cambridge, MA, 1992. 4. M. Crochemore, C. Hancart, and T. Lecroq. Algorithms on Strings. Cambridge University Press, 2007. 5. M. Crochemore and W. Rytter. Text Algorithms. Oxford University Press, 1994. 6. M. Crochemore and W. Rytter. Jewels of Stringology. World Scientific, 2002. 7. E. W. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik, 1:269–271, 1959. 8. R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probalistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK, 1998. 9. S. Eilenberg. Automata, Languages and Machines, volume A–B. Academic Press, 1974–1976. 10. D. Gusfield. Algorithms on Strings, Trees and Sequences. Cambridge University Press, Cambridge, UK, 1997. 11. D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Communications of the ACM, 18(6):341–343, June 1975. 12. W. Kuich and A. Salomaa. Semirings, Automata, Languages. Number 5 in EATCS Monographs on Theoretical Computer Science. Springer-Verlag, 1986. 13. M. Mohri. Semiring frameworks and algorithms for shortest-distance problems. Journal of Automata, Languages and Combinatorics, 7(3):321–350, 2002. 14. M. Mohri. Edit-distance of weighted automata: General definitions and algorithms. International Journal of Foundations of Computer Science, 14(6):957–982, 2003. 15. M. Mohri, F. C. N. Pereira, and M. Riley. Weighted automata in text and speech processing. In Proceedings of the 12th biennial European Conference on Artificial Intelligence (ECAI-96), Workshop on Extended finite state models of language, Budapest, Hungary. John Wiley and Sons, Chichester, 1996. 16. E. W. Myers and W. Miller. Optimal alignments in linear space. CABIOS, 4(1):11– 17, 1988. 17. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of Mathematical Biology, 51(1):5–37, 1989.18. G. Navarro and M. Raffinot. Flexible pattern matching. Cambridge University Press, 2002. 19. F. Pereira and M. Riley. Finite State Language Processing, chapter Speech Recognition by Composition of Weighted Finite Automata. The MIT Press, 1997. 20. D. Perrin. Finite automata. In J. V. Leuwen, editor, Handbook of Theoretical Computer Science, Volume B: Formal Models and Semantics, pages 1–57. Elsevier, Amsterdam, 1990. 21. P. A. Pevzner. Computational Molecular Biology: an Algorithmic Approach. MIT Press, 2000. 22. A. Salomaa and M. Soittola. Automata-Theoretic Aspects of Formal Power Series. Springer-Verlag, 1978. 23. D. Sankoff and J. B. Kruskal. Time Wraps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, Reading, MA, 1983. 24. K. Thompson. Regular expression search algorithm. Communications of the ACM, 11(6):365–375, 1968. 25. R. A. Wagner. Order-n correction for regular languages. Communications of the ACM, 17(5):265–268, May 1974. 26. R. A. Wagner and J. I. Seiferas. Correcting counter-automaton-recognizable languages. SIAM Journal on Computing, 7(3):357–375, August 1978. JMLR: Workshop and Conference Proceedings vol 23 (2012) 44.1–44.3 25th Annual Conference on Learning Theory Open Problem: Better Bounds for Online Logistic Regression H. Brendan McMahan MCMAHAN@GOOGLE.COM Google Inc., Seattle, WA Matthew Streeter MSTREETER@GOOGLE.COM Google Inc., Pittsburgh, PA Editor: Shie Mannor, Nathan Srebro, Robert C. Williamson Abstract Known algorithms applied to online logistic regression on a feasible set of L2 diameter D achieve regret bounds like O(e D log T) in one dimension, but we show a bound of O( √ D + log T) is possible in a binary 1-dimensional problem. Thus, we pose the following question: Is it possible to achieve a regret bound for online logistic regression that is O(poly(D) log(T))? Even if this is not possible in general, it would be interesting to have a bound that reduces to our bound in the one-dimensional case. Keywords: online convex optimization, online learning, regret bounds 1. Introduction and Problem Statement Online logistic regression is an important problem, with applications like click-through-rate prediction for web advertising and estimating the probability that an email message is spam. We formalize the problem as follows: on each round t the adversary selects an example (xt , yt) ∈ R n × {−1, 1}, the algorithm chooses model coefficients wt ∈ R n , and then incurs loss `(wt ; xt , yt) = log(1 + exp(−ytwt · xt)), (1) the negative log-likelihood of the example under a logistic model. For simplicity we assume kxtk2 ≤ 1 so that any gradient kO`(wt)k2 ≤ 1. While conceptually any w ∈ R n could be used as model parameters, for regret bounds we consider competing with a feasible set W = {w | kwk2 ≤ D/2}, the L2 ball of diameter D centered at the origin. Existing algorithms for online convex optimization can immediately be applied. First-order algorithms like online gradient descent (Zinkevich, 2003) achieve bounds like O(D √ T). On a bounded feasible set logistic loss (Eq. (1)) is exp-concave, and so we can use second-order algorithms like Follow-The-Approximate-Leader (FTAL), which has a general bound of O(( 1 α + GD)n log T) (Hazan et al., 2007) when the loss functions are α-exp-concave on the feasible set; we have α = e −D/2 for the logistic loss (see Appendix A), which leads to a bound of O((exp(D) + D)n log T) in the general case, or O(exp(D) log T) in the one-dimensional case. The exponential dependence on the diameter of the feasible set can make this bound worse than the O(D √ T) bounds for practical problems where the post-hoc optimal probability can be close to zero or one. We suggest that better bounds may be possible. In the next section, we show that a simple Follow-The-Regularized-Leader (FTRL) algorithm can achieve a much better result, namely c 2012 H.B. McMahan & M. Streeter.MCMAHAN STREETER O( √ D + log T), for one-dimensional problems where the adversary is further constrained1 to pick xt ∈ {−1, 0, +1}. A single mis-prediction can cost about D/2, and so the additive dependence on the diameter of the feasible set is less than the cost of one mistake. The open question is whether such a bound is achievable for problems of arbitrary finite dimension n. Even the general onedimensional case, where xt ∈ [−1, 1], is not obvious. 2. Analysis in One Dimension We analyze an FTRL algorithm. We can ignore any rounds when xt = 0, and then since only the sign of ytxt matters, we assume xt = 1 and the adversary picks yt ∈ {−1, 1}. The cumulative loss function on P positive examples and N negative examples is c(w; N, P) = P log(1 + exp(−w)) + N log(1 + exp(w)). Let Nt denote the number of negative examples seen through the t’th round, with Pt the corresponding number of positive examples. We play FTRL, with wt+1 = arg min w c(w; Nt + λ, Pt + λ), for a constant λ > 0. This is just FTRL with a regularization function r(w) = c(w; λ, λ). Using the FTRL lemma (e.g., McMahan and Streeter (2010, Lemma 1)), we have Regret ≤ r(w ∗ ) +X T t=1 ft(wt) − ft(wt+1) where ft(w) = `(w; xt , yt). It is easy to verify that r(w) ≤ λ(|w| + 2 log 2). It remains to bound ft(wt) − ft(wt+1). Fix a round t. For compactness, we write N = Nt−1 and P = Pt−1. Suppose that yt = −1, so Nt = N + 1 and Pt = P (the case when yt+1 = +1 is analogous). Since ft is convex, by definition ft(w) ≥ ft(wt) + gt(w − wt) where gt = Oft(wt). Taking w = wt+1 and re-arranging, we have ft(wt) − ft(wt+1) ≤ gt(wt − wt+1) ≤ |gt ||wt − wt+1|. It is easy to verify that |gt | ≤ 1, and also that wt = log P + λ N + λ . Since yt = −1, wt+1 < wt , and so |wt − wt+1| = log P + λ N + λ − log P + λ N + 1 + λ = log(N + 1 + λ) − log(N + λ) = log 1 + 1 N + λ ≤ 1 N + λ . 1. Constraining the adversary in this way is reasonable in many applications. For example, re-scaling each xt so kxtk2 = 1 is a common pre-processing step, and many problems also are naturally featurized by xt,i ∈ {0, 1}, where xt,i = 1 indicates some property i is present on the t’th example. 44.2OPEN PROBLEM: ONLINE LOGISTIC REGRESSION Thus, if we let T − = {t | yt = −1}, we have X t∈T − ft(wt) − ft(wt+1) ≤ X NT N=0 1 N + λ ≤ 1 λ + X NT N=1 1 N ≤ 1 λ + log(NT ) + 1. Applying a similar argument to rounds with positive labels and summing over the rounds with positive and negative labels independently gives Regret ≤ λ(|w ∗ | + 2 log 2) + log(PT ) + log(NT ) + 2 λ + 2. Note log(PT ) + log(NT ) ≤ 2 log T. We wish to compete with w ∗ where |w ∗ | ≤ D/2, so we can choose λ = √ 1 D/2 which gives Regret ≤ O( √ D + log T). References Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Mach. Learn., 69, December 2007. H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In COLT, 2010. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, 2003. Appendix A. The Exp-Concavity of the Logistic Loss Theorem 1 The logistic loss function `(wt ; xt , yt) = log(1 + exp(−ytwt · xt)), from Eq. (1), is α-exp-concave with α = exp(−D/2) over set W = {w | kwk2 ≤ D/2} when kxtk2 ≤ 1 and yt ∈ {−1, 1}. Proof Recall that a function ` is α-exp-concave if O 2 exp(−α`(w)) 0. When `(w) = g(w·x) for x ∈ R n , we have O 2 exp(−α`(w)) = O 2f 00(z)xx>, where f(z) = exp(−αg(z)). For the logistic loss, we have g(z) = log(1 + exp(z)) (without loss of generality, we consider a negative example), and so f(z) = (1 + exp(z))−α. Then, f 00(z) = αez (1 + e z ) −α−2 (αez − 1). We need the largest α such that f 00(z) ≤ 0, given a fixed z. We can see by inspection that α = 0 is a zero. Since e z (1 + e z ) −α−2 > 0, from the term (αez − 1) we conclude α = e −z is the largest value of α where f 00(z) ≤ 0. Note that z = wt · xt , and so |z| ≤ D/2 since kxtk2 ≤ 1, and so taking the worst case over wt ∈ W and xt with kxtk2 ≤ 1, we have α = exp(−D/2). 44.3 Online Microsurveys for User Experience Research Abstract This case study presents a critical analysis of microsurveys as a method for conducting user experience research. We focus specifically on Google Consumer Surveys (GCS) and analyze a combination of log data and GCSs run by the authors to investigate how they are used, who the respondents are, and the quality of the data. We find that such microsurveys can be a great way to quickly and cheaply gather large amounts of survey data, but that there are pitfalls that user experience researchers should be aware of when using the method. Author Keywords Microsurveys; user experience research; user research methods ACM Classification Keywords H.5.2. User Interfaces: Theory and methods. Introduction To keep up with fast paced design and development teams, user researchers must develop a toolkit of methods to quickly and efficiently address research questions. One such method is the microsurvey, or a short survey of only one to three questions. There are several commercial microsurveys—including Google Consumer Surveys (GCS), SlimSurveys, and Survata— Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s). CHI 2014, April 26–May 1, 2014, Toronto, Ontario, Canada. ACM 978-1-4503-2474-8/14/04. http://dx.doi.org/10.1145/2559206.2559975 David Huffaker Google Inc. Mountain View, CA 94043 huffaker@google.com Gueorgi Kossinets Google Inc. Mountain View, CA 94043 gkossinets@google.com Kerwell Liao Google Inc. Mountain View, CA 94043 kerwell@google.com Paul McDonald Google Inc. Mountain View, CA 94043 pmcdonald@google.com Aaron Sedley Google Inc. Mountain View, CA 94043 asedley@google.com Victoria Schwanda Sosik Cornell University 301 College Ave. Ithaca, NY 14850 vsosik@cs.cornell.edu Elie Bursztein Google Inc. Mountain View, CA 94043 elieb@google.com Sunny Consolvo Google Inc. Mountain View, CA 94043 sconsolvo@google.com Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 889Figure 1. An respondent en asked to answ or share the p social media i the publisher’ Figure 2. Pa interface. To responses by to a multiple to the right. example of how a ncounters GCS. The wer a short survey q page they are readin in order to continue ’s content. rt of the GCS result the left are controls y demographics, and choice question are that p data q study micro quest respo collec for us One Since brief o shown quest each r survey of twe ended respo the qu must limits to sho Surve popula demo Doubl Quest publis catego Refere contin way, t betwe acces y are question, ng via reading s s to filter d results e shown promise to provide quickly and at a re , we present a cri survey, Google Co ions about how th ndents are, and o t. We conclude wi sing this method in Example of a M we use GCS in th overview of how it n only one questio ion. If a survey ha respondent is ran y questions. The s elve predefined qu d, single answer, m nses. Certain que uestion or respons be short, with 12 respectively; mu owing 5 response ey designers can r ation, or target re graphics (as infer eClick cookies) or tions are then sho sher’s premium co ories of News, Art ence—and people nue reading the co these microsurvey een the responden s. e people with larg elatively low cost. tical analysis of o onsumer Surveys hey are being used of what quality is t ith some current b n user research. Microsurvey: G his case study, we t works. Each GCS on, two if there is as more than one domly shown only survey designer c uestion formats th multiple answer, a stion formats allo ses. Questions an 5 character and 4 ltiple choice quest options to each re request a represen espondents based rred by IP address r by using a scree own to people tryi ontent—primarily ts & Entertainmen answer the quest ontent (see Figure ys are acting as a nt and the content e amounts of In this case ne type of , addressing d, who their the data they best practices GCS e first provide a S respondent is a screening question, then y one of the can choose one hat include open and rating scale ow for images in d responses 44 character tions are limited espondent. ntative on specific ses and ening question. ng to access a in the nt, and tion in order to e 1); in this surveywall t they want to After data results in t basic analy different d clustering Results: We analyz surveys ru run specifi and others for our pro a methodo GCS by th GCS log da types of q Table 1). T up over 80 the most c the lowest On averag to a GCS q seconds (s quickly—o collecting created, a four days. collection targeted s Who are G In Novemb compare G is collected, surv the GCS interface ysis tools includin demographics and of open-ended te Analysis of GC zed GCS log data a un by the authors ically to gather da s were run to answ oduct teams, how ological perspectiv he Numbers ata shows that th uestions are mult Together, single a 0% of all deployed common question t completion rate ge, respondents sp question, and the see Figure 3). GCS on average, survey data between one nd complete data General populati on the lower end surveys tend to ta GCS Respondents? ber 2012, PEW Re GCS demographic vey designers can e, which provides ng comparison of r d automatic, edita ext responses (see CS and data from sev . Some of the sur ata about GCS as wer user research wever we analyzed ve for this case st e two most frequ tiple-choice questi and multiple answ d GCS questions. type—multiple an (see Table 1). pend 9.7 seconds modal response t Ss also collect dat ys are approved t e and four hours a a collection in abou on surveys finish of that range, wh ke the four days. ? esearch ran a stud s with that of the view the users with results by ble e Figure 2). veral rveys were a method, h questions d them from tudy. ently used ions (see wers make However nswer—has responding time is 4 ta very to start after being ut two to data hereas dy to ir Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 890Demogra Men Women 18–24 25–34 35—44 45—54 55—64 65+ Unknown Ag Survey Question Do you ever use the inter to use a socia networking s like MySpace Facebook, or LinkedIn.co m What is the primary socia networking s you use? Table 2. Infe compared to Table 3. Soci older America survey sample phic PEW 32% 35% 33% 37% 49% 38% 28% 18% ge — n PEW G net al ite e, r m? 42% [age 50+] 4 [age al ite Fac (8 Linked Twitte Google MyS (1 erred GCS demograp PEW demographics. al Network usage a m ns, using PEW and G es. teleph respo compo that t heav y Q Mult Sing Open Ratin Nu m Ratin Ratin Larg Side Imag Open Two GCS 27% 27% 18% 30% 32% 28% 26% 23% 27% GCS 46% e 45+] ebook 85%) dIn (6%) er (4%) e+ (3%) Space 1%) Table compl types Figure GCS su phics . mong GCS hone panels. Their ndents “conform c osition of the ove here is little evide y internet users. [ Question Type tiple answers gle answer n Ended ng meric open ended ng with text ng with image ge image choice e-by-side images ge with menu n ended with image choices with image e 1. Rate of usage a letion rate among re of GCS questions. e 3. Distribution of r urvey questions. r overall findings w closely to the de m rall internet popu ence that GCS is b 4] Usage C 62.04% 21.71% 4.62% 3.81% 1.60% 1.50% 1.30% 0.99% 0.92% 0.82% 0.69% -- mong survey design espondents for the 1 response times in se were that GCS mographic lation,” and biased towards Completion Rate 20.56% 39.37% 27.03% 34.19% 25.30% 34.09% 27.20% 28.49% 29.37% 36.57% 27.79% -- ners and 12 different econds to We ran a s and techno of tablet o phone own (35%, 33 % was lower of demogr and gende networking findings us We also co from Surv Knowledge and techno were simil heavier int being the Overall, w between th of unkno w similar acr representi Responde n We ran a G surveywal are trying options the premium c response w followed b (34%), ma purchasing they then series of GCSs to ology-use questio ownership (PEW = nership (91%, 67 %) or the internet among GCS than raphics, GCS sho w er (see Table 2). W g site usage amon sing GCS were clo ompared GCS res ey Sampling Inte e Networks (KN) w ology adoption. R ar, with SSI respo ternet users and t lowest (see Table while we notice de m he survey sample wns in GCS—techn ross all four samp ng the high and lo nts’ Attitudes To w GCS to explore re ls that stand bet w to access. We as ey would prefer w content. We found was taking a shor by having content aking a small one g a subscription (6 had to specify as dig deeper into d ons. We found tha = 34%, GCS = 28 % %) and use of ce t for banking (61 % n PEW respondent ws lower rates acr With respect to so ng older American ose to PEW (see T pondents to respo rnational (SSI) an with respect to int Results across the ondents tending t technology adopte e 4). mographic differe es—likely due to th nology usage and ples, with PEW and ow extremes, res ward Surveywalls espondents’ attitud ween them and co ked them which o when trying to acc d that the most po rt microsurvey (47 sponsored by an e-time payment (1 6%), and other (3 open ended text) emographic at the rate %), cell ll phones %, 48%) ts. In terms ross age ocial ns, our Table 3). ondents nd ternet use panels to be the ers, and KN nces he number adoption is d KN pectively. des toward ontent they of five cess opular 7%), advertiser 10%), 3%; which ). Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 891Data Quality: Survey Attentiveness As one measure of data quality, we ran a GCS that asked respondents one of several trap questions. For a summary of how respondents performed, see the sidebar to the left. We find that our GCS respondents answered correctly the “Very Often” question less often (73%) than an example of the same trap question being asked on a paper survey (97%) [3]. A trap survey run in Mechanical Turk found only 61% of respondents answering correctly when asked to read an email and answer two questions [2], but this task is arguably harder than the questions we asked. Data Quality: Garbage Open Ended Responses We also analyzed data quality by looking at the rate of garbage responses that we received across 25 GCS questions run for other projects. Examples of these questions include: “which web browser(s) do you use?” and “what does clicking on this image allow you to do?” responses such as “blah”, “who cares”, and “zzzzz” and found that the percentage of garbage responses ranged from 1.8% to 23.4% (Mean = 7.8%). Our analysis revealed that the percentage of “I don’t know” responses tended to correlate with the percentage of garbage responses, suggesting that people were more likely to provide such garbage responses when they were not sure of what the question was asking of them. Conclusion: Best Practices for Microsurveys We find that microsurveys such as Google Consumer Surveys can quickly provide large amounts of data with relatively low setup costs. We also see that the GCS population is fairly representative as compared to other large-scale survey panels. However there are also pitfalls to keep in mind. Our findings from the trap question survey suggests that being concise is important to maximize data quality, which supports GCS’s question length constraints. We also suggest that it is important to appropriately target surveys to a population in order to keep garbage open ended responses to a minimum. If respondents are being asked about something they are unfamiliar with, they are less likely to provide meaningful responses. Finally, multiple answer questions had the lowest completion rate—which is often used as a measure of data quality (e.g. [1])—so we suggest that people think critically about the types of questions they use, and consider using other question types if at all appropriate. With respect to analyzing microsurveys, first it is important to remember that demographics are inferred, and there are many “unknowns”. We also suggest using built-in text clustering tools to categorize open-ended responses, and if desired, following up with multiple choice questions to determine how frequent these categories are. References [1] Dillman, D. A. & Schaefer, D. R. (1998). Development of a standard e-mail methodology: results of an experiment. Public Opinion Quarterly, 62, 3. [2] Downs, J.S., Holbrook, M. B., Sheng, S., & Cranor, L. F. (2010). Are your participants gaming the system? Screening mechanical turk workers. In Proceedings of CHI '10. [3] Hargittai, E. (2005). Survey Measures of WebOriented Digital Literacy. Social Science Computer Review. 23, 3, 371–379. [4] Pew Research (November 2012). A Comparison of Results from Surveys by the Pew Research Center and Google Consumer Surveys. KN GCS SSI For personal purposes, I normally use the Internet (5 = every hour or more, 1 = once per week or less) 3.2 3.5 3.8 Other people often seek my ideas and advice regarding technology (5 = describes me very well, 1 = describes me very poorly) 2.7 3.1 3.2 I am willing to pay more for the latest technology (same as above) 2.3 2.6 3.1 Which of the following best describes when you buy or try out new technology? (5 = Among the first people, 1 = I am usually not interested) 2.5 2.6 3.1 How frequently do you post on social networks? (5 = multiple times a day, 1 = once a month or less) 1.7 2.1 2.4 Trap Questions in GCS What is the color of a red ball? (90.3% correct) What is the shape of a red ball? (85.7%) The purpose of this question is to assess your attentiveness to question wording. For this question please mark the ‘Very Often’ response. (72.5%) The purpose of this question is to assess your attentiveness to question wording. Ignore the question below, and select “blue” from the answers. What color is a basketball? (57%) Table 4. Technology use and adoption among 3 different survey panels. Case Study: Creating Methods CHI 2014, One of a CHInd, Toronto, ON, Canada 892 Minimizing off-target signals in RNA fluorescent in situ hybridization Aaron Arvey1,2, Anita Hermann3 , Cheryl C. Hsia3 , Eugene Ie2,4, Yoav Freund2 and William McGinnis3,* 1 Computational and Systems Biology Center, Memorial Sloan-Kettering Cancer Center, New York, NY, 10065, 2 Department of Computer Sciences and Engineering, 3 Division of Biological Sciences, University of California, San Diego, La Jolla, CA 92093 and 4 Google Inc., Mountain View, CA 94043, USA Received November 4, 2009; Revised December 11, 2009; Accepted January 17, 2010 ABSTRACT Fluorescent in situ hybridization (FISH) techniques are becoming extremely sensitive, to the point where individual RNA or DNA molecules can be detected with small probes. At this level of sensitivity, the elimination of ‘off-target’ hybridization is of crucial importance, but typical probes used for RNA and DNA FISH contain sequences repeated elsewhere in the genome. We find that very short (e.g. 20 nt) perfect repeated sequences within much longer probes (e.g. 350–1500 nt) can produce significant off-target signals. The extent of noise is surprising given the long length of the probes and the short length of non-specific regions. When we removed the small regions of repeated sequence from either short or long probes, we find that the signal-to-noise ratio is increased by orders of magnitude, putting us in a regime where fluorescent signals can be considered to be a quantitative measure of target transcript numbers. As the majority of genes in complex organisms contain repeated k-mers, we provide genome-wide annotations of k-mer-uniqueness at http://cbio.mskcc .org/aarvey/repeatmap. INTRODUCTION The gene expression profiles of individual cells can be drastically different from that of adjacent cells. This is particularly true in developing or heterogeneous tissues such as embryos (1), proliferative adult epithelia (2) and tumors (3). Visualization of RNA expression patterns in fields of cells is often accomplished with fluorescence in situ hybridization (FISH) using antisense probes. Analysis of cellular patterns of gene expression by FISH has provided insight into prognosis (3) and cell fate (4) of tissues. A challenge for the future is to use FISH in tissues to quantify RNA expression levels on a cell-by-cell basis, which requires high resolution, high sensitivity and high signal-to-noise ratios (1,5–8). A major hurdle in making RNA FISH methods quantitative has been increasing sensitivity and specificity to the point where genuine target RNA signals can be distinguished from background. One way to produce probes of high specificity has been to produce chemicallysynthesized oligonucleotides that are directly labeled with fluorophores, and tiled along regions of RNA sequence (6–8). Although directly-labeled oligo probes are elegant, they have not yet been widely applied, in part due to their expense, and in part due to their relatively low signal strength (6–8). One alternative method for single RNA molecule detection employs long haptenylated riboprobes that are enzymatically synthesized from cDNAs (1,5). Such probes are cheaply and easily produced, and when detected with primary and fluorescently-labeled secondary antibodies, they have higher signal intensities and equivalent resolution when compared to probes that are directly labeled with fluorophores (1,5). However, tiling probes have a natural advantage with respect to specificity: if a single probe ‘tile’ hybridizes to an off-target transcript, it is unlikely to generate sufficient signal to pass an intensity threshold that is characteristic of genuine RNA transcripts, which contain multiple tiled binding sites. In contrast, a single haptenylated probe, even if fragmented to sizes in the range of hundreds of nucleotides, may yield strong off-target signals due to the amplification conferred by primary and secondary antibodies. One traditional approach to determine background levels of fluorescence, and thus act as a crude estimate of specificity, has been the use of sense *To whom correspondence should be addressed. Tel: 858 822 0461; Fax: 858 822 3021; Email: wmcginnis@ucsd.edu The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Published online 17 February 2010 Nucleic Acids Research, 2010, Vol. 38, No. 10 e115 doi:10.1093/nar/gkq042 The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. by on July 11, 2010 http://nar.oxfordjournals.org Downloaded from Collaboration in the Cloud at Google Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy Google Inc. January 8, 2014 Abstract Through a detailed analysis of logs of activity for all Google employees1 , this paper shows how the Google Docs suite (documents, spreadsheets and slides) enables and increases collaboration within Google. In particular, visualization and analysis of the evolution of Google’s collaboration network show that new employees2 , have started collaborating more quickly and with more people as usage of Docs has grown. Over the last two years, the percentage of new employees who collaborate on Docs per month has risen from 70% to 90% and the percentage who collaborate with more than two people has doubled from 35% to 70%. Moreover, the culture of collaboration has become more open, with public sharing within Google overtaking private sharing. 1 Introduction Google Docs is a cloud productivity suite and it is designed to make collaboration easy and natural, regardless of whether users are in the same or different locations, working at the same or different times, or working on desktops or mobile devices. Edits and comments on the document are displayed as they are made, even if many people are simultaneously writing and commenting on or viewing the document. Comments enable real-time discussion and feedback on the document, without changing the document itself. Authors are notified when a new comment is made or replied to, and authors can continue a conversation by replying to the comment, or end the discussion by resolving it, or re-start the discussion by re-opening a closed discussion stream. Because documents are stored in the cloud, users can access any document they own or that has been shared with them anywhere, any time and on any device. The question is whether this enriched model of collaboration matters? There have been a few previous qualitative analyses of the effects of Google Docs on collaboration. For example, the review of Google Docs in [1] suggested that its features should improve collaboration and productivity among college students. A technical report [2] from the University of Southern Queensland, Australia argued that Google Docs can overcome barriers to usability such as difficulty of installation and document version control and help resolve conflicts among co-authors of research papers. There has also been at least one rigorous study of the effect of Google Docs on collaboration. Blau and Caspi [3] ran a small experiment that was designed to compare collaboration on writing documents to merely sharing documents. In their experiment, 118 undergraduate students of the Open University of Israel were randomized to one of five groups in which they shared their written assignments and received feedback from other students to varying degrees, ranging from keeping texts 1Full-time Google employees, excluding interns, part-times, vendors, etc 2Full-time employees who have joined Google for less than 90 days 12 COLLABORATION VISUALIZATION private to allowing in-text suggestions or allowing in-text edits. None of the students had used Google Docs previously. The authors found that only students in the collaboration group perceived the quality of their final document to be higher after receiving feedback, and students in all groups thought that collaboration improves documents. This paper takes a different approach, and looks for the effects of collaboration on a large, diverse organization with thousands of users over a much longer period of time. The first part of the paper describes some of the contexts in which Google Docs is used for collaboration, and the second part analyzes how collaboration has evolved over the last two years. 2 Collaboration Visualization 2.1 The Data This section introduces a way to visualize the events during a collaboration and some simple statistics that summarize how widespread collaboration using Google Docs is at Google. The graphics and metrics are based on the view, edit and comment actions of all full-time employees on tens of thousands of documents created in April 2013. 2.2 A Simple Example To start, a document with three collaborators Adam (A), Bryant (B) and Catherine (C) is shown in Figure 1. The horizontal axis represents time during the collaboration. The vertical axis is broken into three regions representing viewing, editing and commenting. Each contributor is assigned a color. A box with the contributor’s color is drawn in any time interval in which the contributor was active, at a vertical position that indicates what the user was doing in that time interval. This allows us to see when contributors were active and how often they contributed to the document. Stacking the boxes allows us to show when contributors were acting at the same time. Only time intervals in which at least one contributor was active are shown, and gaps in time that are shorter than a threshold are ignored. Gray vertical bars of fixed width are used to represent periods of no activity that are longer than the threshold. In this paper, the threshold is set to be 12 hours in all examples. In Figure 1, an interval represents an hour. Adam and Bryant edited the document together during the hour of 10 AM May 4 and Bryant edited alone in the following hour. The collaboration paused for 8 days and resumed during the hour of 2 pm on May 12. Adam, Bryant and Catherine all viewed the document during that hour. Catherine commented on the document in the next hour. Altogether, the collaboration had two active sessions, with a pause of 8 days between them. Figure 1: This figure shows an example of the collaboration visualization technique. Each colored block except the gray one represents an hour and the gray one represents a period of no activity. The Y axis is the number of users for each action type. This document has three contributors, each assigned a different color. Although we have used color to represent collaborators here, we could instead use color to represent the locations of the collaborators, their organizations, or other variables. Examples with different colorings are given in Sections 2.5 and 2.6. 2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics 2.3 Collaboration Metrics To estimate the percentage of users who concurrently edit a document and the percentage of documents which had concurrent editing, we discretize the timestamps of editing actions into 15 minute intervals and consider editing actions by different contributors in the same 15 minute interval to be concurrent. Two users who edit the same document but always more than 15 minutes apart would not be considered as concurrent, although they would still be considered collaborators. Edge cases in which two collaborators edit the same document within 15 minutes of each other but in two adjacent 15 minute intervals would not be counted as concurrent events. The choice of 15 minutes is arbitrary; however, metrics based on a 15 minute discretization and a 5 minute discretization are little different. The choice of 15 minute intervals makes computation faster. A more accurate approach would be to look for sequences of editing actions by different users with gaps below 15 minutes, but that requires considerably more computing. 2.4 Collaborative Editing Collaborative editing is common at Google. 53% of the documents that were created and shared in April 2013 were edited by more than one employee, and half of those had at least one concurrent editing session in the following six months. Looking at employees instead of documents, 80% of the employees who edited any document contributed content to a document owned by others and 65% participated in at least one 15 minute concurrent editing session in April 2013. Concurrent editing is sticky, in the sense that 76% of the employees who participate in a 15 minute concurrent editing session in April will do so again the following month. There are many use cases for collaborative editing, including weekly reports, design documents, and coding interviews. The following three plots show an example of each of these use cases. Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the number of users for each action type. The document was mainly edited by 3 employees, commented on by 18 and viewed by 50+. Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION Figure 2 shows the life of a design document created by engineers. The X axis is time in hours and the Y axis is the number of employees working on the document for each action type. The document was mainly edited by three employees, commented on by 18 employees and viewed by more than 50 employees from three major locations. This document was completed within two weeks and viewed many times in the subsequent month. Design documents are common at Google, and they typically have many contributors. Figure 3 shows the life of a weekly report document. Each bar represents a day and the Y axis is the number of employees who edited and viewed the document in a day. This document has the following submission rules: • Wednesday, AM: Reminder for submissions • Wednesday, PM: All teams submit updates • Thursday, AM: Document is locked The activities on the document exhibit a pronounced weekly pattern that mirrors the submission rules. Weekly reports and meeting notes that are updated regularly are often used by employees to keep everyone up-to-date as projects progress. Figure 3: Collaboration on a weekly report. The X axis is time in days and the Y axis is the number of users for each action type. The activities exhibit a pronounced weekly pattern and reflect the submission rules of the document. Finally, Figure 4 shows the life of a document used in an interview. The X axis represents time in minutes. The document was prepared by a recruiter and then viewed by an engineer. At the beginning of the interview, the engineer edited the document and the candidate then wrote code in the document. The engineer was able to watch the candidate typing. At the end of the interview, the candidate’s access to the document was revoked so no further change could be made, and the document was reviewed by the engineer. Collaborative editing allows the coding interview to take place remotely, and it is an integral part of interviews for software engineers at Google. Figure 4: The activity on a phone interview document. The X axis is time in minutes and the Y axis is the number of users for each action type. The engineer was able to watch the candidate typing on the document during a remote interview. 2.5 Commenting Commenting is common at Google. 30% of the documents created in April 2013 that are shared received comments within six months of creation. 57% of the employees who used Google Docs in April commented at least once in April, and 80% of the users who commented in April commented again in the following month. 4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis is the number of user actions for each user location. There are four user actions, each assigned a different color. Timestamps are in Pacific time. Figure 5 shows the life of a design document. Here color represents the type of user action (create a comment, reply to a comment, resolve a comment and edit the document), and the Y axis is split into two locations. The document was written by one engineering team and reviewed by another. The review team used commenting to raise many questions, which the engineering team resolved over the next few days. Collaborators were located in London, UK and Mountain View, California, with a nine hour time zone difference, so the two teams were almost ”taking turns” working on the document (timestamps are in Pacific time). There are many similar communication patterns between engineers via commenting to ask questions, have discussions and suggest modifications. 2.6 Collaboration Across Sites Employees use the Docs suite to collaborate with colleagues across the world, as Figure 6 shows. In that figure, employees working from nine locations in eight countries across the globe contributed to a document that was written within a week. The document was either viewed or edited with gaps of less than 12 hours (the threshold for suppressing gaps in the plot) in the first seven days as people worked in their local timezones. After final changes were made to the document, it was reviewed by people in Dublin, Mountain View, and New York. Figure 7 shows one month of global collaborations for full-time employees using Google Docs. The blue dots show the locations of the employees and a line connects two locations if a document is created in one location and viewed in the other. The warmer the color of the line, moving from green to red, the more documents shared between the two locations. Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in hours and the Y axis is the number of locations for each action type. Users from nine different locations contributed to the document. Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is collaboration on Google Docs between the two locations. 6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work 2.7 Cross Device Work The advantage of cloud-based software and storage is that a document can be accessed from any device. Figure 8 shows one employee’s visits to a document from multiple devices and locations. When the employee was in Paris, a desktop or laptop was used during working hours and a mobile device during non-working hours. Apparently, the employee traveled to Aix-En-Provence on August 18. On August 18 and the first part of August 19, the employee continued working on the same document from a mobile device while on the move. Figure 8: Visits to a document by one user working on multiple devices and from multiple locations. Not surprisingly, the pattern of working on desktops or laptops during working hours and on mobile devices out of business hours holds generally at Google, as Figure 9 shows. The day of week is shown on the X axis and hour of day in local time on the Y axis. Each pixel is colored according to the average number of employees working in Google Docs in a day of week and time of day slot, with brighter colors representing higher numbers. Pixel values are normalized within each plot separately. Desktop and laptop usage of Google Docs peaks during conventional working hours (9:00 AM to 11:00 AM and 1:00 PM to 5:00 PM), while mobile device usage peaks during conventional commuting and other out-of-office hours (7:00 AM to 9:00 AM and 6:00 PM to 8:00 PM). Figure 9: The average number of active users working in Google Docs in each day of week and time of day slot. The X axis is day of the week and the Y axis is time of the day in local time. Desktop/Laptop usage peaks during working hours while mobile usage peaks at out-of-office working hours. 3 The Evolution of Collaboration 3.1 The Data This section explores changes in the usage of Google Docs over time. Section 2 defined collaborators as users who edited or commented on the same document and used logs of employee editing, viewing and commenting actions to describe collaboration within Google. This section defines collaborators differently using metadata on documents. Metadata is much less rich than the event history logs used in Section 2, but metadata is retained for a much longer period of time. Document metadata includes the document creation time and the last time that the document Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION was accessed, but no other information about its revision history. However, the metadata does include the identification numbers for employees who have subscribed to the document, where a subscriber is anyone who has permission to view, edit or comment on a document and who has viewed the document at least once. Here we use metadata on documents, slides and spreadsheets. We call two employees collaborators (or subscription collaborators to be clear) if one is a subscriber to a document owned by the other and has viewed the document at least once and the document has fewer than 20 subscribers. The owner of the document is said to have shared the document with the subscriber. The number of subscribers is capped at 20 to avoid overcounting collaborators. The more subscribers the document has, the less likely it is that all the subscribers contributed to the document. There is no timestamp for when the employee subscribed to the document in the metadata, so the exact time of the collaboration is not known. Instead, the document creation time, which is known, is taken to be the time of the collaboration. An analysis (not shown here) of the event history data discussed in Section 2 showed that most collaborators join a collaboration soon after a document is created, so taking collaboration time to be document creation time is not unreasonable. To make this assumption even more tenable, we exclude documents for which the time of the last view, comment or edit is more than six months after the document was created. This section uses metadata on documents created between January 1, 2011 and March 31, 2013. We say that two employees had a subscription collaboration in July if they collaborated on a document that was created in July. 3.2 Collaboration for New Employees Here we define the new employees for a given month to be all the employees who joined Google no more than 90 days before the beginning of the month and started using Google Docs in the given month. For example, employees called new in the month of January 2011 must have joined Google no more than 90 days before January 1, 2011 and used Google Docs in January 2011. Each month can include different employees. New employees are said to share a document if they own a document that someone else subscribed to, whether or not the person subscribed to the document is a new employee. Similarly, a new employee is counted as a subscriber, regardless of the tenure of the document creator. Figure 10 shows that collaboration among new employees has increased since 2011. Over the last two years, subscribing has risen from 55% to 85%, sharing has risen from 30% to 50%, and the fraction of users who either share or subscribe has risen from 70% to 90%. In other words, new employees are collaborating earlier in their career, so there is a faster ramp-up and easier access to collective knowledge. Figure 10: This figure shows the percentage of new employees who share, subscribe to others’ documents and either share or subscribe in each one-month period over the last two years. Not only do new employees start collaborating more often (as measured by subscription and sharing), they also collaborate with more people. Figure 11 shows the percentage of new employees with at least a given number of collaborators by month. For example, the percentage of 8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing new employees with at least three subscription collaborators was 35% in January 2011 (the bottom red curve) and 70% in March 2013 (the top blue curve), a doubling over two years. It is interesting that the curves hardly cross each other and the curves for the farthest back months lie below those for recent months, suggesting that there has been steady growth in the number of subscription collaborators per new employee over this period. Figure 11: This figure shows the proportion of new employees who have at least a given number of collaborators in each one-month period. Each period is assigned a different color. The cooler the color of the curve, moving from red to blue, the more recent the month. The legend only shows the labels for a subset of curves. The percentage of new employees who have at least three collaborators has doubled from 35% to 70%. To present the data in Figure 11 in another way, Table 1 shows percentiles of the distribution of the number of subscription collaborators per new employee using Google Docs in January 2011 and in January 2013. For example, the lowest 25% of new employees using Google Docs had no such collaborators in January 2011 and two such collaborators in January 2013. 25% 50% 75% 90% 95% January 2011 0 1 4 7 11 January 2013 2 5 10 17 22 Table 1: This table shows the percentile of number of collaborators a new employee have in January 2011 and January 2013. The entire distribution shifts to the right. 3.3 Collaboration in Sales and Marketing Section 3.2 compared new employees who joined Google in different months. This section follows current employees in Sales and Marketing who joined Google before January 1, 2011. That is, the previous section considered changes in new employee behavior over time and this section considers changes in behavior for a fixed set of employees over time. We only analyze subscription collaborations among this fixed set of employees and collaborations with employees not in this set are excluded. Figure 12: This figure shows the percentage of current employees in Sales and Marketing who have at least a given number of collaborators in each onemonth period. Figure 12 shows the percentage of current employees in Sales and Marketing who have at least Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION a given number of collaborators at several times in the past. There we see that more employees are sharing and subscribing over time because the fraction of the group with at least one subscription collaborator has increased from 80% to 95%. And the fraction of the group with at least three subscription collaborators has increased from 50% to 80%. It shows that many of the employees who used to have no or very few subscription collaborators have migrated to having multiple subscription collaborators. In other words, the distribution of number of subscription collaborators for employees who have been in Sales and Marketing since January 1, 2011 has shifted right over time, which implies that collaboration in that group of employees has increased over time. Finally, the number of documents shared by the employees who have been in Sales and Marketing at Google since January 1, 2011 has nearly doubled over the last two years. Figure 13 shows the number of shared documents normalized by the number of shared documents in January, 2011. Figure 13: This figure shows the number of shared documents created by employees in Sales and Marketing each month normalized by the number of shared documents in January 2011. The number has almost doubled over the last two years. 3.4 Collaboration Between Organizations Collaboration between organizations has increased over time. To show that, we consider hundreds of employees in nine teams within the Sales and Marketing group and the Engineering and Product Management group who joined Google before January 1, 2011, were still active in March 31, 2013 and used Google Docs in that period. Figure 14 represents the Engineering and Product Management employees as red dots and the Sales and Marketing employees as blue dots. The same dots are included in all three plots in Figure 14 because the employees included in this analysis do not change. A line connects two dots if the two employees had at least one subscription collaboration in the month shown. The denser the lines in the graph, the more collaboration, and the more lines connecting red and blue dots, the more collaboration between organizations. Clearly, subscription collaboration has increased both within and across organizations in the past two years. Moreover, the network shows more pronounced communities (groups of connected dots) over time. Although there are nine individual teams, there seems to be only three major communities in the network. Figure 14 indicates that teams can work closely with each other even though they belong to separate departments. We also sampled 187 teams within the Sales and Marketing group and the Engineering and Product Management group. Figure 15 represents teams in Engineering and Product Management as red dots and teams in Sales and Marketing as blue dots. Two dots are connected if the two teams had a least one subscription collaboration between their members in the month. Figure 15 shows that the collaboration between those teams has increased and the interaction between the two organizations has becomed stronger over the past two years. 10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations Figure 14: An example of collaboration across organizations. Red dots represent employees in Engineering and Product Management and blue dots represent employees in Sales and Marketing Figure 15: An example of collaboration between teams. Red dots represent teams in Engineering and Product Management and blue dots represent teams in Sales and Marketing Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS 3.5 Cultural Changes in Collaboration Google Docs allows users to specify the access level (visibility) of their documents. The default access level in Google Docs is private, which means that only the user who created the document or the current owner of the document can view it. Employees can change the access level on a document they own and allow more people to access it. For example, the document owner can specify particular employees who are allowed to access the document, or the owner can mark the document as public within Google, in which case any employee can access the document. Clearly, not all documents created in Google can be visible to everyone at Google, but the more documents are widely shared, the more open the environment is to collaboration. Figure 16: This figure shows the percentage of shared documents that are ”public within Google” created in each month. Public sharing is overtaking private sharing at Google. Figure 16 shows the percentage of shared documents in Google created each month between January 1, 2012 and March 31, 2013 that are public within Google. The red line, which is a curve fit to the data to smooth out variability, shows that the percentage has increased about 12% from 48% to 54% in the last year alone. In that sense, the culture of sharing is changing in Google from private sharing to public sharing. 4 Conclusions We have examined how Google employees collaborate with Docs and how that collaboration has evolved using logs of user activity and document metadata. To show the current usage of Docs in Google, we have developed a visualization technique for the revision history of a document and analyzed key features in Docs such as collaborative editing, commenting, access from anywhere and on any device. To show the evolution of collaboration in the cloud, we have analyzed new employees and a fixed group of employees in Sales and Marketing, and computed collaboration network statistics each month. We find that employees are engaged in using the Docs suite, and collaboration has grown rapidly over the last two years. It would also be interesting to conduct a similar analysis for other enterprises and see how long it would take them to reach the benchmark Google has set for collaboration on Docs. Not only has the collaboration on Docs changed at Google, the number of emails, comments on G+, calender meetings between people who work together has also had significant changes over the past few years. How those changes reinforce each other over time would also be an interesting topic to study. Acknowledgements We would like to thank Ariel Kern for her insights about collaboration on Google Docs, Penny Chu and Tony Fagan for their encouragement and support and many thanks to Jim Koehler for his constructive feedback. 12 Google Inc.REFERENCES REFERENCES References [1] Dan R. Herrick (2009). Google this!: using Google apps for collaboration and productivity. Proceeding of the ACM SIGUCCS fall conference (pp. 55-64). [2] Stijn Dekeyser, Richard Watson (2009). Extending Google Docs to Collaborate on Research Papers. Technical Report, The University of Southern Queensland, Australia. [3] Ina Blau, Avner Caspi (2009). What Type of Collaboration Helps? Psychological Ownership, Perceived Learning and Outcome Quality of Collaboration Using Google Docs. Learning in the technological era: Proceedings of the Chais conference on instructional technologies research (pp. 48-55). Google Inc. 13 Page [1] of [40] [March, 2013] WORKING GROUP 4 Network Security Best Practices FINAL Report – BGP Security Best PracticesThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [2] of [40] Table of Contents Table of Contents 1 RESULTS IN BRIEF...................................................................................................................................... 3 1.1 CHARTER ........................................................................................................................................................ 3 1.2 EXECUTIVE SUMMARY................................................................................................................................. 3 2 INTRODUCTION........................................................................................................................................... 4 2.1 CSRIC STRUCTURE......................................................................................................................................... 5 2.2 WORKING GROUP [#4] TEAM MEMBERS ..................................................................................................... 7 3 OBJECTIVE, SCOPE, AND METHODOLOGY ................................................................................... 8 3.1 OBJECTIVE ..................................................................................................................................................... 8 3.2 SCOPE .............................................................................................................................................................. 9 3.3 METHODOLOGY ............................................................................................................................................ 9 4 BACKGROUND.............................................................................................................................................. 9 4.1 DEPLOYMENT SCENARIOS .............................................................................................................................. 9 5 ANALYSIS, FINDINGS AND RECOMMENDATIONS .............................................................................10 5.1 BGP SESSION-LEVEL VULNERABILITY ........................................................................................................10 5.1.1 SESSION HIJACKING ...................................................................................................................................................10 5.1.2 DENIAL OF SERVICE (DOS) VULNERABILITY ........................................................................................................12 5.1.3 SOURCE-ADDRESS FILTERING ..................................................................................................................................17 5.2 BGP INJECTION AND PROPAGATION VULNERABILITY...............................................................................20 5.2.1 BGP INJECTION AND PROPAGATION COUNTERMEASURES.................................................................................22 5.2.2 BGP INJECTION AND PROPAGATION RECOMMENDATIONS ................................................................................25 5.3 OTHER ATTACKS AND VULNERABILITIES OF ROUTING INFRASTRUCTURE..............................................26 5.3.1 HACKING AND UNAUTHORIZED 3RD PARTY ACCESS TO ROUTING INFRASTRUCTURE.....................................26 5.3.2 ISP INSIDERS INSERTING FALSE ENTRIES INTO ROUTERS ...................................................................................28 5.3.3 DENIAL-OF-SERVICE ATTACKS AGAINST ISP INFRASTRUCTURE.......................................................................28 5.3.4 ATTACKS AGAINST ADMINISTRATIVE CONTROLS OF ROUTING IDENTIFIERS....................................................30 6 CONCLUSIONS............................................................................................................................................32 7 APPENDIX .....................................................................................................................................................33 7.1 BACKGROUND................................................................................................................................................33 7.1.1 SALIENT FEATURES OF BGP OPERATION..............................................................................................................33 7.1.2 REVIEW OF ROUTER OPERATIONS..........................................................................................................................34 7.2 BGP SECURITY INCIDENTS AND VULNERABILITIES ...................................................................................35 7.3 BGP RISKS MATRIX......................................................................................................................................38 7.4 BGP BCP DOCUMENT REFERENCES............................................................................................................40The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [3] of [40] 1 Results in Brief 1.1 Charter This Working Group was convened to examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to some significant deployment of protocol extensions such as the Domain Name System Security Extensions (DNSSEC), Secure BGP (Border Gateway Protocol) and the like. The focus of the group is limited to what is possible using currently available and deployed hardware and software. Development and refinement of protocol extensions for both systems is ongoing, as is the deployment of such extensions, and is the subject of other FCC working groups. The scope of Working Group 4 is to focus on currently deployed and available feature-sets and processes and not future or non-widely deployed protocol extensions. 1.2 Executive Summary Routing is what provides reachability between the various end-systems on the Internet be they servers hosting web or email applications; home user machines; VoIP (Voice over Internet Protocol) equipment; mobile devices; connected home monitoring or entertainment systems. Across the length and breadth of the global network it is inter-domain routing that allows for a given network to learn of the destinations available in a distant network. BGP (Border Gateway Protocol) has been used for inter-domain routing for over 20 years and has proven itself a dynamic, robust, and manageable solution to meet these goals. BGP is configured within a network and between networks to exchange information about which IP address ranges are reachable from that network. Among its many features, BGP allows for a flexible and granular expression of policy between a given network and other networks that it exchanges routes with. Implicit in this system is required trust in information learned from distant entities. That trust has been the source of problems from time to time causing reachability and stability problems. These episodes have typically been short-lived but underscored the need for expanding the use of Best Current Practices (BCPs) for improving the security of BGP and the inter-domain routing system. These mechanisms have been described in a variety of sources and this document does not seek to re-create the work done elsewhere but to provide an overview and gloss on the vulnerabilities and methods to address each. Additionally, the applicability of these BCPs can vary somewhat given different deployment scenarios such as the scale of a network’s BGP deployment and the number of inter-domain neighbors. By tailoring advice for these various scenarios, recommendations that may seem confusing or contradictory can be clarified. Further, an appendix includes a table that indexes the risks and countermeasures according to different deployment scenarios. Issues that the working group considered included: • Session hijacking • Denial of service (DoS) vulnerabilities • Source-address filtering • BGP injection and propagation vulnerabilitiesThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [4] of [40] • Hacking and unauthorized access to routing infrastructure • Attacks against administrative controls of routing identifiers Working Group 4 recommends that the FCC encourage adoption of numerous best practices for protecting ISPs’ routing infrastructures and addressing risks related to routing that are continuously faced by ISPs.. Inter-domain routing via BGP is a fundamental requirement for ISPs and their customers to connect and interoperate with the Internet. As such, it is a critical service that ISPs must ensure is resilient to operational challenges and protect from abuse by miscreants. SPECIAL NOTE: For brevity, and to address the remit of the CSRIC committee to make recommendations for ISPs, the term ISP is used throughout the paper. However, in most instances the reference or the recommendations are applicable to any BGP service components whether implemented by an ISP or by other organizations that peer to the Internet such as business enterprises, hosting providers, and cloud providers. 2 Introduction CSRIC was established as a federal advisory committee designed to provide recommendations to the Commission regarding Best Practices and actions the Commission may take to ensure optimal operability, security, reliability, and resiliency of communications systems, including telecommunications, media, and public safety communications systems. Due to the large scope of the CSRIC mandate, the committee then divided into a set of Working Groups, each of which was designed to address individual issue areas. In total, some 10 different Working Groups were created, including Working Group 4 on Network Security Best Practices. This Working Group will examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to a anticipated widespread implementation of protocol updates such as the Domain Name System Security Extensions (DNSSEC) and Secure Border Gateway Protocol (BGPsec) extensions. The Working Group presented its report on DNS Security issues in September 2012. This Final Report – BGP Best Practices documents the efforts undertaken by CSRIC Working Group 4 Network Security Best Practices with respect to securing the inter-domain routing infrastructure that is within the purview of ISPs, enterprises, and other BGP operators. Issues affecting the security of management systems that provide control and designation of routing and IP-space allocation records that BGP is based on were also considered. Routing and BGP related services are necessary and fundamental components of all ISP operations, and there are many established practices and guidelines available for operators to consult. Thus most ISPs have mature BGP/routing management and infrastructures in-place. Still, there remain many issues and exposures that introduce major risk elements to ISPs, since the system itself is largely insecure and unauthenticated, yet provides the fundamental traffic control system of the Internet. This report enumerates the issues the group identified as most critical and/or that may need more attention.Page [5] of [40] 2.1 CSRIC Structure Communications Security, Reliability, and Interoperability Council (CSRIC) III Working Group 5: DNSSEC Implementation Practices for ISPs Working Group 6: Secure BGP Deployment Working Group 4: Network Security Best Practices Working Group 7: Botnet Remediation Working Group 3: E911 Location Accuracy Working Group 8: E911 Best Practices Working Group 2: Next Generation Alerting Working Group 9: Legacy Broadcast Alerting Issues Working Group 1: Next Generation 911 Working Group 10: 911 Prioritization CSRIC Steering Committee Co-Chairs Working Group 1 Co-Chairs Working Group 2 Co-Chairs Working Group 3 Co-Chairs Working Group 4 Chair Working Group 5 Co-Chairs Working Group 6 Chair Working Group 7 Chair Working Group 8 Co-Chairs Working Group 9 Co-Chairs Working Group 10Page [6] of [40]The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [7] of [40] 2.2 Working Group [#4] Team Members Working Group [#4] consists of the members listed below for work on this report. Name Company Rodney Joffe – Co-Chair Neustar, Inc. Rod Rasmussen – Co-Chair Internet Identity Mark Adams ATIS (Works for Cox Communications) Steve Bellovin Columbia University Donna Bethea-Murphy Iridium Rodney Buie TeleCommunication Systems, Inc. Kevin Cox Cassidian Communications, an EADS NA Comp John Crain ICANN Michael Currie Intrado, Inc. Dale Drew Level 3 Communications Chris Garner CenturyLink Joseph Gersch Secure64 Software Corporation Jose A. Gonzalez Sprint Nextel Corporation Kevin Graves TeleCommunication Systems (TCS) Tom Haynes Verizon Chris Joul T-Mobile Mazen Khaddam Cox Kathryn Martin Access Partnership Ron Mathis Intrado, Inc. Danny McPherson Verisign Doug Montgomery NIST Chris Oberg ATIS (Works for Verizon Wireless) Victor Oppleman Packet Forensics Elman Reyes Internet Identity Ron Roman Applied Communication Sciences Heather Schiller Verizon Jason Schiller Google Marvin Simpson Southern Company Services, Inc. Tony Tauber Comcast Paul Vixie Internet Systems Consortium Russ White Verisign Bob Wright AT&T Name Company Rodney Joffe – Co-Chair Neustar, Inc. Rod Rasmussen – Co-Chair Internet Identity Mark Adams ATIS (Works for Cox Communications) Steve Bellovin Columbia University Donna Bethea-Murphy Iridium Rodney Buie TeleCommunication Systems, Inc. Kevin Cox Cassidian Communications, an EADS NA Comp John Crain ICANN Michael Currie Intrado, Inc. Dale Drew Level 3 CommunicationsThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [8] of [40] Chris Garner CenturyLink Joseph Gersch Secure64 Software Corporation Jose A. Gonzalez Sprint Nextel Corporation Kevin Graves TeleCommunication Systems (TCS) Tom Haynes Verizon Chris Joul T-Mobile Mazen Khaddam Cox Kathryn Martin Access Partnership Ron Mathis Intrado, Inc. Danny McPherson Verisign Doug Montgomery NIST Chris Oberg ATIS (Works for Verizon Wireless) Victor Oppleman Packet Forensics Elman Reyes Internet Identity Ron Roman Applied Communication Sciences Heather Schiller Verizon Jason Schiller Google Marvin Simpson Southern Company Services, Inc. Tony Tauber Comcast Paul Vixie Internet Systems Consortium Russ White Verisign Bob Wright AT&T Table 1 - List of Working Group Members 3 Objective, Scope, and Methodology 3.1 Objective This Working Group was convened to examine and make recommendations to the Council regarding best practices to secure the Domain Name System (DNS) and routing system of the Internet during the period leading up to what some anticipate might be widespread implementation of protocol updates such as the Domain Name System Security Extensions (DNSSEC) and Secure Border Gateway Protocol (BGPsec) extensions (though the latter outcome is not entirely uncontroversial). DNS is the directory system that associates a domain name with an IP (Internet Protocol) address. In order to achieve this translation, the DNS infrastructure makes hierarchical inquiries to servers that contain this global directory. As DNS inquiries are made, their IP packets rely on routing protocols to reach their correct destination. BGP is the protocol utilized to identify the best available paths for packets to take between points on the Internet at any given moment. This foundational system was built upon a distributed unauthenticated trust model that has been mostly sufficient for over two decades but has some room for improvement. These foundational systems are vulnerable to compromise through operator procedural mistakes as well as through malicious attacks that can suspend a domain name or IP address's availability, or compromise their information and integrity. While there are formal initiatives under way within the IETF (which has been chartered to develop Internet technical standards and protocols) that will improve this situation significantly, global adoption and implementation will take some time. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [9] of [40] This Working Group will examine vulnerabilities within these areas and recommend best practices to better secure these critical functions of the Internet during the interval of time preceding deployment of more robust, secure protocol extensions. This report covers the BGP portion of these overall group objectives. 3.2 Scope Working Group 4’s charter clearly delineates its scope to focus on two subsets of overall network security, DNS and routing. It further narrows that scope to exclude consideration of the implementation of DNSSEC (tasked to Working Group 5) and secure extensions of BGP (tasked to Working Group 6). While those groups deal with protocol modifications requiring new software and/or hardware deployments; WG4 is geared toward items that either don't require these extensions or are risks which are outside the scope of currently contemplated extensions. For this report regarding BGP, the focus is on using known techniques within the Operator community. Some of these methods and the risks they seek to address are useful even in cases where protocol extensions are used in some future landscape. 3.3 Methodology With the dual nature of the work facing Working Group 4, the group was divided into two subgroups, one focused on issues in DNS security, another in routing security. Starting in December 2011, the entire Working Group met every two weeks via conference call(s) to review research and discuss issues, alternating between sub-groups. The group created a mailing list to correspond and launched a wiki to gather documents and to collectively collaborate on the issues. Additional subject matter experts were occasionally tapped to provide information to the working group via conference calls. The deliverables schedule called for a series of reports starting in June 2012 that would first identify issues for both routing and DNS security, then enumerate potential solutions, and finally present recommendations. The initial deliverables schedule was updated in March in order to concentrate efforts in each particular area for separate reports. This first report on DNS security issues was presented in September 2012, and this, the second report on routing issues, is being published in March 2013. Based on the discussions of the group, a list of BGP risks, potential solutions, and relevant BCP documents was created and refined over the course of the work. Subject matter experts in BGP then drove development of the initial documentation of issues and recommendations. These were then brought together into a full document for review and feedback. Text contributions, as completed, were reviewed, edited and approved by the full membership of Working Group 4. 4 Background 4.1 Deployment Scenarios BGP is deployed in many different kinds of networks of different size and profiles. Many different recommendations exist to improve the security and resilience of the inter-domain routing system. Some of the advice can even appear somewhat contradictory and often the key decision can come down to understanding what is most important or appropriate for a given network considering its size, the number of external connections, number of BGP routers, size The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [10] of [40] and expertise of the staff and so forth. We attempt to tailor the recommendations and highlight which are most significant for a given network operator’s situation. Further background and information on routing operations can be found in the Appendix (Section 7) of this document for readers unfamiliar with this area of practice. 5 Analysis, Findings and Recommendations The primary threats to routing include: • Risks to the routers and exchange of routing information • Routing information that is incorrect or propagates incorrectly • General problems with network operations 5.1 BGP Session-Level Vulnerability When two routers are connected and properly configured, they form a BGP peering session. Routing information is exchanged over this peering session, allowing the two peers to build a local routing table, which is then used to forward actual packets of information. The first BGP4 attack surface is the peering session between two individual routers, along with the routers themselves. Two classes of attacks are included here, session hijacking and denial of service. 5.1.1 Session Hijacking The BGP session between two routers is based on the Transport Control Protocol (TCP), a session protocol also used to transfer web pages, naming information, files, and many other types of data across the Internet. Like all these other connection types, BGP sessions can be hijacked or disrupted, as shown in Figure 1. Figure 1: Session Hijacking In this diagram, the attack host can either take over the existing session between Routers A and B, or build an unauthorized session with Router A. By injecting itself into the peering between Routers A and B, the attacker can inject new routing information, change the routing information being transmitted to Router A, or even act as a “man in the middle,” modifying the routing information being exchanged between these two routers.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [11] of [40] 5.1.1.1 Session-Level Countermeasures Current solutions to these types of attacks center on secure hash mechanisms, such as HMACMD5 (which has been deprecated) and HMAC-SHA. These mechanisms rely on the peering routers sharing a key (a shared key – essentially a password) that is used to calculate a cryptographic hash across each packet (or message) transmitted between the two routers, and included in the packet itself. The receiving router can use the same key to calculate the hash. If the hash in the packet matches the locally calculated hash, the packet could have only been transmitted by another router that knows the shared key. This type of solution is subject to a number of limitations. First, the key must actually be shared among the routers building peering sessions. In this case, the routers involved in the peering session are in different administrative domains. Coming to some uniform agreement about how keys are generated and communicated (e.g. phone, email, etc.) with the often hundreds of partners and customers of an ISP is an impractical task. There is the possibility that any key sharing mechanism deployed to ease this administrative burden could, itself, come under attack (although such attacks have never been seen in the wild). Lastly, some concerns have been raised that burden of cryptographic calculations could itself become a vector for a Denial-of-Service (DoS) attack by a directed stream of packets with invalid hash components. One way to deny service is to make the processor that is responsible for processing routing updates and maintaining liveness too busy to reliably process these updates in a timely manner. In many routers the processor responsible for calculating the cryptographic hash is also responsible for processing new routing information learned, sending out new routing information, and even transmitting keep-alive messages to keep all existing sessions up. Since calculating the cryptographic hash is computationally expensive, a smaller flood of packets with an invalid hash can consume all the resources of the processor, thus making it easier to cause the processor to be to busy. Other mechanisms, such as the Generalized TTL Security Mechanism (GTSM, described in RFC 5082), focus on reducing the scope of such attacks. This technique relies on a feature of the IP protocol that would prevent an attacker from effectively reaching the BGP process on a router with forged packets from some remote point on the Internet. Since most BGP sessions are built across point-to-point links (on which only two devices can communicate), this approach would prevent most attackers from interfering in the BGP session. Sessions built over a shared LAN, such as is the case in some Internet exchanges, will be protected from those outside the LAN, but will remain vulnerable to all parties that are connected to the LAN. This solution is more complicated to implement when BGP speaking routers are not directly connected. It is possible to count the number of hops between routers and limit the TTL value to only that number of hops. This will provide some protection, limiting the scope of possible attackers to be within that many hops. If this approach is used consider failure scenarios of devices between the pair of BGP speaking routers, what impact those failures will have on the hop count between the routers, and if you want to expand the TTL value to allow the session to remain up for a failure that increases the hop count. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [12] of [40] 5.1.1.2 Session-Level Current Recommendations For a network with a small (e.g., single-digit) number of eBGP neighbors, it is reasonable to follow the lead of what is specified by the upstream ISPs who may have a blanket policy of how they configure their eBGP sessions. A network with larger numbers of eBGP neighbors may be satisfied that they can manage the number of keys involved either through data-store or rubric. Note that a rubric may not always be feasible as you cannot ensure that your neighbors will always permit you to choose the key. Managing the keys for a large number of routers involved in BGP sessions (a large organization may have hundreds or thousands of such routers) can be an administrative burden. Questions and issues can include: • In what system should the keys be stored and who should have access? • Should keys be unique per usage having only one key for internal usage and another key that is shared for all external BGP sessions? • Should keys be unique per some geographical or geo-political boundary say separate keys per continent or per country or per router? • Should keys be unique to each administrative domain, for example a separate key for each Autonomous System a network peers with? • There is no easy way to roll over keys, as such changing a key is quite painful, as it disrupts the transmission of routing information, and requires simultaneous involvement from parties in both administrative domains. This makes questions of how to deal with the departure of an employee who had access to the keys, or what keys to use when peering in a hostile country more critical. Another consideration is the operational cost of having a key. Some routing domains will depend on their peers to provide the key each time a new session is established, and not bother to make a record of the key. This avoids the problems of how to store the key, and ensure the key remains secure. However if a session needs to be recreated because configuration information is lost either due to accidental deletion of the configuration, or hardware replacement, then the key is no longer known. The session will remain down until the peer can be contacted and the key is re-shared. Often times this communication does not occur, and the peer may simply try to remove the key as a troubleshooting step, and note the session reestablishes. When this happens the peer will often prefer for the session to remain up, leaving the peering session unsecured until the peer can be contacted, and a maintenance window can be scheduled. For unresponsive peers, an unsecured peering session could persist, especially considering that the urgency to address the outage has now passed. Despite these vulnerabilities having been widely known for a decade or more, they have not been implicated in any notable number of incidents. As a result some network operators have not found the cost/benefit trade-offs to warrant the operational cost of deploying such mechanisms while others have. Given these facts, the Working Group recommends that individual network operators continue to make their own determinations in using these countermeasures. 5.1.2 Denial of Service (DoS) Vulnerability Because routers are specialized hosts, they are subject to the same sorts of Denial of Service (DoS) attacks any other host or server is subject to. These attacks fall into three types:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [13] of [40] 1. Attacks that seek to consume all available interface bandwidth making it difficult for enough legitimate traffic to get through such as UDP floods and reflective attacks 2. Attacks that seek to exhaust resources such as consume all available CPU cycles, memory, or ports so that the system is too busy to respond such as TCP SYN attacks 3. Attacks utilizing specially crafted packets in an attempt to cause the system to crash or operate in an unexpected way such as buffer overflow attacks, or malformed packet attacks that create an exception that is not properly dealt with Bandwidth exhaustion attacks attempt to use so much bandwidth that there is not enough available bandwidth for services to operate properly. This type of an attack can cause routers to fail to receive routing protocol updates or keep-alive messages resulting in out-of-date routing information, routing loops, or interruption of routing altogether, such as happens when a BGP session goes down and the associated routing information is lost. Resource exhaustion attacks target traffic to the router itself, and attempt to make the router exhaust its CPU cycles or memory. In the case of the former, the router’s CPU becomes too busy to properly process routing keepalives and updates causing the adjacencies to go down. In the case of the latter, the attacker sends so much routing protocol information that the router has no available memory to store all of the required routing information. Crafted packets attacks attempt to send a relatively small number of packets that the router does not deal with appropriately. When a router receives this type of packet it may fill up interface buffers and then not forward any traffic on that interface causing routing protocols to crash and restart, reboot, or hang. In some cases the router CPU may restart, reboot, or hang likely causing loss of all topological and routing state. One example was the “protocol 55” attack, where some router vendors simply did not code properly how to deal with this traffic type. Some routers are specialized to forward high rates of traffic. These routers often implement their forwarding capabilities in hardware that is optimized for high throughput, and implement the less demanding routing functions in software. As such bandwidth exhaustion attacks are targeted at the routers interfaces or the backplane between those interfaces, or the hardware responsible for making forwarding decisions. The other types of attack target the software responsible for making the routing decisions. Due to the separation between routing and forwarding, a fourth class of attacks are targeted at exhausting the bandwidth of the internal interconnection between the forwarding components and the routing components. The section below on “Denial-of-Service Attacks on ISP Infrastructure” contains a discussion of disruptive attacks besides those targeting the exchange of BGP routing information. 5.1.2.1 Denial of Service Countermeasures GTSM, described above, can be an effective counter to some forms of DoS attacks against routers, by preventing packets originating outside the direct connection between two BGP peers from being processed by the router under attack. GTSM cannot resolve simple buffer overflow problems, or DoS attacks that exploit weaknesses in packet processing prior to the TTL check, however.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [14] of [40] Another mechanism currently used to prevent DoS attacks against routers is to simply make the interfaces on which the BGP session is running completely unreachable from outside the local network or the local segment. Using link-local addresses in IPv6, is one technique (with obviously limited applicability). Another approach is applying packet-filters on the relevant address ranges at the network edge. (This process is called infrastructure filtering). Other well-known and widely deployed DoS mitigation techniques can be used to protect routers from attack just as they can be used to protect other hosts. For instance, Control Plane Policing can prevent the routing process on a router from being overwhelmed with high levels of traffic by limiting the amount of traffic accepted by the router directed at the routing processor itself. 5.1.2.2 Denial-of-Service Current Recommendations Since routers are essentially specialized hosts, mechanisms that can be used to protect individual routers and peering sessions from attack are widely studied and well understood. What prevents these techniques from being deployed on a wide scale? Two things: the perception that the problem space is not large enough to focus on, and the administrative burden of actually deploying such defenses. For instance, when GTSM is used with infrastructure filtering, cryptographic measures may appear to be an administrative burden without much increased security. Smaller operators, and end customers, often believe the administrative burden too great to configure and manage any of these techniques. Despite these vulnerabilities having been widely known for a decade or more, they have not been implicated in any notable number of incidents. As a result some network operators have not found the cost/benefit trade-offs to warrant the operational cost of deploying such mechanisms while others have. Given these facts, the Working Group recommends that individual network operators continue to make their own determinations in using these countermeasures. In dealing with vulnerabilities due to “crafted packets”, the vendor should provide notification to customers as the issues are discovered as well as providing fixed software in a timely manner. Customers should make it a point to keep abreast of notifications from their vendors and from various security information clearing-houses. 5.1.2.2.1 Interface Exhaustion Attacks Recommendations include: 1. Understanding the actual forwarding capabilities of your equipment in your desired configuration 2. Examining your queuing configuration 3. Carefully considering which types of traffic share a queue with your routing protocols, and if that traffic can be blocked, rate-limited or forced to another queue 4. Understanding packet filtering capabilities of your equipment, and under what scenarios it is safe to deploy packet filters 5. When it is safe to do so, tactically deploy packet filers upstream from a router that is being attacked The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [15] of [40] The first thing to consider with regard to attacks that attempt to consume all available bandwidth is to determine the actual throughput of the router. It is not safe to assume that a router with two or more 100 Gigabit Ethernet interfaces can receive 100G on one interface and transmit that same 100G out another interface. Some routers can forward at line rate, and some routers cannot. Performance may vary with packet size, for example forwarding traffic in software is more taxing on the CPU as the number of packets increases. The next thing to consider is outbound queuing of routers upstream from the router that is being attacked. Routers typically place routing protocol traffic in a separate Network Control (NC) queue. Determine the characteristics of this queue such as the queue depth and frequency of it being serviced. These values may be tunable. Also consider what types of traffic are placed in this queue, and specifically what traffic an outside attacker can place in this queue. Consider preventing users of the network from being able to place traffic in this queue if they do not need to exchange routing information with your network. For direct customers running eBGP, limit traffic permitted into the NC queue to only traffic required to support their routing protocols. Consider rate-limiting this traffic so no one customer can fill up the queue. Note that rate limits will increase convergence time, so test a customer configuration that is advertising and receiving the largest set of routes, and measure how long it takes to re-learn the routing table after the BGP session is reset with and without the rate limits. If the attack traffic has a particular profile, and all traffic matching that profile can be dropped without impacting legitimate traffic than a packet filter can be deployed upstream from the router that is under attack. Ensure that a deploying a packet filter will not impact the performance of your router by testing packets filters with various types of attacks and packet sizes on your equipment in a lab environment. Ensure that total throughput is not decreased, and that there is not a particular packet per second count that causes the router to crash, or become unresponsive, or stop forwarding traffic reliably, cause routing protocols to time out, etc. If the upstream router belongs to a non-customer network, you will need to work with them to mitigate the attack. Additional bandwidth on the interconnect may allow you to move the bottleneck deeper into your network where you can deal with it. Often the IP destination of these attacks is something downstream from the router. It is possible that some or all of the attack traffic may be destined to the router. In that case some of the mitigation techniques in the next section may also be helpful. 5.1.2.2.2 Resource Exhaustion Attacks Recommendations include: 1. Consider deploying GTSM 2. Consider making router interfaces only reachable by directly connected network 3. Consider only permitting traffic sourced from configured neighbors 4. Consider deploying MD5 5. Deploy maximum prefixes The first set of recommendations is to consider deploying mechanisms that restrict who can send routing protocol traffic to a router. The second set of recommendations restricts how much routing protocol state a neighbor can cause a router to hold.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [16] of [40] GTSM, described above, can be an effective counter to some forms of DoS attacks against routers, by limiting who can send routing protocol traffic to the router by a configured hop-count radius. GTSM works by preventing packets originating outside the direct connection between two BGP peers from being processed by the router under attack. GTSM cannot resolve simple buffer overflow problems, or DoS attacks that exploit weaknesses in packet processing prior to the TTL check, however. Another mechanism currently used to prevent DoS attacks against routers is to simply make the interfaces on which the BGP session is running completely unreachable from outside the local network or the local segment. Using link-local addresses in IPv6, is one technique (with obviously limited applicability). Another approach is applying packet-filters on the relevant address ranges at the network edge. (This process is called infrastructure filtering). Some routers can dynamically generate packet filters from other portions of the router configuration. This enables one to create an interface packet filter that only allows traffic on the BGP ports from source IP addresses that belong to a configured neighbor. This means attempts to send packets to the BGP port by IP addresses that are not a configured neighbor will be dropped right at the interface. The same session level protections discussed earlier, such as MD5 can also limit who can send routing information to only those routers or hosts that have the appropriate key. As such this is also an effective mechanism to limit who can send routing protocol traffic. While these packets will be processed by the router, and could possibly tax the CPU, they cannot cause the router to create additional routing state such as adding entried to the Routing Information Base (RIB) or Forwarding Information Base (FIB). Lastly each neighbor should be configured to limit the number of prefixes they can send to a reasonable value. A single neighbor accidentally or intentionally de-aggregating all of the address space they are permitted to send could consume a large amount of RIB and FIB memory, especially with the large IPv6 allocations. 5.1.2.2.3 Crafted Packet Attacks Crafted packet attacks typically occur when a router receives some exception traffic that the vendor did not plan for. In some cases it may be possible to mitigate these attacks by filtering the attack traffic if that traffic has a profile that can be matched on and all traffic matching that profile can be discarded. More often then not, this is not the case. In all cases the vendor should provide new code that deals with the exception. Recommendations include keeping current with all SIRT advisories from your vendors. When vulnerability is published move quickly to upgrade vulnerable versions of the code. This may require an upgrade to a newer version of the code or a patch to an existing version. For larger organizations that have extensive and lengthy software certification programs, it is often more reasonable to ask the vendor to provide a patch for the specific version(s) of code that organization is running. If possible the vendor should provide the extent to which the code is modified to quantify how substantial the change is in order to help the provide plan what should be included in the abbreviated software certification tests.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [17] of [40] For smaller organizations, or organizations that complete little or no software certification, the newer version of code with the fix in place should be deployed. Generally this is deployed cautiously at first to see if issues are raised with a limited field trail, followed by a more widespread deployment. 5.1.2.2.4 Internal Bandwidth Exhaustion Attacks Other well-known and widely deployed DoS mitigation techniques can be used to protect routers from attack just as they can be used to protect other hosts. For instance, Control Plane Policing can prevent the routing process on a router from being overwhelmed with high levels of traffic by limiting the amount of traffic accepted by the router directed at the routing processor itself. One should consider not only the impact on the router CPU, but also the impact on the bandwidth between the forward components and the router CPU. There may be some internal queuing in place on the interconnect between the forwarding components and the router CPU. It may be possible to influence which queue routing protocol traffic is placed in, or with queue traffic generated by customers is placed in, the depths and or servicing of these queues in order to separate and minimize the ability of non-routing traffic to impact routing traffic. If traffic generated by customers (for routing protocols or otherwise) can crowd out a network’s internal routing protocol traffic, then operators may consider separately rate limiting this customer traffic. 5.1.3 Source-address filtering Many Internet security exploits hinge on the ability of an attacker to send packets with spoofed source IP addresses. Masquerading in this way can give the attacker entre to unauthorized access at the device or application level and some BGP vulnerabilities are also in this category. The problem of source-spoofing has long been recognized and countermeasures available for filtering at the interface level. 5.1.3.1 Source-address spoofing example Consider the diagram below which illustrates the legitimate bi-directional traffic flow between two hosts on the left-hand side. An attacker connected to another network can send IP packets with a source address field set to the address of one of the other machines, unless filtering is applied at the point that that attacker’s host or network attaches to AS65002.The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [18] of [40] 5.1.3.2 Source-address spoofing attacks Though most IP transactions are bi-directional, attacks utilizing spoofed source IPs do not require bi-directional communication but instead exploit particular protocol or programmatic semantic weaknesses. Exploits using this technique have covered many areas over the years including these types followed by some examples. • Attacks against services which rely only on source IP of the incoming packet for authorization • rsh, rlogin, NFS, Xwindows, etc. • Attacks where the unreachability of the source can be exploited • TCP SYN floods which exhaust resources on the server • Attacks where the attacker masquerades as the “victim” • Small DNS or SNMP requests resulting in highly asymmetric data flow back toward the victim • Abusive traffic which result in the legitimate user getting blocked from the server or network 5.1.3.3 Source-address filtering challenges The barriers to implementing these countermeasures have ranged from lack of vendor support to lack of solid motivation to implement them. • Lack of proper vendor support: In older implementations of network devices, filtering based on the source address of a packet was performed in software, rather than hardware, and thus had a major impact on the rate of forwarding through the device. Most modern The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [19] of [40] network equipment can perform source filtering in the hardware switching path, eliminating this barrier to deployment. • Lack of scalable deployment and configuration management: In older deployments, filters based on the source of traffic was configured and managed manually, adding a large expense to the entity running the network. This barrier has largely been resolved through remote triggered black hole, unicast Reverse Path Forwarding (uRPF), and loose uRPF options. • Fear of interrupting legitimate traffic, for example in multi-homed situations: Vendors have created flexibility in uRPF filtering to reduce or eliminate this barrier. Future possible additions include “white lists,” which would allow traffic to pass through a uRPF check even though it didn’t meet the rules. • Lack of business motivation: Unilateral application of these features does not benefit or protect a network or its customers directly; rather, it contributes to the overall security of the Internet. Network operators are realizing that objection to incurring this “cost” is being overcome by the realization that if everyone performs this type of filtering, then everyone benefits. 5.1.3.4 Source-address filtering recommendations Filtering should be applied as close as possible to the entry point of traffic. Wherever one host, network, or subnet is attached, a feature such as packet filtering, uRPF, or source-addressvalidation should be used. Ensure adequate support from equipment vendors for subscribermanagement systems (e.g. for Cable and DSL providers) or data-center or campus routers and switches. Stub networks should also filter traffic at their borders to ensure IP ranges assigned to them do not appear in the source field of incoming packets and only those ranges appear in the source field of outgoing packets. Transit networks should likewise use features such as uRPF. Strict mode should be used at a border with a topological stub network and loose mode between transit networks. Transit networks that provide connectivity primarily stub networks, such as consumer ISPs, should consider uRPF strict mode on interfaces facing their customers. If these providers provide a home router to their customers they should consider making uRPF part of the default home router configuration. Transit networks that provide connectivity to a mix of stub networks and multi-homed networks must consider the administrative burden of configuring uRPF strict mode only on stub customers and uRPF loose mode, or no uRPF on customers that are, or become multi-homed. When using uRPF loose mode with the presence of a default route, one must special care to consider configuration options to include or exclude the default route. The value of loose mode uRPF with networks in the default free zone is debatable. It will only prevent traffic with a source address of RFC-1918 space and dark IPs (IP addresses that are not routed on the Internet). Often these dark IP addresses are useful for backscatter techniques and The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [20] of [40] tracing the source(s) of a spoofed DoS attack. It is also important to consider if RFC-1918 addresses are used internal to the transit provider’s network. This practice may become more common if ISPs implement Carrier Grade NAT. It is also worth pointing out that some business customers depend on VPN software that is poorly implemented, and only changes the destination IP address when re-encapsulating a packet. If these customers are using non-routed IP addresses in their internal network then enabling uRPF will break these customers. It is important to measure the impact of forwarding when enabling uRPF. Even when uRPF is implemented in hardware, the router must lookup the destination as well as the source. A double lookup will cause forwarding throughput to be reduced by half. This may have no in the forwarding rate if the throughput of the forwarding hardware is more than twice the rate of all the interfaces it supports. Further, more detailed advice and treatment of this subject can be found in: • IETF BCP38/RFC 2827 Network Ingress Filtering1 • BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks2 5.2 ICANN SAC004 Securing the EdgeBGP Injection and Propagation Vulnerability A second form of attack against the routing information provided by BGP4 is through injection of misleading routing information, as shown in figure 2. Figure 2: A Prefix Hijacking Attack In this network, AS65000 has authority to originate 192.0.2.0/24. Originating a route, in this context, means that computers having addresses within the address space advertised are actually reachable within your network — that a computer with the address 192.0.2.1, for instance, is physically attached to your network. Assume AS65100 would like to attract traffic normally destined to a computer within the 192.0.2.0/24 address range. Why would AS65100 want to do this? There are a number of possible motivations, including: 1 http://tools.ietf.org/html/bcp38 2 http://tools.ietf.org/html/bcp84The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [21] of [40] • A server with an address in this range accepts logins from customers or users, such as a financial web site, or a site that hosts other sensitive information, or information of value • A server with an address in this range processes information crucial to the operation of a business the owner of AS65100 would like to damage in some way, such as a competitor, or a political entity under attack AS65100, the attacker, can easily attract packets normally destined to 192.0.2.0/24 within AS65000 by simply advertising a competing route for 192.0.2.0/24 to AS65002. From within BGP itself, there’s no way for the operators in AS65002 to know which of these two advertisements is correct (or whether both origins are valid – a configuration which does see occasional legitimate use). The impact of the bogus information may be limited to the directly neighboring AS(es) depending on the routing policy of the nearby ASes. The likelihood of the incorrect route being chosen can be improved by two attributes of the route: • A shorter AS Path A shorter AS Path has the semantic value of indicating a topologically “closer” network. In the example above, the normal propagation of the route would show AS65100 as “closer” to AS65001 thus, other factors being equal, more preferred than the legitimate path via AS65000. • A longer prefix Longer prefixes represent more-specific routing information, so a longer prefix is always preferred over a shorter one. For instance, in this case the attacker might advertise 192.0.2.0/25, rather than 192.0.2.0/24, to make the false route to the destination appear more desirable than the real one. • A higher local-preference setting Local-preference is the non-transitive BGP attribute that most network operators use to administratively influence their local routing. Typically, routes learned from a “customer” (i.e., paying) network are preferred over those where the neighboring network has a non-transit relationship or where the operator is paying for transit from the neighboring network. This attribute is more important in the decision algorithm for BGP than AS-path length so routes learned over such a session can draw traffic even without manipulation of the AS-path attribute. This illustration can be used to help describe some related types of risks:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [22] of [40] Figure 3: BGP Propagation Vulnerabilities Route Leak: In this case, AS65100 is a transit customer of both AS65000 and AS65002. The operator of AS65100 accidentally leaks routing information advertised by AS65000 into its peering session with AS65002. This could possibly draw traffic passing from AS65002 towards a destination reachable through AS65000 through AS65100 when this path was not intended to provide transit between these two networks. Most often such happenings are the result of misconfigurations and can result in overloading the links between AS65100 and the other ASes. As the name suggests, this phenomenon has most often been the result of inadvertent misconfiguration. Occasionally they can result in more malicious outcomes: • Man in the Middle: In this case, all the autonomous systems shown have non-transit relationships. For policy reasons, AS65000 would prefer traffic destined to 192.0.2.0/24 pass through AS65100. To enforce this policy, AS65000 filters the route for 192.0.2.0/24 towards AS65100. In order to redirect traffic through itself (for instance, in order to snoop on the traffic stream), AS65100 generates a route advertisement that makes it appear as if AS65000 has actually advertised 192.0.2.0/24, advertising this route to AS65002, and thus drawing traffic destined to 192.0.2.0/24 through itself. • False Origination: This attack is similar to the man in the middle explained above, however in this case there is no link between AS65000 and AS65100. Any traffic destined to 192.0.2.0/24 into AS65100, is discarded rather than being delivered. Note that all three of these vulnerabilities are variations on a single theme: routing information that should not be propagated based on compliance with some specific policy nonetheless is. 5.2.1 BGP Injection and Propagation Countermeasures 5.2.1.1 Prefix Filtering The key bulwark against entry and propagation of illegitimate routing announcements into the global routing system is prefix-level filtering; typically at the edge between the ISP and their customers. The usual method involves the customer communicating a list of prefixes and downstream ASes which they expect to be reachable through the connection to the ISP. The ISP will then craft a filter applied to the BGP session which explicitly enumerates this list of expected prefixes (a “prefix-list”), perhaps allowing for announcement of some more-specific The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [23] of [40] prefixes within the ranges such as might be needed by the customer to achieve some goals in adjusting the load of the customer’s inbound traffic across their various connections. This configuration forms a “white-list”, in security parlance, of possible downstream destinations but does not validate the overall semantic correctness of the resulting routing table. 5.2.1.1.1 Manual Prefix-Filter Limitations The validity of the information in this list is obviously important and if the customer is either malicious or simply mistaken in the prefixes they communicate, the prefix filter could obviously still leave open a vulnerability to bogus route injection. Thus typically the information communicated by the customer is checked against registration records such as offered in the “whois” information available from Regional Internet Registries (RIRs) and/or others in the address assignment hierarchy. However, there is no information in the RIR information explicitly indicating a mapping between the address-assignment and the origin AS. The reality is that the process of manually checking a filter that is a few thousand lines long, with hundreds of changes a week is tedious and time consuming. Many transit providers do not check at all. Others have a policy to always check, but support staff may grow complacent with updates from certain customers that have long filter lists or have filter lists that change weekly, especially if those customers have never had a questionable prefix update in the past. In these cases they may only spot-check, only check the first few changes and then give up, or grow fatigued and be less diligent the more lines they check. Moreover, the entity name fields in the “whois” information are free-form and often can’t be reliably matched to the entity name used by the ISP’s customer records. Typically it is a fuzzy match between the customer name on record and the company name listed in the “whois” record. A truly malicious actor could order service with a name which is intentionally similar to a company name whose IP addresses they intend to use. Another possibility is that the company name on record is legitimate and an exact match to the company name on the “whois” record, but the customer is a branch office, and the legitimate holder of the IP addresses is the corporate office which has not authorized the branch office to use their space. It is also possible that the customer is the legitimate holder of the address space, but the individual who called in to the provider support team is not authorized to change the routing of the IP block in question. This problem is further complicated when a transit provider’s customer has one or more downstream customers of its own. These relationships are typically hard or impossible to verify. If every transit provider accurately filtered all of the prefixes their customers advertised, and each network that a transit provider peers with could be trusted to also accurately filter all of the prefixes of their customers, then route origination and propagation problems could be virtuailly eliminated. However, managing filters requires thousands of operators examining, devising, and adjusting the filters on millions of devices throughout the Internet. While there are processes and tools within any given network, such highly inconsistent processes, particularly when handling large amounts of data (a tedious process in and of itself), tends to produce an undesirable rate of errors. Each time an individual operator misjudges a particular piece of information, or simply makes a mistake in building a filter, the result is a set of servers (or services) that are unreachable until the mistake is found and corrected. 5.2.1.2 Internet Routing Registry (IRR) The second source of information a provider can use as a basis for filtering received routing The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [24] of [40] information is a voluntary set of databases of routing policy and origination called Internet Routing Registries (IRRs). These IRRs allow providers to register information about their policies towards customers and other providers, and also allow network operators to register which address space they intend to originate. Some providers require their customers to register their address space in an IRR before accepting the customer’s routes, oftentimes the provider will “proxy register” information on the customer’s behalf since most customers are not versed in IRR details. 5.2.1.2.1 IRR Limitations Because IRRs are voluntary, there is some question about the accuracy and timeliness of the information they contain (see Research on Routing Consistency in the Internet Routing Registry by Nagahashi and Esaki for a mostly negative view, and How Complete and Accurate is the Internet Routing Registry by Khan for a more positive view). Anecdotally, RIPE’s IRR is in widespread use today, and some large providers actually build their filtering off this database, so the accuracy level is at least operationally acceptable for some number of operators. Some IRR repositories use an authorization model as well as authentication but none that primarily serve North America perform RPSL authorization using the scheme described in RFC2725 – Routing Policy System Security3 . 5.2.1.3 AS-Path filtering Filters on the AS_PATH contents of incoming BGP announcements can also be part of a defensive strategy to guard against improper propagation of routing information. Some ISPs have used AS-path filters on customer-facing BGP sessions instead of prefix-filters. This approach is generally inadequate to protect against even the most naïve misconfigurations, much less a deliberate manipulation. Often a leak has involved either redistributing BGP routes inadvertently from one of a stub network’s ISPs to the other. Another problem in the past has involved redistributing BGP routes into an internal routing protocol and back to BGP. Where AS-path filters can be useful is to guard against an egregious leak. For instance an ISP would not expect ASNs belonging to known large ISPs to show up in the AS_PATH of updates from an enterprise-type customer network. Applying an AS-path filter to such a BGP session could act as a second line of defense to the specific prefix-list filter. Similarly, if there are networks which the ISP has non-transit relationships with, applying a similar AS-path filter to those sessions (which wouldn’t be candidates for prefix filters) could help guard against a leak resulting in an unintended transit path. 5.2.1.3.1 AS-Path filtering limitations Maintaining such a list of “known” networks which aren’t expected to show up in transit adjacencies can be fairly manual, incomplete and error-prone. Again, applying a filter which validates the neighbor AS is in the path is useless since this state is the norm of what’s expected. 5.2.1.4 Maximum-prefix cut-off threshold Many router feature-sets include the ability to limit the number of prefixes that are accepted from a neighbor via BGP advertisements. When the overall limit is exceeded, the BGP session is torn down on the presumption that this situation is a dangerous error condition. Typically also a threshold can be set at which a warning notification (e.g. log message) to the Operations staff 3 https://tools.ietf.org/html/rfc2725The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [25] of [40] is issued. This way a gradual increase in the number of advertisements will trigger a sensible manual raise in the cut-off threshold without causing an outage. This tool can be used to guard against the most egregious leaks which can, if the numbers are large enough, exhaust the routing table memory on the recipient’s routers and/or otherwise cause widespread network instability. Typical deployments will set the threshold based on the current observed number of advertisements within different bands; for instance, 1-100, 100-1000, 1000-5000, 5000-10000, 10000-50000, 50,000-100,000, 100,000-150,000, 150,000-200,000, 200,000-250,000, 250,000- 300,000. 5.2.1.4.1 Maximum-prefix limitations When the threshold is exceeded, the session is shut down and manual intervention is required to bring it back up. In the case where a network has multiple interconnection points to another network (thus multiple BGP neighbors), all sessions will typically go down at the same time assuming all are announcing the same number of prefixes. In this case, it may be the case that all connectivity between the two networks is lost during this period. Obviously this measure is an attempt to balance two different un-desirable outcomes so must be weighed judiciously. Above 10000 or perhaps 50000 (e.g., a full Internet routing table from a transit provider), applying maximum-prefix thresholds provide limited protection. A small number of neighbors each advertising a unique set of 300,000 routes would fill the memory of the receiving router anyway. However if these neighbors are all advertising a large portion of the Internet routes, with many routes overlapping, then the limit offers some protection. 5.2.1.5 Monitoring Aside from a proactive filtering approach, a network operator can use various vantage points external to their own network (e.g, “route servers” or “looking glasses”) to monitor the prefixes for which they have authority to monitor for competing announcements which may have entered the BGP system. Some tools such as BGPmon have been devised to automate such monitoring. 5.2.1.5.1 Monitoring Limitations Obviously, this approach is reactive rather than proactive and steps would then need to be taken to contact the offending AS and/or intermediate AS(es) to stop the advertisement and/or propagation of the misinformation. Also, the number of such vantage points is limited so a locally impacting bogus route may or may not be detected with this method. 5.2.2 BGP Injection and Propagation Recommendations The most common router software implementations of BGP do not perform filtering of route advertisements, either inbound or outbound, by default. While this situation eases the burden of configuration on network operators (the customers of the router vendors), it has also caused the majority of unintentional inter-domain routing problems to date. Thus it is recommended that network operators of all sizes take extra care in configuration of BGP sessions to keep unintentional routes from being injected and propagated. Stub network operators should configure their outbound sessions to only explicitly allow the The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [26] of [40] prefixes which they expect to be advertising over a particular session. ISPs should explicitly filter their inbound sessions at the boundary with their “customer edge”. The inter-provider connections between large ISPs are impractical locations for filtering given the requirement for significant dynamism in BGP routing and traffic-engineering across the global Internet. However, the cumulative gains accrued when each ISP filters at this “customer edge” are significant enough to lessen the residual risk of not filtering on these “non-customer” BGP sessions. ISPs (and even stub networks) should also consider using AS-path filters and maximum-prefix limits on sessions as a second line of defense to guard against leaks or other pathological conditions. 5.3 Other Attacks and Vulnerabilities of Routing Infrastructure There are many vulnerabilities and attack vectors that can be used to disrupt the routing infrastructure of an ISP outside of the BGP protocol and routing-specific operations. These are just as important to address as issues the working group has identified within the routing space itself. The largest attack surface for routing infrastructure likely lies within the standard operational security paradigm that applies to any critical networked asset. Therefore the working group looked at including BCPs relating to network and operational security as part of addressing these issues, and ISPs should be aware that they are likely to see attacks against their routing infrastructure based on these “traditional” methods of computer and network intrusion. 5.3.1 Hacking and unauthorized 3rd party access to routing infrastructure ISPs and all organizations with an Internet presence face the ever-present risk of hacking and other unauthorized access attempts on their infrastructure from various actors, both on and off network. This was already identified as a key risk for ISPs, and CSRIC 2A – Cyber Security Best Practices was published in March 2011 to provide advice to address these types of attacks and other risks for any ISP infrastructure elements, including routing infrastructure. The current CSRIC III has added a new Working Group 11 that will report out an update to prior CSRIC work in light of recent advancements in cybersecurity practices and a desire of several US government agencies to adopt consensus guidelines to protect government and critical infrastructure computers and networks. A recent SANS publication, Twenty Critical Security Controls for Effective Cyber Defense: Consensus Audit Guidelines (CAG)4 lays out these principals and maps them out versus prior work, including another relevant document, NIST SP-800-53 Recommended Security Controls for Federal Information Systems and Organizations. 5 The SANS publication appears to be a primary driver for Working Group 11’s work. The entire document is available for review, and we have included the 20 topic areas here for reference: 4 http://www.sans.org/critical-security-controls/ 5 http://csrc.nist.gov/publications/PubsSPs.htmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [27] of [40] Critical Control 1: Inventory of Authorized and Unauthorized Devices Critical Control 2: Inventory of Authorized and Unauthorized Software Critical Control 3: Secure Configurations for Hardware and Software on Laptops, Workstations, and Servers Critical Control 4: Continuous Vulnerability Assessment and Remediation Critical Control 5: Malware Defenses Critical Control 6: Application Software Security Critical Control 7: Wireless Device Control Critical Control 8: Data Recovery Capability Critical Control 9: Security Skills Assessment and Appropriate Training to Fill Gaps Critical Control 10: Secure Configurations for Network Devices such as Firewalls, Routers, and Switches Critical Control 11: Limitation and Control of Network Ports, Protocols, and Services Critical Control 12: Controlled Use of Administrative Privileges Critical Control 13: Boundary Defense Critical Control 14: Maintenance, Monitoring, and Analysis of Audit Logs Critical Control 15: Controlled Access Based on the Need to Know Critical Control 16: Account Monitoring and Control Critical Control 17: Data Loss Prevention Critical Control 18: Incident Response Capability Critical Control 19: Secure Network Engineering Critical Control 20: Penetration Tests and Red Team Exercises Because this work is being analyzed directly by Working Group 11 to address the generic risk to ISPs of various hacking and unauthorized access issues, Working Group 4 will not be commenting in-depth in this area, and refers readers to reports from Working Group 11 for comprehensive, and updated coverage of these risks when they issue their report. We will comment upon current BCPs for ISPs to look to adopt in the interim, and provide further background around risks unique to running BGP servers/routers in this area. An ISP’s routing infrastructure is an important asset to protect, as gaining control of it can lead to a wide variety of harms to ISP customers. Further, an ISP’s staff computers, servers, and networking infrastructure also rely upon their own routers to correctly direct traffic to its intended destinations. The ISP’s own sensitive data and processes could be compromised via hacked routers/servers. Thus routers should be included on the list of network assets that are assigned the highest level of priority for protection under any type of ISP security program. There are many industry standard publications pertaining to overall cybersecurity best practices available for adoption by ISPs or any organization at risk of attack, including prior CSRIC reports. It is incumbent upon ISPs to maintain their overall security posture and be up-to-date on the latest industry BCPs and adopt the practices applicable to their organization. Of particular note is the IETF’s RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments6 which offers a comprehensive survey of ISP security practices. An older IETF publication, but still active BCP, that still applies to ISP environments can be found with BCP 46, aka RFC 3013 Recommended Internet Service Provider Security Services and Procedures7 . NIST also puts out highly applicable advice and BCPs for running 6 http://www.ietf.org/rfc/rfc4778.txt 7 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [28] of [40] government networks, with the most currently relevant special report, NIST SP-800-53. The ultimate goal of someone attempting unauthorized access to routing infrastructure would be to either deny customer use of those servers or, more likely, insert false entries within the router to misdirect the users of those routers. This is a functional equivalent to route injection and propagation attacks as already described in section 5.2. So the analysis and recommendations presented in section 5.2.1.5 with respect to monitoring for and reacting to route injection and propagation attacks apply in the scenario where an attacker has breached a router to add incorrect entries. 5.3.1.1 Recommendations 1) ISPs should refer to and implement the practices found in CSRIC 2A – Cyber Security Best Practices that apply to securing servers and ensure that routing infrastructure is protected. 2) ISPs should adopt applicable BCPs found in other relevant network security industry approved/adopted publications. Monitor for applicable documents and update. Three documents were identified that currently apply to protecting ISP networks: IETF RFC 4778 and BCP 46 (RFC 3013); NIST special publications series: NIST SP-800-53 3) ISPs should ensure that methods exist within the ISP’s operations to respond to detected or reported successful route injection and propagation attacks, so that such entries can be rapidly remediated. 4) ISPs should consider implementing routing-specific monitoring regimes to assess the integrity of data being reported by the ISP’s routers that meet the particular operational and infrastructure environments of the ISP. 5.3.2 ISP insiders inserting false entries into routers While insider threats can be considered a subset of the more general security threat of unauthorized access and hacking, they deserve special attention in the realm of routing security. ISP insiders have unparalleled access to any systems run by an ISP, and in the case of routers, the ability to modify entries is both trivially easy and potentially difficult to detect. Since routers don’t typically have company-sensitive information, are accessed by thousands of machines continuously, and are not usually hardened or monitored like other critical servers, it is relatively easy for an insider to alter a router’s configuration in a way that adversely affects routing. The analysis and recommendations for this particular threat do not differ significantly from those presented in Section 5.3.1 of this report - Hacking and unauthorized 3rd party access to routing infrastructure. However, it is worth paying special attention to this particular exposure given the liabilities an ISP may be exposed to from such difficult-to-detect activities of its own employees. 5.3.2.1 Recommendations 1) Refer to section 5.3.1.1 for generic hacking threats. 5.3.3 Denial-of-Service Attacks against ISP Infrastructure Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) are some of the oldest and most prolific attacks that ISPs have faced over the years and continue to defend against today. The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [29] of [40] Typically, an external actor who is targeting some Internet presence or infrastructure to make it unusable is behind such attacks. However, DoS/DDoS attacks come in many flavors that can be broadly lumped into two primary categories: logic attacks and resource exhaustion/flooding attacks.8 Logic attacks exploit vulnerabilities to cause a server or service to crash or reduce performance below usable thresholds. Resource exhaustion or flooding attacks cause server or network resources to be consumed to the point where the targeted service no longer responds or service is reduced to the point it is operationally unacceptable. We will examine the latter type of attack in this section of analysis, as resource exhaustion . Logic attacks are largely directed to break services/servers and can be largely addressed with the analysis and recommendations described above with respect to BGP specific issues and also put forward in section 5.3.1 that cover protecting networked assets from various hacking and other attacks. There is a large variety of flooding attacks that an ISP could face in daily operations. These can be targeted at networks or any server, machine, router, or even user of an ISP’s network. From the perspective of routing operations, it is helpful to differentiate between “generic” DoS attacks that could affect any server, and those that exploit some characteristic of BGP that can be utilized to affect routers in particular, which have already been covered. Due to the long history, huge potential impact, and widespread use of various DoS and DDoS attacks, there is an abundance of materials, services, techniques and BCPs available for dealing with these attacks. ISPs will likely have some practices in place for dealing with attacks both originating from their networks and that are being directed at their networks and impacting their services. The IETF’s RFC 4732 Internet Denial-of-Service Considerations9 provides an ISP with a thorough overview of DoS/DDoS attacks and mitigation strategies and provides a solid foundational document. The SANS Institute has published a useful document for ISPs that is another reference document of BCPs against DoS/DDoS attacks entitled A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment10. As mentioned in section 5.3.1, there are several documents that cover general ISP security concerns, and those typically include prescriptive advice for protecting a network against DoS/DDoS attacks. Such advice can be found in previously cited documents including prior CSRIC reports: CSRIC 2A – Cyber Security Best Practices11, the IETF’s RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments12, BCP 46, RFC 3013 Recommended Internet Service Provider Security Services and Procedures13 and NIST’s special report, NIST SP-800-53. For the most part, an ISP’s routers for interdomain routing must be publicly available in order for the networks they serve to be reachable across the Internet. Thus measures to restrict access 8 http://static.usenix.org/publications/library/proceedings/sec01/moore/moore.pdf 9 http://tools.ietf.org/rfc/rfc4732.txt 10 http://www.sans.org/reading_room/whitepapers/intrusion/summary-dos-ddos-preventionmonitoring-mitigation-techniques-service-provider-enviro_1212 11 http://www.fcc.gov/pshs/docs/csric/WG2A-Cyber-Security-Best-Practices-Final-Report.pdf 12 http://www.ietf.org/rfc/rfc4778.txt 13 http://www.apps.ietf.org/rfc/rfc3013.txtThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [30] of [40] that can be implemented for an ISP’s internal infrastructure are unavailable as options for these connecting routers. This leaves an ISP with limited choices for DDoS protection, including the traditional approaches of overprovisioning of equipment and bandwidth and various DoS/DDoS protection services and techniques. 5.3.3.1 Recommendations 1) ISPs should implement BCPs and recommendations for securing an ISP’s infrastructure against DoS/DDoS attacks that are enumerated in the IETF’s RFC 4732 Internet Denial-of-Service Considerations and consider implementing BCPs enumerated in the SANS Institute reference document of BCPs against DoS/DDoS attacks entitled A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment. 2) ISPs should refer to and implement the BCPs related to DoS/DDoS protection found in CSRIC 2A – Cyber Security Best Practices that apply to protecting servers from DoS/DDoS attacks. 3) ISPs should consider adopting BCPs found in other relevant network security industry approved/adopted publications that pertain to DoS/DDoS issues, and monitor for applicable documents and updates. Four that currently apply to protecting ISP networks from DoS/DDoS threats are IETF RFC 4778 and BCP 46 (RFC 3013); NIST special publications series: NIST SP-800-53; and ISOC Publication Towards Improving DNS Security, Stability, and Resiliency. 4) ISPs should review and apply BCPs for protecting network assets against DoS/DDoS attacks carefully to ensure they are appropriate to protect routing infrastructure. 5.3.4 Attacks against administrative controls of routing identifiers Blocks of IP space and Autonomous Systems Numbers (ASNs) are allocated by various registries around the world. Each of these Regional Internet Registries (RIR’s) is provided IP space and ASN allocation blocks by IANA, to manage under their own rules and practices. Inturn, several of these registries allow for country or other region/use specific registries to suballocate IP space based on their own rules, processes and systems. Each RIR maintains a centralized “whois” database that designates the “owner” of IP spaces or ASN’s within their remit. Access to the databases that control these designations, and thus “rights” to use a particular space or ASN is provided and managed by the RIR’s and sub RIR registries depending upon the region. Processes for authentication and management of these identifier resources are not standardized, and until recently, were relatively unsophisticated and insecure. This presents an administrative attack vector allowing a miscreant to use a variety of account attack methods, from hacking to password guessing to social engineering and more that could allow them to assume control over an ASN or IP space allocation. In other fields, such attacks would be considered “hijacking” or “account take-over” attacks, but the use of the word “hijacking” in the BGP space to include various injection and origin announcements complicates the common taxonomy. Thus for this section, we will refer to account “hijacking” as “account take-over”. The primary concern for most ASN and prefix block owners and the ISPs that service them in such scenarios is the take-over of active space they are using. A miscreant could literally “take The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [31] of [40] over” IP space being routed and used by the victim, much like an origin attack as described in section 5.3. In this case, it would be equivalent to a full take-over, with the majority of the global routing system recognizing the miscreant’s announcement as being the new “legitimate” one, with all the inherent risks previously described. The real owner will have to prove their legitimacy and actual legal ownership/control of the resource that has been taken over. Depending upon the authentication scheme the registry uses, this can prove difficult, especially for legacy space and older ASN registrations. A corollary of this attack scenario, is a miscreant taking over “dormant” IP space or an unused ASN, and thus “squatting” in unused territory14. While not impactful on existing Internet presence, squatting on IP space can lead to many forms of abuse, including the announcement of bogus peering arrangements, if the squatted resource is an ASN. In a take-over scenario, a miscreant typically impersonates or compromises the registrant of the ASN and/or IP space in order to gain access to the management account for that ASN or CIDR block. Until recently, nearly all RIRs and registries used an e-mail authentication scheme to manage registrant change requests. Thus, if the registrant’s e-mail address uses an available domain name, the miscreant can register the domain name, recreate the administration email address, and authenticates himself with the registry. If the domain isn’t available, the criminal could still try to hijack the domain name registration account to gain control of that same domain. If a registry or RIR requires more verification for registrant account management, the criminal use various social engineering tricks against the registry staff to get into the management account. Once a criminal has control of the registration account, they can update the information there to allow them to move to a new peering ISP, create new announcements from their “new” space, or launch any sort of BGP-type attack as listed above. Even more basically, the criminal can simply utilize their new control of the ASN/prefix to have their own abusive infrastructure announced on the Internet for whatever process they would like. This includes direct abuse against the Internet in general (e.g. hosting malware controllers, phishing, on-line scams, etc.), but also the ability to impersonate the original holder of the space they have taken over. Of course they can also intercept traffic originally destined for the legitimate holder of the space as previously described for various route-hijacking scenarios. The end result of an administrative account take-over is likely to be similar to other injection attacks against routing infrastructure as covered in section 5.2. Thus, ISPs will want to consult BCPs covering techniques for monitoring and reacting to those types of attacks. These BCPs cover the general effects of a BGP origin attacks – dealing with service interruptions and the worldwide impacts. ISPs typically do not have direct access or control to RIR or other registry account information that has been compromised in most hijacking attacks. The ISP is dependent upon the affected registry to restore control of the ISP’s management account, or in the case of a serious breach, the registrar/registry’s own services. Once control is re-established, the original, correct information needs to be re-entered and published again. This will usually mean updating 14 For a full description of the taxonomy of hijacking, squatting, and spoofing attacks in routing space, see Internet Address Hijacking, Spoofing and Squatting Attacks - http://securityskeptic.typepad.com/the-security-skeptic/2011/06/internet-address-hijackingspoofing-and-squatting-attacks.htmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [32] of [40] routing table entries/BGP announcements, and fixing any account information that has been modified. The industry has largely been slow to adopt security measures to protect account access for controlling ASN and CIDR block management that are found in other online services like financial services, e-commerce, or even some ISP management systems. The industry also has many participants with a wide variety of geographical regions, with few standards and requirements for the security of registration systems, and very limited oversight. This means it is often difficult to find support for typical online security tools like multi-factor authentication, multi-channel authentication, and verification of high-value transactions. This has been changing in recent years at the RIR level, with ARIN, RIPE, and several other IP registries at various levels of authority implementing new controls and auditing account information. Despite this, gaps exist, especially with “legacy” data entered many years ago before current management systems and authentication processes were implemented. While there is scant guidance on this topic area for ASN/IP block management, ICANN’s Security and Stability Advisory Committee (SSAC) has released two documents to address these issues in the domain name space which is quite analogous to the provisioning of IP space. These documents provide BCPs for avoiding and mitigating many of these issues. SAC 40 Measures to Protect Domain Registration Services Against Exploitation or Misuse15, addresses issues faced by domain name registrars and offers numerous BCPs and recommendations for securing a registrar against the techniques being used by domain name hijackers. Many of the BCP’s presented there would be applicable to RIR’s and other IP address provisioning authorities, including ISPs managing their own customers. SSAC 44, A Registrant's Guide to Protecting Domain Name Registration Accounts16, provides advice to domain name registrants to put in place to better protect their domains from hijacking. Similar techniques could be used by operators to protect their own IP space allocations. Given the limited choices and practices followed by various IP space allocators, ISPs need to carefully evaluate their security posture and the practices of their RIR’s or other IP space allocators with these BCPs in mind. 5.3.4.1 Recommendations 1) ISPs and their customers should refer to the BCPs and recommendations found in SSAC 44 A Registrant's Guide to Protecting Domain Name Registration Accounts with respect to managing their ASN’s and IP spaces they register and use to provide services. 2) ISPs should review the BCPs and recommendations found in SAC 40 Measures to Protect Domain Registration Services Against Exploitation or Misuse to provide similar protections for IP space they allocate to their own customers. 6 Conclusions Working Group 4 has recommended the adoption of numerous best practices for protecting the inter-domain BGP routing system. As a distributed infrastructure requiring several actors to both enable and protect it, network operators face challenges outside of their direct control in tackling many of the issues identified. The more widely Best Current Practices are utilized, the more robust the whole system will be to both bad actors and simple mistakes. See Appendix 7.3 for a 15 http://www.icann.org/en/groups/ssac/documents/sac-040-en.pdf 16 http://www.icann.org/en/committees/security/sac044.pdfThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [33] of [40] tabular display of risks indexed with the appropriate countermeasures as discussed in the body text of the document. 7 Appendix 7.1 Background Note that in order to remain consistent with other CSRIC III reports, considerable portions of “Salient Features of BGP Operation” have been taken verbatim from the Appendix section of the CSRIC III, Working Group 6 Interim Report published March 8, 2012. Other parts have been taken from the NIST (National Institute of Standards and Technologies) report entitled Border Gateway Protocol Security. 7.1.1 Salient Features of BGP Operation This section is intended for non-experts who have a need to understand the origins of BGP security problems. Although unknown to most users, the Border Gateway Protocol (BGP) is critical to keeping the Internet running. BGP is a routing protocol, which means that it is used to update routing information between major systems. BGP is in fact the primary inter-domain routing protocol, and has been in use since the commercialization of the Internet. Because systems connected to the Internet change constantly, the most efficient paths between systems must be updated on a regular basis. Otherwise, communications would quickly slow down or stop. Without BGP, email, Web page transmissions, and other Internet communications would not reach their intended destinations. Securing BGP against attacks by intruders is thus critical to keeping the Internet running smoothly. Many organizations do not need to operate BGP routers because they use Internet service providers (ISP) that take care of these management functions. But larger organizations with large networks have routers that run BGP and other routing protocols. The collection of routers, computers, and other components within a single administrative domain is known as an autonomous system (AS). An ISP typically represents a single AS. In some cases, corporate networks tied to the ISP may also be part of the ISP’s AS, even though some aspects of their administration are not under the control of the ISP. Participating in the global BGP routing infrastructure gives an organization some control over the path traffic traverses to and from its IP addresses (Internet destinations). To participate in the global BGP routing infrastructure, an organization needs: • Assigned IP addresses, grouped into IP network addresses (aka prefixes) for routing. • A unique integer identifier called an Autonomous System Number (ASN). • A BGP router ready to connect to a neighbor BGP router on an Internet Service Provider’s network (or another already connected AS) that is willing to establish a BGP session and exchange routing information and packet traffic with the joining organization. The basic operation of BGP is remarkably simple – each BGP-speaking router can relay The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [34] of [40] messages to its neighbors about routes to network addresses (prefixes) that it already knows, either because it “owns” these prefixes, or it already learned routes to them from another neighbor. As part of traveling from one border router to another, a BGP route announcement incrementally collects information about the ASes that the route “update” traversed in an attribute called AS_PATH. Therefore, every BGP route is constructed hop-by-hop according to local routing policies in each AS. This property of BGP is a source of its flexibility in serving diverse business needs, and also a source of vulnerabilities. The operators of BGP routers can configure routing policy rules that determine which received routes will be rejected, which will be accepted, and which will be propagated further – possibly with modified attributes, and can specify which prefixes will be advertised as allocated to, or reachable through, the router’s AS. In contrast to the simplicity of the basic operation of BGP, a routing policy installed in a BGP router can be very complex. A BGP router can have very extensive capabilities for manipulating and transforming routes to implement the policy, and such capabilities are not standardized, but instead, are largely dictated by AS interconnection and business relationships. A route received from a neighbor can be transformed before a decision is made to accept or reject the route, and can be transformed again before the route is relayed to other neighbors; or, the route may not be disseminated at all. All this works quite well most of the time – largely because of certain historically motivated trust and established communication channels among human operators of the global BGP routing system. This is the trust that a route received from a neighbor accurately describes a path to a prefix legitimately reachable through the neighbor ASes networks, and its attributes have not been tampered with. Notwithstanding the above, the “trust but verify” rule applies: Best Current Practices recommend filtering the routes received from neighbors. While this can be done correctly for well-known direct customers, currently there is no validated repository of the “ground truth” allowing for correct filtering of routes to all networks in the world. Now observe that the BGP protocol itself provides a perfect mechanism for spreading malformed or maliciously constructed routes, unless the BGP players are vigilant in filtering them out from further propagation. However, adequate route filtering may not be in place, and from time to time a malicious or inadvertent router configuration change creates a BGP security incident: malformed or maliciously constructed routing messages will propagate from one AS to another simply by exploiting legitimate route propagation rules, and occasionally can spread to virtually all BGP routers in the world. Because some BGP-speaking routers advertise all local BGP routes to all external BGP peers by default, another example that commonly occurs involves a downstream of two or more upstream ASes advertising routes learned from one upstream ISP to another ISP – both the customer and the ISPs should put controls in place to scope the propagation of all routes to those explicitly allocated to the customer AS, but this is difficult given the lack of “ground truth”. The resulting routing distortions can cause very severe Internet service disruptions, in particular effective disconnection of victim networks or third parties from parts or all of Internet, or forcing traffic through networks that shouldn’t carry it, potentially opening higher-level Internet transactions up to packet snooping or man-in-themiddle attacks. 7.1.2 Review of Router Operations In a small local area network (LAN), data packets are sent across the wire, typically using Ethernet hardware, and all hosts on the network see the transmitted packets. Packets addressed The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [35] of [40] to a host are received and processed, while all others are ignored. Once networks grow beyond a few hosts, though, communication must occur in a more organized manner. Routers perform the task of communicating packets among individual LANs or larger networks of hosts. To make internetworking possible, routers must accomplish these primary functions: • Parsing address information in received packets • Forwarding packets to other parts of the network, sometimes filtering out packets that should not be forwarded • Maintaining tables of address information for routing packets. BGP is used in updating routing tables, which are essential in assuring the correct operation of networks. BGP is a dynamic routing scheme—it updates routing information based on packets that are continually exchanged between BGP routers on the Internet. Routing information received from other BGP routers (often called “BGP speakers”) is accumulated in a routing table. The routing process uses this routing information, plus local policy rules, to determine routes to various network destinations. These routes are then installed in the router’s forwarding table. The forwarding table is actually used in determining how to forward packets, although the term routing table is often used to describe this function (particularly in documentation for home networking routers). 7.2 BGP Security Incidents and Vulnerabilities In this section we classify the observed BGP security incidents, outline the known worst-case scenarios, and attempt to tie the incidents to features of proposed solutions that could prevent them. Many of the larger incidents are believed to have been the result of misconfigurations or mistakes rather than intentional malice or criminal intent. It has long been suspected that more frequent, less visible incidents have been happening with less attention or visibility. BGP security incidents usually originate in just one particular BGP router, or a group of related BGP routers in an AS, by means of changing the router’s configuration leading to announcements of a peculiar route or routes that introduce new paths towards a given destination or trigger bugs or other misbehaviors in neighboring routers in the course of propagation. There are no generally accepted criteria for labeling a routing incident as an “attack”, and – as stressed in the recommendations – lack of broadly accepted routing security metrics that could automatically identify certain routing changes as “routing security violations”. BGP security incidents that were observed to date can be classified as follows: • Route origin hijacking (unauthorized announcements of routes to IP space not assigned to the announcer). Such routing integrity violations may happen under various scenarios: malicious activity, inadvertent misconfigurations (“fat fingers”), or errors in traffic engineering. There are further sub-categories of such suspected security violations:The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [36] of [40] o Hijacking of unused IP space such as repetitive hijacks of routes to prefixes within a large IP blocks assigned to an entity such as US government but normally not routed on the public Internet. Temporarily using these “unused” addresses enables criminal or antisocial activities (spam, network attacks) while complicating efforts to detect and diagnose the perpetrators. o Surgically targeted hijacks of specific routes and de-aggregation attacks on specific IP addresses. They may be hard to identify unless anomaly detection is unambiguous, or the victim is important enough to create a large commotion. Examples: Pakistan Hijacks YouTube17 (advertisement of a more specific is globally accepted, and totally black-holes the traffic to the victim). There may be significantly more such attacks than publicly reported, as they may be difficult to distinguish from legitimate traffic engineering or network re-engineering activities. o Unambiguous massive hijacks of many routes where many distinct legitimate origin ASes are replaced by a new unauthorized origin AS advertising the hijacked routes. Significant recent incidents include a 2010 “China's 18-minute Mystery”18, or a hijacking of a very large portion of the Internet for several hours by TTNet in 200419, or a 2006 ConEd incident20. Without knowing the motivations of the implicated router administrators it is difficult to determine if these and similar incidents were due to malicious intent, or to errors in implementations of routing policy changes. • Manipulation of AS_PATH attribute in transmitted BGP messages executed by malicious, selfish, or erroneous policy configuration. The intention of such attacks is to exploit BGP routers’ route selection algorithms dependent on AS_PATH properties, such as immediate rejection of a route with the router’s own ASN in the AS_PATH (mandated to prevent routing loops), or AS_PATH length. Alternatively, such attacks may target software bugs in distinct BGP implementations (of which quite a few were triggered in recent years with global impact). o For routing incidents triggered by long AS_PATHs see House of Cards21, AfNOG Takes Byte Out of Internet22, Longer is Not Always Better23 for actual examples. o Route leaks - A possibility of “man in the middle” (MITM) AS_PATH attacks detouring traffic via a chosen AS was publicly demonstrated at DEFCON in 17 http://www.renesys.com/blog/2008/02/pakistan-hijacks-youtube-1.shtml 18 http://www.renesys.com/blog/2010/11/chinas-18-minute-mystery.shtml 19 Alin C. Popescu, Brian J. Premore, and Todd Underwood, Anatomy of a Leak: AS9121. NANOG 34, May 16, 2005. 20 http://www.renesys.com/blog/2006/01/coned-steals-the-net.shtml 21 http://www.renesys.com/blog/2010/08/house-of-cards.shtml 22 http://www.renesys.com/blog/2009/05/byte-me.shtml 23 http://www.renesys.com/blog/2009/02/longer-is-not-better.shtmlThe Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [37] of [40] 200824. Two other similar incidents were found in a 7-month period surrounding the DEFCON demo by mining of a BGP update repository conducted in 200925 but were not confirmed as malicious. This can occur either by accident as detailed above, and is sometimes referred to as route “leaks”, or may be intentional. Additionally, such attacks may or may not attempt to obscure the presence of additional ASes in the AS path, should they exist. These are particularly problematic to identify as they require some knowledge of intent by the resource holder and intermediate ASes. o AS_PATH poisoning – sometimes used by operators to prevent their traffic AS from reaching and/or transiting a selected AS, or steer the traffic away from certain paths. It is technically a violation of BGP protocol and could be used harmfully as well. • Exploitations of router packet forwarding bugs, router performance degradation, bugs in BGP update processing o Example of a transient global meltdown caused by a router bug tickled by deaggregation26 and several other cases cited there. There are also BGP vulnerabilities that may have not been exploited in the wild so far, but that theoretically could do a lot of damage. The BGP protocol does not have solid mathematical foundations, and certain bizarre behaviors – such as persistent route oscillations – are quite possible. There have been several RFCs and papers addressing BGP vulnerabilities in the context of protocol standard specification and threat modeling, see the following Request For Comments (RFCs): • RFC 4272 “BGP Security Vulnerabilities Analysis” S. Murphy, Jan 2006. • RFC 4593 “Generic Threats to Routing Protocols”, A. Barbir, S. Murphy and Y. Yang, Oct 2006. • Internet draft draft-foo-sidr-simple-leak-attack-bgpsec-no-help-01 “Route Leak Attacks Against BGPSEC”, D. McPherson and S. Amante, Nov 2011. • Internet draft draft-ietf-sidr-bgpsec-threats-01 “Threat Model for BGP Path Security”, S. Kent and A. Chi, Feb 2012. 24 A. Pilosov and T. Kapela, Stealing the internet, DEFCON 16 August 10, 2008 25C. Hepner and E. Zmijewski, Defending against BGP Man-in-the-Middle attacks, Black Hat DC February 2009 26 J. Cowie, The Curious Incident of 7 November 2011, NANOG 54, February 7, 2012Page [38] of [40] 7.3 BGP Risks Matrix BGP Routing Security Risks Examined by WG 4 Network Operator Role Risks Report Sect. Recommendations Stub network (e.g. Enterprise, Data Center) Session-level threats 5.1.1 • Consider MD5 or GTSM if neighbor recommends it DoS (routers and routing info) 5.1.2 • Control-Plane Policing (rate-limiting) • Keep up-to-date router software Spoofed Source IP Addresses 5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict mode or other similar features at access edge of network (e.g. datacenter or campus). • Filter source IP address on packets at network edge to ISPs Incorrect route injection and propagation 5.2.1 • Keep current information in “whois” and IRR (Internet Routing Registry) databases • Outbound prefix filtering • Use monitoring services to check for incorrect routing announcements and/or propagation Other Attacks (e.g., hacking, insider, social engineering) 5.3 • Consider many recommendations about operational security processes Internet Service Provider Network Session-level threats 5.1.1 • Consider a plan to use MD5 or GTSM including flexibility to adjust to different deployment scenario specifics DoS (routers and routing info) 5.1.2 • Control-Plane Policing (rate-limiting)The Communications Security, Reliability and Interoperability Council III Working Group [#4] Draft Report [MARCH, 2013] Page [39] of [40] • Keep up-to-date router software Spoofed Source IP Addresses 5.1.3 • Use uRPF (unicast Reverse Path Forwarding) in strict or loose mode as appropriate (e.g. strict mode at network ingress such as data-center or subscriber edge, loose mode at inter-provider border) Incorrect route injection and propagation 5.2.1 • Keep current information in “whois” and IRR (Internet Routing Registry) databases • Consult current information in “whois” and IRR (Internet Routing Registry) databases when provisioning or updating customer routing • Implement inbound prefix filtering from customers • Consider AS-path filters and maximum-prefix limits as second line of defense • Use monitoring services to check for incorrect routing announcements and/or propagation Other Attacks (e.g., hacking, insider, social engineering) 5.3 • Consider many recommendations about operational security processesPage [40] of [40] 7.4 BGP BCP Document References Network Protection Documents NIST Special Publication 800-54 Border Gateway Protocol (BGP) Security Recommendations WG2A - Cyber Security Best Practices SANS: Twenty Critical Security Controls for Effective Cyber Defense: Consensus Audit Guidelines (CAG) NIST Special Publication 800-53 Recommended Security Controls for Federal Information Systems and Organizations IETF RFC 4778 - Current Operational Security Practices in Internet Service Provider Environments IETF RFC 3013 Recommended Internet Service Provider Security Services and Procedures Source Address verification/filtering IETF BCP38/RFC 2827 Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing BCP 84/RFC 3704 Ingress Filtering for Multihomed Networks BCP 140/RFC 5358 Preventing Use of Recursive Nameservers in Reflector Attacks ICANN SAC004 Securing the Edge DoS/DDoS Considerations IETF RFC 4732 Internet Denial-of-Service Considerations SANS A Summary of DoS/DDoS Prevention, Monitoring and Mitigation Techniques in a Service Provider Environment Scalable Dynamic Nonparametric Bayesian Models of Content and Users ∗ Amr Ahmed1 Eric Xing2 1Research @ Google , 2Carnegie Mellon University amra@google.com, epxing@cs.cmu.edu Abstract Online content have become an important medium to disseminate information and express opinions. With their proliferation, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. In this paper, we addresses the problem of information organization of online document collections, and provide algorithms that create a structured representation of the otherwise unstructured content. We leverage the expressiveness of latent probabilistic models (e.g., topic models) and non-parametric Bayes techniques (e.g., Dirichlet processes), and give online and distributed inference algorithms that scale to terabyte datasets and adapt the inferred representation with the arrival of new documents. This paper is an extended abstract of the 2012 ACM SIGKDD best doctoral dissertation award of Ahmed [2011]. 1 Introduction Our online infosphere is evolving with an astonishing rate. It is reported that there are 50 million scientific journal articles published thus far [Jinha, 2010], 126 million blogs 1 , an average of one news story published per second, and around 500 million tweets per day. With the proliferation of such content, users are faced with the problem of missing the big picture in a sea of irrelevant and/or diverse content. Thus several unsupervised techniques were proposed to build a structured representation of users and content. Traditionally, clustering is used as a popular unsupervised technique to explore and visualize a document collection. When applied in document modeling, it assumes that each document is generated from a single component (cluster or topic) and that each cluster is a uni-gram distribution over a given vocabulary. This assumption limits the expressive power of the model, and does not allow for modeling documents as a mixture of topics. ∗The dissertation on which this extended abstract is based was the recipient of the 2012 ACM SIGKDD best doctoral dissertation award, [Ahmed, 2011]. 1 http://www.blogpulse.com/ Recently, mixed membership models [Erosheva et al., 2004], also known as admixture models, have been proposed to remedy the aforementioned deficiency of mixture models. Statistically, an object wd is said to be derived from an admixture if it consists of a bag of elements, say {wd1, . . . , wdN }, each sampled independently or coupled in some way, from a mixture model, according to an admixing coefficient vector θ, which represents the (normalized) fraction of contribution from each of the mixture component to the object being modeled. In a typical text modeling setting, each document corresponds to an object, the words thereof correspond to the elements constituting the object, and the document-specific admixing coefficient vector is often known as a topic vector and the model is known as latent Dirichlet allocation (LDA) model due to the choice of a Dirichlet distribution as the prior for the topic vector θ [Blei et al., 2003]. Notwithstanding these developments, existing models can not faithfully model the dynamic nature of online content, represent multiple facets of the same topic and scale to the size of the data on the internet. In this paper, we highlight several techniques to build a structured representation of content and users. First we present a flexible dynamic non-parametric Bayesian process called the Recurrent Chinese Restaurant Process for modeling longitudinal data and then present several applications in modeling scientific publication, social media and tracking of user interests. 2 Recurrent Chinese Restaurant Process Standard clustering techniques assume that the number of clusters is known a priori or can be determined using cross validation. Alternatively, one can consider non-parametric techniques that adapt the number of clusters as new data arrives. The power of non-parametric techniques is not limited to model selection, but they endow the designer with necessary tools to specify priors over sophisticated (possibly in- finite) structures like trees, and provide a principled way of learning these structures form data. A key non-parametric distribution is the Dirichlet process (DP). DP is a distribution over distributions [Ferguson, 1973]. A DP denoted by DP(G0, α) is parameterized by a base measure G0 and a concentration parameter α. We write G ∼ DP(G0, α) for a draw of a distribution G from the Dirichlet process. G itself is a distribution over a given parameter space θ, therefore we can draw parameters θ1:N from G. Integrating out G, the1987 speech Neuroscience NN Classificatio n Methods Control Prob. Models image SOM RL Bayesian Mixtures Generalizat -ion 1990 boosting 1991 Clustering 1995 ICA Kernels 1994 1996 Memory speech Kernels ICA PM Classification Mixtures Control support kernel svm regularization sv vectors feature regression kernel support sv svm machines regression vapnik feature solution Kernels kernel support Svm regression feature machines solution margin pca Kernel svm support regression solution machines matrix feature regularization 1996 1997 1998 1999 -Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing, V.Vapnik, S. E. Golowich and A.Smola - Support Vector Regression Machines H. Drucker, C. Burges, L. Kaufman, A. Smola and V. Vapnik -Improving the Accuracy and Speed of Support Vector Machines, C. Burges and B. Scholkopf - From Regularization Operators to Support Vector Kernels, A. Smola and B. Schoelkopf - Prior Knowledge in Support Vector Kernels, B. Schoelkopf, P. Simard, A. Smola and V.Vapnik - Uniqueness of the SVM Solution, C. Burges and D.. Crisp - An Improved Decomposition Algorithm for Regression Support Vector Machines, P. Laskov ..... Many more Figure 1: Left: the NIPS conference timeline as discovered by the iDTM. Right the evolution of the Topic Kernel Methods. parameters θ follow a Polya urn distribution [Blackwell and MacQueen, 1973], also knows as the Chinese restaurant process (CRP), in which the previously drawn values of θ have strictly positive probability of being redrawn again, thus making the underlying probability measure G discrete with probability one. More formally, θi |θ1:i−1, G0, α ∼ X k mk i − 1 + α δ(φk) + α i − 1 + α G0. (1) where φ1:k denotes the distinct values among the parameters θ, and mk is the number of parameters θ having value φk. By using the DP at the top of a hierarchical model, one obtains the Dirichlet process mixture model, DPM [Antoniak, 1974]. The generative process thus proceeds as follows: G|α, G0 ∼ DP(α, G0), θd|G ∼ G, wd|θd ∼ F(.|θd), (2) where F is a given likelihood function parameterized by θ. Dirichlet process mixture (or CRP) models provide a flexible Bayesian framework, however the full exchangeability assumption they employ makes them an unappealing choice for modeling longitudinal data such as text streams that can arrive or accumulate as epochs, where data points inside the same epoch can be assumed to be fully exchangeable, whereas across the epochs both the structure (i.e., the number of mixture components) and the parametrization of the data distributions can evolve and therefore unexchangeable. In this section, we present the Recurrent Chinese Restaurant Process ( RCRP ) [Ahmed and Xing, 2008] as a framework for modeling these complex longitudinal data, in which the number of mixture components at each time point is unbounded; the components themselves can retain, die out or emerge over time; and the actual parametrization of each component can also evolve over time in a Markovian fashion. In RCRP, documents are assumed to be divided into epochs (e.g., one hour or one day); we assume exchangeability only within each epoch. For a new document at epoch t, a probability mass proportional to α is reserved for generating a new cluster. Each existing cluster may be selected with probability proportional to the sum mkt + m0 kt, where mkt is the number of documents at epoch t that belong to cluster k, and m0 kt is the prior weight for cluster k at time t. If we let ctd denotes the cluster assingment of document d at time t, then: ctd|c1:t−1, ct,1:d−1 ∼ RCRP(α, λ, ∆) (3) to indicate the distribution P(ctd|c1:t−1, ct,1:d−1) ∝ m0 kt + m−td kt existing cluster α new cluster (4) As in the original CRP, the count m−td kt is the number of documents in cluster k at epoch t, not including d. The temporal aspect of the model is introduced via the prior m 0 kt, which is defined as m 0 kt = X ∆ δ=1 e − δ λ mk,t−δ. (5) This prior defines a time-decaying kernel, parametrized by ∆ (width) and λ (decay factor). When ∆ = 0 the RCRP degenerates to a set of independent Chinese Restaurant Processes at each epoch; when ∆ = T and λ = ∞ we obtain a global CRP that ignores time. In between, the values of these two parameters affect the expected life span of a given component, such that the lifespan of each storyline follows a power law distribution [Ahmed and Xing, 2008]. In addition, the distribution φk of each component changes over time in a markovian fashion, i.e.: φkt|φk,t−1 ∼ P(.|φk,t−1). In the following three sections we give various models build on top of RCRP and highlight how inference is performed and scaled to the size of data over the internet. 3 Modeling Scientific Publications With the large number of research publications available online, it is important to develop automated methods that can discover salient topics (research area), when each topicstarted, how each topic developed over time and what are the representative publications in each topic at each year. Mixedmembership models (such as LDA) are static in nature and while several dynamic extensions have been proposed ([Blei and Lafferty, 2006]), non of them can deal with evolving all of the aforementioned aspects. While, the RCRP models can be used for modeling the temporal evolution of research topics, it assumes that each document is generated from a single topic (cluster). To marry these two approaches, we first introduce Hierarchical Dirichlet Processes (HDP [Teh et al., 2006]) and then illustrate our proposed model. Instead of modeling each document wd as a single data point, we could model each document as a DP. In this setting,each word wdn is a data point and thus will be associated with a topic sampled from the random measure Gd, where Gd ∼ DP(α, G0). The random measure Gd thus represents the document-specific mixing vector over a potentially infi- nite number of topics. To share the same set of topics across documents, we tie the document-specific random measures by modeling the base measure G0 itself as a random measure sampled from a DP(γ, H). The discreteness of the base measure G0 ensures topic sharing between all the documents. Now we proceed to introduce our model, iDTM [Ahmed and Xing, 2010b] which allows for infinite number of topics with variable durations. The documents in epoch t are modeled using an epoch specific HDP with high-level base measure denoted as Gt 0 . These epoch-specific base measures {Gt 0} are tied together using the RCRP of Section 2 to evolve the topics’ popularity and distribution over words as time proceeds. To enable the evolution of the topic distribution over words, we model each topic as a logistic normal distribution and evolve its parameters using a Kalman filter. This choice introduces non-conjugacy between the base measure and the likelihood function and we deal with it using a Laplace approximate inference technique proposed in [Ahmed and Xing, 2007]. We applied this model to the collection of papers published in the NIPS conference over 18 years. In Figure 1 we depict the conference timeline and the evolution of the topic ‘Kernel Methods’ alone with popular papers in each year. In addition to modeling temporal evolution of topics, in [Ahmed et al., 2009] we developed a mixed-membership model for retrieving relevant research papers based on multiple modalities: for example figures or key entities in the paper such as genes or protein names (as in biomedical papers). Figures in biomedical papers pose various modeling challenges that we omit here for space limitations. 4 Modeling Social Media News portals and blogs/twitter are the main means to disseminate news stories and express opinions. With the sheer volume of documents and blog entries generated every second, it is hard to stay informed. This section explores methods that create a structured representation of news and opinions. Storylines emerge from events in the real world, such as the Tsunami in Japan, and have certain durations. Each story can be classified under multiple topics such as disaster, rescue and economics. In addition, each storyline focuses on certain Sports games won team final season league held Politics government minister authorities opposition officials leaders group Unrest police attack run man group arrested move India-Pakistan tension nuclear border dialogue diplomatic militant insurgency missile Pakistan India Kashmir New Delhi Islamabad Musharraf Vajpayee UEFA-soccer champions goal leg coach striker midfield penalty Juventus AC Milan Real Madrid Milan Lazio Ronaldo Lyon Tax bills tax billion cut plan budget economy lawmakers Bush Senate US Congress Fleischer White House Republican T O P I C S S T O R Y L I N E S Figure 2: Some example storylines and topics extracted by our system. For each storyline we list the top words in the left column, and the top named entities at the right; the plot at the bottom shows the storyline strength over time. For topics we show the top words. The lines between storylines and topics indicate that at least 10% of terms in a storyline are generated from the linked topic. words and named entities such as the name of the cities or people involved in the event. In [Ahmed et al., 2011b,a] we used RCRP to model storylines. In a nutshell, we emulate the process of generating news articles. A story is characterized by a mixture of topics and the names of the key entities involved in it. Any article discussing this story then draws its words from the topic mixture associated with the story, the associated named entities, and any story-specific words that are not well explained by the topic mixture. The latter modification allows us to improve our estimates for a given story once it becomes popular. In summary, we model news story clustering by applying a topic model to the clusters, while simultaneously allowing for cluster generation using RCRP. Such a model has a number of advantages: estimates in topic models increase with the amount of data available. Modeling a story by its mixture of topics ensures that we have a plausible cluster model right from the start, even after observing only one article for a new story. Third, the RCRP ensures a continuous flow of new stories over time. Finally, a distinct named entity model ensures that we capture the characteristic terms rapidly. In order to infer storyline from text stream, we developed a Sequential Monte Carlo (SMC) algorithm that assigns news articles to storylines in real time. Applying our online model to a collection of news articles extracted from a popular news portal, we discovered the structure shown in Figure 2. This structure enables the user to browse the storylines by topics as well as retrieve relevant storylines based on any combination of the storyline attributes. Note that named entities are extracted by a preprocessing step using standard extractors. Quantitatively, we compared the accuracy of our online clustering with a strong offline algorithm [Vadrevu et al., 2011] with favorable outcome. Second, we address the problem of ideology-bias detection in user generated content such as microblogs. We follow the notion of ideology as defines by Van Dijk [Dijk, 1998]: “a set of general abstract beliefs commonly shared by a grouppalestinian israeli peace year political process state end right government need conflict way security palestinian israeli Peace political occupation process end security conflict way government people time year force negotiation bush US president american sharon administration prime settlement pressure policy washington ariel new middle unit state american george powell minister colin visit internal policy statement express pro previous package work transfer european administration receive arafat state leader roadmap george election month iraq week peace june realistic yasir senior involvement clinton november post mandate terrorism US role Israeli View roadmap phase violence security ceasefire state plan international step implement authority final quartet issue map effort roadmap end settlement implementation obligation stop expansion commitment consolidate fulfill unit illegal present previou assassination meet forward negative calm process force terrorism unit road demand provide confidence element interim discussion want union succee point build positive recognize present timetable Roadmap process israel syria syrian negotiate lebanon deal conference concession asad agreement regional october initiative relationship track negotiation official leadership position withdrawal time victory present second stand circumstance represent sense talk strategy issue participant parti negotiator peace strategic plo hizballah islamic neighbor territorial radical iran relation think obviou countri mandate greater conventional intifada affect jihad time Arab Involvement Palestinian View Israeli Background topic Palestinian Background topic Figure 3: Ideology-detection. Middle topics represent the unbiased portion of each topic, while each side gives the Israeli and Palestinian perspective. At time t At time t+1 At time t+2 At time t+3 User 1 process User 2 process User 3 process Global process m m' n n' Figure 4: A Fully evolving non-parametric process. Top level process evolves the global topics via an RCRP. Each row represents a user process evolving using an RCR process whose topics depends both on the the global topics at each epoch and the previous state of the user at previous epochs. The user process is sparser than the global process as users need not appear in each epoch, moreover users can appear (and leave) at any epoch. of people.” In other words, an ideology is a set of ideas that directs one’s goals, expectations, and actions. For instance, freedom of choice is a general aim that directs the actions of “liberals”, whereas conservation of values is the parallel for “conservatives”. In Ahmed and Xing [2010a] we developed a multi-view mixed-membership model that utilizes a factored representation of topics, where each topic is composed of two parts: an unbiased part (shared across ideologies) and a biased part (different for each ideology). Applying this model on a few ideologically labelled documents as seeds and many unlabeled documents, we were able to identify how each ideology stands with respect to mainstream topics. For instance in Figure 3 we show the result of applying the model to a set of articles written on the middle east conflict by both Israeli and Palestinian writers. Given a new documents, the model can 1) detect its idealogical bias (if any), 2) point where the bias appears (i.e. highlight words and/or biased sentences) and 3) retrieve documents written on the same topic from the opposing ideology. Our model achieves state of the art results in task 1 and 3 while being unique in solving task 2. 0 10 20 30 40 0 0.1 0.2 0.3 0.4 0.5 Propotion Day Baseball Dating Celebrity Health 0 10 20 30 40 0 0.1 0.2 0.3 Propotion Day Baseball Finance Jobs Dating Snooki Tom Cruise Katie Holmes Pinkett Kudrow Hollywood League baseball basketball, doublehead Bergesen Griffey bullpen Greinke skin body fingers cells toes wrinkle layers women men dating singles personals seeking match Dating Baseball Celebrity Health job career business assistant hiring part-time receptionist financial Thomson chart real Stock Trading currency Jobs Finance Figure 5: Dynamic interests of two users. 5 Modeling User Interests Historical user activity is key for building user profiles to predict the user behaviour and affinities in many web applications such as targeting of online advertising, content personalization and social recommendations. User profiles are temporal, and changes in a user’s activity patterns are particularly useful for improved prediction and recommendation. For instance, an increased interest in car-related web pages suggests that the user might be shopping for a new vehicle. In Ahmed et al. [2011c] we present a comprehensive statistical framework for user profiling based on the RCRP model which is able to capture such effects in a fully unsupervised fashion. Our method models topical interests of a user dynamically where both the user association with the topics and the topics themselves are allowed to vary over time, thus ensuring that the profiles remain current. For instance if we represent each user as a bag of the words in their search history, we could use the iDTM model described in Section 3. However, unlink research papers that exist in a given epoch, users exist along multiple epoch (where each epoch here might denote a day). To solve this problem we extend iDTM by modeling each user himself as a RCRP that evolves over time as shown in Figure 4. To deal with the size of data on the internet, we developed a streaming, distributed inference algorithm that distribute users over multiple machines and synchronizing the model parameters using an asynchronous consensus protocol described in more details in [Ahmed et al., 2012; Smola and Narayanamurthy, 2010]. Figure 5 shows qualitatively the output of our model over two users. Quantitatively the discovered interests when used as features in an advertising task results in significant improvement over a strong deployed system. 6 Conclusions Our infosphere is diverse and dynamic. Automated methods that create a structured representation of users and content are key to help users staying informed. We presented a flexible nonparametric Bayesian model called the Recurrent Chinese Restaurant Process and showed how using this formalism (in addition to mixed-membership models) can solve this task. We validated our approach on many domains and showed how to scale the inference to the size of the data on the internet and how to performing inference in online settings.References A. Ahmed and E. P. Xing. On tight approximate inference of the logistic normal topic admixture model. In AISTATS, 2007. A. Ahmed and E. P. Xing. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In SDM, pages 219–230. SIAM, 2008. A. Ahmed and E. P. Xing. Staying informed: Supervised and semi-supervised multi-view topical analysis of ideological perspective. In EMNLP, 2010. A. Ahmed and E. P. Xing. Timeline: A dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. In UAI, 2010. A. Ahmed, E. P. Xing, W. W. Cohen, and R. F. Murphy. Structured correspondence topic models for mining captioned figures in biological literature. In KDD, pages 39– 48. ACM, 2009. A. Ahmed, Q. Ho, J. Eisenstein, E. P. Xing, A. J. Smola, , and C. H. Teo. Unified analysis of streaming news. In in WWW 2011, 2011. A. Ahmed, Q. Ho, C. hui, J. Eisenstein, A. Somla, and E. P. Xing. Online inference for the infinte topic-cluster model: Storylines from text stream. In AISTATS, 2011. A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In KDD, pages 114–122, 2011. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A.J. Smola. Scalable inference in latent variable models. In Web Science and Data Mining (WSDM), 2012. A. Ahmed. Modeling Users and Content: Structured Probabilistic Representation, and Scalable Online Inference Algorithms. PhD thesis, School of Computer Science, Carnegie Mellon University, 2011. C. E. Antoniak. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics, 2(6):1152–1174, 1974. D. Blackwell and J. MacQueen. Ferguson distributions via polya urn schemes. The Annals of Statistics, 1(2):353–355, 1973. D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, volume 148, pages 113–120. ACM, 2006. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. T. A. Van Dijk. Ideology: A multidisciplinary approach. 1998. E. Erosheva, S. Fienberg, and J. Lafferty. Mixed membership models of scientific publications. PNAS, 101(1), 2004. T. S. Ferguson. A bayesian analysis of some nonparametric problems. The Annals of Statistics, 1(2):209–230, 1973. A. E. Jinha. Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publishing, 23(3):258–263, 2010. A. J. Smola and S. Narayanamurthy. An architecture for parallel topic models. In Very Large Databases (VLDB), 2010. Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(576):1566–1581, 2006. S. Vadrevu, C. H. Teo, S. Rajan, K. Punera, B. Dom, A. J. Smola, Y. Chang, and Z. Zheng. Scalable clustering of news search results. In in WSDM 2011, 2011. Towards A Unified Modeling and Verification of Network and System Security Configurations Mohammed Noraden Alsaleh, Ehab Al-Shaer University of North Carolina at Charlotte Charlotte, NC, USA Email: {malsaleh, ealshaer}@uncc.edu Adel El-Atawy Google Inc Mountain View, CA, USA Email: aelatawy@google.com Abstract—Systems and networks access control configuration are usually analyzed independently although they are logically combined to define the end-to-end security property. While systems and applications security policies define access control based on user identity or group, request type and the requested resource, network security policies uses flow information such as host and service addresses for source and destination to define access control. Therefore, both network and systems access control have to be configured consistently in order to enforce endto-end security policies. Many previous research attempt to verify either side separately, but it does not provide a unified approach to automatically validate the logical consistency between both of them. Thus, using existing techniques requires error-prone manual and ad-hoc analysis to validate this link. In this paper, we introduce a cross-layer modeling and verifi- cation system that can analyze the configurations and policies across both application and network components as a single unit. It combines policies from different devices as firewalls, NAT, routers and IPSec gateways as well as basic RBACbased policies of higher service layers. This allows analyzing, for example, firewall polices in the context of application access control and vice versa. Thus, by incorporating policies across the network and over multiple layers, we provide a true endto-end configuration verification tool. Our model represents the system as a state machine where packet header, service request and location determine the state and transitions that conform with the configuration, device operations, and packet values are established. We encode the model as Boolean functions using binary decision diagrams (BDDs). We used an extended version of computational tree logic (CTL) to provide more useful operators and then use it with symbolic model checking to prove or find counter examples to needed properties. The tool is implemented and we gave special consideration to efficiency and scalability. I. INTRODUCTION Users inadvertently trigger a long sequence of operations in many locations and devices by just a simple request. The application requests are encapsulated inside network packets which in turn are routed through the network devices and subjected to different types of routing, access control and transformation policies. Misconfigurations at the different layers in any of the network devices can affect the end to end connection between the hosts them selves and the communicating services running on top of them. Moreover, applications may require to transform requests to another one or more requests with different characteristics. This means that the network layer should guarantee more than one packet flow at the same time in order for the application request to be successful. Although it is already very hard to verify that only the legitimate packets can pass through the network successfully, the consistency between network and application layer access configuration adds another challenge. The different natures of policies from network layer devices to the logic of application access control makes it more complex. In this paper we have extended ConfigChecker [3] to include application layer access control. The ConfigChecker is a model checker for network configuration. It implemented many network devices including: routers, firewalls, IPSec gateways, hosts, and NAT/PAT nodes. ConfigChecker models the transitions based on the packet forwarding at the network layer where the packet header fields along with the location represent the variables for the model checker. We define application layer requests following a loose RBAC model: a 4-tuple of
Voir également :
01AINrues.htm 07-Oct-2011 14:09 5.5M
75PARISRUEMONTGALLET..> 19-Oct-2011 11:58 32K
75PARISavenuedescham..> 19-Oct-2011 11:50 521K
75PARISruedeRennesen..> 19-Oct-2011 12:09 203K
75PARISruegeneralDel..> 19-Oct-2011 12:04 17K
75PARISrues.htm 11-Jul-2011 20:49 3.6M
78YVELINESrues.htm 11-Jul-2011 20:55 2.0M
92HAUTSDESEINErues.htm 12-Jul-2011 09:21 4.1M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:03 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:21 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:24 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23 1.7M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22 1.6M
Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22 1.6M
Rues-1.htm 04-Jun-2013 12:57 2.9M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:31 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:31 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:31 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:30 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:30 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:30 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:30 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:29 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:29 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:29 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:29 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:28 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:28 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:28 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 07:28 1.5M
Rues-13-Bouches-du-R..> 27-Jun-2013 10:12 1.9M
Rues-33-Gironde-1.htm 27-Jun-2013 16:39 1.9M
Rues-33-Gironde-2.htm 27-Jun-2013 16:39 1.9M
Rues-33-Gironde-3.htm 27-Jun-2013 16:38 1.9M
Rues-33-Gironde-4.htm 27-Jun-2013 16:38 1.9M
Rues-33-Gironde-5.htm 27-Jun-2013 16:46 1.9M
Rues-33-Gironde-6.htm 27-Jun-2013 16:46 1.9M
Rues-33-Gironde-7.htm 27-Jun-2013 16:46 1.9M
Rues-33-Gironde-8.htm 27-Jun-2013 16:46 1.9M
Rues-33-Gironde-9.htm 27-Jun-2013 16:45 1.9M
Rues-33-Gironde-10.htm 27-Jun-2013 16:45 1.9M
Rues-33-Gironde-11.htm 27-Jun-2013 16:53 1.9M
Rues-33-Gironde-12.htm 27-Jun-2013 16:53 1.9M
Rues-33-Gironde-13.htm 27-Jun-2013 16:53 1.9M
Rues-33-Gironde-14.htm 27-Jun-2013 16:52 1.9M
Rues-33-Gironde-15.htm 28-Jun-2013 08:05 1.9M
Rues-33-Gironde-16.htm 28-Jun-2013 08:03 1.9M
Rues-33-Gironde-17.htm 28-Jun-2013 08:03 1.9M
Rues-33-Gironde-18.htm 28-Jun-2013 07:37 1.9M
Rues-33-Gironde-19.htm 28-Jun-2013 07:37 1.9M