Real data sets are plagued by the disruptive presence of artifacts. Spike like responses can result from any number of electronic, physical, or social phenomena. Digital gain can change dramatically with temperature or electron tunneling. Physical effects like unpredictable radiation bursts (solar activity), flash weather, and geometric specular points and glint (peak surface reflection) all contribute to edge data. Social spikes occur when a novel message virally spreads from a few sources to saturation levels in a short period of time.
Edge Points Dominate Statistical Representations
Much of research is spent improving models and reducing the impact of anomalous measurements. If special care isn't taken to minimize the impact of spikes, iterative estimates based on models will divert from true state. System models completely break down in the presence of severe artifacts.
Note that if you're interested in identifying spikes, ameliorating their impact on distributions will only help them stand out. Most estimation techniques have predictable degradation with the number and size of artifacts. Take for example the mean of 4 numbers. Given the points [4,4,4,100] it's easy for us to see that the point 100 stands out as an anomaly. The mean without that value is 4 but with it the mean is 28. The variance is 0 without the outlier, and quite large with it**.
A Different Way of Looking at Data
Another statistical representation is available which is less susceptible to outliers, namely Robust Statistics. Instead of characterizing data sets with means and variances^, the set of data is represented by a median, and the median of the absolute median deviations. These measures are insensitive to outliers which can dominate standard sample estimates. For the example above 4,4,4,100 the median is 4 (it's also the mode). The median absolute deviations are 0,0,0,96. The median of these deviations is 0. So if the point 100 were truly and anomaly, applying robust statistics allows us to characterize the rest of the data fairly well.
What about Anomalies that are a Signal of Systematic Change?
While the occasional spike can ruin a good estimate (or a tire) a significant presence of outliers is something most geeks want to know about, understand, and characterize.
One of the examples I mentioned earlier was social spikes. Studying networks and messages for these types of spikes can be highly lucrative when optimizing broadcast efficiency.
Examples or Hypothesized Conditional Social Sharing Patterns:
- A great many hackers make git repo commits late at night
- People may be more likely to share an interesting article at certain times of the day (pre-work or lunchtime)
- Video and image sharing may dominate late night activity based on folks winding down
- Stimulating articles may have greater attention saturation in the early morning
- An intriguing app released at SXSW gives the early adopting conference goers something new to talk about
Specific topics have a stronger viral coefficient based on their content, location, day, time, and the early audience of their release.
The onset of network spikes is of critical importance. Conditionally weighting the factors that contribute to communication avalanches helps us model and understand them. In this case we want to pay particular attention to the onset of message storms. Estimation techniques which are insensitive to early spikes conceal emerging trends with their steady performance until they catastrophically break down. The normalized measurement of traffic would be like a delta function when looked through the eyes of robust statistics. Understanding attention and influence will require careful analysis of both content and environment (location, day, time, early points of contact).
Multiple Estimation Models
A series of models may aid in understanding emerging trends. A first order application of robust statistics will clearly pick up early outliers but not characterize their distribution. A secondary (or tertiary, etc) estimation model can begin characterizing outliers and separate them to avoid catastrophic failure for the first model. As the number of system models increase the degrees of freedom are extended. A balancing pressure which penalizes the presence of secondary models is required to remove redundant models when no longer necessary.
*= Least Squares is an estimation technique that minimizes the sum squared error. Alternative estimates may be L1 Norm which work with absolute values much like Robust Statistics (equivalent?)
^= Means and variances (overall or by cluster for multimodal data) are ideal for Gaussian distributions which are common in many signal processing applications
**= variance of 4,4,4,100 is 1728 or 2304 depending on whether you normalize by N or N-1. 1/3 or 1/4 of (24*24 + 24*24 + 24*24 + 72*72). By definition it's N-1 but for most large data sets the difference is negligible.