We have selected five highly variable data sets to test our approach.
The first data set (indicated as ``Trace 1'') is a trace from the 1998 World
Soccer Cup Web site4.2.
It contains the sizes of the files requested by clients from this Web site in
the course of an entire day.
The other four traces are synthetically generated from analytic models that
closely approximate Web server traffic [3].
Traces 2 and 3 are generated from Lognormal distributions with shape parameters
1.85 and 1.5, respectively, and the same scale parameter 7.0.
Traces 4 and 5 are generated from Weibull distributions with shape parameters
0.25 and 0.35, respectively, and the same scale parameter 9.2.
The statistical characteristics of these data sets are shown in
Table 4.1.
Table 4.1:
Statistical characteristics of the data sets.
Trace
Entries
Unique
Mean
CV
1
16045065
12122
4407.81
7.28
2
25000
25000
6358.23
5.87
3
25000
25000
3459.86
3.13
4
25000
22969
227.27
7.36
5
25000
24298
47.50
3.86
The number of entries and the number of unique entries for each data set
are significant for the performance of the D&C EM since the running
time of the EM algorithm depends on these parameters [72].
Observe that the real trace has less unique entries than the
synthetically generated data sets.