next up previous
Next: 4.2.2 D&C EM queueing Up: 4.2.1 D&C EM experimental Previous: 4.2.1.1 Traces


4.2.1.2 D&C EM statistical analysis

The size of the data sets precludes using goodness-of-fit tests such as the Kolmogorov-Smirnov and $\chi^2$ tests [48]. Therefore, we evaluate the accuracy of D&C EM by checking the matching of statistical properties such as the moments and the median, and by plotting PDFs, CDFs, and CCDFs (Complimentary Cumulative Distribution or Survival function).

Table 4.2 illustrates the means and the CVs of the original data sets, plus various hyperexponential fittings using the EM, FW, and D&C EM algorithms. In Table 4.2, ``$x$ ph'' means that the EM or FW algorithms fit the entire data set into a hyperexponential with $x$ phases. Observe that the D&C EM models match the mean of the traces with maximal error of 4%. The coefficient of variation is more difficult to match, since it is obtained using both the first and the second moments. Nevertheless, D&C EM models match it with a maximal error of 20% (Trace 2). The EM algorithm alone could not generate results for Traces 4 and 5 within reasonable amount of computation time (less than a week) in a Pentium III 800MHz processor with 1GB of memory. Since Traces 4 and 5 are synthetically generated from a Weibull distributions, we fit the same distribution functions into hyperexponential distributions using the FW algorithm and compare them with the D&C EM fits. The results of Table 4.2 show that D&C EM technique matches better the statistical properties of the data sets, when compared to the EM and FW algorithms.

Table 4.2: Statistical evaluation of the fittings.
  Data EM 8 ph FW 15 ph D&C EM
Trace 1
Mean 4407.81 4402.35 n/a 4393.56
CV 7.28 3.44 n/a 7.86
Median 938.00 961.51 n/a 950.59
Trace 2
Mean 6358.23 6196.27 n/a 6164.50
CV 5.87 4.13 n/a 5.13
Median 1082.91 1063.12 n/a 1061.25
Trace 3
Mean 3459.86 3425.72 n/a 3391.06
CV 3.13 2.69 n/a 2.82
Median 1085.32 1084.13 n/a 1086.59
Trace 4
Mean 227.27 n/a 221.34 220.21
CV 7.36 n/a 8.12 7.14
Median 2.20 n/a 2.06 2.27
Trace 5
Mean 47.50 n/a 46.40 46.61
CV 3.86 n/a 3.95 3.87
Median 3.32 n/a 3.14 3.35


D&C EM fits match better even the higher moments of the data sets. We present in Table 4.3 the relative errors of fitted third moments from the actual third moments of all five data sets (we omit the absolute values because they are too large and not easy to read).

Table 4.3: Relative error of the third moment.
  Trace 1 Trace 2 Trace 3 Trace 4 Trace 5
EM 8 ph 75% 91% 70% n/a n/a
FW 15 ph n/a n/a n/a 106% 27%
D&C EM 60% 60% 66% 25% 11%


Figure 4.5: PDF, CDF, and CCDF of the real data and the fitted models.
\begin{figure}\centering {\psfig{figure=III/FIGS/CDFs.eps, width=0.8\linewidth}}
\end{figure}

Figure 4.5 plots the PDH, CDF, and CCDF for each data set and the D&C EM, EM, and FW models. The PDF plots for all five data sets are shown in the first column of graphs in Figure 4.5 (note the logscale of the x-axis). The PDF of Trace 1 is heavily jagged, characteristic of real trace data, which makes matching the PDF more challenging. D&C EM offers accurate fits for all traces. The fits for Traces 2 and 3, both D&C EM and EM ones, do not match well the body of the data PDFs. This happens because Traces 2 and 3 do not have monotone PDFs, while the hyperexponential distribution has a completely monotone PDF [30].

The CDF plots for all traces (middle column of graphs in Figures 4.5) illustrate that the D&C EM provides a good match for the body of the distribution for all traces. In order to investigate the accuracy of the fittings for the tail of the distribution, we present the CCDF plots in log-log scale (third column of graphs in Figure 4.5). Note that even for the tail of the distribution, which reflects the observed variability of the data sets, D&C EM generates models that closely match the data set characteristics.


next up previous
Next: 4.2.2 D&C EM queueing Up: 4.2.1 D&C EM experimental Previous: 4.2.1.1 Traces
Alma Riska 2003-01-13