next up previous
Next: 4.4 MAP fitting algorithm Up: 4. Data Fitting Algorithms Previous: 4.2.4.1 Results with the


4.3 D&C MM

We extend the idea of D&C EM by dividing the data set based on the expected value of each partition rather than the coefficient of variation. In this approach, we fit the data of each partition into PH distributions using methods of moment matching rather than the EM algorithm. This technique applies moment matching techniques in a divide and conquer fashion, hence we call it D&C MM.

Again, we base our technique on the analysis of the data set CDH. Contrary to D&C EM, D&C MM requires determining the number of partitions $N$ before splitting the data into subsets. Partition boundaries are determined such that the expected value of each partition is $M_d/N$, where $M_d$ is the expected value of the entire data set. Once the data set is partitioned, we compute the CV of each subset of data. If the CV of a subset of data is less than 1.0, i.e., the CV of the exponential distribution, we fit that subset of data into a hypoexponential distribution. If the CV is greater than 1.0, we fit that subset of data into a hyperexponential distribution. We fit each subset of data into a PH distribution using the Newton-Raphson method of moment matching, which is described in details in Appendix C and [98]. The resulting PH distribution is a mixture of hypoexponential and hyperexponential distributions. We formally present the D&C MM algorithm in Figure 4.11.

Figure 4.11: D&C MM fitting algorithm.
\framebox{\begin{minipage}[t]{5.5in}
\begin{tabbing}
9\=999\=9999999999999999999...
...\\
\> \> ~~~~Combine results of all PH fittings\\
\end{tabbing}\end{minipage}}

To illustrate the fitting methodology, we selected a data set containing the sizes of the requested files, i.e., the service process, measured during one entire day at the World Cup'98 Web site. We split the data set into four partitions and present in Figure 4.12 the PH distribution resulting form the merging of the four individual fittings; of those, the first one is a two-stage hyperexponential, while the other three are hypoexponential, with the last one very close to an Erlang (the numbers written inside each stage are the rates of the corresponding exponential distributions, while those on the arcs describe probabilistic splittings). To assess the quality of the overall PH fitting, we plot the PDF and CDF of the data and of our fit in Figure 4.134.4. We also evaluate the accuracy of the fitting from the queueing system perspective, and present the results in Table 4.5 (we assume a M/PH/1 server with Poisson arrivals and the fitted PH distribution for service process). We conclude that D&C MM is a fast and accurate approach to fit data sets into PH distributions.

Figure 4.12: The resulting overall phase-type distribution.
\begin{figure}\centerline{\psfig{figure=figs-web/fitting.eps,width=0.95 \linewidth}}\vspace*{-4mm}\end{figure}

Figure 4.13: Comparing the empirical and fitted data: pdf (left) and CDF (right).
\begin{figure}\centerline{\psfig{figure=figs-web/pdf-cdf.eps,width=0.85 \linewidth}}\end{figure}


Table 4.5: Comparing the empirical and fitted data: performance results.
  Real data Fitted data
Arrival rate (trace-driven simulation) (analytical solution)
  E[queue length] E[slowdown] E[queue length] E[slowdown]
0.0000250 0.84 6.51 0.83 6.41
0.0000375 1.92 9.96 1.92 9.81
0.0000500 3.60 14.01 3.60 13.84
0.0000625 6.05 18.82 6.06 18.60
0.0000750 9.51 24.65 9.52 24.38
0.0000875 14.37 31.92 14.40 31.60
0.0001000 21.22 41.24 21.26 40.82
0.0001125 30.99 53.54 31.10 53.09
0.0001250 45.34 70.49 45.44 69.81
0.0001375 67.42 95.29 67.41 94.14
0.0001500 104.22 135.02 103.95 133.07



next up previous
Next: 4.4 MAP fitting algorithm Up: 4. Data Fitting Algorithms Previous: 4.2.4.1 Results with the
Alma Riska 2003-01-13