For CMMI high maturity, building and using PPMs (Process Performance Models) is a mandatory practice. This follows from the fact that the goal for OPP (Organizational Process Performance) includes the statement "provide process performance data, baselines, and models to quantitatively manage the organization’s projects" and goals are required components in the CMMI model. PPMs along with PPBs (Process Performance Baselines) are at the heart of high maturity practices and do pose their fair share of challenges for the organization. This is also summarized in
CMMI L5 Nemesis - PPBs and PPM.
Linear Regression-based PPMs seem to be in vogue and probably because of the fact that most of the implementers and consultants find regression easy to learn. This may in turn be a result of the fact that many of the implementers and consultants have minimal or no theoretical and applied background in statistics
Coming back to PPM building, before deciding on the modeling approach (regression is just one of the many) it is a good idea for one to probe the following fundamental questions:
- How much time and energy does one have for building a model?
- How much prediction power one is aiming for?
- How much does one already know about statistical modeling?
- How much does one understand the underlying physical process?
- How reliable is the data to be used for building a statistical model?
The last question is the most critical as it directly impacts the goodness of the PPM regardless of what statistical method one chooses. It is but obvious that good PPMs need good data. And in respect of data the following points are important to note:
- The only data available are historical - collecting experimental data or using DoE is not a feasible option due to cost reasons or the data generation frequency imposing natural constraints
- Data are not reliable - historical data generally has high noise factor and tends not to be too reliable
Linear Regression is a good starting point while building PPMs. However, it may fail to deliver PPMs which work well when deployed (despite showing great statistical promise - p-value, R Square Adjusted, VIF, etc. all showing good statistical significance). This may happen due to many reasons like:
- Y and/or X (one ore more) is not normally distributed - this can be handled statistically through transformations and despite the Central Limit Theorem and Chebyshev's Inequality requires careful evaluation of the underlying statistical characteristics.
- One or more Y-X combinations are non-linear - fitting the equation of a line to a curve is like trying to put a cylinder in a square hole, if the diameter of the cylinder is equal or less than the length of the side of the square one can still do that but one would have a good amount of gap left around the cylinder.
- Interaction effects exist between one or more X - this will require modeling X in combinations like X1/X2 or X1*X2 so that the interaction effects are damped or amplified as is appropriate.
- Regression equation shows unrealistic sign relation between Y and X - though a certain relation might be statistically valid it might be physically invalid as understanding of the process should prevail over the sign relation one gets from statistical analysis (it's worth remembering that all sign relations in respect of earth's movement around sun being a circle were supposed to be valid until the sharp insight by Kepler revealed that it is actually an ellipse).
So what does one do beyond the regression-based PPMs. There are many more and probably better options available. Some examples include General Regression, Bayesian Belief Networks (BBNs), Simulation-based PPMs based on underlying empirical distribution, Queuing models, etc.
No matter what statistical modeling technique is used in the end what becomes important is that the PPM works well when deployed. Constant refinement and calibration thus becomes important so that the changes and improvements in the physical process, if any, get reflected in the PPM with a minimal time lag.