Follow-up Commentary on the
Preliminary Water Quality Screening Results

Lake Merced Pilot Stormwater Enhancement Project


I am disappointed that this exchange has been reduced to a level of charge and counter-charge, rather than a cooperative effort seeking a better program to maintain Lake Merced.  Unfortunately, that seems to be the way that the San Francisco Public Utilities Commission prefers to operate, and that approach has carried over to the present study as well.  I am then preparing a second set of comments, referencing both the final report on the Vista Grande diversion project and the response to my earlier commentaries prepared by the researchers.1  

“The primary goal was to determine whether the diversion of limited volumes of treated stormwater (about 0.1 to 3.6 million liters per storm event) increased concentrations of bacterial indicators of fecal contamination in South Lake Merced.  Such increases would indicate the potential for increased human health risk (i.e., contracting gastrointestinal disease) during recreation in the lake.”  (1, pg ES-i)

We agree that the goal is to assure that no significant increase in risk to human health results from additions of Vista Grande Canal stormwater to Lake Merced.  We will compare what happens after a storm event with and without such additions; our main concern is to assure that no detrimental impact has occurred.  From an analytic perspective, we take a cautionary approach, placing the burden of proof on the experimenters, requiring them to clearly demonstrate that no such threat exists.  This puts the shoe on the other foot, so to speak, when compared with traditional statistical analysis in which we assume that no impact has occurred, the so-called null hypothesis, until it is clearly demonstrated that the treatment being evaluated has had an effect.

Throughout their report the researchers ignore this distinction, leaving both good science and objective analysis behind.  They want to proceed with their demonstration project, and will defend their opportunity to do so until it is demonstrated beyond any reasonable doubt that the treatment, in this case the introduction of contaminated water into Lake Merced, has had a detrimental effect.  This approach is antithetical to careful protection of public health and safety.

And so the researchers have collected data for three storm events with no diversion and six storm events with diversion.  Based on sample sizes of three and six respectively the researchers conclude that the data is log-normally distributed, that the variances of the two populations are equal, and that sufficient data has been collected to conduct statistical analyses with tools such as the Student t-test.  Other analysts would argue that this is a mighty fine thread upon which to hang the protection of public health and safety.

“Geometric mean E. coli concentrations at most lake sample stations were higher following diversion events than background storm events, but the differences were not statistically significant.”  (1, pg ES-i)

Let’s take a closer look.2  Average value of E-coli bacteria measured at six sample sites near the shore during diversions was 98.6 MPN/100mL, during storms with no diversion just 49.2 MPN/100mL.  The average amount of E-coli bacteria doubled when stormwater was diverted into the lake, while the maximum amount was more than five times as great.  Comparing these averages using the t-test provides a value of t of 1.55.  Given the decision criterion that a finding would be considered significant only if the significance level exceeded 95%, and the so-called critical value of t at this level for this sample size is 1.67, a value greater than 1.55, the above claim would be supported.

But is it really?  Who said 95% was an appropriate level?  Suppose that the criterion selected had been 90% confidence, another commonly accepted level?  The critical value of t is then just 1.28; the hypothesis that there was no impact would then have been rejected.  Or suppose that the samples had been larger, say that six storms without diversion had been monitored instead of just three, but with the same observed means and variances.  The value of t would then have 2.16, well above the threshold level of 1.67.  Are we happy that the likelihood that no detrimental impact was observed was just 1 chance in 10, and not 1 chance in 20?  Are we willing to hang issues of public health and safety on the difference between monitoring three background storms and six?  I am not.

Assuming a log-normal distribution, using logarithms rather than raw numbers to conduct this analysis, does not materially affect the conclusion.  The observed value of t is reduced from 1.55 to 1.42, the confidence level reduced from 94% to 92%, differences that are not significant.  Using logs, then, adds artificial sophistication to the analysis, but does not materially affect the results.

But all of this is irrelevant.  The t-test is designed to determine whether or not two samples came from the same population.  For example, if we are buying fertilizer for our garden we want to be pretty sure that there will be a beneficial effect on our flowers and vegetables.  The burden of proof then rests on the sellers of this fertilizer to assure that the difference between plots treated with their product and plots that are not so treated are sufficiently large to warrant a purchase.  However, as described above, we are facing the opposite problem.  Users of the lake want the burden of proof to lie with those adding contaminated water to Lake Merced to assure that they are not creating a risk to public health and safety, in short that they are not making a difference.

This is the distinction between Type I and Type II errors.  Type I error occurs when we conclude that a real difference exists when there is none.  Type II error exists when we conclude that no difference exists when in fact there is a difference.  The researchers have acknowledged a fuzzy familiarity with the concept of the null hypothesis, the assumption that no difference exists that will be discarded only if strong evidence rejects that conclusion.  That is not our issue.  We want to accept the conclusion that there is no threat to public health only if strong evidence supports that conclusion; our goal is protection of public health and safety.

We have methods for evaluating Type I error; the t-test is a good example.  We can evaluate Type II error only approximately.  However, to do so a number of conditions must be met which are not met here.  Miller and Freund3 observe “(The probability of) Type II errors . . . can be determined with the use of (Operating Characteristic Curves) so long as we are sampling from normal populations with known standard deviations or both samples are large.”  Again, ignoring these caveats, and assuming independence in the observations, a quick calculation indicates that the probability of a Type II error in the present instance, concluding that there is no difference when in fact there is, approaches 80%.  That represents a high level of risk that is simply not accounted for by the researchers.

The conclusion drawn by the researchers, that the differences between the test and control groups are not statistically significant, is then weak at best, supported by far too limited data, and in fact irrelevant if true.

“Although the applicability of these water quality criteria to this study is highly questionable, the criteria are conservative in that full body water contact recreation is prohibited at Lake Merced (SFPUC Resolution No. 10,435) and was not observed during this study, except for fishing.” (1, pg ES-i)

Resolution No. 10,435 says, “(N)o swimming shall be permitted;” other forms of contact recreation have not been excluded.  Further, that Resolution also states, “The Park and Recreation Commission agrees that the primary purpose of the Lake Merced Tract is to supply potable water.”  This purpose was confirmed in 1995 (PUC Resolution #95-0082), observing that “the reservoir’s primary purpose (is) supplying potable water to consumers in San Francisco.”  Clearly, applicability of the highest standards of water quality to Lake Merced is in no way “questionable!”

While a footnote has been added as a response to my earlier comments indicating that full body contact does occur infrequently, if inadvertently, to continue with the claim that such contact is prohibited by a PUC resolution is akin to claiming that the PUC has resolved that boats should not capsize, at least when people are aboard.  Continued attempts to avoid responsibility for protecting water quality in Lake Merced are tiresome at best.

“Based on a “weight-of-evidence” approach, the study results suggested that the pilot diversions probably did not increase potential human health risk associated with fecal contamination during recreation in South Lake Merced.”  (1, pg ES-i)

The so-called “weight-of-evidence” approach requires that there be a number of indicators all supporting the same conclusion, none of which may be individually strong enough to support the conclusion drawn.  The researchers have not found that there are several indicators suggesting that the increase in levels of E-coli bacteria is insignificant.  Instead, they have suggested that such a finding with regard to two or three contaminants, ignoring many others, constitutes multiple evidence that the lake is not impaired.  This reflects the bias of the researchers, that the conclusion, no risk is created by diverting contaminated stormwater into Lake Merced, will be supported unless it can be definitively rejected.  That is not good research methodology, nor does it provide adequate protection for public health and safety.

“CDS effluent concentrations of bacterial indicators and metals were generally several orders of magnitude greater than the concentrations found in South Lake Merced.  This suggests that treatment by the riparian buffer effectively reduced bacterial concentrations.” (1, pg ES-i)

No it doesn’t.  Two additional factors, dilution of water after it has mixed with lake water and coliform die-off, clearly impact these observations, and may in fact be dominant. 

Response to earlier comments stated, “Some level of treatment by the riparian buffer is likely (see the attached document prepared by Michael J. Casteel, Ph.D., SFPUC Research Microbiologist). Additional engineering analysis would be needed to address this issue further. Such analysis was beyond the scope of the pilot study.  (Emphasis added.)  (2, pg 1)

Casteel argues that since a result was observed in Louisiana it is reasonable to expect that a similar result would occur here.  It is certainly important to review other experience.  However, applying conclusions drawn from that experience to the current demonstration when, as Casteel acknowledges, no supporting data has been collected to evaluate these factors, is not appropriate.  Perhaps the most noteworthy conclusion to be drawn from Casteel’s addendum is the fact that other research teams have found it prudent to collect the data necessary to conduct the analyses that this team has simply ignored.  It is necessary that we do so as well.

As with the remainder of the report, the researchers ignore the destiny of the metals.  “The results of chemical analyses of surface soil samples collected from the riparian buffer suggest that metals present in Vista Grande stormwater runoff did not accumulate in the riparian buffer soils.”  (1, pg ES-ii)  It may also suggest that the Vista Grande stormwater did not percolate into the riparian buffer soils, and that the riparian buffer soils therefore had no effect.  But neither were metals found in the lake.  So where did they go? 

A separate commentary from the researchers states “A number of physical, biological and chemical processes potentially govern the fate of metals in the stormwater runoff diverted to the riparian buffer/lake. Such potential processes include accumulation in the riparian buffer soils (with any changes in soil concentrations potentially masked by natural variability), removal by biological uptake in the buffer or the lake, and adsorption to particles in the lake system. Transformations among species of individual metals are also likely. Characterization of the fate of the metals would require additional monitoring data and engineering analysis. Such monitoring and analysis were beyond the scope of the pilot study.“ (Emphasis added.)  (2, pp 1-2)

Finding the metal contaminants is “beyond the scope of this study?”  Yes, there are a number of alternative answers, including the possibility that the sampling technique, collecting surface water grab samples, was inadequate for finding the metals.  Listing possibilities without exploring their impact neither constitutes good research nor provides adequate protection for the environment.

“Meeting overall Pilot Stormwater Enhancement Project goals with respect to raising water levels in Lake Merced would necessitate increasing the volume of Vista Grande stormwater runoff diverted to the lake.  If the diversion volume is increased, Water Board staff would likely request additional water quality monitoring to continue testing for water quality impacts in the lake.”  (1, pg ES-ii)

I sure hope so.  The research report minimizes the threat of significant contamination resulting from these diversions.  However, in Figure 3 (1, pg 17) it is evident that the level of E-coli bacteria in the test area came very close to acceptable limits during diversion periods, and never came nearly so close during what is called “background storms.”  Note that this data is presented on a logarithmic scale, and differences would appear much greater on a linear chart.  In fact, the maximum level of E-Coli observed during diversion storms was more than five times as great as the maximum during background storms.  Having observed this graph, however, the researchers reach the remarkable conclusion that the distribution of the black dots is not significantly different than the distribution of the white dots.


Furthermore, we don’t know if the contamination was introduced during the end of the test period, resulting from saturation of the riparian buffer during the earlier periods.  Were that the case, increasing the diversion amount by a relatively small amount might significantly increase the amount of coliform bacteria introduced into the lake.  The researchers respond, “Additional engineering analysis would be needed to address this issue.  Such analysis was beyond the scope of the pilot study  (Emphasis added.)  (2, repeated 5 times)

In fact, evaluation of all mechanisms by which the riparian buffer might have an impact, and the extent to which each contributes, is considered to be “outside the scope of the pilot study.”  We have then a classic black-box evaluation, with a look at the input, another look at the output, and no assessment of the processes in between.  With such ‘black-box’ studies it is frequently the case that early results can not be replicated, and it is certainly the case that the process can not be controlled.  Extrapolation outside the range of the observed results can not be supported, and we can draw no conclusions regarding the likely outcome of increased volumes of Visa Grande water being added to the lake.  Additional monitoring, in terms of increased frequency, will not be adequate.  It will be necessary to design the study so as to address process issues that have been deferred as “beyond the scope of the pilot study.”

The scope of this study seems much too modest, and inadequate to the goal of protecting public health and safety.  Until a more thorough research protocol has been defined that explores the mechanism of the process being evaluated as well as the results, and until a research team has been assembled with a greater awareness of risk assessment tools, no additional testing should go forward.

John Plummer
October 31, 2005

1) Direct quotations from these sources are italicized. Quotations will be identified with the following notation: (n, page no.) where n=1 indicates the final report issued by the research team, n=2 indicates a response to my letter to John West dated 9/26/05, and n=3 indicates a response to my initial notes provided Patrick Sweetland dated 9/15/05.

2) This analysis was conducted assuming that the six measurements for each event were independent values, and not replicates of a single value.  This produced sample sizes of 36 and 18, respectively, not the 6 and 3 described.  In addition, data from the control point in the middle of the lake was excluded.  A significant storm-to-storm difference in observed performance during diversion events indicates that this assumption is not valid.  However, without this assumption, retaining the sample sizes of 6 and 3, no such analysis is feasible, as a meaningful variance can not be determined with three sample points.  Correcting for this correlation is beyond the scope of my quick assessment.

3) Miller, Irwin and John E. Freund, Probability and Statistics for Engineers, Third Edition, Prentice-Hall, Inc., Englewood Cliffs, NJ, 1985, pg. 220




 

Attachment: A few notes on methodology.

This is not intended to be a comprehensive review of the analysis presented.  Instead, a few examples are provided to illustrate the type of analytic problems encountered in this report.

1) In earlier response to my comments the researchers state, “There is only one geometric mean of a data set, and we did not report a “logarithmic mean” or a “geometric mean of logarithms.”  (3, pg 2) Yet the text of the report states, “The results of lake water sample and CDS effluent bacteriological assays are reported as MPN of total coliform, E. coli and enterococci per 100 mL. These data were also log-transformed and average bacteriological values are reported as the geometric mean (log10 MPN/100 mL) presented with their corresponding 95% confidence limits.” (1, pg 5)

Standard notation would indicate that the geometric mean of the logs was presented.  Since the geometric mean and the exponentiated mean of the logs are exactly the same values, the “geometric mean (log10 MPH/100mL)” is the same as the mean of the logs of the logs of the values, again after appropriate exponentiation.  There may be precedent for this approach with which I am not familiar.  My earlier observation, that this “sounds like a double smoothing,” has not been satisfactorily answered.

2) I had earlier observed that differences in rainfall between background and diversion events made comparisons between these samples at best conjecture.  The researchers responded, “The fact that average rainfall during diversion events was greater than during background events makes the analysis conservative. This is because a greater volume of local stormwater runoff from the surrounding watershed and associated bacteria entered the lake during diversion storm events than background storms.” (2, pp 2-3)  I’ve had difficulty finding a definition of “conservative” in my Scientific Method book; in fact there is none.  While the researcher’s rationale makes superficial sense, our lack of any understanding of the mechanics of this process makes this statement indefensible.  Suppose, for example, that greater volumes of rainfall increased the rate of flow of the water cascading down the buffer zone, driving it deeper into the lake.  In that case surface grab samples might seriously under-represent the actual lake impact.  That may also not be the case; we simply don’t know.

3)  I think that it is well established that the t-test has been inappropriately applied in this report.  However, even were the application of this test appropriate, the way it has been applied is not.  I earlier made the point that it is not a statistically valid technique to simply cherry-pick a long list of t-test statistics, picking the ones that you like, or that satisfy some significance criterion.

The researchers responded, “The above comment appears to refer to statistical procedures such as the Bonferroni correction, which adjust the significance level to compensate for the increased probability of error when multiple comparisons are made. While many statistical comparisons were made in the draft report, the comparisons were individual rather than multiple.” (2, pg 4)

No, I was not referring to the Bonferroni correction.  And I am well aware that “many statistical comparisons were made,” each of them “individual;” that was exactly my point.  The researchers did not understand my comment; given their apparent lack of understanding of the field of statistics that might be expected.  Perhaps in the future someone better versed with the field of statistical risk analysis will be added to the research team. 

John Plummer
November 4, 2005