Follow-up Commentary on the
Preliminary Water Quality Screening Results
Lake Merced Pilot Stormwater Enhancement Project
I am disappointed that this
exchange has been reduced to a level of charge and counter-charge,
rather than
a
cooperative effort seeking a better program to maintain Lake Merced. Unfortunately, that seems to be the way that
the San Francisco Public Utilities Commission prefers to operate, and
that
approach has carried over to the present study as well.
I am then preparing a second set of
comments, referencing both the final report on the Vista Grande
diversion
project and the response to my earlier commentaries prepared by the
researchers.1
“The primary goal was to
determine whether the diversion of limited volumes of treated
stormwater (about
0.1 to 3.6 million liters per storm event) increased concentrations of
bacterial indicators of fecal contamination in South Lake Merced. Such increases would indicate the potential
for increased human health risk (i.e., contracting gastrointestinal
disease)
during recreation in the lake.” (1, pg ES-i)
We agree that the goal is to
assure that no significant increase in risk to human health results
from additions of
Vista
Grande Canal stormwater to Lake Merced.
We will compare what happens after a storm event with and
without such
additions; our main concern is to assure that no detrimental impact has
occurred. From an analytic perspective, we
take a
cautionary approach, placing the burden of proof on the experimenters,
requiring them to clearly demonstrate that no such threat exists. This puts the shoe on the other foot, so to
speak, when compared with traditional statistical analysis in which we
assume
that no impact has occurred, the so-called null hypothesis, until it is
clearly
demonstrated that the treatment being evaluated has had an effect.
Throughout their report the
researchers ignore this distinction, leaving both good science and
objective
analysis behind. They want to proceed
with their demonstration project, and will defend their opportunity to
do so
until it is demonstrated beyond any reasonable doubt that the
treatment, in
this case the introduction of contaminated water into Lake Merced, has
had a
detrimental effect. This approach is
antithetical to careful protection of public health and safety.
And so the researchers have
collected data for three storm events with no diversion and six storm
events
with diversion. Based on sample sizes
of three and six respectively the researchers conclude that the data is
log-normally distributed, that the variances of the two populations are
equal,
and that sufficient data has been collected to conduct statistical
analyses
with tools such as the Student t-test.
Other analysts would argue that this is a mighty fine thread
upon which
to hang the protection of public health and safety.
“Geometric mean E. coli
concentrations at most lake sample stations were higher following
diversion
events than background storm events, but the differences were not
statistically
significant.”
(1, pg ES-i)
Let’s take a closer
look.2 Average value
of E-coli bacteria
measured at six sample sites near the shore during diversions was 98.6
MPN/100mL, during storms with no diversion just 49.2 MPN/100mL. The average amount of E-coli bacteria
doubled when stormwater was diverted into the lake, while the maximum
amount
was more than five times as great.
Comparing these averages using the t-test provides a value of t
of
1.55. Given the decision criterion that
a finding would be considered significant only if the significance
level
exceeded 95%, and the so-called critical value of t at this level for
this
sample size is 1.67, a value greater than 1.55, the above claim would
be
supported.
But is it really? Who
said 95% was an appropriate level? Suppose
that the criterion selected had been
90% confidence, another commonly accepted level? The
critical value of t is then just 1.28; the hypothesis that
there was no impact would then have been rejected.
Or suppose that the samples had been larger, say that six storms
without diversion had been monitored instead of just three, but with
the same
observed means and variances. The value
of t would then have 2.16, well above the threshold level of 1.67. Are we happy that the likelihood that no
detrimental impact was observed was just 1 chance in 10, and not 1
chance in
20? Are we willing to hang issues of
public health and safety on the difference between monitoring three
background
storms and six? I am not.
Assuming a log-normal
distribution, using logarithms rather than raw numbers to conduct this
analysis, does not materially affect the conclusion.
The observed value of t is reduced from 1.55 to 1.42, the
confidence level reduced from 94% to 92%, differences that are not
significant. Using logs, then, adds
artificial sophistication to the analysis, but does not materially
affect the
results.
But all of this is
irrelevant. The t-test is designed to
determine whether or not two samples came from the same population. For example, if we are buying fertilizer for
our garden we want to be pretty sure that there will be a beneficial
effect on
our flowers and vegetables. The burden
of proof then rests on the sellers of this fertilizer to assure that
the
difference between plots treated with their product and plots that are
not so
treated are sufficiently large to warrant a purchase.
However, as described above, we are facing the opposite
problem. Users of the lake want the
burden of proof to lie with those adding contaminated water to Lake
Merced to
assure that they are not creating a risk to public health and safety,
in short
that they are not making a difference.
This is the distinction
between Type I and Type II errors. Type
I error occurs when we conclude that a real difference exists when
there is
none. Type II error exists when we
conclude that no difference exists when in fact there is a difference. The researchers have acknowledged a fuzzy
familiarity with the concept of the null hypothesis, the assumption
that no
difference exists that will be discarded only if strong evidence
rejects that
conclusion. That is not our issue. We want to accept the conclusion that there
is no threat to public health only if strong evidence supports that
conclusion;
our goal is protection of public health and safety.
We have methods for evaluating Type I error;
the t-test is a good example. We can evaluate Type II error only
approximately. However, to do so a number of conditions must be
met which are not met here. Miller and Freund3 observe “(The probability of) Type II errors . . .
can be determined with the use of (Operating Characteristic Curves) so
long as we are sampling from normal populations with known standard
deviations or both samples are large.” Again, ignoring these
caveats, and assuming independence in the observations, a quick
calculation indicates that the probability of a Type II error in the
present instance, concluding that there is no difference when in fact
there is, approaches 80%. That represents a high level of risk
that is simply not accounted for by the researchers.
The conclusion drawn by the
researchers, that the differences between the test and control groups
are not
statistically significant, is then weak at best, supported by far too
limited
data, and in fact irrelevant if true.
“Although the
applicability of these water quality criteria to this study is highly
questionable, the criteria are conservative in that full body water
contact
recreation is prohibited at Lake Merced (SFPUC Resolution No. 10,435)
and was
not observed during this study, except for fishing.” (1, pg ES-i)
Resolution No. 10,435 says,
“(N)o swimming shall be permitted;” other forms of contact recreation
have not
been excluded. Further, that Resolution
also states, “The Park and Recreation Commission agrees that the
primary
purpose of the Lake Merced Tract is to supply potable water.” This purpose was confirmed in 1995 (PUC
Resolution #95-0082), observing that “the reservoir’s primary purpose
(is)
supplying potable water to consumers in San Francisco.”
Clearly, applicability of the highest standards
of water quality to Lake Merced is in no way “questionable!”
While a footnote has been
added as a response to my earlier comments indicating that full body
contact
does occur infrequently, if inadvertently, to continue with the claim
that such
contact is prohibited by a PUC resolution is akin to claiming that the
PUC has
resolved that boats should not capsize, at least when people are aboard. Continued attempts to avoid responsibility
for protecting water quality in Lake Merced are tiresome at best.
“Based on a
“weight-of-evidence” approach, the study results suggested that the
pilot
diversions probably did not increase potential human health risk
associated
with fecal contamination during recreation in South Lake Merced.” (1,
pg ES-i)
The so-called
“weight-of-evidence” approach requires that there be a number of
indicators all
supporting the same conclusion, none of which may be individually
strong enough
to support the conclusion drawn. The
researchers have not found that there are several indicators suggesting
that
the increase in levels of E-coli bacteria is insignificant. Instead, they have suggested that such a
finding with regard to two or three contaminants, ignoring many others,
constitutes multiple evidence that the lake is not impaired. This reflects the bias of the researchers,
that the conclusion, no risk is created by diverting contaminated
stormwater
into Lake Merced, will be supported unless it can be definitively
rejected. That is not good research
methodology, nor
does it provide adequate protection for public health and safety.
“CDS effluent
concentrations of bacterial indicators and metals were generally
several orders
of magnitude greater than the concentrations found in South Lake Merced. This suggests that treatment by the riparian
buffer effectively reduced bacterial concentrations.” (1, pg ES-i)
No it doesn’t. Two
additional factors, dilution of water
after it has mixed with lake water and coliform die-off, clearly impact
these
observations, and may in fact be dominant.
Response to earlier comments
stated, “Some level of treatment by the riparian buffer is likely
(see the
attached document prepared by Michael J. Casteel, Ph.D., SFPUC Research
Microbiologist). Additional engineering analysis would be needed to
address
this issue further. Such analysis was beyond the scope of the pilot
study.” (Emphasis added.) (2, pg 1)
Casteel argues that since a
result was observed in Louisiana it is reasonable to expect that a
similar
result would occur here. It is
certainly important to review other experience. However,
applying conclusions drawn from that experience to the
current demonstration when, as Casteel acknowledges, no supporting data
has
been collected to evaluate these factors, is not appropriate. Perhaps the most noteworthy conclusion to be
drawn from Casteel’s addendum is the fact that other research teams
have found
it prudent to collect the data necessary to conduct the analyses that
this team
has simply ignored. It is necessary
that we do so as well.
As with the remainder of the
report, the researchers ignore the destiny of the metals.
“The results of chemical analyses of
surface soil samples collected from the riparian buffer suggest that
metals
present in Vista Grande stormwater runoff did not accumulate in the
riparian
buffer soils.” (1, pg ES-ii) It may also suggest that the Vista Grande
stormwater did not percolate into the riparian buffer soils, and that
the riparian buffer soils therefore had no effect.
But neither were metals found in the
lake. So where did they go?
A separate commentary from
the researchers states “A number of physical, biological and
chemical
processes potentially govern the fate of metals in the stormwater
runoff
diverted to the riparian buffer/lake. Such potential processes include
accumulation in the riparian buffer soils (with any changes in soil
concentrations potentially masked by natural variability), removal by
biological uptake in the buffer or the lake, and adsorption to
particles in the
lake system. Transformations among species of individual metals are
also
likely. Characterization of the fate of the metals would require
additional
monitoring data and engineering analysis. Such monitoring and analysis
were beyond
the scope of the pilot study.“ (Emphasis added.)
(2, pp 1-2)
Finding the metal
contaminants is “beyond the scope of this study?” Yes,
there are a number of alternative answers, including the
possibility that the sampling technique, collecting surface water grab
samples,
was inadequate for finding the metals.
Listing possibilities without exploring their impact neither
constitutes
good research nor provides adequate protection for the environment.
“Meeting overall Pilot
Stormwater Enhancement Project goals with respect to raising water
levels in
Lake Merced would necessitate increasing the volume of Vista Grande
stormwater
runoff diverted to the lake. If the
diversion volume is increased, Water Board staff would likely request
additional water quality monitoring to continue testing for water
quality
impacts in the lake.” (1, pg ES-ii)
I sure hope so. The
research report minimizes the threat of
significant contamination resulting from these diversions.
However, in Figure 3 (1, pg 17) it is
evident that the level of E-coli bacteria in the test area came very
close to
acceptable limits during diversion periods, and never came nearly so
close
during what is called “background storms.”
Note that this data is presented on a logarithmic scale, and
differences
would appear much greater on a linear chart.
In fact, the maximum level of E-Coli observed during diversion
storms
was more than five times as great as the maximum during background
storms. Having observed this graph,
however, the
researchers reach the remarkable conclusion that the distribution of
the black
dots is not significantly different than the distribution of the white
dots.

Furthermore, we don’t know if
the contamination was introduced during the end of the test period,
resulting
from saturation of the riparian buffer during the earlier periods. Were that the case, increasing the diversion
amount by a relatively small amount might significantly increase the
amount of
coliform bacteria introduced into the lake.
The researchers respond, “Additional engineering analysis
would be
needed to address this issue. Such
analysis was beyond the scope of the pilot study” (Emphasis added.) (2,
repeated 5 times)
In fact, evaluation of all
mechanisms by which the riparian buffer might have an impact, and the
extent to
which each contributes, is considered to be “outside the scope of the
pilot
study.” We have then a classic
black-box evaluation, with a look at the input, another look at the
output, and
no assessment of the processes in between.
With such ‘black-box’ studies it is frequently the case that
early
results can not be replicated, and it is certainly the case that the
process
can not be controlled. Extrapolation
outside the range of the observed results can not be supported, and we
can draw
no conclusions regarding the likely outcome of increased volumes of
Visa Grande
water being added to the lake.
Additional monitoring, in terms of increased frequency, will not
be
adequate. It will be necessary to
design the study so as to address process issues that have been
deferred as
“beyond the scope of the pilot study.”
The scope of this study seems
much too modest, and inadequate to the goal of protecting public health
and
safety. Until a more thorough research
protocol has been defined that explores the mechanism of the process
being evaluated
as well as the results, and until a research team has been assembled
with a
greater awareness of risk assessment tools, no additional testing
should go
forward.
John Plummer
October 31, 2005
1)
Direct quotations from these
sources are
italicized. Quotations will be identified with the following
notation: (n, page no.) where n=1 indicates the final report issued by
the
research team, n=2 indicates a response to my letter to John West dated
9/26/05, and n=3 indicates a response to my initial notes provided
Patrick
Sweetland dated 9/15/05.
2) This analysis
was conducted assuming that the six measurements for each event were
independent values, and not replicates of a single value. This
produced sample sizes of 36 and 18, respectively, not the 6 and 3
described. In addition, data from the control point in the middle
of the lake was excluded. A significant storm-to-storm difference
in observed performance during diversion events indicates that this
assumption is not valid. However, without this assumption,
retaining the sample sizes of 6 and 3, no such analysis is feasible, as
a meaningful variance can not be determined with three sample
points. Correcting for this correlation is beyond the scope of my
quick assessment.
3) Miller, Irwin and John E.
Freund, Probability and Statistics for Engineers, Third Edition,
Prentice-Hall, Inc., Englewood Cliffs, NJ, 1985, pg. 220
Attachment: A few notes on methodology.
This is not intended to be a
comprehensive review of the analysis presented. Instead,
a few examples are provided to illustrate the type of
analytic problems encountered in this report.
1) In earlier response to my
comments the researchers state, “There is only one geometric mean
of a data
set, and we did not report a “logarithmic mean” or a “geometric mean of
logarithms.” (3, pg 2) Yet the text
of the report states, “The results of lake water sample and CDS
effluent
bacteriological assays are reported as MPN of total coliform, E. coli
and
enterococci per 100 mL. These data were also log-transformed and
average
bacteriological values are reported as the geometric mean (log10
MPN/100 mL)
presented with their corresponding 95% confidence limits.” (1, pg 5)
Standard notation would
indicate that the geometric mean of the logs was presented. Since the geometric mean and the
exponentiated mean of the logs are exactly the same values, the
“geometric mean
(log10 MPH/100mL)” is the same as the mean of the logs of the logs of
the
values, again after appropriate exponentiation. There
may be precedent for this approach with which I am not
familiar. My earlier observation, that
this “sounds like a double smoothing,” has not been satisfactorily
answered.
2) I had earlier observed
that differences in rainfall between background and diversion events
made
comparisons between these samples at best conjecture.
The researchers responded, “The fact that average rainfall
during diversion events was greater than during background events makes
the
analysis conservative. This is because a greater volume of local
stormwater
runoff from the surrounding watershed and associated bacteria entered
the lake
during diversion storm events than background storms.” (2, pp 2-3) I’ve had difficulty finding a definition of
“conservative” in my Scientific Method book; in fact there is none. While the researcher’s rationale makes
superficial sense, our lack of any understanding of the mechanics of
this
process makes this statement indefensible.
Suppose, for example, that greater volumes of rainfall increased
the
rate of flow of the water cascading down the buffer zone, driving it
deeper
into the lake. In that case surface
grab samples might seriously under-represent the actual lake impact. That may also not be the case; we simply
don’t know.
3) I think that it is well
established that the t-test has been inappropriately applied in this
report. However, even were the
application of this test appropriate, the way it has been applied is
not. I earlier made the point that it is
not a
statistically valid technique to simply cherry-pick a long list of
t-test
statistics, picking the ones that you like, or that satisfy some
significance
criterion.
The researchers responded, “The
above comment appears to refer to statistical procedures such as the
Bonferroni
correction, which adjust the significance level to compensate for the
increased
probability of error when multiple comparisons are made. While many
statistical
comparisons were made in the draft report, the comparisons were
individual
rather than multiple.” (2, pg 4)
No, I was not referring to
the Bonferroni correction. And I am
well aware that “many statistical comparisons were made,” each of them
“individual;” that was exactly my point.
The researchers did not understand my comment; given their
apparent lack
of understanding of the field of statistics that might be expected. Perhaps in the future someone better versed
with the field of statistical risk analysis will be added to the
research team.
John Plummer
November 4, 2005