## Effect Metrics Derived with Hypothesis Tests

Developed initially to cope with dose/concentration-response data for which an acceptable model could not be developed, hypothesis test-based methods now are applied heavily in tests of chronic or subtle effects. As will be shown, the intent is to estimate a threshold concentration or dose above which an observable effect might be expected. Most, but not all, relevant statistical methods are conventional hypothesis tests.

The general approach (Figure 4) is similar to that shown in Figure 1 but the variance within and among treatments are assessed instead of developing a dose/ concentration-effect model. A series of dose, dilution, or concentration treatments are established with replication within each treatment. After a specific duration, the level of effect manifesting within each treatment is scored and that for each treatment compared statistically to that in the reference treatment. As shown in Figure 4, each treatment for which the effect is statistically significantly different from that of the reference treatment is identified (denoted with an asterisk in the figure). The lowest treatment concentration with a response statistically different

MATC

LOEC

NOEC

0% 10% 18% 25% 32% Treatment (concentration)

100%

Figure 4 (a) The experimental design of hypothesis testing-based methods and (b) the determination of NOEC, LOEC, and MATC (data from Table 1). The organisms that died in response to the treatment are denoted in black and those still living are denoted in white. The treatments for which the effect is statistically different (a = 0.05) from that of the reference treatment are denoted with an asterisk. In this example, the effect was death after the exposure although other responses can, and often are, used in these kinds of experiments.

from that of the reference (e.g., 0%) is called the 'lowest observed effect concentration' (LOEC). The highest treatment concentration with a response that is not significantly different from the reference response is called the 'no observed effect concentration' (NOEC). Although formally a dubious inference from hypothesis testing, the NOEC and LOEC are pragmatically treated in eco-toxicology as the lower and upper bounds for the 'maximum acceptable toxicant concentration' (MATC), that is, a threshold concentration presumed to be 'safe'. Extending this pragmatic approach, the geometric mean of the NOEC and LOEC is sometimes used as the best estimate of the MATC. Considerable debate continues about the acceptability of such interpretations of these hypothesis test-derived metrics.

A range of hypothesis tests are commonly applied to NOEC and LOEC estimation including parametric and nonparametric tests (Figure 5). These tests differ in their underlying assumptions and consequent ability to detect a significant difference if there was one, that is, their statistical power. The tests carrying the most assumptions are generally the most powerful. However, the differences in power can be trivial or critical depending on the specific tests being compared and the qualities of the data. As important examples in Figure 5, the parametric tests are generally more powerful than the nonparametric tests and tests assuming a monotonic trend with treatment concentration are more powerful than those that do not assume a monotonic trend. With the hypothesis testing approach, the data (concentration, dilution, or dose vs. effect level for each treatment replicate) might be used directly, or as commonly done for proportions responding, transformed in order to meet assumptions of the subsequent hypothesis tests. Formally, the parametric methods can be applied if the data show no evidence of non-normality or heterogeneity of variances among treatments. A powerful parametric trend test (Williams's) can be used if an additional assumption of a monotonic trend (increase or decrease in response) with increasing concentration or dose is justifiable. In some cases such as in the presence of hormesis, a monotonic trend would not be expected. If the assumptions allowing use of the parametric tests are not met, the less powerful nonparametric methods can still be used. If a trend is assumed, then the Jonckheere-Terpstra test can be applied. If not, the less powerful Wilcoxon rank sum test with a Bonferroni adjustment of experiment-wise error rate or the Steel's many-to-one rank test can be used. These last two tests tend to be the least powerful of the hypothesis tests described to this point because they carry the fewest assumptions.

The formal assumptions of and hypotheses tested by these methods differ in important ways. The most important assumption to be met for all is that individuals be randomly assigned to treatments. The results of the hypothesis tests are questionable if this fundamental assumption is not met. The parametric tests further require that the data be normally distributed although most are robust to moderate violations of this assumption. The normality is often tested with a statistic such as the Shapiro-Wilk statistic (W). A small value of W (or a p value less than a predetermined a level such as 0.05) leads to the rejection of the null hypothesis of normality. Because the statistical power of the test increases with sample size, when sample sizes are small, a higher a level

may be applied in tests of normality. These methods also require that the treatments have the same variances, that is, homogeneity of variances, although again, the methods are robust to moderate deviations from the homogeneity of variances assumption. This assumption can be formally tested with Bartlett's, Levene's, or one of several similar tests. Caution should be taken with the commonly applied Bartlett's test because it can be inaccurate even if the data deviate slightly from being normally distributed.

The common parametric tests differ slightly relative to the exact hypothesis they test. The hypothesis assessed by the Z-test with Bonferroni adjustment of experiment-wise error rates and Dunnett's test is simply that the mean responses of the treatments are not significantly different from the mean of the reference (control) treatment. Williams's test carries an additional assumption of a monotonic trend (consistently increasing or decreasing effect) with dose/concentration. It tests the null hypothesis that there is no monotonic trend.

The nonparametric methods do not require data normality or homogeneity of variances. With the Wilcoxon rank sum test with Bonferroni adjustment of experiment-wise error rates or Steel's many-to-one rank sum test, the null hypothesis is that observations in the treatments come from the same population. The Jonckheere-Terpstra test is the nonparametric equivalent ofthe Williams's test in that it has an alternate hypothesis of a monotonic trend. Formally, the null hypothesis for this test is no different in the distribution of responses among the treatments.

The mysid shrimp data can be used again to illustrate the hypothesis testing method (Figure 4), although normally more replicates would be recommended. After testing for normality and homogeneity of variance, the data without transformation are tested with Dunnett's one-tailed Z-test, with the null hypothesis being that the mean response of each treatment is not significantly higher than the control mean (experiment-wise a = 0.05). The results show that 10% effluent is the highest concentration whose response is not significantly higher than the control, and 18% effluent is the lowest concentration with the response significantly higher than the control; the NOEC and LOEC were determined to be 10% and 18%, respectively. Accordingly, the MATC could be estimated as the geometric mean of the NOEC and LOEC, or 13.4%. If the log normal (probit) model generated previously had been used to estimate the proportion dead at the NOEC level (10% effluent), the prediction would be 8.0% mortality at the NOEC. The results generally agree between the regression and hypothesis testing approaches although such is not always true.

One shortcoming of the approach of applying these various methods to produce NOEC and LOEC values has already been discussed in the preceding paragraphs - the NOEC and LOEC values can vary for the same data set as a function of the chosen hypothesis test. Also, the NOEC

and LOEC values depend heavily on the experimental design and statistical aspects of the calculations that influence statistical power. The power of any test will depend on the number of observations per treatment, number of treatments, and variability in the background response. Literature surveys have demonstrated that the designs normally applied in effects testing have sufficient power to detect an approximately 5-10% effect difference in mammalian toxicology studies and 10-34% effect difference in ecotoxicology studies. But effects less than these levels can have unacceptable consequences. A final shortcoming is that these hypothesis testing methods were not initially designed to infer a biological threshold concentration or dose. A threshold estimated from a test of statistically significant difference is not necessarily a good estimate of a significant biological effect threshold.

## Project Earth Conservation

Get All The Support And Guidance You Need To Be A Success At Helping Save The Earth. This Book Is One Of The Most Valuable Resources In The World When It Comes To How To Recycle to Create a Better Future for Our Children.

## Post a comment