## Modelling and sampling design

The work has focused on assessment of statistical power when using the statistical test procedures, sets of genetic markers, and sample sizes applied within the HERGEN project. To provide generality, however, we have also examined power for a wider set of tests, genetic markers, numbers of loci, sample sizes, and levels of genetic divergence.

Our analysis has followed three major lines, i.e. i) assessing the magnitude of actual power in some empirical and basic hypothetical settings, ii) comparing the efficiency of a few statistical approaches for testing for heterogeneity, and iii) evaluating in greater detail the phenomenon of a reduced power when combining multiple exact tests by means of Fisher's method, particularly in relation to chi-square.

The major observations may be summarized as follows. First, and as expected, power generally increases with the level of differentiation (FST), sample size, number of samples, and the number of loci and alleles, and uniform allele frequency distributions are better than skewed ones. Second, regardless of the statistical method employed, the power for detecting divergence may be substantial for frequently used sample sizes (say 50-100 individuals) and sets of genetic markers (say 5-20 allozyme or microsatellite loci) also at quite low values of FST (e.g. FST<0.01). For the herring microsatellites, for example, the probability of obtaining a significant result when sampling 50 fish from each of five populations approaches 100% for an FST of about 0.003, and the corresponding probability is around 50% for an FST as small as 0.001. Similarly, power may be high also for the less polymorphic allozymes when a reasonably large number of loci is scored (Figs. 1 and 2). Actually, in many studies using large numbers of highly polymorphic markers, the differences that are likely to be detected are so small that it may be questionable if they should be considered biologically meaningful for the issue at hand.

Third, we show that in some situations the choice of statistical method may be critical for detecting an existing genetic structure. Specifically, with a restricted number of alleles per locus summation of chi-square may outperform Fisher's exact and permutation tests, when the single locus P-values from these tests are combined by Fisher's method. This distinction between methods is most pronounced at skewed allele frequencies and few samples, and sometimes the already poor performance of the latter methods may decrease further with an increasing number of loci. Single or combined P-values from Fisher's exact test never display elevated a errors, though, and with large numbers of infrequent alleles the power of this approach may be higher than for chi-square.

Finally, an interesting observation refers to the good performance and the apparent robustness of the traditional chi-square test in situations when it is generally considered to be unreliable. On the other hand of the spectrum, the G-test (Sokal & Rohlf 1981) frequently tends to yield an unduly high proportion of false significances at loci segregating for low frequency alleles

On the basis of the present observations we recommend the following:

1. Assess power and α before launching a study. This can be done either through simulation, or more crudely through comparisons with estimates from similar investigations. For example, investigators should seriously consider whether the sample sizes and numbers of loci and alleles commonly used are really necessary for the particular question at hand.

2. When only considering a single locus (e.g. mtDNA) or contingency table, Fisher's exact test should be applied (rather than an approximation such as chi-square).

3. The G-test of Sokal and Rohlf (1981) should be avoided because of its tendency to produce excessive rates of false significances.

4. Be aware that combining exact P-values from multiple contingency tables (loci) by means of Fisher's method may under some circumstances result in a very low power. The risk for this phenomenon is most pronounced in pair-wise sample comparisons when dealing with di-allelic loci with skewed allele frequencies.

5. When exact tests combined with Fisher's method tend to yield an unduly low power, traditional chi-square may constitute a good alternative. Chi-square seems to be more robust than commonly appreciated. The risk for an inflated α error should still be considered, though, particularly in situations when comparing small samples (say n<20-30) with larger ones at loci with skewed allele frequencies.

6. In cases with several multi-allelic loci (e.g. 10 or more alleles), using Fisher's method to combine P-values obtained by Fisher's exact test seems to constitute the method of choice, because of a generally high power and an apparent tendency not to exceed the intended α level.

Our analysis has followed three major lines, i.e. i) assessing the magnitude of actual power in some empirical and basic hypothetical settings, ii) comparing the efficiency of a few statistical approaches for testing for heterogeneity, and iii) evaluating in greater detail the phenomenon of a reduced power when combining multiple exact tests by means of Fisher's method, particularly in relation to chi-square.

The major observations may be summarized as follows. First, and as expected, power generally increases with the level of differentiation (FST), sample size, number of samples, and the number of loci and alleles, and uniform allele frequency distributions are better than skewed ones. Second, regardless of the statistical method employed, the power for detecting divergence may be substantial for frequently used sample sizes (say 50-100 individuals) and sets of genetic markers (say 5-20 allozyme or microsatellite loci) also at quite low values of FST (e.g. FST<0.01). For the herring microsatellites, for example, the probability of obtaining a significant result when sampling 50 fish from each of five populations approaches 100% for an FST of about 0.003, and the corresponding probability is around 50% for an FST as small as 0.001. Similarly, power may be high also for the less polymorphic allozymes when a reasonably large number of loci is scored (Figs. 1 and 2). Actually, in many studies using large numbers of highly polymorphic markers, the differences that are likely to be detected are so small that it may be questionable if they should be considered biologically meaningful for the issue at hand.

Third, we show that in some situations the choice of statistical method may be critical for detecting an existing genetic structure. Specifically, with a restricted number of alleles per locus summation of chi-square may outperform Fisher's exact and permutation tests, when the single locus P-values from these tests are combined by Fisher's method. This distinction between methods is most pronounced at skewed allele frequencies and few samples, and sometimes the already poor performance of the latter methods may decrease further with an increasing number of loci. Single or combined P-values from Fisher's exact test never display elevated a errors, though, and with large numbers of infrequent alleles the power of this approach may be higher than for chi-square.

Finally, an interesting observation refers to the good performance and the apparent robustness of the traditional chi-square test in situations when it is generally considered to be unreliable. On the other hand of the spectrum, the G-test (Sokal & Rohlf 1981) frequently tends to yield an unduly high proportion of false significances at loci segregating for low frequency alleles

On the basis of the present observations we recommend the following:

1. Assess power and α before launching a study. This can be done either through simulation, or more crudely through comparisons with estimates from similar investigations. For example, investigators should seriously consider whether the sample sizes and numbers of loci and alleles commonly used are really necessary for the particular question at hand.

2. When only considering a single locus (e.g. mtDNA) or contingency table, Fisher's exact test should be applied (rather than an approximation such as chi-square).

3. The G-test of Sokal and Rohlf (1981) should be avoided because of its tendency to produce excessive rates of false significances.

4. Be aware that combining exact P-values from multiple contingency tables (loci) by means of Fisher's method may under some circumstances result in a very low power. The risk for this phenomenon is most pronounced in pair-wise sample comparisons when dealing with di-allelic loci with skewed allele frequencies.

5. When exact tests combined with Fisher's method tend to yield an unduly low power, traditional chi-square may constitute a good alternative. Chi-square seems to be more robust than commonly appreciated. The risk for an inflated α error should still be considered, though, particularly in situations when comparing small samples (say n<20-30) with larger ones at loci with skewed allele frequencies.

6. In cases with several multi-allelic loci (e.g. 10 or more alleles), using Fisher's method to combine P-values obtained by Fisher's exact test seems to constitute the method of choice, because of a generally high power and an apparent tendency not to exceed the intended α level.