In general, a Bonferroni-corrected threshold is conservative in the sense that it leads to an actual significance level smaller than the nominal significance level of 5%. Note that the Bonferroni inequality does not use any information regarding the covariances between the different quantities Zs and Zt. It is based solely on the marginal distribution of the individual variables. This is a major advantage, since it leads to a very simple formula. However, this simplicity is a double-edged sword because the covariances between the variables may play an important role in the determination of the probability, and ignoring them may lead in some cases to a poor approximation. Indeed, the denser the collection of markers is, the larger the covariance between the markers. And, as we can see numerically, these larger covariances lead to a more conservative inequality.

Let us turn next to the construction of better, albeit more complex, approximations that do take into account the correlation between markers. We will write the formulas for these approximations in the context of a stationary Gaussian process with standardized marginal distributions. A process is called stationary if its distribution does not vary in time. In particular, a gaussian process is stationary if the mean function is constant, and if the correlation between two components is only a function of the time lag between them. Note, in particular, that the Ornstein-Uhlenbeck is a stationary process.

The approximation is determined by several process-specific parameters. One parameter is the total genetic length of that part of the genome spanned by markers, denoted here by L. In the context of the mouse genome its value is about 1600 cM, but it may be smaller if there are parts of the genome devoid of markers, for example, near the ends of chromosomes. The second parameter is the number of independent copies of the basic process involved. This number is denoted here by C. In our case this number relates to the number of chromosomes, which we take to be 20. (Since the sex chromosomes behave differently from the autosomes, it would also make sense to take C = 19. The practical impact of this difference would be negligible.) The effect of the correlation is introduced with the parameter 3, which is the negative of the partial derivative of the correlation function at the origin:

In particular, for the Ornstein-Uhlenbeck process the covariance is given by cov(0, s) = exp{—¡s}, for s > 0. Consequently, the parameter 3 defined here coincides with the rate 3 in the exponent of the covariance function. In the backcross design this parameter takes the value 0.02. The next parameter is the distance between consecutive components of the process. The letter A will be used for this parameter. In genome scans this A corresponds to the spacing between markers. When a marker is placed each 10 cM, the parameter takes the value 10. For a two-sided threshold of z and a genome consisting of C chromosomes of total genetic length L, the basic approximation takes the form:

Pr(max IZ^| > z) « 1 — exp { — 2C[1 — $(z)] — 23Lzj>(z)v(z{23A}12)}. 1 (5.3)

Three functions appear in (5.3). The first two functions, $(•) and are the cumulative distribution and the density functions, respectively, of the standard normal distribution. The third function v(■) appears quite frequently in renewal theory - the probability theory that provides many of the tools for the investigation of the probability that a random process exceeds a high threshold. The function v(-) can be expressed, in terms of the standard normal cumulative distribution function, as the infinite sum:

For numerical purposes, it is sufficient to use the approximation:

( (2/y)($(y/2) — 0.5) (y) ~ (y/2)$(y/2) + ¿(y/2) .

The function Nu produces the given approximation of the function v(■):

The function "OU.approx" applies approximation (5.3) for the probability that the absolute value of the Ornstein-Uhlenbeck process exceeds a threshold z.

+ length=1600,chr=20,center=0,test="two-sided") + {

+ d <- switch(test,"two-sided"=2,"one-sided"=1) + p <- 1-exp(-d*chr*(1-pnorm(z))

+ -d*beta*length*z*dnorm(z)*Nu(z*sqrt(2*beta*Delta)))

Note the argument "test". This parameter determines the type of tail computation. The default is set to a two-sided test, which is the type we will be using in the context of experimental genetics. An alternative is to use a one-sided test, which involves only extreme positive values of the process. The function "switch" is used in order to set the value of "d" to be one or two depending on the value of test. Like in the Bonferroni case, a threshold can be set by solving (5.3) with the target genome-wide significance level at the left-hand side of the equation. A numerical solution can be found by trial and error, or more systematically by application of a root-solver algorithm.

The function "uniroot" searches within an interval for a root of a given function with respect to its first argument. If you inquire the help file of the function, you will note that its last argument takes the form "... ". This format allows for the introduction of arguments for functions called by the given function. Hence, in particular, we can set values for the other arguments of the function "OU.approx" as it is being called by the root-finder function:

+ beta=0.02,Delta=80,length=0,center=0.05)$root

+ beta=0.02,Delta=40,length=20*(60-20),center=0.05)$root

+ beta=0.02,Delta=25,length=20*(65-15),center=0.05)$root

+ beta=0.02,Delta=20,center=0.05)$root

+ beta=0.02,Delta=10,center=0.05)$root

+ beta=0.02,Delta=5,center=0.05)$root

+ beta=0.02,Delta=1,center=0.05)$root

> app cM80 cM40 cM25 cM20 cM10 cM5 cM1 3.01559 3.22481 3.32749 3.45765 3.56272 3.64895 3.78015

Observe that for a single marker per chromosome the mathematical approximation and the Bonferroni approximation are simple functions of each other. In the other two examples where markers are not located at the extreme ends of the chromosomes (A = 40 and A = 25), the effective length of each chromosome is less than 80 cM, and hence the effective length of the genome is less than 1600 cM. In fact, the effective length of each chromosome is the distance between the first and the last markers. These chromosome-specific effective lengths are summed in order to get an effective length for the entire genome, which is the value of L used in the approximation.

The actual significance levels for the approximate thresholds can be obtained from the simulation results which are stored in the matrix "Z.max":

> for(i in 1:length(app)) <+ c(,mean(Z.max[,i] >= app[i]))

cM80 cM40 cM25 cM20 cM10 cM5 cM1 p.bon 0.0461 0.0468 0.0468 0.0419 0.0373 0.0286 0.0104 0.0478 0.0472 0.0495 0.0464 0.0476 0.0483 0.0470

Observe that the actual probabilities for the approximation are not too far from the target significant level of 0.05. Since the correlation between tests is handled properly, we no longer obtain an overly conservative threshold for small values of A.

A simpler approximation to the significance level is obtained when one considers the continuous Ornstein-Uhlenbeck process. In this case A = 0, thus v = 1, and the formula becomes:

Pr( max Zt\ ~ 1 " exp{ - 2 C[1 - $(z)j - 2/3Lz0(z)} . (5.4)

Similarly to the Bonferroni inequality, this computation also provides an upper bound on the true significance level. However, unlike the Bonferroni upper bound, which is tighter when markers are sparser, this upper bound is tighter when the markers are denser.

5.4 Other Methods

There are other methods for approximating the genome-wide significance level. Whereas the simulations discussed above involved an approximating Ornstein-Uhlenbeck process, the most direct method is to simulate the regression statistics themselves. This requires that we simulate the phenotypes according to some hypothetical distribution and the marker genotypes, which under the hypothesis of no linkage are independent of the phenotypes. In principle this should provide a more accurate answer than the simulation of the Ornstein-Uhlenbeck process, since the Ornstein-Uhlenbeck process depends for its justification on the central limit theorem, which may or may not provide an adequate approximation in a particular case. This more direct simulation has two minor disadvantages: (i) it requires more programming and more computing time, and (ii) it requires that we choose a particular distribution for simulating the y's, and to some degree the results will depend on the distribution chosen. (Of course, since the Ornstein-Uhlenbeck process does not involve the specific distribution of the y's, its justification is in effect a mathematical proof that at least in large samples dependence on the distribution of the y's is negligible.)

A conceptually different approach is to regard the phenotypes as given and study the conditional behavior of the process Zt as a function of the process of marker genotypes. In this case we need make no assumptions about the distribution of the phenotypes. To illustrate this approach we re-define Zt slightly by

0 0

Post a comment