Candidate Variables

STUDY ENDPOINT

We assume that a prognostic factor study is in design and, just as for any other study, the key endpoint has to be established. For example, in many circumstances this will be the time either to the resolution of the disease (cure) or, as would be the case in prognostic studies in patients with advanced cancer, the survival time of the patient. In this latter case, the survival time might be calculated from the date of diagnosis to the date of death. The outcome for a group of such patients with survival times is summarised using the Kaplan-Meier estimate of the corresponding survival curve. One such example has been given in Figure 9.3 which shows the survival curves of patients with hepatocellular carcinoma treated with three doses of tamoxifen as reported by Chow, Tai, Tan et al. (2002).

Although we will use an illustrative example involving survival time, and hence the Cox proportional hazards model is appropriate, for other outcome measures differing models would be required. Thus for binary outcomes this would be expressed via logistic regression and for continuous outcomes multiple (least squares) regression. All of which are available in standard statistical computer packages.

The univariate Cox proportional hazards regression model of a single potential prognostic variable, x is h = exp(bx), (11.1)

where h represents the risk and b is the corresponding regression coefficient to be estimated from the survival times of the patients, each with associated value for x. In the simplest case, x is a binary variable, for example, taking the value 0 for males and 1 for females. In this case if b, the estimate of b of equation (11.1), turns out to be zero, then h = exp(0) = 1. This implies that, whatever the value of x, h = 1 and the risk for males and females is the same. Thus gender is not prognostic for outcome. On the other hand, if b = 2 say, then if x = 0, hMale = 1 but when x = 1, hFemale = exp(2), implying a greater risk for the females. In general, the associated hazard ratio is HR = hFemale/ hMale = exp(bx 1)/exp(bx0) = exp(b). Since log HR = b, the regression coefficient itself is often termed the log hazard ratio.

For n potential prognostic variables, Xj, X2, X3, ..., xv the Cox model becomes/takes the multivariable form h = expibjXj + ^2X2 + ^3*3 + . . . + bvXv), (11.2)

where b1, b2, b3, ..., bv are the corresponding regression coefficients to be estimated in the modelling process.

The basic structure of a prognostic factor study is to record, for each of the N patients recruited, their basic characteristics at the time of diagnosis of their disease and their ultimate survival, t. As we noted in Chapter 3, for survival time studies these times may be censored for some subjects in which case T+ is recorded. In very simple terms, once the regression model is fitted to these data, those X(4 v) variables for which a null hypothesis that the corresponding regression coefficient b is zero is rejected, are retained in the model and are termed prognostic. The remaining v~X variables that are 'not statistically different' from zero are removed from the model and are considered as not prognostic for outcome.

IDENTIFYING THE SUBJECTS

Before establishing which individual patients are to be included in the prognostic factor investigation the basic patient population of interest has to be defined. So, just as one would do for any clinical study, clear eligibility criteria have to be established. Such eligibility requirements will usually include the particular diagnoses of interest as well as precise details of how these diagnoses are established. Further, it may be necessary to restrict the patients so selected to those that will receive a particular form of therapy. In some situations, this restriction may be made to ensure a relatively homogeneous set of patients so that the potential prognostic indicators are not obscured by varying degrees of efficacy of the (possibly) uncontrolled choice of therapies that may have been given to such patients.

Clearly if the patients used for a prognostic study are those recruited to a randomised trial, then differences between patients (due to treatment received) can be accommodated in the prognostic modelling process in a systematic way. The choice of subjects may also (and should) be influenced by the quality of data that can be collected for the purposes of the study. Once again if these data come from a randomised trial then one may be reassured more easily that the data are well documented than for a study that involves extracting data from patient case notes which are designed principally for other purposes. Without good quality data, the conclusions drawn from studies of prognosis must be regarded as uncertain.

USE OF THE INDEX

One should also give some thought as to how the prognostic factors once established are to be used. If the purpose is purely scientific, then the variables considered can be very esoteric in nature (perhaps determined by very complex assays). On the other hand, if the index is to be used to guide advice that will be given to patients in the clinic then the variables for use are best established easily with minimal sophisticated (laboratory-type) measures involved.

CHOOSING THE VARIABLES

Apart from the endpoint measure itself, it is also important to determine which variables are to be the candidate variables for the prognostic factor investigation.

Help with the choice of variables to study (or not) should be obtained by reviewing the literature for variables that have been investigated in previous studies. It is clear that those that have been shown to have major prognostic influence should also be included in the planned study. We will call these 'Level-In' variables. A decision then has to be made as to which of the other variables so examined may be still unproven, 'Level-Query', and those that have already been found conclusively not to be useful, 'Level-Out'.

Of course, the purpose of the current study may be to investigate entirely new variables, 'Level-New'. For this latter category it will be advisable to consider aspects of the Bradford-Hill criteria of Figure 1.1 with respect to ultimate causality. At this stage one also has to determine whether the objective of investigating 'Level-New' variables is to replace the 'Level-In' variables or rather to ask if their added inclusion 'enhances' the ability to distinguish more clearly prognostic groups. The approach to modelling is different in these two situations.

Considerable thought at the planning stage needs to be focused on the selection of the variables and the associated Levels. There is a tendency to assign very few to 'Level-Out' for fear of 'missing something important'.

Case study - choice of candidate variables - inoperable hepatocellular carcinoma

It is well known that AFP is indeed predictive in this disease and so it was regarded as a 'Level-In' variable for the modelling. The remaining variables were all 'Level-Query' as, although most had been investigated before and some not found to be very predictive, they had usually been considered together with variables that were strongly predictive but which were not candidate variables for this study.

Example - choice of candidate variables - node-positive breast cancer

Although not using our categorisation, Sauerbrei, Royston, Bojar et al. (1999) imply for non-metastatic breast cancer that nodal status is the only Level-In variable associated with prognosis. In contrast Level-Query is attached to tumour size, tumour grade, histological type, oestrogen (ER) and progesterone receptor (PR), menopausal status and age despite many investigations of their respective roles. They also point out that more than 100 Level-New factors have been proposed at various times.

SCREENING THE VARIABLES

However, before going to the stage of fitting the chosen model of the form (11.2), it is often desirable to do some (often informal) preliminary screening of the variables. One such screen, for those candidate variables that are continuous in nature, is to calculate the correlation matrix of all the corresponding pairwise correlation coefficients. Should this matrix contain some large correlation coefficients then this may indicate that only one of the corresponding pair of variables contributing to any high value needs to be included in the modelling process.

The choice of which to take forward to the modelling can be made in several ways depending on circumstance. These may include the easiest or cheapest of the two measures to obtain from the patients or the variable most often used by previous studies. A common option is to begin by first modelling these variables individually by use of the univariate model equation (11.1). Suppose the variables concerned are x1 and x2, then the models to fit are h1 = exp(bixj) and h2 = exp(b2x2). It is then determined which of the estimated regression coefficients (b1 or b2) is the 'most statistically significant' and the associated x then may be the variable that is chosen. Before the final choice, a check is made using the two-variable version of equation (11.2), that is h12 = exp(b1x1 + b2x2), to see that if both variables are included whether worthwhile extra information is obtained over the single variable chosen.

TRANSFORMATIONS

In the modelling process, it is often easier if a variable has a linear influence on the outcome of concern. If a variable, say x, is continuous then the direct use of equation (11.1) implies that the effect on the risk of death is log-linear, that is, the log HR increases or decreases linearly as the value of the factor increases. This may or may not be the case. This may be checked by fitting the model hquadratic = exp(b1x + b2x2) which has algebraically a quadratic form and is estimated by hquadratic = exp(b1x + b2x2). If a formal test of the null hypothesis, b2=0, is not rejected then x is assumed to act linearly, since this implies b2x2 = 0 whatever the value of x. Otherwise a more detailed examination of the relationship implied by changing values of x will need to be instituted.

If linearity is not the case, then creation of categories to reflect the shape of the relationship is recommended in preference to attempting to describe the precise detail of the non-linear relationship. Although in certain circumstances a transformation of the basic variable may achieve the desired linearity. Common transformations of the basic variable, x, are log x and «Jx. Complex transformations are best avoided.

Example - investigating linearity - colorectal cancer

Chung, Eu, Machin et al. (1998) investigated whether young age was an adverse prognostic factor for survival in patients with colorectal cancer. In general colorectal death rates increase with age (as is the case for many cancers) and so if young age (4 39 years) is indeed indicative of a worse prognosis, then the age-specific death rates would be U-shaped when plotted against decade. This was indeed the case, with those aged 40-59 of lowest risk whilst those of the 439 age-group were at a similar risk to those 60-79 or some 30 years older.

As a consequence, for age an unordered categorical variable was created of the age decades for the modelling process - this despite the fact that the underlying variable was continuous and hence the successive categories had a natural order. Had ordered categories been used in the modelling process, then this is equivalent to coding them as 0, 1, 2, etc. that is then treated as numerical data. If the model is then fitted with this variable, then it takes a linear form which, in this situation, is not appropriate. Using the unordered categories allows the shape of the underlying relationship to be examined without imposing an algebraic form such as the quadratic referred to previously.

One difficulty with categorising continuous variables is the fact that, once created, there is an implicit assumption that there is a step change in risk at a boundary between adjacent categories. This is unlikely to be the case. Sometimes boundaries are convenient choices, for example, decade groups for age. In other circumstances, the choice may be made by investigating a range of options and then choosing that which 'magnifies' the difference beween adjacent categories. Such devices can lead to an over-optimistic view of the prognostic variable in question. If a dichotomy is to be chosen, then one method is to take the cut at the median value but there is no guarantee that the risk will divide along these lines. Since the purpose of categorising the variable is to better investigate the 'shape' of the associated risk, a minimum number of three categories is required for this, and a maximum of five would seem reasonable (although if data are plentiful more could be taken).

Case study - transformation of data - inoperable hepatocellular carcinoma

The only laboratory-based variable included was AFP as it is widely used even in less developed clinical settings in other respects. This was categorised into the five groups, effectively on a logarithmic scale, as shown in Table 11.1.

MISSING VALUES

It is important that the proportion of data items that are missing or unknown in the data set is minimal. Experience suggests that as the number of variables requested of the clinical teams increases the proportion of 'missing' data also increases. Missing data cause considerable 'biases' to arise in the modelling process and should be avoided if at all possible. Although there are no formal rules attached to an acceptable level of missing data, if more than 20% are missing for a particular variable, then serious consideration should be given to excluding it from the modelling process. If the missing data comprise less than 5%, then the bias introduced may be regarded as minimal. These are only pragmatic suggestions, however, and may have to be varied with circumstance. No useful model can result if a vital piece of information cannot be easily collected.

For those variables for which data are missing, it is useful to create a category of their own. If treated like this in the modelling process, then, if the data are missing at random, this category should behave in a central manner since it will comprise a (random) mixture of the other category levels. Were it to correspond to (say) the highest risk category, then this may indicate that 'missing' is a sign of poor prognosis. Perhaps it is then 'missing' as the patient was too ill for the measure to be recorded. For example, when a patient is an emergency admission, time for less routine assessments may not be available and so they may go unrecorded. In which case the absence of these values may be indicative of a worse outcome and hence the fact of them 'missing' is prognostic for outcome.

Case study - missing data - inoperable hepatocellular carcinoma

In Table 11.1 the Zubrod score has a large proportion (19.6%) of missing values. In screening the variable, an unordered categorical variable was first created with four levels (0, 1, 2, Missing) and used in a univariate model of Zubrod score. The size of the corresponding category '2' and 'Missing' regression coefficients were similar. As a consequence, these two categories were merged for the multivariable modelling.

This device is no substitute for the 'real' data values, however, and serious concern must be raised about a variable for which there is a large proportion of missing data.

NUMBER OF VARIABLES

In equation (11.2) the number of variables, v, that can be included is clearly without end, but for every variable added there is at least one further regression coefficient to be estimated. It is easy to imagine that there can be more candidate variables than patients. So a simple rule is to never allow into the model more variables than subjects. If there are more variables than subjects, then the screening process to determine Level-In, Level-Query, and Level-Out must ensure the number is reduced accordingly. It has to be realised that if a g-group categorical variable is included, then this adds g— 1 regression coefficients to the model. Thus it is really the number of regression coefficients, k (5 v), to be estimated that should, at the very least, be less then N. In fact, for survival-type studies, it is the number of events observed, O, that is critical rather than N itself. Thus a very large study with few events may have the same limit to k as a small study with a proportionately larger number of events. We return to this topic when discussing an appropriate study size.

Although we have talked in general terms about the candidate variables, as we have indicated in Chapter 2, reliable measurement of these is clearly critical to prognostic factor studies also. Thus, for example, Simon and Altman (1994) stress that any laboratory assays should be performed blinded to clinical data and outcome, and that intra- and inter-laboratory reproducibility of assays should be documented.

FITTING THE MODEL Univariate

As there are usually several, sometimes many, variables as potential candidates for inclusion even after screening, the next step is often to reduce these numbers by options by fitting a univariate model for each in turn, then only to take forward those of the candidate variables (all of which must be either Level-In or Level-Query) for which the corresponding null hypothesis of b = 0 had been rejected. The remaining variables are then studied in a multivariable regression model.

Case study - univariate models - inoperable hepatocellular carcinoma

Individual (univariate) Cox regression analysis of all the clinical parameters indicated in Table 11.1 and serum AFP level showed that the major variables influencing survival are Zubrod performance score, presence of ascites and AFP. Thus, the univariate analysis screen had reduced the k from 30 to eight regression coefficients to be estimated: two for Zubrod score, two for ascites and three for log AFP. The corresponding regression coefficients are given in Table 11.2(a). Little prognostic information was provided by age, gender, ethnicity or significant alcohol history. 