Randomised trials: a special case?

Summary

This page debates the question of whether randomised trials should be analysed differently to other comparative trials. Several authors have argued that a randomised trial is a special case of a comparative trial, and should be analysed differently (i.e. by the Fisher-Irwin test). There are counter-arguments to this, and the recommendation here is that no distinction should be made.

Published views

Several authors have concluded that a randomised trial is a special case of a comparative trial, and that there are special reasons why a 2 x 2 table that has been generated in a randomised trial should be analysed by a Fisher-Irwin test.

For example, E. Pearson (1947) discussed the results from two treatments after these had been randomly assigned to a group of N ( = m + n) individuals, the first treatment having been applied to m and the second to n of the N individuals: 'In this case the random process has been applied within the group of N individuals and its repetition would simply involve other random reassignments of the two treatments among the N'. Pearson put this forward as an example of a table where both sets of marginal totals are fixed, and he advocated that it should be analysed by the Fisher-Irwin test unlike other comparative trials where he recommended the N - 1 chi squared test.

Barnard (1979) considered that, where subjects are randomly allocated to one of two treatments, A or B, and the hypothesis to be tested is that treatment A and treatment B are equivalent, those who are going to be cured by treatment A would be the same as those who would be cured by having treatment B: 'The reference set involved in the test of significance would be that generated by the randomisation procedure in which the total ... persons cured are redistributed, as a result of the randomisation, amongst those having the two kinds of treatment. It is easy to see that in such a case the totals receiving the two treatments cured and not cured should both be regarded as fixed and the exact test should be applied.'

Similarly, Yates (1984) discussed a trial of a new inoculation technique in its ability to reduce the risk of contracting some infectious disease. A group of N individuals are studied and are randomly assigned to a group of m who are inoculated, and n who are not, so the values of m and n are fixed by the experiment. Yates argued that if none of the N individuals were inoculated, a given number, (unknown to the experimenter) would be fated to contract the disease: 'If inoculation has no effect, this will not be changed by the experiment. Therefore, the [r and s marginal totals are] also determined. The statistical problem ... is the evaluation of the probability that the observed or a greater apparent effect can be attributed to chance causes resulting from the random assignment of the inoculation treatment.'

Competing paradigms

These authors are putting forward what will be termed here paradigm 1. According to this, using the example of a randomised clinical trial,
• the row totals are fixed as equal to the sample sizes specified for the two treatments that are being compared,
• the null hypothesis is of no treatment difference for each patient in the study (i.e. each patient in the study would have the same response regardless of the treatment assigned to that patient by randomisation), so the patients with favorable response would be the same for all possible randomisations, and so the column totals are also fixed,
• the population being sampled is the set of patients recruited to the study, and the samples taken are random samples from these.
• the relevant distribution is the hypergeometric, and the appropriate statistical test is the Fisher-Irwin

This is in contrast to paradigm 2 where:
• the row totals are fixed as equal to the sample sizes specified for the two treatments that are being compared (as paradigm 1),
• the null hypothesis is of no overall treatment difference
• the population being sampled is the set of patients with a condition similar to the patients recruited, who may present at other institutions and other times.
• the relevant distribution is the binomial, and the appropriate statistical test is the 'N - 1' chi squared.

Rather than being a philosophical distinction, I would argue that the distinction is a practical one, that each paradigm is appropriate in certain circumstances, and that the circumstances of randomised trials dictate the second paradigm involving the binomial distribution rather than the first involving the hypergeometric. The points will be argued using randomised clinical trials (RCTs) as an example.

Firstly, there is the question of the null hypothesis, and secondly there is the question of what is the population being sampled. Confidence intervals and power are also relevant to the debate.

Whether the null hypothesis specifies that each patient in the study would have the same response regardless of the treatment assigned to that patient by randomisation

I believe that null hypotheses in RCTs are rarely of this form, but more often are of a second form: that there is no difference in outcome at the population level rather than at the individual level. For example, in a trial of chemotherapy, we might anticipate more deaths from disease in one group, and more deaths from treatment toxicity in a second group, but the null hypothesis is of no overall difference in survival (i.e. the probability of survival is the same whichever treatment is given). A second example is that in a comparison of two different treatments, we would not necessarily expect the same patients to respond to the two different treatments. Looking at some issues of the British Medical Journal, I came across six randomised trials or discussions of such trials, and in each case, the null hypothesis being tested was not explicitly stated, but (I would say) was of the second form. For example, a randomised study of tennis elbow included a comparison of physiotherapy and steroid injection; we cannot assume that the patients who respond to steroid injection would have responded to physiotherapy if that was what they had been allocated, but instead, we are interested in what is the better treatment overall.

Whether the recruited patients are to be regarded as a random sample or as the population

In a randomised clinical trial, the recruited patients might be described as a convenience rather than a truly random sample. However, in a well-run trial, the recruitment process will ensure that the recruited patients can be regarded as a random sample from the whole group of patients of interest (who might present in other places and at other times). This will be achieved if there is no systematic bias in the patients recruited, such as if a consecutive series of cases from a typical institution is entered into the trial. These cases are not a random sample in the sense of all cases in all institutions presenting at all times are equally likely to be entered, but they are effectively random in that there is no systematic difference between the cases recruited and the larger group.

There can be difficulties with this process. For example, the patients recruited may be just a subset of those eligible (e.g. just those who are the youngest and fittest); then either the trial will need to be abandoned, or the conclusions from the recruited patients cannot be directly extrapolated to the whole group of patients of interest, but only to a subgroup similar to those studied. Secondly, the institution where the trial is conducted may be a specialist centre and not typical of the whole of the health service, and then the conclusions of the trial may be relevant merely to similar specialist centres, and a second trial may need to be carried out of patients seen in general practice. But there must always be a larger group of patients of whom the patients recruited are representative. Rather than this being an assumption (which may or may not hold), I would say that this is a fundamental principle of a clinical trial. We carry out a trial in a sample of patients so that we can extrapolate the results in those patients to predict outcome and inform treatment decisions in the larger group. If we cannot identify a population that can be regarded as the population from whom the patients recruited formed a random sample, then there is no general interest in the outcome in those patients recruited, and the study would not be termed a clinical trial.

So I would say that in a RCT, the population of interest is the whole group of patients, and the recruited patients form effectively a random sample of them, from which two or more smaller random samples of the whole population can be obtained by the randomisation process.

If we regard the recruited patients as the population of interest, and those in each treatment group as samples from this population, then conclusions from a comparison of the outcomes in the samples can be extrapolated only to the population of the recruited patients, and not to the wider group of similar patients. There do not seem to be many examples where a trial might be carried out on this basis, but it is possible that a trial in a chronic condition might be carried out in a remote area, or on the entire worldwide cases of a rare condition, where we do not intend to draw conclusions on patients outside those recruited.

So I believe that paradigm 1 is a valid paradigm, but with rare application, and that RCTs fall instead into paradigm 2, where the samples studied are regarded as random samples from a large population. These considerations also apply to observational studies: if the group of individuals being studied cannot be regarded as representative of a larger group of individuals, to whom conclusions can be extrapolated, then they are not of general interest.

Confidence intervals

Confidence intervals for the proportion responding to treatment in each group are generally calculated according to paradigm 2, and so are confidence intervals of the difference in response rate. This means that confidence intervals are consistent with statistical testing by paradigm 2, i.e. when the difference found is statistically significant by a statistical test at some level, then the correspondingly-sized confidence interval for the difference between groups will exclude zero, and so on. If we were to adopt paradigm 1 for RCTs, then we would have conflict between the results according to statistical testing and the results according to the calculation of confidence intervals, and it would not be clear how the results of the trial should be interpreted.

Statistical power

When clinical trials are being designed, sample sizes are calculated on the basis of calculations of power, which involves the concept of repeating the trial many times. Such repeats are generally considered as according to paradigm 2, i.e. that binary responses will follow the binomial distribution. If we reject paradigm 2 for the analysis of RCTs, would we not also have to reject it for the calculation of sample size, and develop a new method of doing sample size calculation?

Conclusion

In the great majority of randomised trials, paradigm 2 is the appropriate one, and analysis should be via the 'N -1' chi squared test provided the minimum expected number is at least 1.

Back to top