Study design depends greatly on the nature of the research question. In other words, knowing what kind of information the study should collect is a first step in determining how the study will be carried out (also known as the methodology).
Let’s say we want to investigate the relationship between daily walking and cholesterol levels in the body. One of the first things we’d have to determine is the type of study that will tell us the most about that relationship. Do we want to compare cholesterol levels among different populations of walkers and non-walkers at the same point in time? Or, do we want to measure cholesterol levels in a single population of daily walkers over an extended period of time?
The first approach is typical of a cross-sectional study. The second requires a longitudinal study. To make our choice, we need to know more about the benefits and purpose of each study type.
Both the cross-sectional and the longitudinal studies are observational studies. This means that researchers record information about their subjects without manipulating the study environment. In our study, we would simply measure the cholesterol levels of daily walkers and non-walkers along with any other characteristics that might be of interest to us. We would not influence non-walkers to take up that activity, or advise daily walkers to modify their behaviour. In short, we’d try not to interfere.
The defining feature of a cross-sectional study is that it can compare different population groups at a single point in time. Think of it in terms of taking a snapshot. Findings are drawn from whatever fits into the frame.
To return to our example, we might choose to measure cholesterol levels in daily walkers across two age groups, over 40 and under 40, and compare these to cholesterol levels among non-walkers in the same age groups. We might even create subgroups for gender. However, we would not consider past or future cholesterol levels, for these would fall outside the frame. We would look only at cholesterol levels at one point in time.
The benefit of a cross-sectional study design is that it allows researchers to compare many different variables at the same time. We could, for example, look at age, gender, income and educational level in relation to walking and cholesterol levels, with little or no additional cost.
However, cross-sectional studies may not provide definite information about cause-and-effect relationships. This is because such studies offer a snapshot of a single moment in time; they do not consider what happens before or after the snapshot is taken. Therefore, we can’t know for sure if our daily walkers had low cholesterol levels before taking up their exercise regimes, or if the behaviour of daily walking helped to reduce cholesterol levels that previously were high.
A longitudinal study, like a cross-sectional one, is observational. So, once again, researchers do not interfere with their subjects. However, in a longitudinal study, researchers conduct several observations of the same subjects over a period of time, sometimes lasting many years.
The benefit of a longitudinal study is that researchers are able to detect developments or changes in the characteristics of the target population at both the group and the individual level. The key here is that longitudinal studies extend beyond a single moment in time. As a result, they can establish sequences of events.
To return to our example, we might choose to look at the change in cholesterol levels among women over 40 who walk daily for a period of 20 years. The longitudinal study design would account for cholesterol levels at the onset of a walking regime and as the walking behaviour continued over time. Therefore, a longitudinal study is more likely to suggest cause-and-effect relationships than a cross-sectional study by virtue of its scope.
In general, the research should drive the design. But sometimes, the progression of the research helps determine which design is most appropriate. Cross-sectional studies can be done more quickly than longitudinal studies. That’s why researchers might start with a cross-sectional study to first establish whether there are links or associations between certain variables. Then they would set up a longitudinal study to study cause and effect.
Source:At Work, Issue 81, Summer 2015: Institute for Work & Health, Toronto
This column updates a previous column describing the same term, originally published in 2009.
Power and sample size estimation constitutes an impor-tant component of designing and planning modern scientific studies. It provides information for assessing the feasibility of a study to detect treatment effects and for estimating the resources needed to conduct the project. This tutorial discusses the basic concepts of power analysis and the major differences between hypothesis testing and power analyses. We also discuss the advantages of longitudinal studies compared to cross-sectional studies and the statistical issues involved when designing such studies. These points are illustrated with a series of examples.
2. Hypothesis testing, sampling distributions and power
In most studies we do not have access to the entire population of interest because of the prohibitively high cost of identifying and assessing every subject in the population. To overcome this limitation we make inferences about features of interest in our population, such as average income or prevalence of alcohol abuse, based on a relatively small group of subjects, or a sample, from the study population. Such a feature of interest is called a parameter, which is often unobserved unless every subject in the population is assessed. However, we can observe an estimate of the parameter in the study sample; this quantity is called a statistic. Since the value of the statistic is based on a particular sample, it is generally different from the value of the parameter in the population as a whole. Statistical analysis uses information from the statistic to make inferences about the parameter.
For example, suppose we are interested in the preva-lence of major depression in a city with one million people. The parameter π is the prevalence of major depression. By taking a random sample of the population, we can compute the statistic p, the proportion of subjects with major depression in the sample. The sample size, n, is usually quite small relative to the population size. The statistic p will most likely not be equal to the parameter π because p is based on the sample and thus will vary from sample to sample. The spread by which p deviates from π with repeated sampling, is called sampling error. As long as n is less than 1,000,000, there will always be some sampling error. Although we do not know exactly how large this error is for a particular sample, we can characterize the sampling errors of repeated samples through the sampling distribution of the statistic. In the major depression prevalence example above, the behavior of the estimate p can be characterized by the binomial distribution. The distribution is more likely to have a peak around the true value of the parameter as the sample size n gets larger, that is, the larger the sample size n, the smaller the sampling error.
If we want to have more accurate estimates of a parameter, we need to have an n large enough so that sampling error will be reasonably small. If n is too small, the estimate will tend to be too imprecise to be of much use. On the other hand, there is also a point of diminishing returns, beyond which increasing n provides little added precision.
Power analysis helps to find the sample size that achieves the desired level of precision. Although research questions vary, data and power analyses all center on testing statistical hypotheses. A statistical hypothesis expresses our belief about the parameter of interest in a form that can be examined through statistical analysis. For example, in the major depression example, if we believe that the prevalence of major depression in this particular population exceeds the national average of 6%, we can express this belief in the form of a null hypothesis (H0) and an alternative hypothesis (Ha):
Statistical analysis estimates how likely it is to observe the data we obtained from the sample if the null hypothesis H0 was true. If it is very unlikely for us to observe the data we have if H0 was true, then we reject the H0.
Thus, there are four possible decision outcomes of statistical hypothesis testing as summarized in the table below.
Decision outcomes of hypothesis testing
|Do not reject H0||Reject H0|
|H0 true||Correct decision||Type I error α|
|H0 false||Type II error β||Correct decision|
View it in a separate window
There are two types of errors associated with the decision to reject and not reject the null hypothesis H0. The type I error α is committed if we reject the H0 when the H0 is true; the type II error β occurs when we fail to reject the H0 when the H0 is false. In general, α (the risk of committing a type I error) is set at 0.05. The statistical power for detecting a certain departure from the H0 (computed as 1–β), is typically set at 0.80 or higher; thus β (the risk of committing a type II error) is set at 0.20 or less.
3. Difference between hypothesis testing and power analysis
3.1. Hypothesis testing
In most hypothesis testing, we are interested in ascertaining whether there is evidence against the H0 based on the level of statistical significance. Consider a study comparing two groups with respect to some outcome of interest y. If μ1 and μ2 denote the averages of y for groups 1 and 2 in the population, one could make the following hypotheses:
In the above, the difference between the two means under the alternative hypothesis Ha is not specified, since in hypothesis testing, we are trying to determine whether there is evidence to reject the H0. Inference about H0 is based on the distribution of the statistic, , where and are averages of the outcome y observed in the study sample. The level of statistical significance is indicated by the p-value, which is the probability of observing our data, or something more extreme, if the H0 was true. In practice, the threshold for rejecting the null is typically α=0.05 or α=0.01 for large studies, and the null hypothesis is rejected if the p-value is <α.
Note that no direction of effect is specified in the two-sided alternative Ha above; that is, we do not specify whether the average for group 1 is greater or smaller than the average for group 2. If we hypothesize the direction of effect a one-sidedHa may be used. For example:
3.2. Power analysis
Unlike hypothesis testing, both the null H0 and alternative Ha hypotheses must be fully considered when performing power analysis. The usual purposes of conducting power analyses are (a) to estimate the minimum sample size needed in a proposed study to detect an effect of a certain magnitude at a given level of statistical power, or (b) to determine the level of statistical power in a completed study for detecting an effect of a certain magnitude given the sample size in the study. In the example above, to estimate the minimum sample size needed or to compute the statistical power, we must specify a value for δ=μ1-μ2, the difference between the two group averages, that we wish to detect under the Ha.
In power analysis, effects are often specified in terms of effect sizes, not in terms of the absolute magnitude of the hypothesized effect, because the magnitude of the effect depends on how the outcome is defined (i.e., what type of measures are employed) and does not account for the variability of such outcome measures in the study population. For example, if the outcome y is body weight, this could be alternatively measured in pounds or kilograms, the difference between two group averages could be reported either as 11 pounds or 5 kilograms. To remove dependence on the type of measure employed and account for variability of the outcomes in the study population, effect size – as standardized measure of the difference between groups – is often used to quantify hypothesized effect:
where σ12 and σ22 denote the variances of the outcomes in the two groups. Unlike the difference δ=μ1-μ2, the effect size is an invariant quantity, that is, it remains the same regardless of the scale used.
Note that effect sizes are different for different analytical models. For example, in regression analysis the effect size is commonly based on the change in R2, a measure for the amount of variability in the response (dependent) variable that is explained by the explanatory (independent) variables. Regardless of such differences, the effect size is a unitless quantity.
4. Examples of power analysis
4.1. Example 1
Consider again the hypothesis to test difference in average outcomes between two groups:
or equivalently when specified in effect size:
Power is computed based on the sampling distribution of the difference statistic, .
To calculate power, we may specify n1, n2, μ1, μ2, σ1 and σ2. For example, if n1=n2=50, μ1=0.2, μ2=1.1 and σ1=σ2=1.6, then power=80%. Alternatively, we can specify the difference in terms of effect size, effect size= , to obtain the same power=80%.
4.2. Example 2
Consider a linear regression model for a response (outcome) variable that is continuous with m explanatory (independent) variables in the model. The most common hypothesis is whether the explanatory variables jointly explain the variability in the response variable. Power is based on the sampling F-distribution of a statistic measuring the strength of the linear relationship between the response and explanatory variables and is a function of m, R2 (effect size) and sample size n.
If m=5, we need a sample size of n=100 to detect an increase of 0.12 in R2 with 80% power and α=0.05. Note that R is also called the multiple correlation coefficient or coefficient of multiple determination.
4.3. Example 3
Consider a logistic regression model for assessing risk factors of suicide. First, consider the case with only one risk factor such as major depression (predictor).The sample size is a function of the overall suicide rate π in the study population, odds ratio for the risk factor, and level of statistical power. The table below shows sample size estimates as a function of these parameters, with α=0.05 and power=80%. As shown in the table, if π=0.5, a sample size of n=272 is needed to detect an odds ratio of 2.0 for the risk variable (major depression) in the logistic model.
Sample sizes need to have an 80% power to detect different odds ratios at two different prevalence levels (π) of the target variable of interest
View it in a separate window
In many studies, we consider multiple risk factors or one risk factor controlling for other covariates. In this case, we first calculate the sample size needed for the risk variable of interest and then adjust it to account for the presence of other risk variables (covariates).
In the single-risk-factor case of major depression as a risk factor for suicide, if we additionally control for other covariates such as age and gender in the logistics regression model, the sample size needed is obtained by dividing the sample size obtained from the single-risk-factor model by 1-R2, where R2 is from the regression model with the risk factor of interest as the dependent variable and the other covariates as the explanatory variables. In the case where π=0.5, if R2=0.3 for the logistic regression model with major depression as the dependent variable and age and gender as the independent variables, then is the sample size needed to detect an odds ratio of 2.0 for major depression in the prediction of suicide while adjusting for age and gender. In summary, a larger sample size is needed when controlling for other covariates in the model, and the increase in the needed sample size is greater when the correlation between the risk variable of interest and the other covariates is higher.
4.4. Example 4
Consider a drug-abuse study comparing parental con-flict and parenting behavior of parents from families with a drug-abusing father (DA) to that of families with an alcohol-abusing father (AA). Each study participant is assessed at three time points. For such longitudinal studies, power is a function of within-subject correlation ρ, that is, the correlation between the repeated mea-surements within a participant. There are many data structures that can be used to assess this within-subject correlation; the details for doing this can be found in the paper by Jennrich and Schluchter.
Required sample sizes for complete data (and 15% missing data) to detect differences in an outcome of interest between two groups (α=0.05; β=0.20) when the outcome is assessed repeatedly and there are different levels of within-subject correlation
|number of post-baseline assessments||within-subject correlation ρ|
|two||52 (61)||68 (80)||84 (98)||102 (120)||118 (138)|
|four||36 (42)||56 (65)||76 (89)||96 (112)||116 (136)|
View it in a separate window
As seen in the above table, the sample sizes required to detect the desired effect size increased as ρ approaches 1 and decreased as ρ approaches 0. Sample size also depends on the number of post-baseline assessments, with smaller sample sizes needed when there are more assessments. In the extreme case when ρ=0 (there is no relationship between the repeated assessment within a participant) or ρ=1 (repeated assessments within a participant yield identical data), the repeated outcomes become completely independent (as if they were collected from other individuals) or redundant (providing no additional information).
When ρ=1, all repeated assessments within a parti-cipant are identical to each other, and thus the additional assessments do not yield any new information. In comparison, when ρ≠1, longitudinal studies always provide more statistical power than their cross-sectional counterparts. Furthermore, the sample size required is smaller when ρ approaches 0, because repeated measurements are less similar to each other and provide additional information on the participants. To ensure reasonably small within-subject correlations, researchers should avoid scheduling post-baseline assessments too close to each other in time.
In practice, missing data is inevitable. Since most commercial statistical packages do not consider missing data, we need to perform adjustments to account for its effect on power. One way of doing this (shown in the table) is to inflate the estimated sample size. For example, if it is expected that 15% of the data will be missing at each follow-up visit and n is the estimated sample size needed under the assumption of complete data, we inflate the sample size n'=n/(1-15%). As seen in the table, missing data can have a sizable effect on the estimated sample sizes needed so it is important to have good estimates of the expected rate of missing data when estimating the required sample size for a proposed study. It is equally important to try to reduce the amount of missing data during the course of the study to improve statistical power of the results.
5. Software packages
Different statistical software packages can be used for power analysis. Although popular data analysis packages such as R and SAS may be used for power analysis, they are somewhat limited in their application, so it is often necessary to use more specialized software packages for power analysis. We used PASS 11 for all the examples in this paper. As noted earlier, most packages do not accommodate missing data for longitudinal study designs, so ad-hoc adjustments are necessary to account for missing data.
We discussed power analysis for a range of statistical models. Although different statistical models require different methods and input parameters for power analysis, the goals of the analysis are the same: either (a) to determine the power to detect a certain effect size (and reject the null hypothesis) for a given sample size, or (b) to estimate the sample size needed to detect a certain effect size (and reject the null hypothesis) at a specified power. Power analysis for longitudinal studies is complex because within-subject correlation, number of repeated assessments, and level of missing data can all affect the estimations of the required sample sizes.
When conducting power analysis one needs to specify the desired effect size, that is, the minimum magnitude of the standardized difference between groups that would be considered relevant or important. There are two common approaches for determining the effect sizes used when conducting power analyses: use a ‘clinically significant’ difference; and use information from published studies or pilot data about the magnitude of the difference that is common or considered important. When using the second approach, one must be mindful of the sample sizes in prior studies because reported averages, standard deviations, and effect sizes can be quite variable, particularly for small studies. And the previous reports may focus on different population cohorts or use different study designs than those intended for the study of interest so it may not be appropriate to use the prior estimates in the proposed study. Further, given that studies with larger effect sizes are more likely to achieve statistical significance and, hence, more likely to be published, estimates from published studies may overestimate the true effect size.
This research is supported in part by the Clinical and Translational Science Collaborative of Cleveland, UL1TR000439, and of the University of Rochester, 5-27607, from the National Institutes of Health.
Conflict of Interest: The authors report no conflict of interest related to this manuscript.
1. Jennrich RI, Schluchter MD. Unbalanced repeated-measures models with structured covariance matrices. Biometrics. 1986;42:805–820.[PubMed]
2. R Core Team. R: A Language and Environment for Statistical Computing. Vienna (Austria): R Foundation for Statistical Computing; 2012. ISBN 3-900051-07-0, URL http://www.R-project.org/
3. Castelloe JM. Sample Size Computations and Power Analysis with the SAS System. Proceedings of the Twenty-Fifth Annual SAS Users Group International ConferenceM; April 9-12, 2000; Indianapolis, Indiana, USA. Cary, NC: SAS Institute Inc.; Paper 265-25.
4. Hintze J. PASS 11. Kaysville, Utah, USA: NCSS, LLC; 2011.