Correlation is used to assess whether two continuous variables are associated. The correlation coefficient most frequently used is the Pearson correlation coefficient. This coefficient, which is frequently denoted as ‘r', measures the degree of straight line association between two variables x and y. ‘R' can take a value between +1 and –1 and its magnitude indicates how close the points are to a straight line. A value of zero means that there is no linear relationship between the variables studied. The figure shows the association between body weight and diastolic blood pressure in a group of healthy people. In this case r = 0.36, indicating that diastolic blood pressure increases with body weight. R² is the proportion of the variability of y that can be explained by the variation in x, so in this study 13% (0.36²) of the variability in diastolic blood pressure was explained by the variation in body weight. In case of small sample sizes or non linear relationships the Pearson correlation coefficient is not suitable and rank correlation coefficients (Spearman's rho or Kendall's tau) should be used.

There are many pitfalls in correlation analysis. One should for example avoid the use of more than one observation from one individual. Such ‘misuse' tends to result in falsely high correlation coefficients. Once an association is found, its interpretation is often problematic. In case an association is observed between x and y, this means that x influences (or causes) y, y influences x or both x and y are influenced by one or more other variables.
Whereas correlation just indicates the strength of an association between two continuous variables in a single number, regression predicts the value of one variable from the known value of the other(s). The type of regression analysis to be used depends on the type of outcome (=response=dependent) variable. In case the outcome parameter is continuous (e.g. percentage body fat) we use linear regression; for binary (yes/no) outcomes (e.g. presence or absence of depression) we use logistic regression and for ‘time-to-event' data (e.g. time to death or technique failure) we use survival analysis (e.g. Cox proportional hazards regression). The variables that are included in a regression model to predict the outcome parameter are called predictor (=explanatory=independent) variables. In all cases, predictor variables can be either continuous or categorical. In future issues of the newsletter we will discuss the different types of regression analysis in more detail.
For further reading
Rothman K. Epidemiology: an introduction. Oxford University Press, 2002.
Altman DG. Practical Statistics for Medical Research. Chapman and Hall, 1991.
| Kitty Jager |
| Managing Director of the ERA-EDTA Registry |