Oxford University homepage

About us
Reports and Publications
Instrument Selection
  Instrument Types
  Selection Criteria
PROM Bibliography
The Oxford Orthopaedic Scores

Selection criteria

Following the identification of literature pertaining to instruments it is important that users apply the necessary criteria to select the most suitable instrument(s). There are eight criteria that should be considered in the selection of Patient Reported Outcome Measures (PROMs):

Appropriateness: is the instrument content appropriate to the questions which the application seeks to address?
Acceptability: is the instrument acceptable to patients?
Feasibility: is the instrument easy to administer and process?
Interpretability: how interpretable are the scores of the instrument?
Precision: how precise are the scores of the instrument?
Reliability: does the instrument produce results that are reproducible and internally consistent?
Validity: does the instrument measure what it claims to measure?
Responsiveness:: does the instrument detect changes over time that matter to patients?

These criteria are not precisely or uniformly described in the literature; nor can they be prioritised in terms of importance, rather they should be considered in relation to the proposed application. Further information relating to these criteria can be found in the following report:

Fitzpatrick R, Davey C, Buxton MJ, and Jones DR Evaluating patient-based outcome measures for use in clinical trials. Health Technology Assessment 1998 2, (14). Available free from http://www.hta.ac.uk/fullmono/mon214.pdf


Appropriateness is the extent to which instrument content is appropriate to the particular application.

Careful consideration should be given to the aims of the application, with reference to areas of health concern, i.e. which dimensions of health are important, the nature of the patient group and about the content of possible instruments [1-3]. It is difficult to give general recommendations as to what makes an instrument appropriate for a given application, because this ultimately depends on the users' specific questions and the content of instruments.

Instrument selection is often dominated by psychometric considerations of reliability and validity, with insufficient attention given to the content of instruments. The names of instruments and constituent scales or dimensions should not be taken at face value [4]. Users should consider the content of individual items within instruments.

Consideration must also be given to the measurement objective, which is closely related to the proposed application. Patient Reported Outcome Measures (PROMs) can have three broad measurement objectives: discrimination, evaluation and prediction [5]. Discrimination is concerned with the measurement of differences between patients when there is no external criterion available to validate the instrument. For example, measures of psychological well-being have been developed to identify individuals suffering from anxiety and depression. Evaluation is concerned with the measurement of changes over time. For example, Patient Reported Outcome Measures (PROMs) administered before and after treatment, are used as outcome measures in clinical trials. Prediction is concerned with classifying patients when a criterion is available to determine whether the classification is correct. For example, Patient Reported Outcome Measures (PROMs) may be used in diagnosis and screening as a means of identifying individuals for suitable forms of treatment.

The three measurement objectives are not necessarily mutually exclusive. However, before they can be considered appropriate, instruments must undergo testing that is tailored to the measurement objectives that are relevant to the proposed application. Individual items and scales should be examined to determine whether they concord with the measurement objective. Discrimination and evaluation may be complementary if both are concerned with the measurement of differences that are clinically important, be they cross-sectional or longitudinal. However, an item that asks about family history of a particular disease may be useful for determining which patients have the disease but will be inappropriate for evaluation.

It is also important to consider how broad a measure of health is required. Specific instruments can have a very restricted focus on symptoms and signs of disease, but may also take account of the impact of disease on quality of life. Generic instruments measure broader aspects of health and quality of life that are of general importance. Where feasible, it is recommended that both specific and generic instruments be used to measure health outcomes [6,7]. In this way the most immediate effects of treatment on disease should be captured, as well as possible consequences that are harder to anticipate.

(Back to top)


Acceptability is the extent to which an instrument is acceptable to patients. Indicators of acceptability include administration time, response rates, and levels of missing data [1]. There are a number of factors that can influence acceptability including the mode of administration, questionnaire design, and the health status of respondents. The format of patient-reported instruments can also influence acceptability. For example, the task faced by respondents completing individualised instruments is usually more difficult than that for instruments based on summated rating scales [8]. General features of layout, appearance, and legibility are thought to be important influences on acceptability.

To be acceptable, the instrument must be presented in a language that is familiar to respondents. Guidelines are available that are designed to ensure a high standard of translation [9,10]. These guidelines recommend the comparison of several independent translations, back translation, and the testing of acceptability of new translations.

Issues of acceptability should be considered at the design stage of instrument and questionnaire development. Patients' views about a new instrument should be obtained at the pre-testing phase, prior to formal tests of instrument measurement properties including reliability [11]. Patients can be asked by means of additional questions or semi-structured interview whether they found any questions difficult or distressing.


Feasibility concerns the ease of administration and processing of an instrument. These are important considerations for staff and researchers who collect and process the information produced by patient-reported instruments [12,13]. Instruments that are difficult to administer and process may jeopardise the conduct of research and disrupt clinical care. An obvious example is the additional resources required for interviewer administration over self-administration. The complexity and length of an instrument will have implications for the form of administration. Staff training needs must be considered before undertaking interviewer administration. Staff may also have to be available within the clinic to help patients who have difficulty with self-administration. Finally, staff attitudes and acceptance of patient-reported instruments can make a substantial difference to respondent acceptability.

(Back to top)


Interpretability concerns the meaningfulness of scores produced by an instrument. To some extent, the lack of familiarity in the use of instruments may be a hindrance to interpretation. Three approaches to interpretation have been proposed. First, changes in instrument scores have been compared to previously documented change scores produced by the same instrument for major life events such as loss of a job [14]. Secondly, attempts have been made to identify the minimal clinically important difference (MCID), which is equal to the smallest change in instrument scores that is perceived as beneficial by patients [15,16]. External judgments including summary items such as health transition questions are used to determine the MCID. Thirdly, normative data from the general population can be used to interpret scores from generic instruments [17,18]. The standardisation of instrument scores is an extension of this form of interpretation that allows score changes to be expressed in terms of the score distribution for the general population [18].

(Back to top)


Precision concerns the number and accuracy of distinctions made by an instrument. There are a number of aspects to the issue of precision, which relate to methods of scaling and scoring items, and the distribution of items over the range of the construct being measured.

The scaling of items within instruments has important implications for precision. The binary or 'yes' or 'no' is the simplest form of response category but it does not allow respondents to report degrees of difficulty or severity. The majority of instruments use adjectival or Likert type scales such as: strongly agree, agree, uncertain, disagree, strongly disagree. Visual analogue scales appear to offer greater precision but there is insufficient evidence to support this and they may be less acceptable to respondents.

There are a number of instruments that incorporate weighting systems, the most widely used being preferences or values derived from the general public for utility measures such as the EuroQol EQ-5D [19] and the Health Utilities Index [20]. Weighting schemes have also been applied to instruments based on summated rating scales including the Nottingham Health Profile [21] and the Sickness Impact Profile [22]. Such weighting schemes may seem deceptively precise and should be examined for evidence of reliability and validity.

The items and scores of different instruments may vary in how well they capture the full range of the underlying construct being measured. End effects occur when a large proportion of respondents score at the floor or ceiling of the score distribution. If a large proportion of items have end effects then instrument scores will be similarly affected. End effects are evidence that an instrument may be measuring a restricted range of a construct and may limit both discriminatory power and responsiveness [23,24].

The application of Item Response Theory (IRT) can further help determine the precision of an instrument. IRT assumes that a measurement construct such as physical disability, can be represented by a hierarchy that ranges from the minimum to maximum level of disability [25]. IRT has shown that a number of instruments have items concentrated around the middle of the hierarchy with relatively fewer items positioned at the ends [25-27]. The scores produced by such instruments are not only a function of the health status of patients but also the imprecision of measurement.

(Back to top)


Reliability concerns whether an instrument is internally consistent or reproducible, and it assesses the extent to which an instrument is free from measurement error. It may be regarded as the proportion of a score that is signal rather than noise. As the measurement error of an instrument increases, so does the sample size required to obtain precise estimates of the effects of an intervention [1].

Internal consistency is measured with a single administration of an instrument and assesses how well items within a scale measure a single underlying dimension. Internal consistency is usually assessed using Cronbach's alpha, which measures the overall correlation between items within a scale [28]. Caution should be exercised in the interpretation of alpha because its size is dependent on the number of items as well as the level of correlation between items [29]. Furthermore, very high levels of correlation between items may indicate redundancy and the possibility that items are measuring a very narrow aspect of a construct.

Reproducibility assesses whether an instrument produces the same results on repeated administrations when respondents have not changed. This is assessed by test-retest reliability. There is no exact agreement about the length of time between administrations but in practice it tends to be between 2 and 14 days [29]. The reliability coefficient is normally calculated by correlating instrument scores for the two administrations. It is recommended that the intra-class correlation coefficient be used in preference to Pearson's correlation coefficient, which fails to take sufficient account of systematic error.

Reliability estimates of 0.7 and 0.9 are recommended for instruments that are to be used in groups and individual patients respectively [1]. Reliability is not a fixed property and must be assessed in relation to the specific population and context [29].

(Back to top)


Validity is the extent to which an instrument measures what is intended. Validity can be assessed qualitatively through an examination of instrument content, and quantitatively through factor analysis and comparisons with related variables. As with reliability, validity should not be seen as a fixed property and must be assessed in relation to the specific population and measurement objectives.

Content and face validity assess whether items adequately address the domain of interest [1]. They are qualitative matters of judging whether an instrument is suitable for its proposed application. Face validity is concerned with whether an instrument appears to be measuring the domain of interest. Content validity is a judgement about whether instrument content adequately covers the domain of interest. There is increasing evidence that items within instruments tend to be concentrated around the middle of the scale hierarchy with relatively fewer items at the extremes representing lower and higher levels of health. Instrument content should be examined for relevance to the application and for adequate coverage of the domain of interest.

Further evidence can be obtained from considering how the instrument was developed. This includes the extent of involvement in instrument development of experts with relevant clinical or health status measurement experience [30]. More importantly, consideration should be given to the extent of involvement of patients in the generation and confirmation of instrument content [31].

Validity testing should also involve some quantitative assessment. Criterion validity is assessed when an instrument correlates with another instrument or measure that is regarded as a more accurate or criterion variable. Within the field of patient-reported health measurement it is rarely the case that a criterion or 'gold standard' measure exists that can be used to test the validity of an instrument. There are two exceptions. The first is when an instrument is reduced in length with the longer version used as the 'gold standard' to develop the short version [32]. Scores for short and long versions of the instrument are compared, the objective being a very high level of correlation. Secondly, instruments that have the measurement objective of prediction have a gold standard available either concurrently or in the future. For example, the criterion validity of an instrument designed to predict the presence of a particular disease can be assessed through a comparison with the results of diagnosis.

In the absence of a criterion variable, validity testing takes the form of construct validation. Patient Reported Outcome Measures (PROMs) are developed to measure some underlying construct such as physical functioning or pain. On the basis of current understanding, such constructs can be expected to have a set of quantitative relationships with other constructs. For example, patients experiencing more severe pain may be expected to take more analgesics. Construct validity is assessed by comparing the scores produced by an instrument with sets of variables. To facilitate the interpretation of results expected levels of correlation should be specified at the outset of studies [33].

Many instruments are multidimensional and measure several constructs, including physical functioning, mental health, and social functioning. These constructs should be considered when assessing construct validity as should the expected relationships with sets of variables. Furthermore, the internal structure of such instruments can be assessed by methods of construct validation. Factor analysis and principal component analysis provide empirical support for the dimensionality or internal construct validity of an instrument [34]. These statistical techniques are used to identify separate health domains within an instrument [35].

(Back to top)


Responsiveness is concerned with the measurement of important changes in health and is therefore relevant when instruments are to be used in an evaluative context for the measurement of health outcomes. Just as with reliability and validity, estimates of responsiveness are related to applications within specific populations and are not an inherent or fixed property of an instrument.

Responsiveness is usually assessed by examining changes in instrument scores for groups of patients whose health is known to have changed. This may follow an intervention of known efficacy. Alternatively, patients may be asked how their current health compares to some previous point in time by means of a health transition question. There is no single agreed method of assessing responsiveness and a number of statistical techniques are used for quantifying responsiveness.

The effect size statistic is equal to the mean change in instrument scores divided by the baseline standard deviation [36]. The standardised response mean is equal to the mean change in scores divided by the standard deviation of the change in scores [37]. The modified standardised response mean, sometimes referred to as the index of responsiveness, is equal to the mean change in scores divided by the standard deviation of change scores in stable subjects [38]. The denominator for the latter can be derived from the test-retest method of reliability testing.

(Back to top)


  1. Fitzpatrick R, Davey C, Buxton MJ, and Jones DR. Evaluating patient-based outcome measures for use in clinical trials. Health Technology Assessment 1998;2(14).
  2. Liang MH, Cullen KE, Larson M. In search of the perfect mousetrap (health status or quality of life measurement). Journal of Rheumatology 1982;9:775-9.
  3. Guyatt GH, Feeny DH, Patrick DL. Issues in quality-of-life measurement in clinical trials. Controlled Clinical Trails 1991;12:81S-90S.
  4. Ware JE. Standards for validating health measures: definition and content. Journal of Chronic Diseases 1987;40:473-80.
  5. Kirshner B, Guyatt G. A methodological framework for assessing health indices. Journal of Chronic Diseases 1985;38:27-36.
  6. Cox DR, Fitzpatrick R, Fletcher AE, Gore SM, Spiegelhalter DJ, Jones DR. Quality of life measurement: can we keep it simple? Journal of the Royal Statistical Society 1992;155:353-93.
  7. Garratt AM, Ruta DA, Abdalla MI, Russell IT. Responsiveness of the SF-36 and a condition-specific measure of health for patients with varicose veins. Quality of Life Research 1996;5:223-34.
  8. Ruta DA, Garratt AM, Russell IT. Patient centred assessment of quality of life for patients with four common conditions. Quality in Health Care 1999;8:22-29.
  9. Bullinger M. Ensuring international equivalence of quality of life measures. Orley J, Kuykken, eds. Quality of Life Assessment : International Perspectives. Berlin: Springer, 1994:33-40.
  10. Leplege A, Verdier A. The adaptation of health status measures: methodological aspects of the translation procedure. In Shumaker S, Berzon R, eds. The international assessment of health-related quality of life: theory, translation, measurement and analysis. Oxford: Rapid Communications of Oxford, 1995:93-101.
  11. Sprangers MA, Cull A, Bjordal K, Groenvold M, Aaronson NK. The European Organisatation for Research and Treatment of Cancer. Approach to quality of life assessment: guidelines for developing questionnaire modules. EORTC Study Group on Quality of Life. Quality of Life Research 1993;2:287-95.
  12. Aarsonson NK. Assessing the quality of life of patients in cancer clinical trials: common problems and common sense solutions. European Journal of Cancer 1992;28A:1304-7.
  13. Erickson P, Taeuber RC, Scott J. Operational aspects of quality-of-life assessment: choosing the right instrument: review article. Pharmacoeconomics 1995;7:39-48.
  14. Testa MA, Simonson DC. Assessment of quality-of-life outcomes. New England Journal of Medicine 1996;334:835-40.
  15. Jaeschke R, Singer J, Guyatt GH. Measurement of health status. Ascertaining the minimal clinically important difference. Controlled Clinical Trials 1989;10:407-15.
  16. Juniper EF, Guyatt GH. Willan A, Griffith LE. Determining a minimal important change in a disease-specific quality of life questionnaire. Journal of Clinical Epidemiology 1994:47;81-87.
  17. Stewart AL, Greenfield S, Hays RD, Wells K, Rogers WH, Berry SD, McGlynn EA, Ware JE. Functional status and well-being of patients with chronic conditions: results from the medical outcomes study. Journal of the American Medical Association 1989;262:907-13.
  18. Garratt AM, Ruta DA, Abdalla MI, Russell IT. The SF-36 health survey questionnaire: ii responsiveness to changes in health status in four common clinical conditions. Quality in Health Care 1994:3;186-92.
  19. EuroQol Group. EuroQol - a new facility for the measurement of health related quality of life. Health Policy 1990;16:199-208.
  20. Feeny DH, Furlong W, Boyle M, Torrance GW. Multi-attribute health status classification systems: Health Utilities Index. Pharmacoeconomics 1995;7:490-502.
  21. Hunt SM, McEwen J, McKenna SP. Measuring health status: a new tool for clinicians and epidemiologists. Journal of the Royal College of General Practitioners 1985;35:185-8.
  22. Bergner M, Bobbitt RA, Carter WB, Gilson BS. The Sickness Impact Profile: development and final revision of a health status measure. Medical Care 1981;19:787-805.
  23. Bindman AB, Keane D, Lurie N. Measuring health changes among severely ill patients: the floor phenomenon. Medical Care 1990;28:1142-52.
  24. Gardiner PV, Sykes HR, Hassey GA, Walker DJ. An evaluation of the Health Assessment Questionnaire in long-term follow-up of disability in rheumatoid arthritis. British Jounral of Rheumatology 1993;32:724-8.
  25. Stucki G, Daltroy L, Katz JN, Johannesson M, Liang MH. Interpretation of change scores in ordinal clinical scale and health status measures: the whole may not equal the sum of the parts. Journal of Clinical Epidemiology 1996;49:711-171.
  26. Tennant A, Hillman M, Fear J, Pickering A, Chamberlain MA. Are we making the most of the Standford Health Assessment Questionnaire. British Journal of Rheumatology 1996;35:574-8.
  27. Garratt AM in collaboration with UKBEAM. Rasch analysis of the Roland Disability Questionnaire. Spine 2003;28:79-84.
  28. Cronbach LJ. Coefficient alpha and the internal structure of tests. Psychometrika 1951; 16:297-334.
  29. StreinerGL, Norman RD. Health measurement scales: A practical guide to their development and use. Oxford: Oxford University Press 1995 (2nd edn).
  30. Guyatt GH, Cook DJ. Health status, quality of life and the individual. Journal of the American Medical Association 1994;272:630-1.
  31. Lomas J, Pickard L, Mohide A. Patient versus clinician item generation for quality-of-life measures. The case of language-disabled adults. Medical Care 1987;25:764-9.
  32. Ware JE, Kosinski M, Keller SD. A 12-item short-form health survey. Construction of scales and preliminary tests of validity and reliability. Medical Care 1995;34:220-33.
  33. McDowell I, Jenkinson C. Development standards for health measures. Journal of Health Services Research and Policy 1996;1:238-46.
  34. Jolliffe IT, Morgan BJT. Principal component analysis and exploratory factor analysis. Statistical Methods in Medical Research, 1992;1:69-95.
  35. Garratt AM, Hutchinson A, Russell IT. The UK version of the Seattle Angina Questionnaire (SAQ-UK): reliability, validity and responsiveness. Journal of Clinical Epidemiology 2001;54:907-915.
  36. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Medical Care 1989;27:MS178-89.
  37. Liang MH, Fossel AH, Larson MG. Comparisons of five health status instruments for orthopedic evaluation. Medical Care 1990;28:632-42.
  38. Guyatt G, Walter S, Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. Journal of Chronic Diseases 1987;40:171-8.

(Back to top)

Dept. of Public Health

University of Oxford
Old Road Campus
Oxford OX3 7LF

Department of Health

The Information Centre