Following the identification of literature pertaining to instruments it is important that users apply the necessary criteria to select the most suitable instrument(s). There are eight criteria that should be considered in the selection of Patient Reported Outcome Measures (PROMs):
is the instrument content appropriate to the questions which the application
seeks to address?
These criteria are not precisely or uniformly described in the literature; nor can they be prioritised in terms of importance, rather they should be considered in relation to the proposed application. Further information relating to these criteria can be found in the following report:
Fitzpatrick R, Davey C, Buxton MJ, and Jones DR Evaluating patient-based outcome measures for use in clinical trials. Health Technology Assessment 1998 2, (14). Available free from http://www.hta.ac.uk/fullmono/mon214.pdf
Appropriateness is the extent to which instrument content is appropriate to the particular application.
Careful consideration should be given to the aims of the application, with reference to areas of health concern, i.e. which dimensions of health are important, the nature of the patient group and about the content of possible instruments [1-3]. It is difficult to give general recommendations as to what makes an instrument appropriate for a given application, because this ultimately depends on the users' specific questions and the content of instruments.
Instrument selection is often dominated by psychometric considerations of reliability and validity, with insufficient attention given to the content of instruments. The names of instruments and constituent scales or dimensions should not be taken at face value . Users should consider the content of individual items within instruments.
Consideration must also be given to the measurement objective, which is closely related to the proposed application. Patient Reported Outcome Measures (PROMs) can have three broad measurement objectives: discrimination, evaluation and prediction . Discrimination is concerned with the measurement of differences between patients when there is no external criterion available to validate the instrument. For example, measures of psychological well-being have been developed to identify individuals suffering from anxiety and depression. Evaluation is concerned with the measurement of changes over time. For example, Patient Reported Outcome Measures (PROMs) administered before and after treatment, are used as outcome measures in clinical trials. Prediction is concerned with classifying patients when a criterion is available to determine whether the classification is correct. For example, Patient Reported Outcome Measures (PROMs) may be used in diagnosis and screening as a means of identifying individuals for suitable forms of treatment.
The three measurement objectives are not necessarily mutually exclusive. However, before they can be considered appropriate, instruments must undergo testing that is tailored to the measurement objectives that are relevant to the proposed application. Individual items and scales should be examined to determine whether they concord with the measurement objective. Discrimination and evaluation may be complementary if both are concerned with the measurement of differences that are clinically important, be they cross-sectional or longitudinal. However, an item that asks about family history of a particular disease may be useful for determining which patients have the disease but will be inappropriate for evaluation.
It is also important to consider how broad a measure of health is required. Specific instruments can have a very restricted focus on symptoms and signs of disease, but may also take account of the impact of disease on quality of life. Generic instruments measure broader aspects of health and quality of life that are of general importance. Where feasible, it is recommended that both specific and generic instruments be used to measure health outcomes [6,7]. In this way the most immediate effects of treatment on disease should be captured, as well as possible consequences that are harder to anticipate.
Acceptability is the extent to which an instrument is acceptable to patients. Indicators of acceptability include administration time, response rates, and levels of missing data . There are a number of factors that can influence acceptability including the mode of administration, questionnaire design, and the health status of respondents. The format of patient-reported instruments can also influence acceptability. For example, the task faced by respondents completing individualised instruments is usually more difficult than that for instruments based on summated rating scales . General features of layout, appearance, and legibility are thought to be important influences on acceptability.
To be acceptable, the instrument must be presented in a language that is familiar to respondents. Guidelines are available that are designed to ensure a high standard of translation [9,10]. These guidelines recommend the comparison of several independent translations, back translation, and the testing of acceptability of new translations.
Issues of acceptability should be considered at the design stage of instrument and questionnaire development. Patients' views about a new instrument should be obtained at the pre-testing phase, prior to formal tests of instrument measurement properties including reliability . Patients can be asked by means of additional questions or semi-structured interview whether they found any questions difficult or distressing.
Feasibility concerns the ease of administration and processing of an instrument. These are important considerations for staff and researchers who collect and process the information produced by patient-reported instruments [12,13]. Instruments that are difficult to administer and process may jeopardise the conduct of research and disrupt clinical care. An obvious example is the additional resources required for interviewer administration over self-administration. The complexity and length of an instrument will have implications for the form of administration. Staff training needs must be considered before undertaking interviewer administration. Staff may also have to be available within the clinic to help patients who have difficulty with self-administration. Finally, staff attitudes and acceptance of patient-reported instruments can make a substantial difference to respondent acceptability.
Interpretability concerns the meaningfulness of scores produced by an instrument. To some extent, the lack of familiarity in the use of instruments may be a hindrance to interpretation. Three approaches to interpretation have been proposed. First, changes in instrument scores have been compared to previously documented change scores produced by the same instrument for major life events such as loss of a job . Secondly, attempts have been made to identify the minimal clinically important difference (MCID), which is equal to the smallest change in instrument scores that is perceived as beneficial by patients [15,16]. External judgments including summary items such as health transition questions are used to determine the MCID. Thirdly, normative data from the general population can be used to interpret scores from generic instruments [17,18]. The standardisation of instrument scores is an extension of this form of interpretation that allows score changes to be expressed in terms of the score distribution for the general population .
Precision concerns the number and accuracy of distinctions made by an instrument. There are a number of aspects to the issue of precision, which relate to methods of scaling and scoring items, and the distribution of items over the range of the construct being measured.
The scaling of items within instruments has important implications for precision. The binary or 'yes' or 'no' is the simplest form of response category but it does not allow respondents to report degrees of difficulty or severity. The majority of instruments use adjectival or Likert type scales such as: strongly agree, agree, uncertain, disagree, strongly disagree. Visual analogue scales appear to offer greater precision but there is insufficient evidence to support this and they may be less acceptable to respondents.
There are a number of instruments that incorporate weighting systems, the most widely used being preferences or values derived from the general public for utility measures such as the EuroQol EQ-5D  and the Health Utilities Index . Weighting schemes have also been applied to instruments based on summated rating scales including the Nottingham Health Profile  and the Sickness Impact Profile . Such weighting schemes may seem deceptively precise and should be examined for evidence of reliability and validity.
The items and scores of different instruments may vary in how well they capture the full range of the underlying construct being measured. End effects occur when a large proportion of respondents score at the floor or ceiling of the score distribution. If a large proportion of items have end effects then instrument scores will be similarly affected. End effects are evidence that an instrument may be measuring a restricted range of a construct and may limit both discriminatory power and responsiveness [23,24].
The application of Item Response Theory (IRT) can further help determine the precision of an instrument. IRT assumes that a measurement construct such as physical disability, can be represented by a hierarchy that ranges from the minimum to maximum level of disability . IRT has shown that a number of instruments have items concentrated around the middle of the hierarchy with relatively fewer items positioned at the ends [25-27]. The scores produced by such instruments are not only a function of the health status of patients but also the imprecision of measurement.
Reliability concerns whether an instrument is internally consistent or reproducible, and it assesses the extent to which an instrument is free from measurement error. It may be regarded as the proportion of a score that is signal rather than noise. As the measurement error of an instrument increases, so does the sample size required to obtain precise estimates of the effects of an intervention .
Internal consistency is measured with a single administration of an instrument and assesses how well items within a scale measure a single underlying dimension. Internal consistency is usually assessed using Cronbach's alpha, which measures the overall correlation between items within a scale . Caution should be exercised in the interpretation of alpha because its size is dependent on the number of items as well as the level of correlation between items . Furthermore, very high levels of correlation between items may indicate redundancy and the possibility that items are measuring a very narrow aspect of a construct.
Reproducibility assesses whether an instrument produces the same results on repeated administrations when respondents have not changed. This is assessed by test-retest reliability. There is no exact agreement about the length of time between administrations but in practice it tends to be between 2 and 14 days . The reliability coefficient is normally calculated by correlating instrument scores for the two administrations. It is recommended that the intra-class correlation coefficient be used in preference to Pearson's correlation coefficient, which fails to take sufficient account of systematic error.
Reliability estimates of 0.7 and 0.9 are recommended for instruments that are to be used in groups and individual patients respectively . Reliability is not a fixed property and must be assessed in relation to the specific population and context .
Validity is the extent to which an instrument measures what is intended. Validity can be assessed qualitatively through an examination of instrument content, and quantitatively through factor analysis and comparisons with related variables. As with reliability, validity should not be seen as a fixed property and must be assessed in relation to the specific population and measurement objectives.
Content and face validity assess whether items adequately address the domain of interest . They are qualitative matters of judging whether an instrument is suitable for its proposed application. Face validity is concerned with whether an instrument appears to be measuring the domain of interest. Content validity is a judgement about whether instrument content adequately covers the domain of interest. There is increasing evidence that items within instruments tend to be concentrated around the middle of the scale hierarchy with relatively fewer items at the extremes representing lower and higher levels of health. Instrument content should be examined for relevance to the application and for adequate coverage of the domain of interest.
Further evidence can be obtained from considering how the instrument was developed. This includes the extent of involvement in instrument development of experts with relevant clinical or health status measurement experience . More importantly, consideration should be given to the extent of involvement of patients in the generation and confirmation of instrument content .
Validity testing should also involve some quantitative assessment. Criterion validity is assessed when an instrument correlates with another instrument or measure that is regarded as a more accurate or criterion variable. Within the field of patient-reported health measurement it is rarely the case that a criterion or 'gold standard' measure exists that can be used to test the validity of an instrument. There are two exceptions. The first is when an instrument is reduced in length with the longer version used as the 'gold standard' to develop the short version . Scores for short and long versions of the instrument are compared, the objective being a very high level of correlation. Secondly, instruments that have the measurement objective of prediction have a gold standard available either concurrently or in the future. For example, the criterion validity of an instrument designed to predict the presence of a particular disease can be assessed through a comparison with the results of diagnosis.
In the absence of a criterion variable, validity testing takes the form of construct validation. Patient Reported Outcome Measures (PROMs) are developed to measure some underlying construct such as physical functioning or pain. On the basis of current understanding, such constructs can be expected to have a set of quantitative relationships with other constructs. For example, patients experiencing more severe pain may be expected to take more analgesics. Construct validity is assessed by comparing the scores produced by an instrument with sets of variables. To facilitate the interpretation of results expected levels of correlation should be specified at the outset of studies .
Many instruments are multidimensional and measure several constructs, including physical functioning, mental health, and social functioning. These constructs should be considered when assessing construct validity as should the expected relationships with sets of variables. Furthermore, the internal structure of such instruments can be assessed by methods of construct validation. Factor analysis and principal component analysis provide empirical support for the dimensionality or internal construct validity of an instrument . These statistical techniques are used to identify separate health domains within an instrument .
Responsiveness is concerned with the measurement of important changes in health and is therefore relevant when instruments are to be used in an evaluative context for the measurement of health outcomes. Just as with reliability and validity, estimates of responsiveness are related to applications within specific populations and are not an inherent or fixed property of an instrument.
Responsiveness is usually assessed by examining changes in instrument scores for groups of patients whose health is known to have changed. This may follow an intervention of known efficacy. Alternatively, patients may be asked how their current health compares to some previous point in time by means of a health transition question. There is no single agreed method of assessing responsiveness and a number of statistical techniques are used for quantifying responsiveness.
The effect size statistic is equal to the mean change in instrument scores divided by the baseline standard deviation . The standardised response mean is equal to the mean change in scores divided by the standard deviation of the change in scores . The modified standardised response mean, sometimes referred to as the index of responsiveness, is equal to the mean change in scores divided by the standard deviation of change scores in stable subjects . The denominator for the latter can be derived from the test-retest method of reliability testing.