Objective: To evaluate the interrater reproducibility of scientific abstract review.
Design: Retrospective analysis.
Setting: Review for the 1991 Society of General Internal Medicine (SGIM) annual meeting.
Subjects: 426 abstracts in seven topic categories evaluated by 55 reviewers.
Measurements: Reviewers rated abstracts from 1 (poor) to 5 (excellent), globally and on three specific dimensions: interest to the SGIM
audience, quality of methods, and quality of presentation. Each abstract was reviewed by five to seven reviewers. Each reviewer’s
ratings of the three dimensions were added to compute that reviewer’ssummary score for a given abstract. The mean of all reviewers’ summary scores for an abstract, thefinal score, was used by SGIM to select abstracts for the meeting.
Results: Final scores ranged from 4.6 to 136 (mean=9.9). Although 222 abstracts (5296) were accepted for publication, the 95% confidence
interval around the final score of 300 (70.4%) of the 426 abstracts overlapped with the threshold for acceptance of an abstract.
Thus, these abstracts were potentially misclassified. Only 36% of the variance in summary scores was associated with an abstract’s
identity, 12% with the reviewer’s identity, and the remainder with idiosyncratic reviews of abstracts. Global ratings were
more reproducible than summary scores.
Conclusion: Reviewers disagreed substantially when evaluating the same abstracts. Future meeting organizers may wish to rank abstracts
using global ratings, and to experiment with structured review criteria and other ways to improve raters’ agreement.
Key words peer review - abstracts - interrater reliability - judgment - agreement - psychometrics - analysis of variance - general internal medicine - research
Drs. Steinberg and Rubin co-chaired the Abstract Selection Committee for the SGIM’s 14th annual meeting.