Total agreement between independent evaluators on forensic mental health evaluations was less than twenty percent. This is the bottom line of a recently published article in Psychological Assessment. Below is a summary of the research and findings as well as a translation of this research into practice.

Diagnostic Field Reliability in Forensic Mental Health Evaluations


W. Neil Gowensmith, University of Denver
Stephanie N. Sessarego, University of Denver
Meghan K. McKee, University of Denver
Samantha Horkott, University of Denver
Nine MacLean, University of Denver
Katherine E. McCallum, University of Denver


How likely are multiple forensic evaluators to agree on defendants’ diagnoses in routine forensic mental health evaluations? A total of 720 evaluation reports were examined from 240 cases in which 3 evaluators, working independently, provided diagnoses for the same defendant. Results revealed perfect agreement across 6 independent diagnostic categories in 18.3% of cases. Agreement for individual diagnostic categories was higher, with all 3 evaluators agreeing on the separate presence of psychotic, mood, or substance disorders in more than 64.7% of cases and agreeing on the presence of cognitive or developmental disorders in more than 89.7% of cases. However, evaluators agreed about the combination of psychotic and substance-related diagnoses in only 46.5% of cases. Agreement was enhanced by diagnoses with low base rates, and it was suppressed in evaluations conducted in jails. Psychiatrists and contracted evaluators were more likely to provide dissenting diagnostic categories than psychologists and state-employed evaluators. These results are among the first to document diagnostic agreement among nonpartisan practitioners in forensic evaluations conducted in the field, and they allow for practice and policy recommendations for evaluators in routine forensic practice to be made.


Summary of the Research

“Many studies have examined the diagnostic reliability of individual disorders, as well as the criterion validity of each DSM edition, but fewer have examined reliability in the field. A significant body of literature suggests that practitioners do not always demonstrate the same levels of reliability and validity in their daily practice as is found in laboratory conditions. This highlights the importance of diagnostic field reliability, or the reliability of diagnoses made by professionals on cases in the course of their regular practice.” (p. 692-693).

“Researchers in Australia, assessing agreement of psychiatric diagnosis in forensic cases, found good interrater reliability on diagnoses of brain injuries, substance-induced disorders, and intellectual disabilities, but they also reported poor to moderate agreement on diagnosis of depressive, personality, and anxiety disorders. Overall, diagnostic reliability for mental health disorders has been described as “substandard” (p. 693).

“Inherent in the psycholegal constructs of competency and sanity is the diagnostic category of the person under evaluation… An evaluator who misdiagnoses a defendant could run the risk of inaccurately opining that defendant CST or not guilty by reason of insanity (or vice versa). Accurate diagnostic categorization is key in forensic mental health evaluation.” (p. 963).

“One state’s forensic system is arranged in a manner that is ideal for a naturalistic study of evaluator agreement. Hawaii Revised Statute 704–404 states that in felony cases requiring FMHAs, “the court shall appoint three qualified examiners” to evaluate the defendant. These examiners are neutral (not retained by defense or prosecution) and certified to conduct forensic evaluations by the state’s department of health (DOH). In each case involving CST and legal insanity, the court requires that the evaluator provide a “diagnosis of the physical or mental condition” of the person being evaluated (Hawaii Revised Statutes, 2013). […] The present study investigates diagnostic reliability in forensic mental health evaluations conducted in the field.” (p. 694).

Measures used: A total of 257 evaluation reports were submitted by psychiatrists (35.7%) compared with 463 by psychologists (64.3%). A total of 7 psychologists working for the state of Hawaii’s DOH submitted 240 reports (33.3%), compared with 33 contracted evaluators in independent practice who collectively submitted 480 reports (66.67%) […] Diagnoses were coded into the following categories: psychotic, cognitive, mood, personality, substance-related disorders, and developmental disorders/mental retardation (DD/MR).” (p. 694).

Results: “In 44 (18.3%) of the 240 cases, all three evaluators agreed on every diagnostic category, even when the defendant was diagnosed with multiple disorders. Results showed an agreement rate of 71.8% for psychotic disorders (n = 173) 64.7% for substance abuse (n = 156), and 65.2% for mood disorders (n = 157). Agreement was highest for diagnoses of DD/MR (n = 231; 95.9%) and cognitive disorders (n = 216). Diagnostic reliability was lowest for personality disorders; the evaluators agreed on 149 (61.8%) of the 240 cases.” (p. 695).

Researchers “then calculated unanimous agreement rates on various combinations of psychotic, substance-related, and DD/MR diagnostic categories. Evaluators agreed unanimously on the presence or absence of psychotic and substance disorders in 46.5% (n = 109, ICC = .61) of cases, on psychotic and DD/MR disorder in 68.8% (n = 165, ICC = .52) of cases, and on substance-related and DD/MR in 62.0% (n = 150, ICC = .41) of cases. Evaluators agreed unanimously on all three diagnostic categories in 43.8% (n = 105, ICC =.26) of cases. Furthermore “of the 44 evaluations showing perfect categorical diagnostic agreement, 13 (29.5%) were conducted in an inpatient hospital, 12 (27.3%) were conducted in jail, and 19 (43.2%) were conducted in an outpatient office … [and] perfect categorical diagnostic agreement was found significantly more often in inpatient hospitals (29.5% vs. 18.3%) and less often in correctional facilities (27.3% vs. 40.7%.” (p. 695).

In addition, “psychiatrists were disproportionately and significantly more likely to offer the lone dissenting diagnostic category.” (p. 695).

Translating Research into Practice (L2, #2488CD)

“Evaluators showed perfect agreement as to the presence or absence of all six diagnostic categories in only 18.3% of all cases. This percentage is much higher than would be expected by chance. However, the cases in which evaluators agreed across all diagnostic categories were relatively “simple” cases; evaluators coded for the presence of approximately 1.3 disorders in those cases.

Furthermore, although the perfect agreement rate exceeded chance, the 18.3% rate was still low. In routine practice, evaluators agreed on a defendant’s entire diagnostic picture in fewer than one of five cases.” (p. 696).

[E]valuators agreed unanimously on the presence or absence of the combination of psychotic and substance-related disorders in only 46.5% of cases. This represents a substantial drop-off from the diagnostic agreements rates for psychotic and substance-related diagnostic categories on their own (71.8% and 64.7%, respectively). It appears that one of the biggest hurdles regarding diagnostic agreement in forensic evaluations is that evaluators show lower rates of agreement in cases involving the potential of psychosis and substance abuse; however, these two diagnostic categories are critically important in most pretrial forensic evaluations.” (p. 697)

“In terms of field reliability, this means that evaluators reach a consensus on the most pertinent diagnostic categories for pretrial evaluations in fewer than half of all pretrial cases. This low level of agreement is likely to have serious implications for the psycholegal opinions made by the evaluators, and, in turn, the ultimate judicial dispositions made by the court. Evaluators that disagree on the presence of psychosis or substance-related disorders are more likely to disagree on the psycholegal opinions of competency and sanity.” (p. 697).

“[C]onducting evaluations within the first 2 weeks is more likely to lead to disagreement on diagnoses and the psycholegal issues under consideration and that disagreement dissipates beyond the 15-day time frame. […] Defendants
evaluated later may present with more reliable, stable symptoms.” (p. 697).

Other Interesting Tidbits for Researchers and Clinicians

Overall, field reliability of most forensic evaluations is poor to mediocre (Fuger, Acklin, Nguyen, Ignacio, & Gowensmith, 2014; Robinson & Acklin, 2010). These authors have suggested that low reliability may partially stem from diagnostic uncertainty; however, this hypothesis has yet to be empirically tested.

Furthermore, some recent studies have hinted that diagnostic categories can affect the overall reliability of evaluator opinions in forensic cases. In cases involving the evaluation of legal sanity, evaluators were more likely to disagree on the defendant’s sanity in cases in which defendants were diagnosed with substance-related disorders (Gowensmith, Murrie, & Boccaccini, 2013b).

Conversely, in the same study, evaluators were more likely to agree on defendants’ legal sanity when defendants were diagnosed with psychotic disorders.” (p. 693-694).

“[D]espite the near certainty of some diagnostic disagreement, at some point the amount of disagreement becomes unacceptable. That level is not defined in forensic mental health and is probably fodder for setting standards of practice in the field; where that threshold will ultimately be set is unknown.” (p. 698)

“[U]nanimous agreement on all diagnostic categories was found in fewer than one case out of five. Work should be done to improve this low level of diagnostic field reliability.” (p. 698).


