| Standardization Phase | 
The information gleaned from the pilot phase turned into actions to take in the standardization phase, such as revising or removing items to improve test performance . Like with the pilot phase, items comprising the standardization version of the revision were digitally administered remotely via an email link for the Parent and Teacher forms, while youth completed the Self-Report form digitally and in-person with a trained administrator. Parents, teachers, and youth (who were 18 years of age) provided informed consent to participate in this data collection effort. For youth under 18 years of age, informed consent was obtained by the parent/guardian and assent was provided by the youths themselves. All participants received monetary compensation for their participation.
A total of 3,257 parent ratings, 2,881 teacher ratings, and 1,589 youth self-reported ratings were collected during the standardization phase and are included in the Total Samples. Data were collected from both the U.S. and Canada. Table 6.5 displays the demographic characteristics of the rated youth in the Total Sample for Parent, Teacher, and Self-Report forms, and Table 6.6 displays the demographic characteristics of the raters. The Total Sample includes a General Population group that includes youth who were not diagnosed with any mental health disorders (N = 2,518 for Parent; N = 2,379 for Teacher; N = 1,247 for Self-Report) and a Clinical group that includes youth who have a confirmed mental health condition. This Clinical group is comprised in part of youth with ADHD (N = 560 for Parent, N = 321 for Teacher, N = 229 for Self-Report); of these youth, many had co-occurring mental health conditions (46.8% for Parent, 51.1% for Teacher, and 43.2% for Self-Report). The clinical group also included youth diagnosed with disorders other than ADHD (e.g., Generalized Anxiety Disorder, Specific Learning Disorders, Major Depressive Disorder, Autism Spectrum Disorder, and Oppositional Defiant Disorder, among other less frequently reported disorders; N = 179 for Parent, N = 181 for Teacher, N = 113 for Self-Report). Similar to the pilot samples, youth in the Clinical group varied with respect to whether or not they were currently taking medication to treat their conditions, and youth were not asked to stop their medication (for detailed information, refer to Medication Status of the ADHD Reference Samples section in chapter 7, Standardization). Clinical cases were recruited directly through clinicians, who provided details of the diagnoses to confirm their status, as well as recruited through social media channels for the Parent sample. Parents recruited through social media channels were administered the Structured Clinical Interview for DSM Disorders (SCID-5; First, Williams, Karg, & Spitzer, 2016) to confirm their child’s reported diagnostic status.
In addition to the individual characteristics described in the tables and used for stratification in the creation of reference samples, descriptive demographics such as school type, urbanicity, and language(s) spoken were collected in order to profile the samples. Youth attended various types of schools (approximately 87.4 to 92.9% across rater forms were enrolled in public schools, and the remaining students were enrolled in private schools or indicated “other school type,” such as homeschooling). A portion of the youth resided in an urbanized area with a population of 50,000 people or more (48.4% Parent, 43.5%, 80% Self-Report), while the remaining youth lived in urban clusters (i.e., smaller cities or suburbs) or rural areas. The majority of the youth spoke only English (84.6% Parent, 81.3% Teacher, 86.0% Self-Report), while the remaining students were bilingual.
Carefully selected subsets of the Total Sample were used to create the Normative Samples and ADHD Reference Samples (see chapter 7, Standardization, for detailed sample descriptions and to see the close match of the Normative Samples to the U.S. and Canadian census figures). These subsets were used in the majority of analyses described later in this manual, whereas the Total Sample (all General Population and Clinical cases combined) was used for factor analyses and IRT modeling during the standardization phase and was used in the analysis of the fairness of the Conners 4 scores (see chapter 10), as well as in the development of the Conners 4–Short (see chapter 11). These samples, as a whole and as subsets, were used to determine the final set of items for the Conners 4.
Table 6.5. Demographic Characteristics of Rated Youth: Conners 4 Total Samples
| Demographic | Parent | Teacher | Self-Report | ||||
| N | % | N | % | N | % | ||
| Gender | Male | 1,710 | 52.5 | 1,474 | 51.2 | 788 | 49.6 | 
| Female | 1,541 | 47.3 | 1,406 | 48.8 | 797 | 50.2 | |
| Other | 6 | 0.2 | 1 | < 0.1 | 4 | 0.3 | |
| U.S. Race/Ethnicity | Hispanic | 520 | 16.0 | 411 | 14.3 | 286 | 18.0 | 
| Asian | 135 | 4.1 | 98 | 3.4 | 55 | 3.5 | |
| Black | 313 | 9.6 | 376 | 13.1 | 171 | 10.8 | |
| White | 1,731 | 53.1 | 1,546 | 53.7 | 791 | 49.8 | |
| Other | 172 | 5.3 | 146 | 5.1 | 86 | 5.4 | |
| Canadian Race/Ethnicity | Not visible minority | 275 | 8.4 | 213 | 7.4 | 137 | 8.6 | 
| Visible minority | 111 | 3.4 | 90 | 3.1 | 63 | 4.0 | |
| U.S. Region | Northeast | 499 | 15.3 | 595 | 20.7 | 245 | 15.4 | 
| Midwest | 687 | 21.1 | 571 | 19.8 | 343 | 21.6 | |
| South | 1,132 | 34.8 | 1,013 | 35.2 | 530 | 33.4 | |
| West | 553 | 17.0 | 398 | 13.8 | 271 | 17.1 | |
| Canadian Region | East | 33 | 1.0 | 27 | 0.9 | 14 | 0.9 | 
| Central | 242 | 7.4 | 184 | 6.4 | 130 | 8.2 | |
| West | 111 | 3.4 | 93 | 3.2 | 56 | 3.5 | |
| Parental Education Level | No high school diploma | 255 | 7.9 | — | — | 160 | 10.1 | 
| High school diploma/GED | 602 | 18.5 | — | — | 371 | 23.3 | |
| Some college or associate’s degree | 957 | 29.5 | — | — | 426 | 26.8 | |
| Bachelor’s degree | 861 | 26.5 | — | — | 354 | 22.3 | |
| Graduate or professional degree | 572 | 17.6 | — | — | 278 | 17.5 | |
| Diagnosis | General Population (no diagnosis) | 2,518 | 77.3 | 2,379 | 82.6 | 1,247 | 78.5 | 
| ADHD | 560 | 17.2 | 321 | 11.1 | 229 | 14.4 | |
| Other Diagnoses | 179 | 5.5 | 181 | 6.3 | 113 | 7.1 | |
| Age in years M (SD) | 11.7 (3.5) | 11.8 (3.6) | 12.8 (3.0) | ||||
| Total | 3,257 | 100.0 | 2,881 | 100.0 | 1,589 | 100.0 | |
Table 6.6. Demographic Characteristics of Raters: Conners 4 Total Samples
| Rater Demographic | Parent | Teacher | |||
| N | % | N | % | ||
| Gender | Male | 914 | 28.1 | 690 | 24.0 | 
| Female | 2,342 | 71.9 | 2,190 | 76.0 | |
| Other | 1 | < 0.1 | 1 | < 0.1 | |
| U.S. Race/Ethnicity | Hispanic | 466 | 14.3 | 174 | 6.0 | 
| Asian | 1,858 | 57.0 | 2,118 | 73.5 | |
| Black | 327 | 10.0 | 152 | 5.3 | |
| White | 142 | 4.4 | 67 | 2.3 | |
| Other | 78 | 2.4 | 67 | 2.3 | |
| Canadian Race/Ethnicity | Not visible minority | 280 | 8.6 | 258 | 9.0 | 
| Visible minority | 106 | 3.3 | 45 | 1.6 | |
| Relation to Youth Being Rated | Biological mother | 2,181 | 67.0 | — | — | 
| Biological father | 844 | 25.9 | — | — | |
| Non-biological mother | 101 | 3.1 | — | — | |
| Non-biological father | 42 | 1.3 | — | — | |
| Other relative | 89 | 2.7 | — | — | |
| Length of Relationship | 1–5 months | — | — | 0 | 0.0 | 
| 6–11 months | — | — | 532 | 18.5 | |
| 1–3 years | — | — | 1,129 | 39.2 | |
| More than 3 years | — | — | 1,220 | 42.3 | |
| How well does the teacher know the student being rated? | Moderately well | — | — | 1,319 | 45.8 | 
| Very well | — | — | 1,562 | 54.2 | |
| How often does the teacher interact with the student being rated? | Monthly | — | — | 22 | 0.8 | 
| Weekly | — | — | 429 | 14.9 | |
| Daily | — | — | 2,430 | 84.3 | |
| Age in years M (SD) | 42.2 (8.4) | 42.3 (11.5) | |||
| Total | 3,527 | 100.0 | 2,881 | 100.0 | |
During the standardization phase, the validity scale explorations evolved into a larger goal of offering a Response Style Analysis to provide a more thorough examination of an individual’s response style. Used together, the metrics were designed to help identify atypical responding on the Conners 4. The analyses, and additional samples collected, to examine the Positive and Negative Impression Indices and the Inconsistency Index are described in this section, along with the introduction of a response time metric (Pace) that completes the comprehensive set of indicators used in understanding a rater’s response style1 (see chapter 4, Interpretation, for more detail on how these metrics are used).
To properly investigate the quality of the items that were written to capture overly positive responding, a study was conducted with samples that were instructed to present a falsely positive impression. Participants in these samples (N = 81 for Parent, N = 89 for Teacher, and N = 48 for individuals aged 13 to 18 completing the Self-Report form) were instructed to make the rated youth (child for Parent, student for Teacher, or themselves for Self-Report) look as good as possible. For example, for the Self-Report, adolescents were prompted with instructions such as “Pretend that your responses to these statements will be read out loud in class. You want to make sure that you are making a good impression on your teacher and classmates.” Individuals in these samples were from the general population and did not report any mental health diagnoses. All participants were compensated for their participation and were debriefed regarding the purpose of the study upon completion to avoid ill effects from soliciting exaggerated or false responses.
The frequency of responses to the Positive Impression item set were examined within the sample who received the special instructions (termed “Fake Good”), as well as in the General Population and ADHD Reference Samples. Items were considered as candidates if there was a considerable difference in response frequency between those instructed to respond honestly (i.e., General Population and ADHD samples) and those instructed to Fake Good. Specifically, candidate items were selected if there was more than a 15-percentage point difference in endorsement of the most extreme response option, which is an item score of 3, “Completely true (Very often/Always),” between the honest and faking conditions. Four items met this criterion for Parent and Self-Report, while only three items met this criterion for Teacher. These subsets of items were then summed to create a raw score, and the frequency distribution of raw scores was compared among the Fake Good, General Population, and ADHD Reference samples. A cut-off that would effectively distinguish between individuals instructed to fake good and those instructed to respond honestly could not be determined, as the set of successful items in this study did not permit sufficient variability of scores to distinguish the groups. Given this finding, and given that some items on the current Conners 3 Positive Impression Scale present concerns for its users (see Conceptualization and Initial Planning earlier in this chapter), it was determined that the inclusion of a Positive Impression Index would not be in the best interest of the revision and was thus excluded from the Conners 4.
Although it was initially intended that two indices would be created—one to flag overly negative responding (negative response bias) and another to flag invalid symptom reporting (i.e., ADHD Symptom Validity Test)—these indices were collapsed into a single index because high endorsement of either item type results in the need to consider the validity of the responses when interpreting test scores. While it is helpful to know what type of invalid reporting the rater was engaging in, the combination of items resulted in better overall detection of an invalid response style. The items for this index were evaluated in the General Population and ADHD Reference Samples, as well as in independent simulation samples collected during the standardization phase. To this end, separate “Fake Bad”(N = 77 for Parent, N = 106 for Teacher, and N = 35 for adolescents aged 13 to 18 years completing the Self-Report form) and “Fake ADHD” (N = 77 for Parent, N = 94 for Teacher, and N = 40 for adolescents aged 13 to 18 years completing the Self-Report form) samples were collected to simulate different types of invalid response styles.
Individuals in these two simulation samples received special instructions to encourage exaggerated or invalid symptom reporting. For example, parents/guardians in the Fake Bad sample (i.e.,“Fake Bad Responders”) were instructed “…to pretend that your child is experiencing emotional and behavior problems and that you really want to get them the help they need. Pretend that the only way to get them help is to make it sound like they have very serious problems.” Parents/guardians in the Fake ADHD sample (i.e., “Fake ADHD Responders”) received instructions to “…respond to the statements in a way that will make people think your child has Attention-Deficit/Hyperactivity Disorder […] and you are hoping that if they are diagnosed with the disorder, they will get special accommodations in school.” Individuals in the Fake ADHD sample were given a brief list of common symptoms that are typically experienced by youth with ADHD, based on the type of information that would be readily accessible by conducting a cursory internet search. Rated youth (and the adolescents themselves for the Self-Report) in these samples were from the general population and did not report any mental health diagnoses. All participants received an explanation of the purpose of the study upon competition to avoid ill effects from soliciting exaggerated or false responses.
Extreme responses to the Negative Impression Index items were compared between the General Population, ADHD, and the Fake Bad and Fake ADHD simulation samples. Items were identified as candidates for the Negative Impression Index if the General Population and ADHD Reference Samples demonstrated large differences in their response patterns compared to the simulated samples. In particular, items in which the endorsement of 2, “Pretty much true (Often/Quite a bit),” or 3,“Completely true (Very Often/Always),” was at least 15-percentage points more frequent for individuals pretending to have ADHD or faking bad than for individuals with a genuine ADHD diagnosis were retained as candidate items . The items were also required to demonstrate low endorsement from individuals in the General Population (i.e., fewer than 5% of the sample selecting 2 or 3). The item scores for selected candidate items were summed to create a raw score, and the distribution of raw scores within each sample was compared. Note that endorsement of the response option of 1,“Just a little true (Rarely),” did not contribute to the raw score; only item scores of 2 and 3 were counted toward the raw score, as these responses were observed to be rare and extreme and therefore more likely to be indicative of impression management.
Various combinations of items identified using this method were tested to explore the optimal set of items and raw score that could capture the largest proportion of the Fake Bad and Fake ADHD samples, while also capturing very few “Honest Responders” (i.e., individuals in the General Population and ADHD Reference Samples who were instructed to respond to all items honestly). To evaluate how well the set of items could distinguish groups, several key statistics that summarize classification accuracy were calculated, using the approach outlined by Kessel and Zimmerman (1993). To clarify, the following accuracy statistics were calculated (and note that the term “test”can refer to a scale, index, or measure of any kind, and the definitions provided are broad with multiple examples, as these statistics are used throughout this manual for multiple purposes):
A final set of Negative Impression Index items for Parent (8 items), Teacher (7 items), and Self-Report (8 items) was created. The set of items was selected by examining classification accuracy statistics for combinations of items to identify the best performing set. The resultant set of items reflects a balance of items intended to capture a negative response bias and invalid symptom reporting, thus achieving the goal defined during conceptualization (see appendix C for item content).
Classification accuracy statistics are presented in Table 6.7 (a to c), showing the ability of the Negative Impression Index to accurately identify individuals as either belonging to the simulation or to the Honest Responder samples. The optimal cut-off based on these item sets for identifying potential invalid response patterns was greater than or equal to 8 raw score points for Parent, 10 raw score points for Teacher, and 9 raw score points for Self-Report. This cut-off was set at a conservative level in order to avoid false positives (i.e., raters who are identified as having problematic response styles but are actually responding honestly); therefore, the goal was to have a cut-off with maximum levels of specificity (i.e., 90% of higher).
Using the selected raw score cut-offs, the correct classification rate for distinguishing between Fake ADHD and Honest Responders was 94.4% for Parent, 94.1% for Teacher, and 96.6% for Self-Report. Additionally, the correct classification rate for distinguishing between Fake Bad Responders and Honest Responders was 94.8% for Parent, 93.8% for Teacher, and 96.6% for Self-Report. As can be in seen in Table 6.7, the goal of maximizing specificity was met, with specificity values ranging from 90.9% to 97.2%. These results indicate that the Negative Impression Index can be an effective index for identifying individuals who are feigning ADHD and for those who are generally making the rated youth (or themselves) appear worse than they really are. These classification accuracy statistics to predict feigning depend on the prevalence of feigning in the population. The prevalence can vary widely depending on the purpose of the evaluation and the setting. Therefore, for a more nuanced examination of the classification accuracy, the Positive Predictive and Negative Predictive Values based on varying base rates (whereas the initial table reported 50% base rate) are provided in Table 6.8.
Table 6.7a. Classification Accuracy Statistics: Conners 4 Parent Negative Impression Index
| Classification Accuracy Statistic | Fake ADHD vs. Honest Responders | Fake ADHD vs.  | Fake Bad vs. Honest Responders | 
| Overall Correct Classification (%) | 94.4 | 90.7 | 94.8 | 
| Sensitivity (%) | 54.4 | 54.5 | 72.7 | 
| Specificity (%) | 95.3 | 95.7 | 95.3 | 
| Positive Predictive Value (%) | 21.5 | 63.6 | 26.8 | 
| Negative Predictive Value (%) | 98.9 | 93.9 | 99.3 | 
| Kappa | .29 | .54 | .37 | 
Note. Honest Responders = General Population, ADHD, and other clinical samples (N = 3,255); Genuine ADHD = ADHD Reference Sample (N = 560); Fake ADHD = individuals from the Fake ADHD simulation study (N = 77); Fake Bad = individuals from the Fake Bad simulation study (N = 77).
Table 6.7b. Classification Accuracy Statistics: Conners 4 Teacher Negative Impression Index
| Classification Accuracy Statistic | Fake ADHD vs. Honest Responders | Fake ADHD vs. Genuine ADHD | Fake Bad vs. Honest Responders | 
| Overall Correct Classification (%) | 94.1 | 82.1 | 93.8 | 
| Sensitivity (%) | 52.1 | 52.1 | 48.1 | 
| Specificity (%) | 95.5 | 90.9 | 95.5 | 
| Positive Predictive Value (%) | 27.4 | 62.8 | 28.2 | 
| Negative Predictive Value (%) | 98.4 | 86.6 | 98.0 | 
| Kappa | .33 | .46 | .33 | 
Note. Honest Responders = General Population, ADHD, and other clinical samples (N = 2,879); Genuine ADHD = ADHD Reference Sample (N = 320); Fake ADHD = individuals from the Fake ADHD simulation study (N = 94); Fake Bad = individuals from the Fake Bad simulation study (N = 106).
Table 6.7c. Classification Accuracy Statistics: Conners 4 Self-Report Negative Impression Index
| Classification Accuracy Statistic | Fake ADHD vs. Honest Responders | Fake ADHD vs. Genuine ADHD | Fake Bad vs. Honest Responders | 
| Overall Correct Classification (%) | 96.3 | 90.2 | 96.6 | 
| Sensitivity (%) | 62.5 | 62.5 | 80.0 | 
| Specificity (%) | 97.2 | 95.1 | 97.0 | 
| Positive Predictive Value (%) | 35.7 | 69.4 | 36.8 | 
| Negative Predictive Value (%) | 99.0 | 93.4 | 99.5 | 
| Kappa | .44 | .60 | .49 | 
Note. Honest Responders = General Population, ADHD, and other clinical samples (N = 1,592 for adolescents aged 13 to 18 years); Genuine ADHD = ADHD Reference Sample (N = 224 for adolescents aged 13 to 18 years); Fake ADHD = individuals from the Fake ADHD simulation study (N = 40 for adolescents aged 13 to 18 years); Fake Bad = individuals from the Fake Bad simulation study (N = 35 for adolescents aged 13 to 18 years).
Table 6.8. Classification Accuracy Adjusted for Base Rate: Conners 4 Negative Impression Index
| Study | Rater | 10% Base Rate | 60% Base Rate | 70% Base Rate | 80% Base Rate | ||||
| PPV (%) | NPV (%) | PPV (%) | NPV (%) | PPV (%) | NPV (%) | PPV (%) | NPV (%) | ||
| Fake ADHD vs. Honest Responders | Parent | 69.9 | 91.3 | 93.3 | 63.6 | 94.2 | 60.0 | 94.9 | 56.7 | 
| Teacher | 69.8 | 90.1 | 93.3 | 62.4 | 94.2 | 58.8 | 95.5 | 55.5 | |
| Self-Report | 81.5 | 92.8 | 96.4 | 68.4 | 96.9 | 64.9 | 97.2 | 61.8 | |
| Fake ADHD vs. Genuine ADHD | Parent | 71.8 | 91.3 | 93.9 | 63.7 | 94.7 | 60.1 | 95.7 | 95.3 | 
| Teacher | 53.5 | 90.5 | 87.4 | 61.3 | 89.0 | 57.6 | 90.2 | 54.3 | |
| Self-Report | 71.8 | 92.7 | 93.9 | 67.9 | 94.7 | 64.4 | 95.3 | 61.3 | |
| Fake Bad vs. Honest Responders | Parent | 75.6 | 94.6 | 94.9 | 74.4 | 95.6 | 71.4 | 96.1 | 68.6 | 
| Teacher | 68.1 | 90.2 | 92.8 | 60.5 | 93.7 | 56.8 | 94.5 | 53.5 | |
| Self-Report | 84.1 | 96.0 | 97.0 | 80.2 | 97.4 | 77.6 | 97.7 | 75.2 | |
Note. PPV = Positive Predictive Value. NPV = Negative Predictive Value.
The cut-off scores for the Parent and Teacher Negative Impression Index were then cross validated in two separate samples: a “coached” Fake ADHD sample (N = 47 for Parent and N = 47 for Teacher) and a naive or “uncoached” Fake ADHD sample (N = 49 for Parent and N = 49 for Teacher). Both samples were administered the entire Conners 4 inventory, which included the Negative Impression Index items. In the coached sample, participants were instructed to respond as if the rated youth had ADHD symptoms and were provided with a list of symptoms that would be available if one were to conduct an informal search for symptoms of ADHD on the internet. The uncoached sample was given the same instructions regarding faking ADHD but were not provided with a list of symptoms from which to base their false responses. Raw scores from the selected set of Negative Impression Index items for Parent and Teacher were calculated, and the distribution of raw scores was examined. Results revealed that the uncoached and coached samples responded exceedingly similarly to one another, such that the two groups could reasonably be combined into a single Fake ADHD group for each rater form, regardless of instructions received.
Classification accuracy statistics were calculated using the selected cut-offs to compare the Fake ADHD cross-validation sample with (a) the previously employed samples of Honest Responders, and (b) to individuals with a genuine ADHD diagnosis separately; the results are presented in Table 6.9. Overall, results validated a raw score cut-off of greater than or equal to 8 for Parent, with a correct classification rate for distinguishing between Fake ADHD Responders and Honest Responders at 87.8%. For Teacher, a raw score cut-off of greater than or equal to 10 was confirmed, with a correct classification rate for distinguishing between Fake ADHD Responders and Honest Responders at 82.0%. As described earlier, the Negative Impression Index is designed to avoid false positive classifications, as seen in the lower sensitivity values as compared to specificity values (see Table 6.9). Therefore, the absence of a flagged Negative Impression Index score may or may not indicate a negative response bias or invalid symptom reporting, but the presence of a flagged score is a very strong indicator that the responses are atypical and potentially invalid.
Table 6.9. Classification Accuracy Statistics: Conners 4 Parent and Teacher Negative Impression Index Validation Study
| Classification Accuracy Statistic | Parent | Teacher | ||
| Fake ADHD vs.  | Fake ADHD vs.  | Fake ADHD vs.  | Fake ADHD vs.  | |
| Overall Correct Classification (%) | 87.8 | 94.1 | 82.0 | 94.3 | 
| Sensitivity (%) | 52.1 | 52.1 | 58.3 | 58.3 | 
| Specificity (%) | 93.9 | 95.3 | 89.1 | 95.5 | 
| Positive Predictive Value (%) | 59.5 | 24.6 | 61.5 | 30.1 | 
| Negative Predictive Value (%) | 92.0 | 98.5 | 87.7 | 98.6 | 
| Kappa | .49 | .31 | .48 | .37 | 
Note. Honest Responders = General Population, ADHD, and other clinical samples (N = 3,255 for Parent; N = 2,879 for Teacher). Genuine ADHD = ADHD Reference Sample (N = 560 for Parent; N = 320 for Teacher). Fake ADHD N = 47 for Parent; N = 47 for Teacher.
The Inconsistency Index was created to capture random or inconsistent responding. To identify such a pattern and create the Inconsistency Index, correlations between items in the Conners 4 item pool within the Clinical sample were inspected. Items that were highly correlated (ideally above r = .70 but extending to lower correlations as needed if this threshold could not be met) within the Clinical sample and that shared meaningfully related content were selected as item-pairs. Correlations for these item-pairs were also inspected within the General Population sample (see Standardization Phase earlier in this chapter for sample descriptions) to ensure the relationship was still present in this sample. As a result, correlations high enough to indicate the relationship was still present (i.e., greater than r = .55) were deemed acceptable, as this sample’s item interrelationships were expected to be lower due to the restricted range. Through this review and item inspection process, seven item-pairs that were both statistically and conceptually closely related to each other were identified for both the Parent and Teacher forms, and eight item-pairs were selected for the Self-Report form. In rare instances, items were included in the Inconsistency Index that are not found elsewhere in the Conners 4; items were included in the initial item pool and ultimately did not meet criteria for inclusion on the intended Content Scales, often due to a high degree of overlapping content. This overlap or item content redundancy is not ideal for inclusion when measuring a particular construct but is very useful when identifying consistency in response style, and as such, these items were included in item pairs for the Inconsistency Index. The item stems (see appendix C for full item text) and the correlations between item-pairs within the Clinical samples are presented in Table 6.10 for Parent, 6.11 for Teacher, and Table 6.12 for Self-Report. The items selected for the Inconsistency Index represent a range of content areas measured on the Conners 4.
Table 6.10. Inconsistency Index Item Pairs: Conners 4 Parent
| Item Stem 1 | Item Stem 2 | r | 
| 34. Making impulsive decisions | 109. Being impulsive | .87 | 
| 14. Needing to move around | 86. Having trouble sitting still | .83 | 
| 28. Creating stress for the family | 88. Creating chaos for the family | .83 | 
| 16. Refusing to follow the rules | 83. Refusing to do what they are told | .77 | 
| 40. Forgetting to turn in work | 64. Handing things in late | .76 | 
| 55. Interrupting others | 97. Talking out of turn | .76 | 
| 73. People don’t want to be friends with them | 100. Having trouble making or keeping friends | .75 | 
Note. All correlation coefficients (r) are significant, p < .001.
Table 6.11. Inconsistency Index Item Pairs: Conners 4 Teacher
| Item Stem 1 | Item Stem 2 | r | 
| 31. Making impulsive decisions | 100. Being impulsive | .87 | 
| 4. Losing their temper | 60. Trouble controlling their temper | .86 | 
| 17. Having trouble finishing tasks | 32. Failing to finish things they start | .85 | 
| 77. Talking before others are finished | 90. Talking out of turn | .79 | 
| 80. Having trouble sitting still | 99. Fidgeting | .81 | 
| 37. Forgetting to turn in work | 22. Not knowing where or what their homework is | .77 | 
| 34. Feeling angry and resentful | 104. Having trouble controlling their anger | .77 | 
Note. All correlation coefficients (r) are significant, p < .001.
Table 6.12. Inconsistency Index Item Pairs: Conners 4 Self-Report
| Item Stem 1 | Item Stem 2 | r | 
| 46. Worrying so much they get tired | 75. Worrying too much | .73 | 
| 8. Having too much energy to sit still | 88. Having trouble sitting still | .72 | 
| 12. Feeling sad, gloomy, or irritable | 111. Feeling helpless | .70 | 
| 68. Getting really angry | 114. Having trouble controlling their anger | .64 | 
| 48. Having trouble concentrating | 58. Having trouble completing work | .64 | 
| 7. Having trouble sleeping because of worry | 26. Having trouble controlling their worries | .62 | 
| 31. Creating stress for the family | 89. Causing problems for family | .59 | 
| 46. Worrying so much they get tired | 75. Worrying too much | .56 | 
Note. All correlation coefficients (r) are significant, p < .001.
Next, the differences in item rating for each of these item-pairs were calculated, and then the absolute value of differences greater than 1 point (e.g., if Item A was rated 3 and Item B was rated 1, the difference is 2 points) were summed to create the raw score for the Inconsistency Index. Differences equal to or less than 1 point were not considered indicative of inconsistent responding. To determine a cut-off for the raw score at which one could confidently identify inconsistent responding, Conners 4 response data was simulated using R Statistical Software to represent uniform random responding. The simulated dataset is intended to mimic the behavior of an individual who is randomly responding.
The distribution of raw scores within the random datasets was then compared against the distribution of raw scores observed in the “Best Effort Responders” (i.e., General Population and ADHD Reference Samples), as these samples were both considered to be providing their best effort to attend appropriately to item content when making their ratings. A cut-off was selected to minimize flagging the number of individuals providing their best effort while maximizing the identification of simulated fully random cases that would exceed the cut-off and be flagged on the Inconsistency Index. Priority was given to ensuring false positives were rare; that is, individuals giving their best effort should rarely be flagged for inconsistent responding. For Parent, raw scores greater than or equal to 4 successfully captured only 4.2% of Best Effort Responders (i.e., General Population and ADHD Reference samples combined), while capturing 81.3% of the simulated random data. For Teacher, a cut-off of greater than or equal to 3 captured 2.6% of the Best Effort Responders and 85.4% of the simulated random data. For Self-Report, a cut-off of greater than or equal to 5 captured 3.6% of the Best Effort Responders and 67.3% of the simulated random data.
The classification accuracy of the Inconsistency Index was then explored. As seen in Table 6.13, the overall correct classification rate was 88.6% for Parent, 88.8% for Teacher, and 88.7% for Self-Report when distinguishing between the Best Effort Responders providing their best effort and randomly generated data (N = 2,221 for Parent; N = 2,379 for Teacher; and N = 1,247 for Self-Report, with simulated random responses of equal sample size). Similarly, when looking specifically at the ability of the Inconsistency Index to distinguish individuals with ADHD from the random responses, the overall correct classification rate was 83.9% for Parent, 86.9% for Teacher, and 80.6% for Self-Report (N = 393 for Parent, N = 321 for Teacher, and N = 230 for Self-Report). The Inconsistency Index is designed to be more specific than sensitive to reduce the risk of false positives, and these results support the valid use of the Inconsistency Index in the identification of an inconsistent response style.
Table 6.13. Classification Accuracy Statistics: Conners 4 Inconsistency Index
| Classification Accuracy Statistic | Parent | Teacher | Self-Report | |||
| Random  | Random Responses  | Random  | Random Responses  | Random  | Random Responses  | |
| Overall Correct Classification (%) | 88.6 | 83.7 | 88.8 | 82.6 | 88.7 | 87.9 | 
| Sensitivity (%) | 81.6 | 81.6 | 80.8 | 80.8 | 88.4 | 88.4 | 
| Specificity (%) | 95.8 | 98.2 | 97.4 | 99.1 | 89.1 | 84.8 | 
| Positive Predictive Value (%) | 95.2 | 99.7 | 97.1 | 99.9 | 89.7 | 97.6 | 
| Negative Predictive Value (%) | 83.3 | 43.6 | 82.6 | 36.5 | 87.7 | 51.3 | 
| Kappa | .77 | .52 | .78 | .45 | .77 | .57 | 
Note. Best Effort Responders = General Population and ADHD samples. ADHD = ADHD Reference Sample.
To provide an additional indication of response style, the rate of response was examined to understand what is typical or potentially unusual. Response time per item was captured and can be summed to create the total test duration. To calculate pace, the total number of items on each of the Conners 4 forms is divided by the duration in minutes to complete, creating the metric for the rate of responding as the number of items responded to per minute. The distribution of this indicator of pace was explored in the Normative Samples, as well as the ADHD Reference Samples. The two samples had exceedingly similar ranges for pace across Parent, Teacher, and Self-Report; therefore, the Normative Sample was selected for deriving the cut-off for interpretation. To aid interpretation by identifying unusually slow or fast rates of responding, the pace at 3 SD above and 2.5 SD below the Normative Sample’s mean was calculated (this range was selected because 2.5 SD below the mean approaches 0 items per minute, which is a natural boundary for this indicator, and 3 SD above the mean provides a conservative estimate without running the risk of flagging individuals too frequently). Paces that are more rapid than the upper boundary of typical responding can be understood as unusually fast, while paces that are slower than the lower boundary of typical responding can be understood as unusually slow.
Within the Conners 4 Parent Combined Gender Normative Sample, 0.6% of individuals were identified as unusually fast (i.e., approximately 3 SDs above the mean, or greater than 17 items per minute), and 1.2% of the Normative Sample was flagged as unusually slow. For Teacher, the cut-off for a fast Pace is 20 items per minute (approximately 3 SDs above the mean), which identified 0.1% of the Combined Gender Normative Sample as unusually fast, and 2% of the sample was identified as unusually slow. For Self-Report, cut-offs were examined by age within the Normative Sample, and differences were observed between children and adolescents, likely due to developmental gains in literacy and verbal comprehension. Accordingly, the cut-off for fast Pace for youth aged 8- to 11-years-old is 16 items per minute, while the cut-off for youth aged 12- to 18-years-old is 18 items per minute. No one in the Self-Report Combined Gender Normative Sample was flagged as unusually fast, and only 0.2% of the sample was flagged as slow. Note that raters completing the Conners 4 could pause and return to complete the test, so duration (i.e., the overall length from start to finish) and pace (i.e., typical rate of response for each item) ought to be considered in tandem to understand the entire response style as it relates to time. Both of these measures are provided on the Conners 4 reports. Details about the exact guidelines and applied use of this Response Style Indicator can be found in Step 1: Examine the Response Style Analysis in chapter 4, Interpretation.
There are two sets of Critical Items: (a) the Self-Harm Critical Items that are meant to screen for the risk of suicidal thoughts and behaviors, as well as self-injurious behaviors, and (b) the Severe Conduct Critical Items that are meant to screen for destructive behavior and the risk of violence or harm towards others. The Sleep Problems Indicator screens for general sleep problems that the youth may be experiencing.
For the Critical Items, any endorsement, which is any response other than an item score of 0,“Not true at all (Never/Rarely),” warrants attention; norm-referenced information or scores do not influence the need for follow-up. During the standardization phase, there were three Self-Harm items across rater forms. For Parent and Teacher, one item captured non-suicidal self-injury, and two items captured suicidal behaviors (i.e., “Has planned or tried to commit suicide” and “Has talked about committing suicide”). Post-standardization, these suicidal behavior items were modified into a single item to read: “Has talked about, planned, or attempted suicide.” Results presented in this manual reflect the standardization versions of the items (e.g., see Clinical Group Differences in chapter 9, Validity).
The Sleep Problems Indicator items were analyzed by comparing response frequency between the General Population sample and the ADHD Reference Sample to ensure variability in response and greater endorsement for youth with ADHD. Entering the standardization phase, four items were included for Parent and Self-Report and two items were included for Teacher. Analyses revealed potential item redundancy; therefore, only two items were retained on the Parent and Self-Report and one item was retained on the Teacher form. Next, to determine what item rating would result in an elevated response, the response frequencies for the retained items (for the list of Sleep Problem Indicator items, refer to appendix C) were examined within each normative age group. For Parent and Teacher, results revealed that item responses that are greater than or equal to 2, “Just a little true (Occasionally),” should be considered elevated (i.e., relatively infrequent within the General Population sample, corresponding to approximately the upper quartile of the distribution). For the Self-Report, results for the item, “I have trouble falling or staying asleep,” revealed that a response greater than or equal to 2, “Just a little true (Occasionally),” is considered elevated, whereas for the item, “I am tired,” a response of 3, “Pretty much true (Often/Quite a bit),” is considered elevated.
Using the data collected during the standardization phase, analyses were conducted to evaluate item and scale functioning for the Conners 4 Content Scales and the Impairment & Functional Outcome Scales. Analyses relied on both CTT and IRT frameworks, and items were flagged for review if they met any of the following criteria:
Poor discrimination between intended groups (e.g., an item about ADHD-related symptoms should demonstrate a notable mean difference between individuals with and without an ADHD diagnosis) as measured by Cliff’s delta (Cliff, 1993; Romano et al., 2006).
Significant associations to demographic characteristics (e.g., items should not have a significant correlation with an individual’s parental education level).
Items that did not relate well to their intended factor (e.g., item-total correlations and factor loadings ≤ .40).
Relatively low item information, as determined by IRT models.
To identify these features, descriptive statistics such as item-level means and the frequency of response distributions were examined, along with inter-item correlations, correlations and mean group differences between items and demographic groups (as well as raw scale scores and demographic groups), fit to IRT models, and tests for measurement bias (i.e., differential item functioning [DIF] using IRT). Items with statistically significant DIF were reviewed for meaningful effect sizes and inspected graphically for the nature of the DIF effect. In all instances, the items presented minor concerns; a very small proportion (i.e., less than 10%) of each scale contained DIF items, and the DIF effects were quite small in size. These items were retained, given that item-level variance did not result in meaningful measurement invariance at the test level (see chapter 10, Fairness), while items in question remained flagged for further exploration for the purposes of developing the Conners 4–Short and Conners 4–ADHD Index (see chapters 11 and 12, respectively, for more details).
For the Content Scales, confirmatory factor analyses (CFA) were conducted at the item level to cross-validate the latent structure of the item pool as initially explored during the pilot phase and alignment with the theoretical framework (for more details on CFAs conducted, see Internal Structure in chapter 9, Validity). These models provided information about inter-item relationships and each item’s importance to the factors. The identified factors replicated the results of EFAs from the pilot phase. Competing models were tested, and the best fitting model for all rater forms was a six-factor solution: Inattention/Executive Dysfunction, Hyperactivity, Impulsivity, Emotional Dysregulation, Depressed Mood, and Anxious Thoughts. Items that were flagged for not meeting the specified criteria or for general poor or inconsistent performance were reviewed by the development team for construct relevance and clinical significance. Through this process of review, the Content Scale items include 59 items for Parent, 59 items for Teacher, and 60 items for Self-Report.
For the Impairment & Functional Outcome Scales, items that met the item-level criteria were tested using CFA, and the 3-factor model (for Parent and Self-Report; 2-factor model for Teacher) emerged as the best fit (see Internal Structure in chapter 9, Validity, for details about all models tested). The three factors consist of Schoolwork, Peer Interactions, and Family Life (for Parent and Self-Report forms only). Upon review of all psychometric results and careful clinical consideration, the item pool for these scales yielded 19 items for Parent, 12 items for Teacher, and 19 items for Self-Report.
The selected items were collected into scales in alignment with the final corresponding CFA model. Raw scores were then calculated for each Content Scale and each Impairment & Functional Outcome Scale, which were then converted to T-scores and percentiles. For more details on converting raw scores into standardized scores, please see Standardization Procedures and Continuous Norming in chapter 7, Standardization.
In addition to providing standardized scores at the scale level, analyses were conducted to provide information about individual items for all Content Scale and Impairment & Functional Outcome Scale items. Specifically, analyses were conducted with data from the Normative Samples to determine when an item score should be considered to be “Elevated” (see chapter 4, Interpretation, for more information on elevated items). After careful examination of response frequencies, it was determined that all items endorsed at levels that were deemed infrequent in the Conners 4 Normative Samples (that is, endorsed at approximately the 85th percentile or higher), are flagged as“Elevated” in the report.
In addition to meeting all the requirements of a successful item as described in the Content and Impairment & Functional Outcome Scales: Item Selection & Scoring, items for the DSM Symptom Scales must also capture the DSM Symptom A criteria. In some instances, multiple items were written to capture a criterion. If all items for a given criterion met item performance standards in the previous review process (note that DSM ADHD Symptom Scale items also appear on the Content scales, and as such, were reviewed at that time, while the DSM Oppositional Defiant Disorder Symptoms and DSM Conduct Disorder Symptoms scales contain unique items that were subjected to a parallel but separate review process), decisions to retain items were made based on relative empirical strengths (e.g., more pronounced clinical group differences) combined with clinical judgement regarding DSM content alignment in an attempt to reduce the number of items per scale. The final DSM ADHD Inattentive Symptoms Scale is composed of a subset of items from the Conners 4 Inattention/Executive Dysfunction Content Scale, and the final DSM ADHD Hyperactive/Impulsive Symptoms Scale is composed of a subset of items from both the Conners 4 Hyperactivity and Conners 4 Impulsivity Content Scales. The DSM ADHD Inattentive Symptoms Scale and DSM Hyperactive/Impulsive Symptoms Scale were combined to create the DSM Total ADHD Symptoms Scale. Item responses for each scale were summed to create raw scores for the DSM Symptom Scales, which were converted to T-scores and percentiles. For more details on converting raw scores into standardized scores, please see Standardization Procedures and Continuous Norming in chapter 7, Standardization.
For the DSM Symptom Scales, a raw count of symptoms endorsed was calculated to reflect the symptom criteria of the DSM, and the item-level cut-offs were aligned with the requirements from the DSM. For example, the DSM ADHD criteria specify that certain symptoms must be present often, and thus an equivalent response for the Conners 4 DSM ADHD items reflecting each criterion must be endorsed to count toward the Symptom Count. These responses were either a 2, “Pretty much true (Often/Quite a bit),” or 3, “Completely true (Very often/Always).” It should be noted that while most DSM criteria are represented by a single Conners 4 item, there are some DSM symptom criteria that are represented on the Conners 4 by a combination of items. For full details about the scoring criteria for the Conners 4 DSM Symptom Scales, see appendix D.
As with the Content and Impairment & Functional Outcome scales, all items in the DSM Symptoms scales that are endorsed at approximately the 85th percentile or higher (i.e., deemed infrequent in the Conners 4 Normative Samples) are flagged as “Elevated” in the report.
Upon selection of the final items included in the Conners 4 forms, development of the ADHD Index began. Items were re-analyzed to select the items that most efficiently distinguish between youth with and without ADHD. Twelve items were selected for use in this index; for more information about its creation, see chapter 12, Conners 4–ADHD Index.
1The number of omitted items is also provided as part of the Response Style Analysis. Because its development did not involve data analysis, it is not discussed here. Please see chapter 4, Interpretation, for a discussion of this metric.