-
Notifications
You must be signed in to change notification settings - Fork 3
/
17-Test-Bias.Rmd
4934 lines (4012 loc) · 288 KB
/
17-Test-Bias.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Test Bias {#bias}
## Overview of Bias {#overview-bias}
There are multiple definitions of the term "bias" depending on the context.
In general, bias is a [systematic error](#systematicError) [@Reynolds2012].\index{bias}\index{bias!types of}
[Mean error](#meanError) is an example of [systematic error](#systematicError), and is sometimes called bias.\index{bias}\index{measurement error!systematic error}\index{bias!types of}
Cognitive biases are systematic errors in thinking, including confirmation bias and hindsight bias.\index{bias!cognitive}\index{bias!confirmatory}\index{bias!types of}
[Method biases](#methodBias) are a form of [systematic error](#systematicError) that involve the influence of measurement on a person's score that is not due to the person's level on the construct.\index{method bias}\index{bias}\index{measurement error!systematic error}\index{bias!types of}
[Method biases](#methodBias) include response biases or response styles, including acquiescence and social desirability bias.\index{method bias}\index{bias}\index{measurement error!systematic error}\index{response style}\index{bias!social desirability}\index{bias!types of}
Attentional bias refers to the tendency to process some types of stimuli more than others.\index{bias!attentional}\index{bias!types of}
Sometimes bias is used to refer in particular to [systematic error](#systematicError) (in measurement, prediction, etc.) as a function of group membership, where test bias refers to the same score having different meaning for different groups.\index{bias}\index{bias!types of}
Under this meaning, a test is unbiased if a given test score has the same meaning regardless of group membership.\index{bias}\index{bias!types of}
For example, a test is biased if there is differential [validity](#validity) of test scores for groups (e.g., age, education, culture, race, sex).\index{culture}\index{bias}\index{validity}\index{bias!types of}
Test bias would exist, for instance, if a test is a less [valid](#validity) predictor for racial minorities or linguistic minorities.\index{bias}\index{validity}\index{bias!types of}
Test bias would also exist if scores on the Scholastic Aptitude Test (SAT) under-estimate women's grades in college, for instance.\index{bias}\index{measurement error!systematic error}\index{bias!types of}
There are some known instances of test bias, as described in Section \@ref(biasExamples).\index{bias}
Research has not produced much empirical evidence of test bias [@Brown1999; @Hall1999; @Jensen1980; @Kuncel2010a; @Reynolds2012; @Reynolds2021; @Sackett1994; @Sackett2008], though some item-level bias is not uncommon.\index{bias}
Moreover, where test bias has been observed, it is often small, unclear, and does not always generalize [@Cole1981].\index{bias}
However, just because there is not much empirical evidence of test bias does not mean that test bias does not exist.\index{bias}
Moreover, just because a test does not show bias does not mean that it should be used.\index{bias}
Furthermore, just because a test does not show bias does not mean that there are not race-, social class-, and gender-related [biases in clinical judgment](#biasClinicalJudgment) during the assessment process.\index{clinical judgment!bias}\index{bias!clinical judgment}
It is also worth pointing out that group differences in scores do not necessarily indicate bias.\index{bias}
Group differences in scores could reflect true group differences in the construct.\index{bias}
For instance, women have better verbal abilities, on average, compared to men.\index{bias}
So, if women's scores on a verbal ability test are higher on average than men's scores, this would not be sufficient evidence for bias.\index{bias}
There are two broad categories of test bias:\index{bias!types of}
1. [predictive bias](#predictiveBias)\index{bias}\index{bias!types of}\index{bias!predictive}
1. [test structure bias](#testStructureBias)\index{bias}\index{bias!types of}\index{bias!test structure}
[Predictive bias](#predictiveBias) refers to differences between groups in the relation between the test and criterion.\index{bias}\index{bias!types of}\index{bias!predictive}\index{validity!criterion}
As with all [criterion-related validity](#criterionValidity) tests, the findings depend on the strength and quality of the criterion.\index{validity!criterion}
[Test structure bias](#testStructureBias) refers to differences in the internal test characteristics across groups.\index{bias}\index{bias!types of}\index{bias!test structure}
## Ways to Investigate/Detect Test Bias {#detectBias}
### Predictive Bias {#predictiveBias}
Predictive bias exists when differences emerge between groups in terms of [predictive validity](#predictiveValidity) to a criterion.\index{bias!predictive}
It is assessed by using a regression line looking at the association between test score and job performance.\index{bias!predictive}
For instance, consider a [2x2 confusion matrix](#confusionMatrix) used for the standard prediction problem.\index{confusion matrix}
A [confusion matrix](#confusionMatrix) for whom to select for a job is depicted in Figure \@ref(fig:jobSelection).\index{confusion matrix}
```{r jobSelection, out.width = "100%", fig.align = "center", fig.cap = "2x2 Confusion Matrix for Job Selection. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "2x2 Confusion Matrix for Job Selection.", echo = FALSE}
knitr::include_graphics("./Images/JobSelection.png")
```
We can also visualize the [confusion matrix](#confusionMatrix) in terms of a scatterplot of the test scores (i.e., predicted job performance) and the "truth" scores (i.e., actual job performance), as depicted in Figure \@ref(fig:jobSelection2).\index{confusion matrix}\index{bias!predictive}
The predictor (test score) is on the x-axis.\index{bias!predictive}
The criterion (job performance) is on the y-axis.\index{bias!predictive}
The quadrants reflect the cutoffs (i.e., thresholds) imposed from the 2x2 confusion matrix.\index{confusion matrix}\index{bias!predictive}
The vertical line reflects the cutoff for selecting someone for a job.\index{bias!predictive}
The horizontal line reflects the cutoff for good job performance (i.e., people who should have been selected for the job).\index{bias!predictive}
```{r jobSelection2, fig.height = 8, fig.align = "center", fig.cap = "2x2 Confusion Matrix for Job Selection in the Form of a Graph With Predicted Performance on the x-Axis and Actual Job Performance on the y-Axis. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "2x2 Confusion Matrix for Job Selection in the Form of a Graph.", echo = FALSE}
plot.new()
plot.window(xlim = c(-5,5), ylim = c(-5,5))
axis(side = 1, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0)
axis(side = 2, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0, las = 1)
text(x = 2.5, y = -0.4, "Test Score", cex = 1.5)
text(x = -0.3, y = 2.5, "Job Performance", cex = 1.5, srt = 90)
text(x = 5, y = 5, "TP")
text(x = -5, y = -5, "TN")
text(x = 5, y = -5, "FP")
text(x = -5, y = 5, "FN")
```
The data points in the top right quadrant are [true positives](#truePositive): people who the test predicted would do a good job and who did a good job.\index{bias!predictive}\index{true positive}
The data points in the bottom left quadrant are [true negatives](#trueNegative): people who the test predicted would do a poor job and who would have done a poor job.\index{bias!predictive}\index{true negative}
The data points in the bottom right quadrant are [false positives](#falsePositive): people who the test predicted would do a good job and who did a poor job.\index{bias!predictive}\index{false positive}
The data points in the top left quadrant are [false negatives](#falseNegative): people who the test predicted would do a poor job and who would have done a good job.\index{bias!predictive}\index{false negative}
```{r, echo = FALSE}
set.seed(52242)
predictor <- runif(100, min = -4, max = 4)
outcomeGoodPredictor <- 0 + 1*predictor + rnorm(100)
outcomeBadPredictor <- 0 + 0.1*predictor + rnorm(100)
```
Figure \@ref(fig:jobSelection3) depicts a strong predictor.\index{bias!predictive}
The best-fit regression line has a steep slope where there are lots of data points that are [true positives](#truePositive) and [true negatives](#trueNegative), with relatively few [false positives](#falsePositive) and [false negatives](#falseNegative).\index{bias!predictive}\index{true positive}\index{true negative}\index{false positive}\index{false negative}
```{r jobSelection3, fig.height = 8, fig.align = "center", fig.cap = "Example of a Strong Predictor. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "Example of a Strong Predictor.", echo = FALSE}
plot.new()
plot.window(xlim = c(-5,5), ylim = c(-5,5))
points(x = predictor, y = outcomeGoodPredictor)
abline(lm(outcomeGoodPredictor ~ predictor))
axis(side = 1, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0)
axis(side = 2, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0, las = 1)
text(x = 2.5, y = -0.4, "Test Score", cex = 1.5)
text(x = -0.3, y = 2.5, "Job Performance", cex = 1.5, srt = 90)
text(x = 5, y = 5, "TP")
text(x = -5, y = -5, "TN")
text(x = 5, y = -5, "FP")
text(x = -5, y = 5, "FN")
```
Figure \@ref(fig:jobSelection4) depicts a poor predictor.\index{bias!predictive}
The best-fit regression line has a shallow slope where there are just as many data points that are in the false cells ([false positives](#falsePositive) and [false negatives](#falseNegative)) as there are in the true cells ([true positives](#truePositive) and [true negatives](#trueNegative)).\index{bias!predictive}\index{true positive}\index{true negative}\index{false positive}\index{false negative}
In general, the steeper the slope, the better the predictor.\index{bias!predictive}
```{r jobSelection4, fig.height = 8, fig.align = "center", fig.cap = "Example of a Poor Predictor. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "Example of a Poor Predictor.", echo = FALSE}
plot.new()
plot.window(xlim = c(-5,5), ylim = c(-5,5))
points(x = predictor, y = outcomeBadPredictor)
abline(lm(outcomeBadPredictor ~ predictor))
axis(side = 1, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0)
axis(side = 2, labels = c("Bad", "Good"), at = c(-5, 5), pos = 0, las = 1)
text(x = 2.5, y = -0.4, "Test Score", cex = 1.5)
text(x = -0.3, y = 2.5, "Job Performance", cex = 1.5, srt = 90)
text(x = 5, y = 5, "TP")
text(x = -5, y = -5, "TN")
text(x = 5, y = -5, "FP")
text(x = -5, y = 5, "FN")
```
We can evaluate predictive bias using a best-fit regression line between the predictor and criterion for each group.\index{bias!predictive}
#### Types of Predictive Bias {#typesPredictiveBias}
There are three types of predictive bias:\index{bias!predictive}
1. [Different slopes](#differentSlopes)\index{bias!predictive!different slopes}
1. [Different intercepts](#differentIntercepts)\index{bias!predictive!different intercepts}
1. [Different intercepts and slopes](#differentInterceptsAndSlopes)\index{bias!predictive!different intercepts and slopes}
The slope of the regression line is the steepness of the line.\index{bias!predictive}
The intercept of the regression line is the y-value of the point where the line crosses the y-axis (i.e., when $x = 0$).\index{bias!predictive}
If a measure shows predictive test bias, when looking at the regression line for each group, the groups' regression lines differ in either slopes and/or intercepts.\index{bias!predictive}
##### Different Slopes {#differentSlopes}
Predictive bias in terms of different slopes exists when there are differences in the slope of the regression line between minority and majority groups.\index{bias!predictive!different slopes}
The slope describes the *direction* and *steepness* of the regression line.\index{bias!predictive!different slopes}
The slope of a regression line is the amount of change in $y$ for every unit change in $x$ (i.e., rise over run).\index{bias!predictive!different slopes}
Differing slopes indicate differential [predictive validity](#predictiveValidity), in which the test is a more effective predictor of performance in one group over the other.\index{bias!predictive!different slopes}\index{validity!predictive}
Different slopes predictive bias is depicted in Figure \@ref(fig:testBias1).\index{bias!predictive!different slopes}
In the figure, the predictor performs well in the majority group.\index{bias!predictive!different slopes}
However, the slope is close to zero in the minority group, indicating that there is no association between the predictor and the criterion for the minority group.\index{bias!predictive!different slopes}
```{r testBias1, out.width = "100%", fig.align = "center", fig.cap = "Test Bias: Different Slopes. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "Test Bias: Different Slopes.", echo = FALSE}
knitr::include_graphics("./Images/testBias-01.png")
```
Different slopes can especially occur if we develop our measure and criterion based on the normative majority group.\index{bias!predictive!different slopes}
Not much evidence has found empirical evidence of different slopes across groups.\index{bias!predictive!different slopes}
However, samples often do not have the power to detect differing slopes [@Aguinis2010a].\index{bias!predictive!different slopes}
Theoretically, to fix biases related to different slopes, you should find another measure that is more predictive for the minority group.\index{bias!predictive!different slopes}
If the predictor is a strong predictor in both groups but shows slight differences in the slope, [within-group norming](#withinGroupNorming) could be used.\index{bias!predictive!different slopes}
##### Different Intercepts {#differentIntercepts}
Predictive bias in terms of different intercepts exists when there are differences in the intercept of the regression line between minority and majority groups.\index{bias!predictive!different intercepts}
The $y$-intercept describes the point on the $y$-axis that the line intersects with the $y$-axis (when $x = 0$).\index{bias!predictive!different intercepts}
When the distributions have similar slopes, intercept differences suggest that the measure systematically under- or over-estimates group performance relative to the person's ability.\index{bias!predictive!different intercepts}
The same test score leads to systematically different predictions for the majority and minority groups.\index{bias!predictive!different intercepts}
In other words, minority group members get different tests scores than majority group members with the same ability.\index{bias!predictive!different intercepts}
Different intercepts predictive bias is depicted in Figure \@ref(fig:testBias2).\index{bias!predictive!different intercepts}
```{r testBias2, out.width = "100%", fig.align = "center", fig.cap = "Test Bias: Different Intercepts. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "Test Bias: Different Intercepts.", echo = FALSE}
knitr::include_graphics("./Images/testBias-02.png")
```
A higher intercept (relative to zero) indicates that the measure *under*-estimates a person's ability (at that test score)—i.e., the person's job performance is better than what the test score would suggest.\index{bias!predictive!different intercepts}
A lower intercept (relative to zero) indicates that the measure *over*-estimates a person's ability (at that test score)—i.e., the person's job performance is worse than what the test score would suggest.\index{bias!predictive!different intercepts}
Figure \@ref(fig:testBias2) indicates that the measure systematically under-estimates the job performance of the minority group.\index{bias!predictive!different intercepts}
Performance among members of a minority group could be under- or over-estimated.\index{bias!predictive!different intercepts}
For example, historically, women's grades in math and engineering classes tended to be under-estimated by the Scholastic Aptitude Test [SAT; @Clark1984].\index{bias!predictive!different intercepts}
However, where intercept differences have been observed, measures often show small *over*-estimation of school and job performance among minority groups [@Reynolds2012].\index{bias!predictive!different intercepts}
For example, women's physical strength and endurance is over-estimated based on physical ability tests [@Sackett1994].\index{bias!predictive!different intercepts}
In addition, over-estimation of African Americans' and Hispanics' school and job performance has been observed based on cognitive ability tests [@Cole1981; @Reynolds2012; @Sackett1994; @Sackett2008].\index{bias!predictive!different intercepts}
At the same time, the Black–White difference in job performance is less than the Black–White difference in test performance.\index{bias!predictive!different intercepts}
The over-prediction of lower-scoring groups is likely mostly an artifact of [measurement error](#measurementError) [@Gottfredson1994].\index{bias!predictive!different intercepts}
The over-estimation of African Americans' and Hispanics' school and job performance may be due to [measurement error](#measurementError) in the tests.\index{bias!predictive!different intercepts}
Moreover, test scores explain only a portion of the variation in job performance.\index{bias!predictive!different intercepts}
Black people are far less disadvantaged on the noncognitive determinants of job performance than on the cognitive ones.\index{bias!predictive!different intercepts}
Nevertheless, the over-estimation that has been often observed is *on average*—the performance is not over-estimated for all individuals of the groups even if there is an average over-estimation effect.\index{bias!predictive!different intercepts}
In addition, simulation findings indicate that lower intercepts (i.e., over-estimation) among minority groups compared to majority groups could be observed if there are different slopes but not different intercepts in the population, because different slopes are likely to go undetected due to low power [@Aguinis2010a].\index{bias!predictive!different intercepts}
That is, if a test shows weaker [validity](#validity) for a minority group than the majority group, it could appear as different intercepts that favor the minority group when, in fact, it reflects shallower slopes of the minority group that go undetected.\index{bias!predictive!different intercepts}\index{bias!predictive!different slopes}\index{validity}
Predictive biases in intercepts could especially occur if we develop tests that are based on the majority group, and the items assess constructs other than the construct of interest which are systematically biased in favor of the majority group or against the minority group.\index{bias!predictive!different intercepts}
Arguments about reduced power to detect differences are less relevant for intercepts and means than for slopes.\index{bias!predictive!different intercepts}
To correct for a bias in intercepts, we could add [bonus points](#bonusPoints) to the scores for the minority group to correct for the amount of the [systematic error](#systematicError), and to result in the same regression line.\index{bias!predictive!different intercepts}\index{measurement error!systematic error}
But if the minority group is over-predicted (as has often been the case where intercept differences have been observed), we would not want to use [score adjustment](#scoreAdjustment) to lower the minority group's scores.\index{bias!predictive!different intercepts}
##### Different Intercepts and Slopes {#differentInterceptsAndSlopes}
Predictive bias in terms of different intercepts and slopes exists when there are differences in the intercept and slope of the regression line between minority and majority groups.\index{bias!predictive!different intercepts and slopes}
In cases of different intercepts and slopes, there is both differential [validity](#validity) (because the regression lines have different slopes), as well as varying under- and over-estimation of groups' performance at particular scores.\index{bias!predictive!different intercepts and slopes}\index{validity}
Different intercepts and slopes predictive bias is depicted in Figure \@ref(fig:testBias3).\index{bias!predictive!different intercepts and slopes}
```{r testBias3, out.width = "100%", fig.align = "center", fig.cap = "Test Bias: Different Intercepts and Slopes. TP = true positive; TN = true negative; FP = false positive; FN = false negative.", fig.scap = "Test Bias: Different Intercepts and Slopes.", echo = FALSE}
knitr::include_graphics("./Images/testBias-03.png")
```
In instances of different intercepts and slopes predictive bias, a measure can simultaneously over-estimate and under-estimate a person's ability at different test scores.\index{bias!predictive!different intercepts and slopes}
For instance, a measure can under-estimate a person's ability at higher test scores and can over-estimate a person's ability at lower test scores.\index{bias!predictive!different intercepts and slopes}
Different intercepts and slopes across groups is possibly more realistic than just different intercepts or just different slopes.\index{bias!predictive!different intercepts and slopes}
However, different intercepts and slopes predictive bias is more complicated to study, represent, and resolve.\index{bias!predictive!different intercepts and slopes}
It is difficult to examine because of complexity, and it is not easy to fix.\index{bias!predictive!different intercepts and slopes}
Currently, we have nothing to address different intercepts and slopes predictive bias.\index{bias!predictive!different intercepts and slopes}
We would need to use a different measure or measures for each group.\index{bias!predictive!different intercepts and slopes}
### Test Structure Bias {#testStructureBias}
In addition to predictive bias, another type of test bias is test structure bias.\index{bias!test structure}
Test structure bias involves differences in internal test characteristics across groups.\index{bias!test structure}
Examining test structure bias is different from examining the total score, as is used when examining predictive bias.\index{bias!test structure}
Test structure bias can be identified empirically or based on theory/judgment.\index{bias!test structure}\index{empiricism}\index{theory}\index{judgment}\index{clinical judgment}
Empirically, test structure bias can be examined in multiple ways.\index{bias!test structure}\index{empiricism}
#### Empirical Approaches to Identification {#testStructureBiasEmpirical}
##### Item $\times$ Group tests (ANOVA) {#itemXgroupTests-bias}
Item $\times$ Group tests in analysis of variance (ANOVA) examine whether the difference between groups on the overall score match comparisons among smaller items sets between groups.\index{bias!test structure}
Item $\times$ Group tests are used to rule out that items are operating in different ways in different groups.\index{bias!test structure}
If the items operate in different ways in different groups, they do not have the same meaning across groups.\index{bias!test structure}
For example, if we are going to use a measure for multiple groups, we would expect its items to operate similarly across groups.\index{bias!test structure}
So, if women show higher scores on a depression measure compared to men, would also expect them to show similar elevations on each item (e.g., sleep loss).\index{bias!test structure}
##### Item Response Theory {#irt-bias}
Using [item response theory](#irt), we can examine [differential item functioning](#dif) (DIF).\index{bias!test structure}\index{item response theory}\index{item response theory!differential item functioning}
Evidence of [DIF](#dif), indicates that there are differences between group in terms of [discrimination](#itemDiscrimination) and/or [difficulty/severity](#itemDifficulty) of items.\index{bias!test structure}\index{item response theory!differential item functioning}\index{item response theory!item discrimination}\index{item response theory!item difficulty}
Differences between groups in terms of the [item characteristic curve](#icc) (which combines the item's [discrimination](#itemDiscrimination) and [severity](#itemDifficulty)) would be evidence against [construct validity](#constructValidity) invariance between the groups and would provide evidence of bias.\index{bias!test structure}\index{item response theory!differential item functioning}\index{item response theory!item discrimination}\index{item response theory!item difficulty}\index{item response theory!item characteristic curve}\index{validity!construct}
[DIF](#dif) examines stretching and compression of different groups.\index{bias!test structure}\index{item response theory!differential item functioning}
As an example, consider the item "bites others" in relation to externalizing problems.\index{bias!test structure}\index{item response theory!differential item functioning}
The item would be expected to show a weaker [discrimination](#itemDiscrimination) and higher [severity](#itemDifficulty) in adults compared to children.\index{bias!test structure}\index{item response theory!differential item functioning}\index{item response theory!item discrimination}\index{item response theory!item difficulty}
[DIF](#dif) is discussed in Section \@ref(dif).\index{bias!test structure}\index{item response theory!differential item functioning}
##### Confirmatory Factor Analysis {#cfa-bias}
[Confirmatory factor analysis](#cfa-sem) allows tests of [measurement invariance](#measurementInvariance) (also called factorial invariance).\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{factor analysis!confirmatory}
[Measurement invariance](#measurementInvariance) examines whether the factor structure of the underlying latent variables in the test is consistent across groups.\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{latent variable}
It also examines whether the manifestation of the construct differs between groups.\index{bias!test structure}\index{structural equation modeling!measurement invariance}
[Measurement invariance](#measurementInvariance) is discussed in Section \@ref(measurementInvariance).\index{bias!test structure}\index{structural equation modeling!measurement invariance}
Even if you find the same slope and intercepts across groups in a prediction model, the measure would still be assessing different constructs across groups if the measure has a different factor structure between the groups.\index{bias!test structure}\index{structural equation modeling!measurement invariance}
A different factor structure across groups is depicted in Figure \@ref(fig:testBias4).\index{bias!test structure}\index{structural equation modeling!measurement invariance}
```{r testBias4, out.width = "100%", fig.align = "center", fig.cap = "Different Factor Structure Across Groups.", echo = FALSE}
knitr::include_graphics("./Images/testBias-04.png")
```
An example of a different factor structure across groups is the differentiation of executive functions from two factors to three factors (inhibition, working memory, cognitive flexibility) across childhood [@Lee2013].\index{bias!test structure}\index{structural equation modeling!measurement invariance}
There are different degrees of measurement invariance [for a review, see @Putnick2016]:\index{bias!test structure}\index{structural equation modeling!measurement invariance}
- Configural invariance: same number of factors in each group, and which indicators load on which factors are the same in each group (i.e., the same pattern of significant loadings in each group).\index{bias!test structure}\index{structural equation modeling!measurement invariance}
- Metric ("weak factorial") invariance: items have the same factor loadings ([discrimination](#itemDiscrimination)) in each group.\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{structural equation modeling!factor loading}\index{item response theory!item discrimination}
- Scalar ("strong factorial") invariance: items have the same intercepts ([difficulty/severity](#itemDifficulty)) in each group.\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{structural equation modeling!intercept}\index{item response theory!item difficulty}
- Residual ("strict factorial") invariance: items have the same residual/unique variances in each group.\index{bias!test structure}\index{structural equation modeling!measurement invariance}
##### Structural Equation Modeling {#sem-bias}
[Structural equation modeling](#sem) is a [confirmatory factor analysis](#cfa) (CFA) model that incorporates prediction.\index{factor analysis!confirmatory}\index{structural equation modeling}
Structural equation modeling allows examining differences in the underlying structure with differences in prediction in the same model.\index{factor analysis!confirmatory}\index{structural equation modeling}\index{bias!test structure}\index{bias!predictive}
##### Signal Detection Theory {#sdt-bias}
[Signal detection theory](#sdt) is a dynamic measure of bias.\index{bias}\index{signal detection theory}
It allows examining the overall bias in selection systems, including both accuracy and errors at various cutoffs ([sensitivity](#sensitivity), [specificity](#specificity), [positive predictive value](#ppv), and [negative predictive value](#npv)), as well as accuracy across all possible cutoffs (the [area under the receiver operating characteristic curve](#auc)).\index{bias}\index{signal detection theory}\index{sensitivity}\index{specificity}\index{positive predictive value}\index{negative predictive value}\index{receiver operating characteristic curve!area under the curve}
While there may be similar [predictive validity](#predictiveValidity) between groups, the type of errors we are making across groups might differ.\index{bias!predictive}\index{validity!predictive}
It is important to decide which types of error to emphasize depending on the fairness goals and examining [sensitivity](#sensitivity)/[specificity](#specificity) to adjust cutoffs.\index{bias!predictive}\index{signal detection theory}\index{sensitivity}\index{specificity}
##### Empirical Evidence of Test Structure Bias {#evidenceTestStructureBias}
It is not uncommon to find items that show differences across groups in [severity](#itemDifficulty) (intercepts) and/or [discrimination](#itemDiscrimination) (factor loadings).\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{structural equation modeling!intercept}\index{structural equation modeling!factor loading}\index{item response theory!differential item functioning}\index{item response theory!item discrimination}\index{item response theory!item difficulty}
However, cross-group differences in item functioning tend to be small and not consistent across studies, suggesting that some of the differences may reflect Type I errors that result from sampling error and multiple testing.\index{bias!test structure}\index{structural equation modeling!measurement invariance}\index{item response theory!differential item functioning}
That said, some instances of cross-group differences in item parameters could reflect test structure bias that is real and important to address.\index{bias!test structure}
#### Theoretical/Judgmental Approaches to Identification {#testStructureBiasTheoretical}
##### Facial Validity Bias {#facialValidityBias}
[Facial validity](#faceValidity) bias considers the extent to which an average person thinks that an item is biased—i.e., the item has differing [validity](#validity) between minority and majority groups.\index{bias!test structure}\index{validity!face}
If so, the item should be reconsidered.\index{bias!test structure}
Does an item disfavor certain groups?\index{bias!test structure}
Is the language specific to a particular group?\index{bias!test structure}
Is it offensive to some people?\index{bias!test structure}
This type of judgment moves into the realm of whether or not an item should be used.\index{bias!test structure}
##### Content Validity Bias {#contentValidityBias}
[Content validity](#contentValidity) bias is determined by judgments of construct experts who look for items that do not do an adequate job assessing the construct between groups.\index{bias!test structure}\index{validity!content}
A construct may include some content facets in one group, but may include different content facets in another group, as depicted in Figure \@ref(fig:testBias5).\index{bias!test structure}\index{validity!content}
```{r testBias5, out.width = "100%", fig.align = "center", fig.cap = "Different Content Facets in a Given Construct for Two Groups.", echo = FALSE}
knitr::include_graphics("./Images/testBias-05.png")
```
Examples include information questions and vocabulary questions on the Wechsler Adult Intelligence Scale.\index{bias!test structure}\index{validity!content}
If an item is linguistically complicated, grammatically complex or convoluted, or a double negative, it may be less [valid](#validity) or predictive for rural populations and those with less education.\index{bias!test structure}\index{validity!content}\index{validity}
Also, stereotype threat may contribute to [content validity](#contentValidity) bias.\index{bias!test structure}\index{validity!content}\index{stereotype threat}
Stereotype threat occurs when people are or feel at risk of conforming themselves to stereotypes about their social group, thus leading them to show poorer performance in ways that are consistent with the stereotype.\index{bias!test structure}\index{validity!content}\index{stereotype threat}
Stereotype threat may partially explain why some women may perform more poorly on some math items than some men.\index{bias!test structure}\index{validity!content}\index{stereotype threat}
Another example of [content validity](#contentValidity) bias is when the same measure is used to assess a construct across ages even though the construct shows heterotypic continuity.\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
Heterotypic continuity occurs when a construct changes in its behavioral manifestation with development [@Petersen2020].\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
That is, the same construct may look different at different points in development.\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
An example of a construct that shows heterotypic continuity is externalizing problems.\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
In early childhood, externalizing problems often manifest in overt forms, including physical aggression (e.g., biting) and temper tantrums.\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
By contrast, in adolescence and adulthood, externalizing problems more often manifest in covert ways, including relational aggression and substance use.\index{bias!test structure}\index{validity!content}\index{heterotypic continuity}
[Content validity](#contentValidity) and [facial validity](#faceValidity) bias judgments are often related, but not always.\index{bias!test structure}\index{validity!content}\index{validity!face}\index{heterotypic continuity}
## Examples of Bias {#biasExamples}
As described in the overview in Section \@ref(overview-bias), there is not much empirical evidence of test bias [@Brown1999; @Hall1999; @Jensen1980; @Kuncel2010a; @Reynolds2012; @Reynolds2021; @Sackett1994; @Sackett2008].\index{bias!examples}
That said, some item-level bias is not uncommon.\index{bias!examples}
One instance of test bias is that, historically, women's grades in math and engineering classes tended to be under-estimated by the Scholastic Aptitude Test [SAT; @Clark1984].\index{bias!examples}
@Fernandez2018 review the evidence on other instances of test and item bias.\index{bias!examples}
For instance, test bias can occur if a subgroup is less familiar with the language, the stimulus material, or the response procedures, or if they have different [response styles](#methodBias-types).\index{bias!examples}\index{response style}
In addition to test bias, there are known patterns of [bias in clinical judgment](#biasClinicalJudgment), as described in Section \@ref(biasClinicalJudgment).
## Test Fairness {#fairness}
There is interest in examining more than just the accuracy of measures.\index{fairness}
It is also important to examine the *errors* being made and differentiate the weight or value of different kinds of errors (and correct decisions).\index{fairness}\index{prediction!prediction error}
Consider an example of an unbiased test, as depicted in Figure \@ref(fig:testBias6), adapted from @Gottfredson1994.\index{fairness}
Although the example is of a White group and a Black group, we could substitute any two groups into the example (e.g., males versus females).\index{fairness}
(ref:testUnfairness) Potential Unfairness in Testing. The ovals represent the distributions of individuals' performance both on a test and a job performance criterion. TP = true positive; TN = true negative; FP = false positive; FN = false negative. (Adapted from @Gottfredson1994, Figure 1, p. 958. Gottfredson, L. S. (1994). The science and politics of race-norming. *American Psychologist*, *49*(11), 955–963. https://doi.org/10.1037/0003-066X.49.11.955)
```{r testBias6, out.width = "100%", fig.align = "center", fig.cap = "(ref:testUnfairness)", fig.scap = "Potential Unfairness in Testing.", echo = FALSE}
knitr::include_graphics("./Images/testBias-06.png")
```
The example is of an unbiased test between White and Black job applicants.\index{fairness}
There are no differences between the two groups in terms of slope.\index{fairness}
If we drew a regression line, the line would go through the centroid of both ovals.\index{fairness}
Thus, the measure is equally predictive in both groups even though that the Black group failed the test at a higher rate than the White group.\index{fairness}
Moreover, there is no difference between the groups in terms of intercept.\index{fairness}
Thus, the performance of one group is not over-estimated relative to the performance of the other group.\index{fairness}
To demonstrate what a different intercept would look like, Group X shows a different intercept.\index{fairness}\index{bias!predictive!different intercepts}
In sum, there is no [predictive validity](#predictiveValidity) bias between the two groups.\index{fairness}\index{bias!predictive}
But just because the test predicts just as well in both groups does not mean that the selection procedures are *fair*.\index{fairness}
Although the test is unbiased, there are differences in the *quality* of prediction: there are more [false negatives](#falseNegative) in the Black group compared to the White group.\index{fairness}\index{false negative}
This gives the White group an advantage and the Black group additional disadvantages.\index{fairness}
If the measure showed the same quality of prediction, we would say the test is fair.\index{fairness}
The point of the example is that just because a test is unbiased does not mean that the test is fair.\index{fairness}
There are two kinds of errors: [false negatives](#falseNegative) and [false positives](#falsePositive).\index{fairness}\index{false negative}\index{false positive}
Each error type has very different implications.\index{fairness}\index{false negative}\index{false positive}
[False negatives](#falseNegative) would be when the test predicts that an applicant would perform poorly and we do not give them the job even though they would have performed well.\index{fairness}\index{false negative}
[False negatives](#falseNegative) have a negative effect on the applicant.\index{fairness}\index{false negative}
And, in this example, there are more [false negatives](#falseNegative) in the Black group.\index{fairness}\index{false negative}
By contrast, [false positives](#falsePositive) would be when we predict that an applicant would do well, and we give them the job but they perform poorly.\index{fairness}\index{false positive}
[False positives](#falsePositive) are a benefit to the applicant but have a negative effect on the employer.\index{fairness}\index{false positive}
In this example, there are more [false positives](#falsePositive) in the White group, which is an undeserved benefit based on the [selection ratio](#selectionRatio); therefore, the White group benefits.\index{fairness}\index{false positive}\index{selection ratio}
In sum, equal accuracy of prediction (i.e., equal total number of errors) does not necessarily mean the test is fair; we must examine the types of errors.\index{fairness}\index{bias!predictive}
Merely ensuring accuracy does not ensure fairness!\index{fairness}\index{bias!predictive}
### Adverse Impact {#adverseImpact}
Adverse impact is defined as rejecting members of one group at a higher rate than another group.\index{adverse impact}
Adverse impact is different from test [validity](#validity).\index{adverse impact}\index{validity}
According to federal guidelines, adverse impact is present if the [selection rate](#selectionRatio) of one group is less than four-fifths (80%) the [selection rate](#selectionRatio) of the group with the highest [selection rate](#selectionRatio).\index{adverse impact}\index{selection ratio}
There is much more evidence of adverse impact than test bias.\index{adverse impact}\index{bias}
Indeed, disparate impact of tests on personnel selection across groups is the norm rather than the exception, even when using [valid](#validity) tests that are unbiased, which in part reflect group-related differences in job-related skills [@Gottfredson1994].\index{adverse impact}\index{validity}
Examples of adverse impact include:\index{adverse impact}
- physical ability tests, which produce substantial adverse impact against women (despite over-estimation of women's performance),\index{adverse impact}
- cognitive ability tests, which produce substantial impact against some ethnic minority groups, especially Black and Hispanic people (despite over-estimation of Black and Hispanic people's performance), even though cognitive ability tests tend to be among the strongest predictors of job performance [@Sackett2008; @Schmidt1981], and\index{adverse impact}
- personality tests, which produce higher estimates of dominance among men than women; it is unclear whether this has [predictive bias](#predictiveBias).\index{adverse impact}\index{bias!predictive}
### Bias Versus Fairness {#biasVsFairness}
Whether a measure is accurate or shows test bias is a scientific question.
By contrast, whether a test is fair and thus should be used for a given purpose is not just a scientific question; it is also an [ethical](#ethics) question.\index{bias}\index{fairness}\index{ethics}
It involves the consideration of the potential consequences of testing in terms of social values and [consequential validity](#consequentialValidity).\index{fairness}
### Operationalizing Fairness {#operationalizingFairness}
There are many perspectives to what should be considered when evaluating test fairness [@AERA2014; @Camilli2013; @Dorans2017; @GeneralAptitudeTestBattery1989; @Fletcher2021; @Gipps2009; @Helms2006; @Jonson2022; @Melikyan2019; @Sackett2008; @Thorndike1971; @Zieky2006; @Zieky2013].\index{fairness!operationalizing}
As described in @Fletcher2021, there are three primary ways of operationalizing fairness:\index{fairness!operationalizing}
1. Equal outcomes: the [selection rate](#selectionRatio) is the same across groups.\index{fairness!operationalizing}\index{selection ratio}
1. Equal opportunity: the [sensitivity](#sensitivity) ([true positive rate](#sensitivity); 1 $-$ [false negative rate](#falseNegativeRate)) is the same across groups.\index{fairness!operationalizing}\index{sensitivity}\index{false negative!rate}
1. Equal odds: the [sensitivity](#sensitivity) is the same across groups and the [specificity](#specificity) ([true negative rate](#specificity); 1 $-$ [false positive rate](#falsePositiveRate)) is the same across groups.\index{fairness!operationalizing}\index{sensitivity}\index{specificity}\index{false positive!rate}
For example, the job selection procedure shows *equal outcomes* if the proportion of men selected is equal to the proportion of women selected.\index{fairness!operationalizing}
The job selection procedure shows *equal opportunity* if, among those who show strong job performance, the proportion of classification errors ([false negatives](#falseNegative)) is the same for men and women.\index{fairness!operationalizing}
[Receiver operating characteristic (ROC) curves](#roc) are depicted for two groups in Figure \@ref(fig:testBias-ROC).\index{fairness!operationalizing}\index{receiver operating characteristic curve}
A cutoff that represents equal opportunity is depicted with a horizontal line (i.e., the same [sensitivity](#sensitivity)) in Figure \@ref(fig:testBias-ROC).
The job selection procedure shows *equal odds* if (a), among those who show strong job performance, the proportion of classification errors ([false negatives](#falseNegative)) is the same for men and women, and (b), among those who show poor job performance, the proportion of classification errors ([false positives](#falsePositive)) is the same for men and women.\index{fairness!operationalizing}\index{sensitivity}\index{false negative!rate}\index{false positive!rate}
A cutoff that represents equal odds is depicted where the [ROC curve](#roc) for Group A intersects with the [ROC curve](#roc) from Group B in Figure \@ref(fig:testBias-ROC).\index{fairness!operationalizing}
The equal odds approach to fairness is consistent with a National Academy of Sciences committee on fairness [@GeneralAptitudeTestBattery1989; @Gottfredson1994].\index{fairness!operationalizing}
Approaches to operationalizing fairness in the context of prediction models are described by @Paulus2020.\index{fairness!operationalizing}
(ref:testBias-ROCCaption) Receiver Operating Characteristic (ROC) Curves for Two Groups. (Figure reprinted from @Fletcher2021, Figure 2, p. 3. Fletcher, R. R., Nakeshimana, A., & Olubeko, O. (2021). Addressing fairness, bias, and appropriate use of artificial intelligence and machine learning in global health. *Frontiers in Artificial Intelligence*, *3*(116). [https://doi.org/10.3389/frai.2020.561802](https://doi.org/10.3389/frai.2020.561802))
```{r testBias-ROC, out.width = "100%", fig.align = "center", fig.cap = "(ref:testBias-ROCCaption)", fig.scap = "Receiver Operating Characteristic Curves for Two Groups.", echo = FALSE}
knitr::include_graphics("./Images/testBias-ROC.png")
```
It is not possible to meet all three types of fairness simultaneously (i.e., equal [selection rates](#selectionRatio), [sensitivity](#sensitivity), and [specificity](#specificity) across groups) unless the [base rates](#baseRate) are the same across groups or the selection is perfectly accurate [@Fletcher2021].\index{fairness!operationalizing}\index{selection ratio}\index{sensitivity}\index{specificity}\index{base rate}
In the medical context, equal odds is the most common approach to fairness.\index{fairness!operationalizing}
However, using the cutoff associated with equal odds typically reduces overall classification accuracy.\index{fairness!operationalizing}
And, changing the cutoff for specific groups can lead to negative consequences.\index{fairness!operationalizing}
In the case that equal odds results in a classification accuracy that is too low, it may be worth considering using separate assessment procedures/tests for each group.\index{fairness!operationalizing}
In general, it is best to follow one of these approaches to fairness.\index{fairness!operationalizing}
It is difficult to get right, so try to minimize negative impact.\index{fairness!operationalizing}
Many fairness supporters argue for simpler rules.\index{fairness!operationalizing}
In the 1991 Civil Rights Act, [score adjustments](#scoreAdjustment) based on race, gender, and ethnicity (e.g., within-race [norming](#norm) or race-conscious score adjustments) were made illegal in personnel selection [@Gottfredson1994].\index{fairness!operationalizing}
Another perspective to fairness is that selection procedures should predict job performance and if they are correlated with any group membership (e.g., race, socioeconomic status, or gender), the test should not be used [@Helms2006].\index{fairness!operationalizing}
That is, according to Helms, we should not use any test that assesses anything other than the construct of interest (job performance).\index{fairness!operationalizing}
Unfortunately, however, no measures like this exist.\index{fairness!operationalizing}
Every measure assesses multiple things, and factors such as poverty can have long-lasting impacts across many domains.\index{fairness!operationalizing}
Another perspective to fairness is to make the selection procedures equal the number of successes within each group [@Thorndike1971].\index{fairness!operationalizing}
According to this perspective, if you want to do selection, you should hire all people, then look at job performance.\index{fairness!operationalizing}
If among successful employees, 60% are White and 40% are Black, then set this selection rate for each group (i.e., hiring 80% White individuals and 20% Black individuals is not okay).\index{fairness!operationalizing}
According to this perspective, a selection system is only fair if the majority–minority differences on the selection device used are equal in magnitude to majority–minority differences in job performance.\index{fairness!operationalizing}
Selection criteria should be made based on prior distributions of success rates.\index{fairness!operationalizing}
However, you likely will not ever really know the true [base rate](#baseRate) in these situations.\index{fairness!operationalizing}\index{base rate}
No one uses this approach because you would have a period where you have to accept everyone to find the percent that works.\index{fairness!operationalizing}
Also, this would only work in a narrow window of time because the selection pool changes over time.\index{fairness!operationalizing}
There are lots of groups and subgroups.\index{fairness!operationalizing}
Ensuring fairness is very complex, and there is no way to accomplish the goal of being equally fair to all people.\index{fairness!operationalizing}
Therefore, do the best you can and try to minimize negative impact.\index{fairness!operationalizing}
## Correcting For Bias {#correctForBias}
### What to Do When Detecting Bias {#whenDetectingBias}
When examining item bias (using [differential item functioning](#dif)/DIF or [measurement non-invariance](#measurementInvariance)) with many items (or measures) across many groups, there can be many tests, which will make it likely that [DIF](#dif)/[non-invariance](#measurementInvariance) will be detected, especially with a large sample.\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
Some detected [DIF](#dif) may be artificial or trivial, but other [DIF](#dif) may be real and important to address.\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
It is important to consider how you will proceed when detecting [DIF](#dif)/[non-invariance](#measurementInvariance).\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
Considerations of effect size and theory can be important for evaluating the [DIF](#dif)/[non-invariance](#measurementInvariance) and whether it is negligible or important to address.\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
When detecting bias, there are several steps to take.\index{bias!correcting for}
First, consider what the bias indicates.\index{bias!correcting for}
Does the bias present [adverse impact](#adverseImpact) for a minority group?\index{bias!correcting for}\index{adverse impact}
For what reasons might the bias exist?\index{bias!correcting for}
Second, examine the effect size of the bias.\index{bias!correcting for}
If the effects are small, if the bias does not present [adverse impact](#adverseImpact) for a minority group, and if there is no compelling theoretical reason for the bias, the bias might not be sufficient to scrap the instrument for the population.\index{bias!correcting for}
Some detected bias may be artificial, but other bias may be real.\index{bias!correcting for}
Gender and cultural differences have shown a number of statistically significant effects for a number of different assessment purposes, but many of the observed effects are quite small and likely trivial, and they do not present compelling reasons to change the assessment [@Youngstrom2016].\index{bias!correcting for}
However, if you find bias, correct for it!\index{bias!correcting for}
There are a number of [score adjustment](#scoreAdjustment) and [non-score adjustment](#otherBiasCorrections) approaches to correct for bias, as described in Sections \@ref(scoreAdjustment) and \@ref(otherBiasCorrections).\index{bias!correcting for}
If the bias occurs at the item level (e.g., [test structure bias](#testStructureBias)), it is generally recommended to [remove](#removeBiasedItems) or [resolve](#resolveBiasedItems) items that show non-negligible bias.\index{bias!correcting for}\index{bias!test structure}
There are three primary options: (1) drop the item for both groups, (2) drop the item for one group but keep it for the other group, or (3) freely estimate the parameters for the item across groups.\index{bias!correcting for}
Addressing items that show larger bias can also reduce artificial bias in other items [@Hagquist2017].\index{bias!correcting for}
Thus, researchers are encouraged to handle item bias sequentially from high to low in magnitude.\index{bias!correcting for}
If the bias occurs at the test score level (e.g., [predictive bias](#predictiveBias)), [score adjustments](#scoreAdjustment) may be considered.\index{bias!correcting for}\index{bias!test structure}\index{bias!correcting for!score adjustment}
If you do not correct for bias, consider the impact of the test, procedure, and selection procedure when interpreting scores.\index{bias!correcting for}
Interpret scores with caution and provide necessary caveats in resulting papers or reports regarding the interpretations in question.\index{bias!correcting for}
In sum, it is important to examine the possibility of bias—it is important to consider how much "erroneous junk" you are introducing into your research.\index{bias!correcting for}
### Score Adjustment to Correct for Bias {#scoreAdjustment}
Score adjustment involves adjusting scores for a particular group or groups.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
#### Why Adjust Scores? {#whyAdjustScores}
There may be several reasons to adjust scores for various groups in a given situation.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
First, there may be social goals to adjust scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
For example, we may want our selection device to yield personnel that better represent the nation or region, including diversity of genders, races, majors, social classes, etc.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Score adjustments are typically discussed with respect to racial minority differences due to historical and systemic inequities.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Our society aims to provide equal opportunity, including the opportunity to gain a fair share (i.e., proportional representation) of jobs.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
A diversity of perspectives in a job is a strength; a diversity of perspectives can lead to greater creativity and improved problem-solving.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
A second potential reason that we may want to apply score adjustment is to correct for bias.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
A third potential reason that we may want to apply score adjustment is to improve the [fairness](#fairness) of a test.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{fairness}
#### Types of Score Adjustment {#typesOfScoreAdjustment}
There are a number of potential techniques that have been used in attempts to correct for bias, i.e., to reduce negative impact of the test on an under-represented group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
What is considered an under-represented group may depend on the context.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
For instance, men are under-represented compared to women as nurses, preschool teachers, and college students.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
However, men may not face the same systemic challenges compared to women, so even though men may show under-representation in some domains, it is arguable whether scores should be adjusted to increase their representation.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Techniques for score adjustment include:\index{bias!correcting for}\index{bias!correcting for!score adjustment}
- [Bonus points](#bonusPoints)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
- [Within-group norming](#withinGroupNorming)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
- [Separate cutoffs](#separateCutoffs)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate cutoffs}
- [Top-down selection from different lists](#topDownSelection)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!top-down selection from different lists}
- [Banding](#banding)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
- [Banding with bonus points](#bandingBonusPoints)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding with bonus points}
- [Sliding band](#slidingBand)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
- [Separate tests](#separateTests)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}
- [Item elimination based on group differences](#itemElimination)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!item elimination based on group differences}
##### Bonus Points {#bonusPoints}
Providing bonus points involves adding a constant number of points to the scores of all individuals who are members of a particular group with the goal of eliminating or reducing group differences.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
Bonus points is used to correct for [predictive bias](#predictiveBias) [differences in intercepts](#differentIntercepts) between groups.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!predictive!different intercepts}
An example of bonus points is military veterans in placement for civil service jobs—points are added to the initial score for all veterans (e.g., add 5 points to test scores of all veterans).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
An example of using bonus points as a score adjustment is depicted in Figure \@ref(fig:bonusPoints).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
```{r bonusPoints, out.width = "100%", fig.align = "center", fig.cap = "Using Bonus Points as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_bonusPoints.png")
```
There are several pros of bonus points.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
If the distribution of each group is the same, this will effectively reduce group differences.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
Moreover, it is a simple way of impacting test selection and procedure without changing the test, which is therefore a great advantage.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
There are several cons of bonus points.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
If there are differences in group standard deviations, adding bonus points may not actually correct for bias.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
The use of bonus points also obscures what is actually being done to scores, so other methods like using [separate cutoffs](#separateCutoffs) may be more explicit.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!separate cutoffs}
In addition, the simplicity of bonus points is also a great disadvantage because it is easily understood and often not viewed as "fair" because some people are getting extra points that others do not.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{fairness}
##### Within-Group Norming {#withinGroupNorming}
A [norm](#norm) is the standard of performance that a person's performance can be compared to.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm}
Within-group norming treats the person's group in the sample as the [norm](#norm).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Within-group norming converts an individual's score to [standardized scores](#standardizedScores) (e.g., T scores) or percentiles within one's own group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{data!standardized}
Then, the people are selected based on the highest standard scores across groups.
Withing-group norming is used to correct for [predictive bias](#predictiveBias) [differences in slopes](#differentSlopes) between groups.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{bias!predictive!different slopes}
An example of using within-group norming as a score adjustment is depicted in Figure \@ref(fig:withinGroupNorming).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
```{r withinGroupNorming, out.width = "100%", fig.align = "center", fig.cap = "Using Within-Group Norming as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_withinGroupNorming.png")
```
There are several pros of within-group norming.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
First, it accounts for differences in group standard deviations and means, so it does not have the same problem as [bonus points](#bonusPoints) and is generally more effective at eliminating [adverse impact](#adverseImpact) compared to [bonus points](#bonusPoints).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{bias!correcting for!bonus points}\index{adverse impact}
Second, some general (non-group-specific) [norms](#norm) are clearly irrelevant for characterizing a person's functioning.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Group-specific [norms](#norm) aim to describe a person's performance relative to people with a similar background, thus potentially reducing cultural [bias](#bias).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Third, group-specific [norms](#norm) may better reflect cultural, educational, socioeconomic, and other factors that may influence a person's score [@Burlew2019].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Fourth, group-specific [norms](#norm) may increase [specificity](#specificity), and reduce over-pathologizing by preventing giving a diagnosis to people who might not show a condition [@Manly2007].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{specificity}
There are several cons of within-group norming.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
First, group differences could be maintained if one decides to norm based on a reference sample or, when scores are skewed, a local sample, especially when using standardized scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
However, percentile scores will consistently eliminate [adverse impact](#adverseImpact).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{data!percentile rank}\index{adverse impact}
Second, using group-specific [norms](#norm) may obscure background variables that explain underlying reasons for group-related differences in test performance [@Manly2005; @Manly2007].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Third, group-specific [norms](#norm) do not address the problem if the measure shows test [bias](#bias) [@Burlew2019].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Fourth, group-specific [norms](#norm) may reduce [sensitivity](#sensitivity) to detect conditions [@Manly2007].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{sensitivity}
For instance, they may prevent people from getting treatment who would benefit.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
It is worth noting that within-group norming on the basis of sex, gender, and ethnicity is illegal for the basis of personnel selection according to the 1991 Civil Rights Act.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
As an example of within-group norming, the National Football League used to use race-norming for identification of concussions.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm-referenced!race}
The effect of race-norming, however, was that it lowered Black players' concussion risk scores, which prevented many Black players from being identified as having sustained a concussion and from receiving needed treatment.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm-referenced!race}
Race-norming compared the Black football players cognitive test scores to group-specific [norms](#norm): the cognitive test scores of Black people in the general population (not to common [norms](#norm)).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm-referenced!race}
Using Black-specific [norms](#norm) assumed that Black football players showed lower cognitive ability than other groups, so a low cognitive ability score for a Black player was less likely to be flagged as concerning.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm-referenced!race}
Thus, the race-specific [norms](#norm) led to lower identified rates of concussions among Black football players compared to White football players.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{norm-referenced!race}
Due to the [adverse impact](#adverseImpact), Black players sued the National Football League, and the league stopped the controversial practice of race-norming for identification of concussion\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}\index{adverse impact} (https://www.washingtonpost.com/sports/2021/06/03/nfl-concussion-settlement-race-norming/; archived at https://perma.cc/KN3L-5Z7R).\index{norm-referenced!race}
A common question is whether to use group-specific [norms](#norm) or common [norms](#norm).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
Group-specific [norms](#norm) are a controversial practice, and the answer depends.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
If you are interested in a person's absolute functioning (e.g., for determining whether someone is concussed or whether they are suitable to drive), recommendations are to use common [norms](#norm), not group-specific [norms](#norm) [@Barrash2010; @Silverberg2009].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
If, by contrast, you are interested in a person's *relative* functioning compared to a specific group, within-group norming could make sense if there is an appropriate reference group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
The question about which [norms](#norm) to use are complex, and psychologists should evaluate the cost and benefit of each [norm](#norm), and use the [norm](#norm) with the greatest benefit and the least cost for the client [@Manly2007].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
##### Separate Cutoffs {#separateCutoffs}
Using separate cutoffs involves using a separate cutoff score per group and selecting the top number from each group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate cutoffs}
That is, using separate cutoffs involves using different criteria for each group.
Using separate cutoffs functions the same as adding [bonus points](#bonusPoints), but it has greater transparency—i.e., you are lowering the standard for one group compared to another group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate cutoffs}\index{bias!correcting for!bonus points}
An example of using separate cutoffs as a score adjustment is depicted in Figure \@ref(fig:separateCutoffs).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate cutoffs}
```{r separateCutoffs, out.width = "100%", fig.align = "center", fig.cap = "Using Separate Cutoffs as a Scoring Adjustment. In this example, the cutoff for the majority group is 128; the cutoff for the minority group is 123.", fig.scap = "Using Separate Cutoffs as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_separateCutoffs.png")
```
##### Top-Down Selection from Different Lists {#topDownSelection}
Top-down selection from different lists involves taking the best from two different lists according to a preset rule as to how many to select from each group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!top-down selection from different lists}
Top-down selection from different lists functions the same as [within-group norming](#withinGroupNorming).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!top-down selection from different lists}\index{bias!correcting for!within-group norming}
An example of using top-down selection from different lists as a score adjustment is depicted in Figure \@ref(fig:topDownSelectionDifferentLists).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!top-down selection from different lists}
```{r topDownSelectionDifferentLists, out.width = "100%", fig.align = "center", fig.cap = "Using Top-Down Selection From Different Lists as a Scoring Adjustment. In this example, the top three candidates are selected from each group.", fig.scap = "Using Top-Down Selection From Different Lists as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_separateLists.png")
```
##### Banding {#banding}
Banding uses a tier system that is based on the assumption that individuals within a specific score range are regarded as having equivalent scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
So that we do not over-estimate small score differences, scores within the same band are seen as equivalent—and the order of selection within the band can be modified depending on selection goals.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
The [standard error of measurement (SEM)](#standardErrorOfMeasurement) is used to estimate the precision ([reliability](#reliability)) of the test scores, and it is used as the width of the band.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}\index{reliability!standard error of measurement}\index{reliability!precision}
Consider an example: if a person received a score with confidence interval of 18–22, then scores between 18 to 22 are not necessarily different due to random fluctuation ([measurement error](#measurementError)).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}\index{measurement error}
Therefore, scores in that range are considered the same, and we take a band of scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
However, banding by itself may not result in increased selection of lower scoring groups.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
The band provides a subsample of applicants so that we can use other criteria (other than the test) to select a candidate.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
Giving "minority preference" involves selecting members of minority group in a given band before selecting members of the majority group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
An example of using banding as a score adjustment is depicted in Figure \@ref(fig:banding).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
```{r banding, out.width = "100%", fig.align = "center", fig.cap = "Using Banding as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_banding.png")
```
The problem with banding is that bands are set by the [standard error of measurement](#standardErrorOfMeasurement): you can select the first group from the first band, but then whom do you select after the first band?\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}\index{reliability!standard error of measurement}
There is no rationale where to "stop" the band because there are indistinguishable scores on the edges of each band to the next band.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
That is, 17 is indistinguishable from 18 (in terms of its confidence interval), 16 is indistinguishable from 17, and so on.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
Therefore, banding works okay for the top scores, but if you are going to hire a lot of candidates, it is a problem.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}
A solution to this problem with banding is to use a [sliding band](#slidingBand), as described later.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding}\index{bias!correcting for!sliding band}
##### Banding with Bonus Points {#bandingBonusPoints}
[Banding](#banding) is often used with [bonus points](#bonusPoints) to reduce the negative impact for minority groups.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding with bonus points}\index{bias!correcting for!banding}\index{bias!correcting for!bonus points}
An example of using [banding](#banding) with [bonus points](#bonusPoints) as a score adjustment is depicted in Figure \@ref(fig:bandingBonusPoints).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!banding with bonus points}\index{bias!correcting for!banding}\index{bias!correcting for!bonus points}
```{r bandingBonusPoints, out.width = "100%", fig.align = "center", fig.cap = "Using Banding With Bonus Points as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_bandingBonusPoints.png")
```
##### Sliding Band {#slidingBand}
Using a sliding band is a solution to the problem of which bands to use when using [banding](#banding).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{bias!correcting for!banding}
Using a sliding band can help increase the number of minorities selected.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
Using the top band, you select all members of a minority group in the top band, then select members of the majority group with the top score of the band, then slide the band down (based on [SEM](#standardErrorOfMeasurement)), and repeat.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
You work your way down with bands though groups that are indistinguishable based on [SEM](#standardErrorOfMeasurement), until getting a cell needed to select a relevant candidate.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{reliability!standard error of measurement}
For instance, if the top score is 22 and the [SEM](#standardErrorOfMeasurement) is 4 points, the first band would be: [18, 22].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{reliability!standard error of measurement}
Here is how you would proceed:\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Select the minority group members who have a score between 18 to 22.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Select the majority group members who have a score of 22.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Slide the band down based on the SEM to the next highest score: [17, 21].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Select the minority group members who have a score between 17 to 21.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Select the majority group members who have a score of 21.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. Slide the band down based on the SEM to the next highest score: [16, 20].\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. ...\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
1. And so on\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
An example of using a sliding band as a score adjustment is depicted in Figure \@ref(fig:slidingBand).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
```{r slidingBand, out.width = "100%", fig.align = "center", fig.cap = "Using a Sliding Band as a Scoring Adjustment.", echo = FALSE}
knitr::include_graphics("./Images/adjustment_slidingBand.png")
```
In sum, using a sliding band, scores that are not significantly lower than the highest remaining score should not be treated as different.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
Using a sliding band has the same effects on decisions as [bonus points](#bonusPoints) that are the width of the band.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{bias!correcting for!bonus points}
For example, if the [SEM](#standardErrorOfMeasurement) is 3, it has the same decisions as [bonus points](#bonusPoints) of 3; therefore, any scores within 3 of the highest score are now considered equal.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{reliability!standard error of measurement}
A sliding band is popular because of its scientific and statistical rationale.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
Also, it is more confusing and, therefore, preferred by some because it may be less likely to be sued.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
However, a sliding band may not always eliminate [adverse impact](#adverseImpact).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}\index{adverse impact}
A sliding band has never been overturned in court (or at least, not yet).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!sliding band}
##### Separate Tests {#separateTests}
Using separate tests for each group is another option to reduce bias.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}
For instance, you might use one test for the majority group and a different test for the minority group, making sure that each test is [valid](#validity) for the relevant group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}\index{validity}
Using separate tests is an extreme version of [top-down selection](#topDownSelection) and [within-group norming](#withinGroupNorming).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}\index{bias!correcting for!top-down selection from different lists}\index{bias!correcting for!separate tests}\index{bias!correcting for!within-group norming}
Using separate tests would be an option if a measure shows [different slopes](#differentSlopes) [predictive bias](#predictiveBias).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}\index{bias!predictive!different slopes}
One way of developing separate tests is to use empirical keying by group: different items for each group are selected based on each item's association with the criterion in each group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}
Empirical keying is an example of [dustbowl empiricism](#theoryEmpiricism) (i.e., relying on empiricism rather than theory).\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}\index{radical operationalism}\index{empiricism}
However, theory can also inform the item selection.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!separate tests}\index{theory}
##### Item Elimination based on Group Differences {#itemElimination}
Items that show large group differences in scores can be eliminated from the test.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!item elimination based on group differences}
If you remove enough items showing differences between groups, you can get similar scores between groups and can get equal group selection.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!item elimination based on group differences}
A problem of item elimination based on group differences is that if you get rid of predictive items, then two goals, equal selection and predictive power, are not met.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!item elimination based on group differences}
If you use this method, you often have to be willing for the measure to show decreases in predictive power.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!item elimination based on group differences}
#### Use of Score Adjustment {#useOfScoreAdjustment}
Score adjustment can be used in a number of different domains, including tests of aptitude and intelligence.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Score adjustment also comes up in other areas.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
For example, the number of drinks it takes to be considered binge drinking differs between men (five) and women (four).\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Although the list of score adjustment options is long, they all really reduce to two ways:\index{bias!correcting for}\index{bias!correcting for!score adjustment}
1. [Bonus points](#bonusPoints)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}
1. [Within-group norming](#withinGroupNorming)\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
[Bonus points](#bonusPoints) and [within-group norming](#withinGroupNorming) are the techniques that are most often used in the real world.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!within-group norming}
These techniques differ in their degree of obscurity—i.e., confusion that is caused not for scientific reasons, but for social, political, and dissemination and implementation reasons.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!within-group norming}
Often procedures that are hard to understand are preferred because it is hard to argue against, critique, or game the system.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Basically, you have two options for score adjustment.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!within-group norming}
One option is to adjust scores by raising scores in one group or lowering the criterion in one group.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!separate cutoffs}
The second primary option is to renorm or change the scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!within-group norming}
In sum, you can change the scores, or you can change the decisions you make based on the scores.\index{bias!correcting for}\index{bias!correcting for!score adjustment}\index{bias!correcting for!bonus points}\index{bias!correcting for!within-group norming}
### Other Ways to Correct for Bias {#otherBiasCorrections}
Because score adjustment is controversial, it is also important to consider other potential ways to correct for bias that do not involve score adjustment.\index{bias!correcting for}\index{bias!correcting for!score adjustment}
Strategies other than score adjustment to correct for bias are described by @Sackett2001.\index{bias!correcting for}
#### Use Multiple Predictors {#useMultiplePredictors}
In general, high-stakes decisions should not be made based on the results from one test.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
So, for instance, do not make hiring decisions based just on aptitude assessments.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
For example, college admissions decisions are not made just based on SAT scores, but also one's grades, personal statement, extracurricular activities, letters of recommendation, etc.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
Using multiple predictors works best when the predictors are not correlated with the assessment that has [adverse impact](#adverseImpact), which is difficult to achieve.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}\index{adverse impact}
There are larger majority–minority subgroup differences in verbal and cognitive ability tests than in noncognitive skills (e.g., motivation, personality, and interpersonal skills).\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
So, it is important to include assessment of relevant noncognitive skills.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
Include as many relevant aspects of the construct as possible for [content validity](#contentValidity).\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}\index{validity!content}
For a job, consider as many factors as possible that are relevant for success, e.g., cognitive and noncognitive abilities.\index{bias!correcting for}\index{bias!correcting for!use multiple predictors}
#### Change the Criterion {#changeCriterion}
Another option is to change the criterion so that the [predictive validity](#predictiveValidity) of tests is less skewed.\index{bias!correcting for}\index{bias!correcting for!change the criterion}\index{validity!predictive}
It may be that the selection instrument is not biased but the way in which we are thinking about selection procedures is biased.\index{bias!correcting for}\index{bias!correcting for!change the criterion}
For example, for judging the quality of universities, there are many different criteria we could use.\index{bias!correcting for}\index{bias!correcting for!change the criterion}
It could be valuable to examine the various criteria, and you might find what is driving adverse effects.\index{bias!correcting for}\index{bias!correcting for!change the criterion}
#### Remove Biased Items {#removeBiasedItems}
Using [item response theory](#irt) or [confirmatory factor analysis](#cfa), you can identify items that function differently across groups (i.e., [differential item functioning](#dif)/[DIF](#dif) or [measurement non-invariance](#measurementInvariance)).\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{item response theory}\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
For instance, you can identify items that show different [discrimination](#itemDiscrimination)/factor loadings or [difficulty](#itemDifficulty)/intercepts by group.\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{item response theory!differential item functioning}\index{item response theory!item difficulty}\index{item response theory!item discrimination}\index{structural equation modeling!measurement invariance}\index{structural equation modeling!factor loading}\index{structural equation modeling!intercept}
You do not just want to remove items that show mean-level differences in scores (or different rates of endorsement) for one group than another, because there may be true group differences in their level on particular items.\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{item response theory!differential item functioning}\index{item response theory!item difficulty}
If an item is clearly [invalid](#validity) in one group but [valid](#validity) in another group, another option is to keep the item in one group, and to remove it in another group.\index{item response theory!differential item functioning}\index{bias!correcting for!remove biased items}\index{validity}
Be careful when removing items because removing items can lead to poorer [content validity](#contentValidity)—i.e., items may no longer be a representative set of the content of the construct.\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{validity!content}
Removing items also reduces a measure's [reliability](#reliability) and ability to detect individual differences [@Hagquist2017; @Hagquist2019].\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{reliability}
[DIF](#dif) effects tend to be small and inconsistent; removing items showing [DIF](#dif) may not have a big impact.\index{bias!correcting for}\index{bias!correcting for!remove biased items}\index{item response theory!differential item functioning}
#### Resolve Biased Items {#resolveBiasedItems}
Another option, for items identified that show [differential item functioning](#dif) using [IRT](#irt) or [measurement non-invariance](#measurementInvariance) using [CFA](#cfa), is to resolve instead of [remove](#removeBiasedItems) items.\index{bias!correcting for}\index{bias!correcting for!resolve biased items}\index{item response theory}\index{item response theory!differential item functioning}\index{structural equation modeling!measurement invariance}
Resolving items involves allowing an item to have a different [discrimination](#itemDiscrimination)/factor loading and/or [difficulty](#itemDifficulty)/intercept parameter for each group.\index{bias!correcting for}\index{bias!correcting for!resolve biased items}\index{item response theory!differential item functioning}\index{item response theory!item difficulty}\index{item response theory!item discrimination}\index{structural equation modeling!measurement invariance}\index{structural equation modeling!factor loading}\index{structural equation modeling!intercept}
Allowing item parameters to differ across groups has a very small effect on [reliability](#reliability) and person separation, so it can be preferable to [removing items](#removeBiasedItems) [@Hagquist2017; @Hagquist2019].\index{bias!correcting for}\index{bias!correcting for!resolve biased items}\index{reliability}
#### Use Alternative Modes of Testing {#alternativeTestingModes}
Another option is to use alternative modes of testing.\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
For example, you could use audio or video to present test items, rather than requiring a person to read the items, or write answers.\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
Typical testing and computerized exams are oriented toward the upper-middle class, which is therefore a procedure problem!\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
McClelland's [-@McClelland1973] argument is that we need more real-life testing.\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
Real-life testing could help address stereotype threats and the effects of learning disabilities.\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
However, testing in different modalities could change the construct(s) being assessed.\index{bias!correcting for}\index{bias!correcting for!alternative modes of testing}
#### Use Work Records {#workRecords}
Using work records is based on McClelland's [-@McClelland1973] argument to use more realistic and authentic assessments of job-relevant abilities.\index{bias!correcting for}\index{bias!correcting for!work records}
Evidence on the value of work records for personnel selection is mixed.\index{bias!correcting for}\index{bias!correcting for!work records}
In some cases, use of work records can actually increase [adverse impact](#adverseImpact) on under-represented groups because the primary group typically already has an idea of how to get into the relevant job or is already in the relevant job; therefore, they have a leg up.\index{bias!correcting for}\index{bias!correcting for!work records}\index{adverse impact}
It would be acceptable to use work records if you trained people first and then tested, but no one spends the time to do this.\index{bias!correcting for}\index{bias!correcting for!work records}
#### Increase Time Limit {#increaseTimeLimit}
Another option is to allot people more testing time, as long as doing so does not change the construct.\index{bias!correcting for}\index{bias!correcting for!increase time limit}
Time limits often lead to greater [measurement error](#measurementError) because scores conflate pace and quality of work.\index{bias!correcting for}\index{bias!correcting for!increase time limit}
Increasing time limits requires convincing stakeholders that job performance is typically not "how fast you do things" but "how well you do them"—i.e., that time does not correlate with outcome of interest.\index{bias!correcting for}\index{bias!correcting for!increase time limit}
The utility of increasing time limits depends on the domain.\index{bias!correcting for}\index{bias!correcting for!increase time limit}
In some domains, efficiency is crucial (e.g., medicine, pilot).\index{bias!correcting for}\index{bias!correcting for!increase time limit}
Increasing time limits is not that effective in reducing group differences, and it may actually increase group differences.\index{bias!correcting for}\index{bias!correcting for!increase time limit}
#### Use Motivation Sets {#motivationSets}
Using motivation sets involves finding ways to increase testing motivation for minority groups.\index{bias!correcting for}\index{bias!correcting for!motivation sets}
It is probably an error to think that a test assesses just aptitude; therefore, we should also consider an individual's motivation to test.\index{bias!correcting for}\index{bias!correcting for!motivation sets}
Thus, part of the score has to do with ability and some of the score has to do with motivation.\index{bias!correcting for}\index{bias!correcting for!motivation sets}
You should try to maximize each examinee's motivation, so that the person's score on the measure better captures their true ability score.\index{bias!correcting for}\index{bias!correcting for!motivation sets}
Motivation sets could include, for example, using more realistic test stimuli that are clearly applicable to the school or job requirements (i.e., that have [face validity](#faceValidity)) to motivate all test takers.\index{bias!correcting for}\index{bias!correcting for!motivation sets}\index{validity!face}
#### Use Instructional Sets {#instructionalSets}
Using instructional sets involves coaching and training.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
For instance, you could inform examinees about the test content, provide study materials, and recommend test-taking strategies.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
This could narrow the gap between groups because there is an implicit assumption that the primary group already has "light" training.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
Using instructional sets aims to reduce error variance due to test anxiety, unfamiliar test format, and poor test-taking skills.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
Giving minority groups better access to test preparation is based on the assumption that group differences emerge because of different access to test preparation materials.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
This could theoretically help to systematically reduce test score differences across groups.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
Standardized tests like the SAT/GRE/LSAT/GMAT/MCAT, etc. embrace coaching/training.
For instance, the organization ETS gives training materials for free.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
After training, scores on standardized tests show some but minimal improvement.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
In general, training yields some improvement on quantitative subscales but minimal change on verbal subscales.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
However, the improvements tend to apply across groups, and they do not seem to lessen group differences in scores.\index{bias!correcting for}\index{bias!correcting for!instructional sets}
## Getting Started {#gettingStarted-bias}
### Load Libraries {#loadLibraries-bias}
```{r}
library("petersenlab") #to install: install.packages("remotes"); remotes::install_github("DevPsyLab/petersenlab")
library("lavaan")
library("semTools")
library("semPlot")
library("mirt")
library("dmacs") #to install: install.packages("remotes"); remotes::install_github("ddueber/dmacs")
library("strucchange")
library("MOTE")
library("tidyverse")
library("here")
library("tinytex")
```
### Prepare Data {#prepareData-bias}
#### Load Data {#loadData-bias}
`cnlsy` is a subset of a data set from the Children of the National Longitudinal Survey of Youth Survey (CNLSY).
The CNLSY is a publicly available longitudinal data set provided by the Bureau of Labor Statistics (https://perma.cc/EH38-HDRN).
The CNLSY data file for these examples is located on the book's page of the Open Science Framework (https://osf.io/3pwza).
```{r}
cnlsy <- read_csv(here("Data", "cnlsy.csv"))
```
#### Simulate Data {#simulateData-bias}
For reproducibility, I set the seed below.\index{simulate data}
Using the same seed will yield the same answer every time.
There is nothing special about this particular seed.
```{r}
sampleSize <- 4000
set.seed(52242)
mydataBias <- data.frame(
ID = 1:sampleSize,
group = factor(c("male","female"),
levels = c("male","female")),
unbiasedPredictor1 = NA,
unbiasedPredictor2 = NA,
unbiasedPredictor3 = NA,
unbiasedCriterion1 = NA,
unbiasedCriterion2 = NA,
unbiasedCriterion3 = NA,
predictor = rnorm(sampleSize, mean = 100, sd = 15),
criterion1 = NA,
criterion2 = NA,
criterion3 = NA,
criterion4 = NA,
criterion5 = NA)
mydataBias$unbiasedPredictor1 <- rnorm(sampleSize, mean = 100, sd = 15)
mydataBias$unbiasedPredictor2[which(mydataBias$group == "male")] <-
rnorm(length(which(mydataBias$group == "male")), mean = 70, sd = 15)
mydataBias$unbiasedPredictor2[which(mydataBias$group == "female")] <-
rnorm(length(which(mydataBias$group == "female")), mean = 130, sd = 15)
mydataBias$unbiasedPredictor3[which(mydataBias$group == "male")] <-
rnorm(length(which(mydataBias$group == "male")), mean = 130, sd = 15)
mydataBias$unbiasedPredictor3[which(mydataBias$group == "female")] <-
rnorm(length(which(mydataBias$group == "female")), mean = 70, sd = 15)
mydataBias$unbiasedCriterion1 <- 1 * mydataBias$unbiasedPredictor1 +
rnorm(sampleSize, mean = 0, sd = 15)
mydataBias$unbiasedCriterion2 <- 1 * mydataBias$unbiasedPredictor2 +
rnorm(sampleSize, mean = 0, sd = 15)
mydataBias$unbiasedCriterion3 <- 1 * mydataBias$unbiasedPredictor3 +
rnorm(sampleSize, mean = 0, sd = 15)
mydataBias$criterion1[which(mydataBias$group == "male")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "male")] +
rnorm(length(which(mydataBias$group == "male")), mean = 0, sd = 5)
mydataBias$criterion1[which(mydataBias$group == "female")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "female")] +
rnorm(length(which(mydataBias$group == "female")), mean = 0, sd = 5)
mydataBias$criterion2[which(mydataBias$group == "male")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "male")] +
rnorm(length(which(mydataBias$group == "male")), mean = 10, sd = 5)
mydataBias$criterion2[which(mydataBias$group == "female")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "female")] +
rnorm(length(which(mydataBias$group == "female")), mean = 0, sd = 5)
mydataBias$criterion3[which(mydataBias$group == "male")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "male")] +
rnorm(length(which(mydataBias$group == "male")), mean = 0, sd = 5)
mydataBias$criterion3[which(mydataBias$group == "female")] <-
.3 * mydataBias$predictor[which(mydataBias$group == "female")] +
rnorm(length(which(mydataBias$group == "female")), mean = 0, sd = 5)
mydataBias$criterion4[which(mydataBias$group == "male")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "male")] +
rnorm(length(which(mydataBias$group == "male")), mean = 0, sd = 5)
mydataBias$criterion4[which(mydataBias$group == "female")] <-
.3 * mydataBias$predictor[which(mydataBias$group == "female")] +
rnorm(length(which(mydataBias$group == "female")), mean = 30, sd = 5)
mydataBias$criterion5[which(mydataBias$group == "male")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "male")] +
rnorm(length(which(mydataBias$group == "male")), mean = 0, sd = 30)
mydataBias$criterion5[which(mydataBias$group == "female")] <-
.7 * mydataBias$predictor[which(mydataBias$group == "female")] +
rnorm(length(which(mydataBias$group == "female")), mean = 0, sd = 5)
```
#### Add Missing Data {#addMissingData-bias}
Adding missing data to dataframes helps make examples more realistic to real-life data and helps you get in the habit of programming to account for missing data.
`HolzingerSwineford1939` is a data set from the `lavaan` package [@R-lavaan] that contains mental ability test scores (`x1`–`x9`) for seventh- and eighth-grade children.
```{r}
varNames <- names(mydataBias)
dimensionsDf <- dim(mydataBias[,-c(1,2)])
unlistedDf <- unlist(mydataBias[,-c(1,2)])
unlistedDf[sample(
1:length(unlistedDf),
size = .01 * length(unlistedDf))] <- NA
mydataBias <- cbind(
mydataBias[,c("ID","group")],
as.data.frame(
matrix(
unlistedDf,
ncol = dimensionsDf[2])))
names(mydataBias) <- varNames
data("HolzingerSwineford1939")
varNames <- names(HolzingerSwineford1939)
dimensionsDf <- dim(HolzingerSwineford1939[,paste("x", 1:9, sep = "")])
unlistedDf <- unlist(HolzingerSwineford1939[,paste("x", 1:9, sep = "")])
unlistedDf[sample(
1:length(unlistedDf),
size = .01 * length(unlistedDf))] <- NA
HolzingerSwineford1939 <- cbind(
HolzingerSwineford1939[,1:6],
as.data.frame(matrix(
unlistedDf,
ncol = dimensionsDf[2])))
names(HolzingerSwineford1939) <- varNames
```
## Examples of Unbiased Tests (in Terms of Predictive Bias) {#examplesUnbiasedTests}
### Unbiased test where males and females have equal means on predictor and criterion {#unbiasedTestsEqualMeans}
Figure \@ref(fig:unbiasedEqualMeans) depicts an example of an unbiased test where males and females have equal means on the predictor and criterion.\index{bias!predictive!unbiased}
The test is unbiased because there are no significant differences in the regression lines (of `predictor` predicting `criterion`) between males and females.\index{bias!predictive!unbiased}
```{r}
summary(lm(
unbiasedCriterion1 ~ unbiasedPredictor1 + group + unbiasedPredictor1:group,
data = mydataBias))
```
```{r unbiasedEqualMeans, out.width = "100%", fig.align = "center", class.source = "fold-hide", fig.cap = "Unbiased Test Where Males and Females Have Equal Means on Predictor and Criterion."}
plot(
unbiasedCriterion1 ~ unbiasedPredictor1,
data = mydataBias,
xlim = c(
0,
max(c(
mydataBias$unbiasedCriterion1,
mydataBias$unbiasedPredictor1),
na.rm = TRUE)),
ylim = c(
0,
max(c(
mydataBias$criterion1,
mydataBias$predictor),
na.rm = TRUE)),
type = "n",
xlab = "predictor",
ylab = "criterion")
points(
mydataBias$unbiasedPredictor1[which(mydataBias$group == "male")],
mydataBias$unbiasedCriterion1[which(mydataBias$group == "male")],
pch = 20,
col = "blue")
points(mydataBias$unbiasedPredictor1[which(mydataBias$group == "female")],
mydataBias$unbiasedCriterion1[which(mydataBias$group == "female")],
pch = 1,
col = "red")
abline(lm(
unbiasedCriterion1 ~ unbiasedPredictor1,
data = mydataBias[which(mydataBias$group == "male"),]),
lty = 1,
col = "blue")
abline(lm(
unbiasedCriterion1 ~ unbiasedPredictor1,
data = mydataBias[which(mydataBias$group == "female"),]),
lty = 2,
col = "red")
legend(
"bottomright",
c("Male","Female"),
lty = c(1,2),
pch = c(20,1),
col = c("blue","red"))
```
### Unbiased test where females have higher means than males on predictor and criterion {#unbiasedTestsFemalesHigherMeans}
Figure \@ref(fig:unbiasedFemaleHigher) depicts an example of an unbiased test where females have higher means than males on the predictor and criterion.\index{bias!predictive!unbiased}
The test is unbiased because there are no differences in the regression lines (of `predictor` predicting `criterion`) between males and females.\index{bias!predictive!unbiased}
```{r}
summary(lm(
unbiasedCriterion2 ~ unbiasedPredictor2 + group + unbiasedPredictor2:group,
data = mydataBias))
```
```{r unbiasedFemaleHigher, out.width = "100%", fig.align = "center", class.source = "fold-hide", fig.cap = "Unbiased Test Where Females Have Higher Means Than Males on Predictor and Criterion."}
plot(
unbiasedCriterion2 ~ unbiasedPredictor2,
data = mydataBias,
xlim = c(
0,
max(c(
mydataBias$unbiasedCriterion2,
mydataBias$unbiasedPredictor2),
na.rm = TRUE)),
ylim = c(
0,
max(c(