INDEX
Explanations
gender roles and inequality
New Auto-Interp
Negative Logits
Interface
0.38
tár
0.38
過程
0.38
量は
0.38
",[],"
0.37
மற
0.37
فرد
0.37
வீன
0.37
hoeveel
0.36
reimb
0.36
POSITIVE LOGITS
specificity
0.94
affiliation
0.88
specificity
0.87
differences
0.86
pecific
0.76
bias
0.72
preference
0.71
Specificity
0.68
specific
0.67
affiliations
0.67
Activations Density 0.091%