INDEX
Explanations
nationalities or demographics of different groups of people
mentions of organizations or groups, particularly in a formal context
New Auto-Interp
Negative Logits
restoration
-0.73
reversible
-0.71
fixes
-0.70
onement
-0.68
irreversible
-0.68
iversary
-0.68
Cancel
-0.67
ettel
-0.66
postp
-0.66
restoring
-0.65
POSITIVE LOGITS
average
0.88
averages
0.81
populous
0.80
average
0.79
diversity
0.77
Average
0.76
dwar
0.76
diverse
0.76
disproportionately
0.76
unaff
0.75
Activations Density 0.927%