INDEX
Explanations
references to historical narratives and societal perceptions related to race and privilege
New Auto-Interp
Negative Logits
rungsseite
-1.11
Monfieur
-1.02
étoient
-1.02
myſelf
-0.98
مشين
-0.98
propOrder
-0.97
avoient
-0.95
wikipagina
-0.94
ainfi
-0.94
bezeichneter
-0.92
POSITIVE LOGITS
↵
0.76
0.74
0.72
↵↵
0.67
,
0.67
<eos>
0.66
.
0.65
'
0.64
O
0.64
a
0.63
Activations Density 0.439%