INDEX
Explanations
sentences expressing societal criticism related to race and inequality
New Auto-Interp
Negative Logits
Efq
-1.15
ſelves
-1.14
myſelf
-1.14
ſelf
-1.09
purpoſe
-1.09
Monfieur
-1.07
Theſe
-0.99
Jefus
-0.98
itſelf
-0.98
faſt
-0.96
POSITIVE LOGITS
still
0.65
still
0.59
Still
0.58
?
0.54
Still
0.53
!
0.51
ainda
0.49
...
0.48
STILL
0.48
nevertheless
0.47
Activations Density 0.138%