INDEX
Explanations
references to authority figures in discussions about governance or policy
New Auto-Interp
Negative Logits
corners
-0.15
amburger
-0.15
amat
-0.15
olumn
-0.15
ullen
-0.15
SSF
-0.15
Lev
-0.14
Corner
-0.14
ableObject
-0.14
omin
-0.14
POSITIVE LOGITS
regret
0.18
illance
0.17
further
0.16
å¹¹
0.16
èį
0.16
imers
0.16
\views
0.15
ingers
0.15
hoped
0.15
fur
0.15
Activations Density 0.059%