INDEX
Explanations
phrases related to general situations and actions
themes related to the concepts of violence and societal issues
New Auto-Interp
Negative Logits
Particularly
-0.78
particularly
-0.76
sidx
-0.75
Specifically
-0.75
Especially
-0.71
arin
-0.71
significant
-0.70
ourage
-0.68
ierre
-0.67
especially
-0.66
POSITIVE LOGITS
unaffected
1.09
unchanged
1.08
irrelevant
1.06
harmless
1.03
ignored
1.02
impunity
0.99
alright
0.99
unrem
0.97
shrugged
0.96
indifferent
0.96
Activations Density 0.576%