INDEX
Explanations
references to penal elements or attributes in a context related to societal issues
New Auto-Interp
Negative Logits
imore
-0.80
ilts
-0.75
CoC
-0.70
Orn
-0.66
office
-0.62
]}
-0.62
housing
-0.62
shotguns
-0.61
Skydragon
-0.61
Household
-0.60
POSITIVE LOGITS
itionally
0.74
龍
0.74
uted
0.71
�
0.66
cribed
0.64
llo
0.63
antly
0.63
nell
0.62
ante
0.62
acio
0.62
Activations Density 0.050%