INDEX
Explanations
mentions related to violent actions or harm being inflicted
actions or events associated with causing harm or violence
New Auto-Interp
Negative Logits
gers
-0.88
cius
-0.86
ger
-0.79
nings
-0.79
bard
-0.79
cean
-0.79
ese
-0.78
sell
-0.77
gered
-0.75
acea
-0.75
POSITIVE LOGITS
cipline
0.71
ngth
0.70
Lauder
0.70
Vict
0.70
verages
0.68
Seym
0.67
lehem
0.67
awei
0.66
enance
0.66
INGTON
0.62
Activations Density 0.074%