INDEX
Explanations
references to self-harm or violent acts directed at oneself
references to suicide and related acts
New Auto-Interp
Negative Logits
Correct
-0.72
parts
-0.71
rium
-0.71
heny
-0.70
Provided
-0.70
afort
-0.69
uv
-0.68
Phar
-0.68
artisan
-0.68
aunder
-0.67
POSITIVE LOGITS
suicide
1.31
zai
1.10
bomber
1.06
bombers
1.00
icide
0.98
icides
0.93
suicides
0.91
itating
0.88
itated
0.85
suicidal
0.83
Activations Density 0.015%