INDEX
Explanations
themes of suicide and self-harm
New Auto-Interp
Negative Logits
exus
-0.18
urch
-0.15
agina
-0.15
Sponsored
-0.14
.learn
-0.14
iert
-0.13
sponsored
-0.13
spons
-0.13
trap
-0.13
acle
-0.13
POSITIVE LOGITS
suicide
0.64
Suicide
0.56
su
0.56
suicides
0.54
Su
0.54
-su
0.54
Su
0.52
commit
0.52
_su
0.50
suic
0.49
Activations Density 0.175%