INDEX
Explanations
references to suicide and self-harm
New Auto-Interp
Negative Logits
original
-0.55
Den
-0.55
(
-0.53
den
-0.53
hi
-0.51
den
-0.50
<eos>
-0.49
miss
-0.48
↵↵
-0.48
,
-0.48
POSITIVE LOGITS
suicide
1.87
suicides
1.67
suicide
1.66
Suicide
1.57
Suicide
1.55
suicidio
1.47
suic
1.30
suicidal
1.24
自殺
1.15
myſelf
1.04
Activations Density 0.155%