INDEX
Explanations
terms related to suicide and self-harm
New Auto-Interp
Negative Logits
anzi
-0.19
ycz
-0.16
Dud
-0.14
éĢł
-0.14
æ°¸ä¹ħ
-0.14
Criminal
-0.13
ValuePair
-0.13
اØŃØ©
-0.13
ye
-0.13
ุà¸ķ
-0.13
POSITIVE LOGITS
/self
0.20
alex
0.15
apas
0.15
dokon
0.15
Ordered
0.14
uars
0.14
Liver
0.14
æ½®
0.14
eros
0.14
ãĥ¶
0.14
Activations Density 0.013%