INDEX
Explanations
expressions related to extreme negative outcomes or experiences
New Auto-Interp
Negative Logits
δά
-0.19
441
-0.17
etty
-0.16
acock
-0.16
ØŃÙĩ
-0.16
'gc
-0.15
etur
-0.14
iele
-0.14
elsen
-0.14
agma
-0.14
POSITIVE LOGITS
worse
0.20
Worse
0.17
-case
0.15
worst
0.15
Wolfe
0.14
漫
0.14
auga
0.14
ãĥ¼ãĤ¹ãĥĪ
0.13
JOB
0.13
ê
0.13
Activations Density 0.030%