INDEX
Explanations
phrases indicating abandonment or neglect
New Auto-Interp
Negative Logits
lav
-0.17
ÏĥÏĦÏĮ
-0.16
wang
-0.16
avra
-0.15
disguise
-0.15
kel
-0.14
akh
-0.14
,$_
-0.14
GU
-0.13
Keeping
-0.13
POSITIVE LOGITS
behind
0.40
alone
0.37
Behind
0.33
alone
0.32
Behind
0.30
beh
0.28
Alone
0.28
-alone
0.25
aside
0.24
à¹Ħว
0.22
Activations Density 0.052%