INDEX
Explanations
phrases indicating indifference or a lack of specific preference
New Auto-Interp
Negative Logits
ses
-0.20
sb
-0.17
sil
-0.17
sid
-0.16
sj
-0.15
ryn
-0.15
.scalablytyped
-0.15
нен
-0.15
sst
-0.15
mund
-0.15
POSITIVE LOGITS
else
0.21
theless
0.21
ly
0.18
anged
0.17
anging
0.16
thing
0.16
337
0.16
ity
0.16
Æ¡
0.16
rr
0.16
Activations Density 0.016%