INDEX
Explanations
phrases indicating direction or movement
New Auto-Interp
Negative Logits
ulton
-0.18
Sist
-0.15
Dias
-0.15
eft
-0.15
dn
-0.15
imest
-0.15
ths
-0.14
engu
-0.14
dias
-0.14
illon
-0.13
POSITIVE LOGITS
icens
0.15
isay
0.15
anka
0.15
terr
0.14
INUX
0.14
778
0.14
eÄį
0.14
923
0.14
inous
0.13
508
0.13
Activations Density 0.051%