INDEX
Explanations
words that indicate actions or directions
New Auto-Interp
Negative Logits
ãģ¤ãģij
-0.17
è¦ĭ
-0.17
á»ħ
-0.15
midt
-0.15
ened
-0.15
stead
-0.14
é©
-0.14
cq
-0.14
ED
-0.14
erd
-0.14
POSITIVE LOGITS
/from
0.34
gether
0.32
plevel
0.21
asts
0.20
ledo
0.19
be
0.19
wner
0.18
ogle
0.18
asting
0.18
xic
0.18
Activations Density 0.805%