INDEX
Explanations
phrases indicating actions or intentions, particularly those starting with "to."
New Auto-Interp
Negative Logits
ovah
-0.16
uchs
-0.16
raya
-0.16
otate
-0.16
loys
-0.15
oland
-0.15
asons
-0.15
ü
-0.15
ouden
-0.15
ctica
-0.14
POSITIVE LOGITS
kol
0.14
Äħż
0.14
ustr
0.14
Brewer
0.14
arsi
0.13
rot
0.13
Flo
0.13
Crow
0.13
angu
0.13
453
0.13
Activations Density 0.039%