INDEX
Explanations
phrases indicating personal actions or experiences
New Auto-Interp
Negative Logits
yte
-0.16
turnstile
-0.15
itud
-0.15
bote
-0.15
ween
-0.15
uida
-0.15
ABEL
-0.15
rose
-0.15
úsqueda
-0.15
ακ
-0.14
POSITIVE LOGITS
ahan
0.17
Thrones
0.14
amil
0.14
ög
0.14
anda
0.14
UMB
0.14
æĺ
0.13
utron
0.13
jet
0.13
bak
0.13
Activations Density 0.164%