INDEX
Explanations
phrases that indicate capability or potential actions
New Auto-Interp
Negative Logits
ahren
-0.15
ometr
-0.15
cia
-0.15
.Flag
-0.14
ách
-0.14
usement
-0.14
avana
-0.14
ë§ŀ
-0.14
åıªè¦ģ
-0.14
ulings
-0.14
POSITIVE LOGITS
lant
0.18
berger
0.17
伯
0.16
769
0.14
outright
0.14
iza
0.14
anytime
0.13
Surely
0.13
ld
0.13
ORS
0.13
Activations Density 0.151%