INDEX
Explanations
phrases and terms indicating simplicity or ease of understanding
New Auto-Interp
Negative Logits
auty
-0.17
лек
-0.14
vla
-0.14
quette
-0.14
623
-0.14
culture
-0.14
SCO
-0.14
quete
-0.13
ender
-0.13
/dat
-0.13
POSITIVE LOGITS
ly
0.21
mente
0.18
ness
0.16
arks
0.15
basit
0.15
simples
0.15
ums
0.15
-ÑĤаки
0.15
LY
0.15
iless
0.14
Activations Density 0.007%