INDEX
Explanations
affirmations or strong agreements in statements
New Auto-Interp
Negative Logits
idy
-0.15
оÑĩек
-0.15
_trait
-0.15
vla
-0.14
gaard
-0.14
queda
-0.13
.ef
-0.13
trou
-0.13
ảo
-0.13
ocrates
-0.13
POSITIVE LOGITS
um
0.16
rost
0.15
ief
0.15
ected
0.14
fest
0.14
oup
0.14
ely
0.14
å¯Į
0.14
ose
0.14
atest
0.14
Activations Density 0.025%