INDEX
Explanations
words that indicate approval or acceptance of actions and situations
New Auto-Interp
Negative Logits
we
-0.19
we
-0.17
mtx
-0.15
deren
-0.15
åħ¶
-0.15
ewe
-0.14
/maps
-0.14
myself
-0.14
&
-0.14
We
-0.14
POSITIVE LOGITS
our
0.40
our
0.36
æĪij们çļĦ
0.33
OUR
0.32
Our
0.28
è¿Ļæł·çļĦ
0.28
nosso
0.28
OUR
0.28
наÑĪиÑħ
0.28
such
0.28
Activations Density 0.006%