INDEX
Explanations
expressions of affection and admiration
New Auto-Interp
Negative Logits
zelf
-0.16
agua
-0.15
indo
-0.14
obia
-0.14
اÙĦا
-0.13
earer
-0.13
antes
-0.13
boro
-0.13
lette
-0.13
arguably
-0.13
POSITIVE LOGITS
how
0.32
how
0.23
ÙĥÙĬÙģ
0.20
cómo
0.20
æĢİä¹Ī
0.19
å¦Ĥä½ķ
0.19
hearing
0.19
eeee
0.19
rằng
0.18
everything
0.18
Activations Density 0.038%