INDEX
Explanations
phrases expressing concerns or fears about potential negative consequences
New Auto-Interp
Negative Logits
etler
-0.16
aman
-0.15
iaux
-0.15
kc
-0.14
yon
-0.14
lar
-0.14
ama
-0.14
hopefully
-0.14
humble
-0.13
433
-0.13
POSITIVE LOGITS
too
0.29
somehow
0.28
TOO
0.27
too
0.26
Too
0.20
Too
0.20
-too
0.20
might
0.19
podrÃŃa
0.18
dil
0.18
Activations Density 0.172%