INDEX
Explanations
words or phrases indicating significant negative impacts or challenges
New Auto-Interp
Negative Logits
iaux
-0.19
595
-0.16
ollo
-0.16
eselect
-0.16
sá»ķ
-0.15
ekte
-0.15
goog
-0.15
DM
-0.15
095
-0.15
areth
-0.15
POSITIVE LOGITS
mour
0.18
Wilhelm
0.16
otron
0.14
볨
0.14
Freed
0.14
PILE
0.14
rael
0.13
peak
0.13
ku
0.13
ogen
0.13
Activations Density 0.001%