INDEX
Explanations
phrases related to negative consequences
instances of the word "the."
New Auto-Interp
Negative Logits
pez
-0.76
å§«
-0.73
razil
-0.72
advertising
-0.71
cember
-0.68
LIN
-0.66
Notes
-0.66
çīĪ
-0.65
LAB
-0.65
é¾įåĸļ士
-0.65
POSITIVE LOGITS
slightest
1.31
brightest
1.06
usual
0.96
smartest
0.94
easiest
0.94
same
0.91
prett
0.90
strongest
0.86
stereotypical
0.84
norm
0.83
Activations Density 0.068%