INDEX
Explanations
expressions of negation or denial
New Auto-Interp
Negative Logits
ward
-0.19
ulle
-0.17
ged
-0.16
stu
-0.16
ried
-0.14
_UNUSED
-0.14
named
-0.14
cessive
-0.14
airo
-0.14
?q
-0.14
POSITIVE LOGITS
oint
0.18
longer
0.17
matter
0.17
doubt
0.15
differently
0.15
xious
0.15
obs
0.15
sooner
0.14
theless
0.14
ScreenWidth
0.14
Activations Density 0.036%