INDEX
Explanations
instances of the word "not."
New Auto-Interp
Negative Logits
lical
-0.15
ilerden
-0.15
781
-0.15
olated
-0.15
halt
-0.15
ymous
-0.14
μά
-0.14
navr
-0.14
hetto
-0.14
ussy
-0.13
POSITIVE LOGITS
everyone
0.28
everybody
0.25
knowing
0.23
everyone
0.23
ori
0.23
ches
0.22
everything
0.20
only
0.20
surprisingly
0.20
tingham
0.19
Activations Density 0.037%