INDEX
Explanations
assertions about the existence or presence of something
New Auto-Interp
Negative Logits
incy
-0.16
stav
-0.15
lover
-0.15
quip
-0.15
apos
-0.15
uib
-0.14
adelphia
-0.14
ä¹Łæľī
-0.14
ä¾ĭ
-0.14
ddit
-0.13
POSITIVE LOGITS
nothing
0.21
NOTHING
0.18
.neo
0.17
Nothing
0.17
nowhere
0.17
nobody
0.16
never
0.16
egin
0.15
no
0.15
nothing
0.15
Activations Density 0.085%