INDEX
Explanations
phrases that describe existence or states of being
New Auto-Interp
Negative Logits
rts
-0.16
contres
-0.15
xis
-0.15
ovit
-0.14
urus
-0.14
wiÄħz
-0.13
Weaver
-0.13
Göz
-0.13
patch
-0.13
iams
-0.13
POSITIVE LOGITS
chner
0.15
let
0.15
zcze
0.15
EATURE
0.14
ÙĬز
0.14
Upper
0.14
eger
0.13
842
0.13
enty
0.13
whose
0.13
Activations Density 0.199%