INDEX
Explanations
words indicating presence or absence
New Auto-Interp
Negative Logits
ryn
-0.16
enor
-0.15
OI
-0.15
cia
-0.14
hasher
-0.14
irsch
-0.14
poultry
-0.13
ollar
-0.13
PFN
-0.13
νε
-0.13
POSITIVE LOGITS
ones
0.30
ones
0.20
them
0.18
Ones
0.18
hers
0.18
cka
0.17
176
0.17
mine
0.17
ck
0.16
Mine
0.16
Activations Density 0.004%