INDEX
Explanations
questions or references to people
New Auto-Interp
Negative Logits
robat
-0.20
bian
-0.17
aises
-0.17
ted
-0.16
bol
-0.16
wr
-0.16
алеж
-0.15
nt
-0.15
Darling
-0.15
pure
-0.15
POSITIVE LOGITS
ever
0.33
ops
0.33
else
0.29
opi
0.29
oping
0.28
osh
0.28
op
0.25
a
0.24
am
0.24
amongst
0.23
Activations Density 0.023%