INDEX
Explanations
references to quality or superiority in comparison to others
New Auto-Interp
Negative Logits
away
-0.17
ych
-0.16
ned
-0.15
nel
-0.15
ernaut
-0.15
TO
-0.14
erson
-0.14
ep
-0.14
åĦ¿
-0.14
urning
-0.14
POSITIVE LOGITS
-quality
0.25
iors
0.21
ior
0.21
iets
0.19
quality
0.17
quality
0.17
вÑģего
0.17
owl
0.17
-most
0.16
haps
0.16
Activations Density 0.010%