INDEX
Explanations
mentions of specific people's names
New Auto-Interp
Negative Logits
Prelude
-0.74
uate
-0.69
headache
-0.69
enclosed
-0.65
succeeding
-0.63
duplication
-0.63
gratification
-0.62
psy
-0.62
headaches
-0.62
bottleneck
-0.61
POSITIVE LOGITS
ITNESS
1.35
OOD
1.24
ALK
1.22
ITCH
1.20
arsh
1.17
idespread
1.17
orthy
1.16
rote
1.15
OW
1.13
edge
1.12
Activations Density 0.027%