INDEX
Explanations
mentions of specific names or topics
the word "mentioned" and its variations in various contexts
New Auto-Interp
Negative Logits
uilt
-0.75
eware
-0.70
usterity
-0.70
earned
-0.69
otypes
-0.69
orneys
-0.69
iership
-0.68
inals
-0.67
heres
-0.67
onew
-0.66
POSITIVE LOGITS
mentioning
1.03
mentions
0.97
mentioned
0.84
lihood
0.80
aloud
0.80
mention
0.76
Prelude
0.71
above
0.71
REDACTED
0.68
prominently
0.68
Activations Density 0.010%