INDEX
Explanations
instances of the word "compelling."
references to engaging or persuasive content
New Auto-Interp
Negative Logits
hops
-0.89
pez
-0.84
hop
-0.80
sterdam
-0.77
atel
-0.71
alde
-0.65
ource
-0.64
pec
-0.64
Sloan
-0.63
abad
-0.60
POSITIVE LOGITS
ly
1.05
ingly
1.00
NESS
0.84
LY
0.83
ively
0.82
enough
0.79
enough
0.79
ments
0.76
ibly
0.76
reason
0.76
Activations Density 0.028%