INDEX
Explanations
proper names followed by colons
statements and responses in a conversational or question-and-answer format
New Auto-Interp
Negative Logits
downstream
-0.69
doub
-0.67
wrath
-0.64
forgotten
-0.63
unchecked
-0.63
abad
-0.63
rule
-0.62
foreseen
-0.62
trespass
-0.61
attention
-0.61
POSITIVE LOGITS
Exactly
1.15
Yeah
1.07
Absolutely
0.98
Absolutely
0.95
Originally
0.94
Yes
0.91
Yeah
0.90
Provided
0.88
Firstly
0.86
Hmm
0.86
Activations Density 0.030%