INDEX
Explanations
the word "surprise" or similar variations
phrases indicating a lack of surprise or expected outcomes
New Auto-Interp
Negative Logits
eatures
-0.84
minster
-0.79
chnology
-0.77
İĭ
-0.76
rote
-0.74
adle
-0.72
folios
-0.71
nai
-0.71
rogram
-0.70
ettel
-0.70
POSITIVE LOGITS
whatsoever
0.87
anymore
0.82
nor
0.78
surprises
0.67
surprise
0.66
imaru
0.65
prompts
0.64
why
0.64
enough
0.63
REDACTED
0.63
Activations Density 0.021%