INDEX
Explanations
phrases indicating a contrast or contradiction to some stated beliefs or expectations
phrases that challenge popular beliefs or narratives
New Auto-Interp
Negative Logits
eport
-0.83
ahead
-0.81
estones
-0.80
iless
-0.74
enary
-0.74
erning
-0.71
gins
-0.71
hene
-0.71
pan
-0.71
between
-0.70
POSITIVE LOGITS
expectations
1.08
belief
0.98
expectation
0.94
stereotype
0.92
stereotypes
0.88
popular
0.87
suggestion
0.84
intuition
0.82
appearances
0.82
assertions
0.81
Activations Density 0.059%