INDEX
Explanations
phrases indicating contrast or exceptions
instances of the word "although."
New Auto-Interp
Negative Logits
Eye
-0.76
ais
-0.70
ized
-0.69
hal
-0.69
ledged
-0.68
edu
-0.67
Ing
-0.67
lean
-0.67
elle
-0.66
tnc
-0.66
POSITIVE LOGITS
soever
0.87
yip
0.86
thood
0.79
acknowledging
0.78
terness
0.76
netflix
0.73
conced
0.72
agreeing
0.71
userc
0.70
REDACTED
0.70
Activations Density 0.013%