INDEX
Explanations
questions posed rhetorically for confirmation
rhetorical questions
New Auto-Interp
Negative Logits
apan
-0.72
rament
-0.68
shaw
-0.66
foreground
-0.64
binge
-0.62
apers
-0.61
lobster
-0.60
thro
-0.60
slam
-0.60
background
-0.59
POSITIVE LOGITS
Nope
0.96
Wouldn
0.91
Yeah
0.87
Anyway
0.86
Why
0.84
Especially
0.84
Isn
0.84
Surely
0.83
Alright
0.81
Maybe
0.80
Activations Density 0.065%