INDEX
Explanations
words conveying extreme levels of disbelief or ridicule
descriptors of absurdity and ridiculousness
New Auto-Interp
Negative Logits
yer
-0.88
Reviewed
-0.82
enfranch
-0.78
rien
-0.74
avers
-0.74
ribution
-0.71
maid
-0.71
builders
-0.71
rounder
-0.70
oyal
-0.70
POSITIVE LOGITS
ness
0.89
amounts
0.89
nesses
0.89
nonsense
0.85
absurdity
0.85
ly
0.84
LY
0.83
lengths
0.82
amount
0.79
NESS
0.79
Activations Density 0.042%