INDEX
Explanations
frankly, honestly, seriously
New Auto-Interp
Negative Logits
!
0.50
'!
0.43
)!
0.43
screech
0.42
fio
0.42
"!
0.41
~!
0.41
!
0.41
!?
0.40
monkeys
0.40
POSITIVE LOGITS
frankly
0.63
listen
0.62
Frankly
0.60
Honestly
0.58
Listen
0.56
Honestly
0.56
Seriously
0.56
Anyway
0.54
honestly
0.52
Anyway
0.52
Activations Density 0.004%