INDEX
Explanations
phrases expressing surprise or disbelief
expressions of surprise or astonishment
New Auto-Interp
Negative Logits
opath
-0.77
aker
-0.75
aldi
-0.73
atche
-0.73
odor
-0.72
oran
-0.71
reb
-0.70
ather
-0.69
ript
-0.69
ethic
-0.68
POSITIVE LOGITS
?!
1.07
?:
0.99
?,
0.99
Huh
0.92
.?
0.90
??
0.89
ï¸ı
0.87
!?
0.83
@#
0.82
!!
0.82
Activations Density 0.008%