INDEX
Explanations
adjectives describing emotions or judgments
discussions about controversial or unsettling topics
New Auto-Interp
Negative Logits
umbn
-0.82
rongh
-0.78
foreseen
-0.75
execute
-0.74
ocument
-0.71
cele
-0.71
sylvania
-0.71
mediately
-0.70
obook
-0.69
ufact
-0.68
POSITIVE LOGITS
huh
1.49
eh
1.41
tho
1.04
?!
1.01
!
1.00
!!
0.95
!?
0.95
ya
0.92
ðŁĺ
0.85
kidding
0.84
Activations Density 0.525%