INDEX
Explanations
expressions of strong emotions and opinions regarding justice and fairness
New Auto-Interp
Negative Logits
.*↵↵
-0.18
.*↵
-0.18
.↵↵
-0.15
.*,
-0.15
).*
-0.15
.)↵↵
-0.15
!*
-0.15
.*/↵
-0.14
."""↵↵
-0.14
).↵↵
-0.14
POSITIVE LOGITS
its
0.28
cant
0.23
hope
0.22
ive
0.22
,,
0.22
dont
0.21
iam
0.21
.look
0.21
.im
0.20
.i
0.20
Activations Density 1.024%