INDEX
Explanations
emphasizing words that enhance or reinforce personal feelings or opinions
New Auto-Interp
Negative Logits
's
-0.36
’s
-0.30
'S
-0.27
be
-0.27
´s
-0.25
himself
-0.23
’S
-0.22
themselves
-0.22
herself
-0.21
`s
-0.21
POSITIVE LOGITS
can
0.34
cannot
0.34
need
0.32
aren
0.30
are
0.26
don
0.26
want
0.25
need
0.24
must
0.24
haven
0.24
Activations Density 0.105%