INDEX
Explanations
phrases prompting audience engagement and personal evaluation
New Auto-Interp
Negative Logits
stÃŃ
-0.16
aper
-0.15
reek
-0.15
igne
-0.15
erno
-0.15
whim
-0.15
úc
-0.14
Quest
-0.14
ost
-0.14
asa
-0.14
POSITIVE LOGITS
judge
0.23
yourself
0.22
for
0.21
yourselves
0.20
for
0.19
_for
0.19
themselves
0.18
ÑģобÑĸ
0.18
Yourself
0.18
judge
0.18
Activations Density 0.036%