INDEX
Explanations
statements indicating user engagement or communication
New Auto-Interp
Negative Logits
↵
-0.27
...
-0.19
[
-0.18
...↵
-0.17
Âł
-0.16
'
-0.15
*
-0.15
,,
-0.15
&
-0.15
Continue
-0.15
POSITIVE LOGITS
°}
0.21
aeper
0.17
ilon
0.17
.")
0.17
alsy
0.17
azar
0.17
@}
0.16
ehir
0.15
.']
0.15
";}
0.15
Activations Density 0.080%