INDEX
Explanations
phrases demanding accountability or improvement regarding moral or ethical standards
New Auto-Interp
Negative Logits
elage
-0.17
oose
-0.16
alama
-0.15
ÃŃÅ¡
-0.15
oka
-0.15
iske
-0.14
że
-0.14
sock
-0.14
uge
-0.14
lace
-0.14
POSITIVE LOGITS
should
0.22
Should
0.20
Should
0.20
shouldn
0.20
etr
0.20
should
0.18
ought
0.18
instead
0.18
.should
0.17
134
0.16
Activations Density 0.244%