INDEX
Explanations
words and phrases related to rejection or prohibition
New Auto-Interp
Negative Logits
.Reporting
-0.18
ÃŃc
-0.14
Into
-0.14
Ù쨵ÙĦ
-0.14
semb
-0.14
loub
-0.14
ازÙħ
-0.14
vil
-0.13
ya
-0.13
ei
-0.13
POSITIVE LOGITS
altogether
0.33
æİī
0.33
/null
0.28
entirely
0.24
alto
0.20
completely
0.19
outright
0.19
issippi
0.18
ive
0.18
/block
0.18
Activations Density 0.163%