INDEX
Explanations
phrases that indicate the capability or permission to perform actions
New Auto-Interp
Negative Logits
uf
-0.14
ige
-0.14
pomoc
-0.14
erdale
-0.13
ione
-0.13
帮
-0.13
RIPT
-0.13
idence
-0.13
essler
-0.13
.FAIL
-0.13
POSITIVE LOGITS
us
0.35
them
0.22
him
0.21
you
0.20
greater
0.18
for
0.17
raž
0.16
us
0.16
flexibility
0.15
747
0.15
Activations Density 0.065%