INDEX
Explanations
phrases related to specific actions or instructions
phrases indicating negation or refusal
New Auto-Interp
Negative Logits
Palest
-0.75
anwhile
-0.70
mathemat
-0.70
RAD
-0.66
Fatal
-0.64
Morg
-0.63
çīĪ
-0.63
Leilan
-0.62
Hir
-0.62
Blaz
-0.60
POSITIVE LOGITS
Ķ
1.23
¬
1.21
ª
1.20
ĸ
1.19
£
1.19
©
1.14
¿
1.14
¼
1.14
ij
1.13
Ļ
1.12
Activations Density 0.170%