INDEX
Explanations
instances of the word "ignore" and other similar terms
New Auto-Interp
Negative Logits
icens
-0.16
otel
-0.15
ulis
-0.15
anzi
-0.14
elled
-0.14
meni
-0.14
ÙĬÙĦÙĬ
-0.14
ÑĥменÑĤ
-0.14
ines
-0.14
pects
-0.14
POSITIVE LOGITS
therefore
0.24
Therefore
0.19
ÙĦذا
0.17
Therefore
0.17
thus
0.16
zilla
0.16
apiro
0.16
uth
0.15
onya
0.15
donc
0.15
Activations Density 0.004%