INDEX
Explanations
phrases indicating negation or absence
New Auto-Interp
Negative Logits
fometimes
-0.80
Anſ
-0.79
ſind
-0.74
ſeveral
-0.74
myſelf
-0.74
chofe
-0.72
Catto
-0.72
fhort
-0.71
itſelf
-0.71
iſt
-0.69
POSITIVE LOGITS
non
0.88
Non
0.83
Non
0.81
without
0.77
ilman
0.77
非
0.75
Without
0.73
ohne
0.72
AndEndTag
0.72
Без
0.72
Activations Density 0.539%