INDEX
Explanations
negations and phrases expressing absence or exceptions
New Auto-Interp
Negative Logits
gan
-0.17
oog
-0.17
ook
-0.15
xaf
-0.15
utow
-0.15
agnostics
-0.14
ersions
-0.14
ghi
-0.14
roud
-0.14
inson
-0.14
POSITIVE LOGITS
Zot
0.15
Mention
0.15
either
0.15
vest
0.15
_SR
0.14
ys
0.14
ÑĨин
0.14
altogether
0.14
stint
0.14
chwitz
0.14
Activations Density 0.347%