INDEX
Explanations
phrases expressing disagreement or contradiction
New Auto-Interp
Negative Logits
unt
-0.15
ãĥ©ãĤ¤ãĥĪ
-0.15
ãĤ¯ãĥĪ
-0.14
dept
-0.14
either
-0.14
ãĤ¯ãĤ·ãĥ§ãĥ³
-0.14
Ä«
-0.13
inst
-0.13
tron
-0.13
ult
-0.13
POSITIVE LOGITS
heimer
0.20
OLON
0.17
olon
0.17
inati
0.15
ãĥ³ãĤ¬
0.14
fault
0.14
htar
0.14
CONTRIBUTORS
0.14
HEMA
0.14
sst
0.13
Activations Density 0.061%