INDEX
Explanations
instances of logical or causal reasoning
New Auto-Interp
Negative Logits
è¸
-0.17
477
-0.16
388
-0.16
ç´ł
-0.15
Å¡tÄĽ
-0.15
ander
-0.15
atoon
-0.15
enville
-0.14
quen
-0.14
jumbotron
-0.14
POSITIVE LOGITS
actable
0.16
urgeon
0.16
ãĥķãĤ
0.15
мом
0.15
pper
0.15
.fi
0.14
unsafe
0.14
byt
0.14
ipop
0.14
ÑĤов
0.14
Activations Density 0.335%