INDEX
Explanations
phrases that indicate causal relationships or origins
New Auto-Interp
Negative Logits
imson
-0.17
.infinity
-0.16
opis
-0.16
alama
-0.15
ersh
-0.15
ainen
-0.15
ulumi
-0.14
eton
-0.14
¹
-0.14
iniz
-0.14
POSITIVE LOGITS
fact
0.20
:↵
0.16
áž
0.16
fact
0.15
:↵↵
0.15
:
0.15
ager
0.14
mir
0.14
Descriptors
0.14
having
0.14
Activations Density 0.127%