INDEX
Explanations
the concept of "reason" related to various explanations or justifications
New Auto-Interp
Negative Logits
gow
-0.20
IRST
-0.16
anzeigen
-0.15
pery
-0.15
/read
-0.14
erson
-0.14
nez
-0.14
/lists
-0.14
/run
-0.14
æijĩ
-0.14
POSITIVE LOGITS
why
0.23
why
0.20
nal
0.18
lessly
0.17
naires
0.16
hift
0.16
APPER
0.16
üstü
0.16
WHY
0.15
ably
0.15
Activations Density 0.039%