INDEX
Explanations
terms that indicate causation or consequence
New Auto-Interp
Negative Logits
arus
-0.17
lar
-0.16
sg
-0.14
ril
-0.14
PF
-0.13
lover
-0.13
ül
-0.13
inou
-0.13
472
-0.13
raries
-0.13
POSITIVE LOGITS
forth
0.21
aze
0.15
confront
0.15
ÙĪØ§Ø±
0.15
unt
0.14
588
0.14
eme
0.14
acer
0.14
ays
0.14
detr
0.14
Activations Density 0.006%