INDEX
Explanations
statements focusing on the consequences of actions and social behaviors
New Auto-Interp
Negative Logits
antan
-0.15
elman
-0.15
rico
-0.15
kers
-0.15
ajaran
-0.14
aiser
-0.14
ıt
-0.14
iju
-0.14
uxt
-0.14
loys
-0.14
POSITIVE LOGITS
/null
0.19
itself
0.18
inta
0.17
ulla
0.16
dess
0.15
lings
0.15
thereof
0.14
ÑģобоÑİ
0.14
ska
0.14
INLINE
0.14
Activations Density 0.304%