INDEX
Explanations
phrases indicating uncertainty or questioning established norms and expectations
New Auto-Interp
Negative Logits
not
-0.23
oint
-0.16
ارج
-0.15
à¹ģล
-0.15
no
-0.15
not
-0.15
ikke
-0.15
à¹Ħม
-0.15
не
-0.15
combe
-0.14
POSITIVE LOGITS
anymore
0.53
necessarily
0.34
any
0.30
nor
0.29
anywhere
0.27
anything
0.26
slightest
0.25
ä»»ä½ķ
0.25
yet
0.24
nor
0.23
Activations Density 0.543%