INDEX
Explanations
logical arguments or reasoning in discussions
New Auto-Interp
Negative Logits
Spoljašnje
-1.05
Paglinawan
-1.04
Roskov
-1.02
Datuak
-1.00
Portail
-0.98
зулта
-0.95
Италијани
-0.94
tanleria
-0.94
gainera
-0.91
^(@)
-0.89
POSITIVE LOGITS
↵
0.52
P
0.50
0.49
T
0.48
I
0.48
0.47
x
0.47
?
0.47
!
0.46
d
0.45
Activations Density 0.863%