INDEX
Explanations
affirmative expressions or words indicating agreement
New Auto-Interp
Negative Logits
')")
-0.63
\}\\
-0.53
ówno
-0.52
"");
-0.51
/*",
-0.50
〕
-0.50
achy
-0.49
۵
-0.49
</s>
-0.48
>");
-0.48
POSITIVE LOGITS
Y
2.21
y
1.80
Yel
1.45
YE
1.43
YC
1.42
Ys
1.40
YP
1.39
YR
1.38
YM
1.37
YS
1.36
Activations Density 0.123%