INDEX
Explanations
statements involving opinions or judgments on race and morality
New Auto-Interp
Negative Logits
bels
-0.44
initial
-0.39
</i>
-0.38
-0.38
initially
-0.37
록
-0.37
плане
-0.36
بگیرید
-0.36
flu
-0.35
Flu
-0.35
POSITIVE LOGITS
Hozzáférés
0.88
oprot
0.85
HasFactory
0.85
SizeMode
0.82
WebVitals
0.82
autorytatywna
0.80
مشين
0.79
виправивши
0.79
kaarangay
0.74
:✨
0.74
Activations Density 0.189%