INDEX
Explanations
phrases indicating superiority or dominance
New Auto-Interp
Negative Logits
elman
-0.16
chemas
-0.16
gis
-0.15
xea
-0.14
berger
-0.14
amik
-0.14
amination
-0.14
pio
-0.14
itemprop
-0.14
ngen
-0.14
POSITIVE LOGITS
sensitivity
0.20
sensitive
0.18
jud
0.17
diamonds
0.17
anchor
0.16
y
0.15
gem
0.15
sensit
0.15
intolerance
0.15
FM
0.15
Activations Density 0.004%