INDEX
Explanations
words indicating health-related conditions and actions
New Auto-Interp
Negative Logits
orem
-0.28
oret
-0.20
iming
-0.18
cest
-0.17
grily
-0.17
oretical
-0.16
quoi
-0.15
visa
-0.15
rone
-0.14
ÑģÑİ
-0.14
POSITIVE LOGITS
/of
0.17
MRI
0.16
bidden
0.15
DNA
0.15
plevel
0.15
pired
0.14
cribes
0.14
Gesch
0.14
cribed
0.14
Tube
0.14
Activations Density 0.531%