INDEX
Explanations
phrases indicating protection or safety from various dangers or negative influences
New Auto-Interp
Negative Logits
sparing
-0.14
acific
-0.14
<<<
-0.14
èĥ¶
-0.14
ahl
-0.14
ĶĶ
-0.14
/fwlink
-0.13
luent
-0.13
ÙħØ´Ú©
-0.13
oyer
-0.13
POSITIVE LOGITS
scrutiny
0.25
attack
0.24
being
0.20
criticism
0.19
becoming
0.19
harm
0.18
harms
0.18
attacks
0.18
æĿ¥èĩª
0.18
further
0.17
Activations Density 0.160%