INDEX
Explanations
terms related to emotional injury or pain
New Auto-Interp
Negative Logits
keh
-0.16
ootball
-0.16
cio
-0.16
ritte
-0.15
ë£Į
-0.15
CARD
-0.15
ãĥ³ãĤº
-0.15
éĮ²
-0.14
iei
-0.14
hood
-0.14
POSITIVE LOGITS
lessly
0.22
害
0.18
fully
0.18
ful
0.17
Hurt
0.17
ibal
0.17
ening
0.15
ten
0.14
urious
0.14
less
0.14
Activations Density 0.023%