INDEX
Explanations
words related to causing physical harm or damage
action words indicating processes or activities
New Auto-Interp
Negative Logits
ULTS
-0.68
cius
-0.64
ACTION
-0.63
aimon
-0.61
Pwr
-0.61
Nare
-0.59
sidx
-0.59
Harlem
-0.56
brim
-0.56
Ear
-0.55
POSITIVE LOGITS
ing
2.70
ership
1.29
ING
1.28
ments
1.25
ging
1.25
ingham
1.24
eering
1.24
edIn
1.21
ingly
1.20
ning
1.13
Activations Density 0.237%