INDEX
Explanations
phrases with pronouns indicating specific individuals
pronouns, particularly those referring to male individuals
New Auto-Interp
Negative Logits
Fail
-0.65
fail
-0.59
Killing
-0.58
Underworld
-0.58
Anarchy
-0.57
Row
-0.55
Description
-0.55
Chem
-0.55
metal
-0.55
Destruction
-0.55
POSITIVE LOGITS
expects
1.39
understands
1.29
believes
1.28
intends
1.25
regretted
1.21
thinks
1.20
disagrees
1.20
hoped
1.19
hopes
1.18
regrets
1.18
Activations Density 0.125%