INDEX
Explanations
violent actions and physical harm
references to acts of violence or severe harm
New Auto-Interp
Negative Logits
entric
-0.71
uci
-0.69
Collabor
-0.65
issions
-0.63
ty
-0.61
TP
-0.60
extension
-0.60
impl
-0.60
Mutual
-0.59
AE
-0.59
POSITIVE LOGITS
beaten
3.79
beat
1.93
beating
1.85
beat
1.59
battered
1.55
slain
1.44
defeated
1.42
Beat
1.39
beats
1.34
bruised
1.33
Activations Density 0.018%