INDEX
Explanations
phrases related to criminal activities, particularly assault and harassment
references to acts of violence or assault
New Auto-Interp
Negative Logits
spection
-0.79
gio
-0.72
emption
-0.72
HCR
-0.71
join
-0.71
venants
-0.70
ahime
-0.69
ItemImage
-0.69
bet
-0.69
ventures
-0.69
POSITIVE LOGITS
unsuspecting
1.18
entire
1.03
offending
0.98
hardest
0.91
innocent
0.89
weakest
0.88
targets
0.88
opponent
0.87
nearest
0.87
target
0.87
Activations Density 0.322%