INDEX
Explanations
instances of profane language or derogatory language
New Auto-Interp
Negative Logits
HCR
-1.18
fman
-1.11
gary
-0.97
================================
-0.96
Expend
-0.96
ocamp
-0.96
NetMessage
-0.96
ervation
-0.92
CVE
-0.90
AUT
-0.90
POSITIVE LOGITS
bags
1.27
posts
1.20
storm
1.15
loads
1.14
detector
1.14
heads
1.13
detectors
1.10
faced
1.08
lords
1.07
lord
1.06
Activations Density 0.714%