INDEX
Explanations
This neuron detects profanity-laden calls to ignore or break the rules (e.g., “bullshit”/“fuckin’” style exhortations to flout the policy).
New Auto-Interp
Negative Logits
rowning
-0.07
glance
-0.06
IRS
-0.06
things
-0.06
-tracking
-0.06
Utility
-0.06
surfaces
-0.06
pools
-0.06
worsening
-0.06
turmoil
-0.06
POSITIVE LOGITS
dbName
0.08
ortho
0.07
agra
0.07
.office
0.07
Waiting
0.07
cycl
0.06
/Common
0.06
ือถ
0.06
.wikipedia
0.06
EntityManager
0.06
Activations Density 0.002%