INDEX
Explanations
explicitly censored profanity
explicit language and strong profanity
New Auto-Interp
Negative Logits
mosqu
-0.69
conduc
-0.68
elig
-0.66
isolation
-0.66
Buyable
-0.63
waivers
-0.61
Agric
-0.61
VB
-0.60
Annotations
-0.58
unsupported
-0.58
POSITIVE LOGITS
cking
1.19
king
1.15
kers
1.13
ked
1.13
tty
1.06
gger
1.04
shit
1.03
tch
1.02
ker
0.98
k
0.97
Activations Density 0.047%