INDEX
Explanations
the word "you"
New Auto-Interp
Negative Logits
ItemBackground
-0.69
XNUMX
-0.68
PCP
-0.59
Platon
-0.57
snippetHide
-0.55
)))),
-0.54
Sermons
-0.54
Stare
-0.54
imgur
-0.54
considérons
-0.54
POSITIVE LOGITS
the
0.70
us
0.64
me
0.59
him
0.56
a
0.54
our
0.52
your
0.51
away
0.49
his
0.48
[toxicity=0]
0.47
Activations Density 0.292%