INDEX
Explanations
instances of confrontation or arguments
New Auto-Interp
Negative Logits
cx
-0.15
infeld
-0.14
“
-0.14
ensch
-0.14
"$
-0.14
“[
-0.14
isay
-0.13
ertz
-0.13
Fol
-0.13
stere
-0.13
POSITIVE LOGITS
978
0.18
alian
0.16
your
0.16
.""
0.15
uh
0.14
yourselves
0.14
MOTE
0.14
your
0.14
yourself
0.14
obsess
0.13
Activations Density 1.791%