INDEX
Explanations
references to manipulative or deceptive behavior in social contexts
New Auto-Interp
Negative Logits
ót
-0.15
vit
-0.15
vÃŃ
-0.15
smarty
-0.14
bias
-0.14
KHTML
-0.14
дÑĥ
-0.14
reeNode
-0.13
indow
-0.13
probe
-0.13
POSITIVE LOGITS
Pickup
0.29
pickup
0.28
incel
0.23
pickup
0.23
pickups
0.22
PU
0.21
PU
0.20
kino
0.20
pick
0.19
pick
0.19
Activations Density 0.067%