INDEX
Explanations
phrases related to luring or baiting
language related to deception or entrapment
New Auto-Interp
Negative Logits
urity
-0.82
blance
-0.73
oret
-0.69
yrus
-0.69
eas
-0.69
olitan
-0.68
ppard
-0.67
eely
-0.66
ias
-0.65
iator
-0.65
POSITIVE LOGITS
lure
1.08
bait
1.03
ument
0.96
mong
0.90
glers
0.84
EStream
0.82
Wag
0.78
GGGGGGGG
0.76
crow
0.71
ãĥĦ
0.70
Activations Density 0.036%