INDEX
Explanations
high-activation words or characters often associated with online videos or links
New Auto-Interp
Negative Logits
opi
-0.16
CTL
-0.16
á»ijng
-0.15
chez
-0.15
IGHL
-0.15
inth
-0.15
sinc
-0.15
efd
-0.15
AXB
-0.15
ADV
-0.15
POSITIVE LOGITS
ew
0.18
-sw
0.17
sw
0.17
gg
0.17
Bo
0.17
oc
0.16
-w
0.16
lw
0.16
-as
0.16
oc
0.15
Activations Density 0.005%