INDEX
    Explanations

    high-activation words or characters often associated with online videos or links

    New Auto-Interp
    Negative Logits
    opi
    -0.16
    CTL
    -0.16
    á»ijng
    -0.15
    chez
    -0.15
    IGHL
    -0.15
    inth
    -0.15
     sinc
    -0.15
    efd
    -0.15
    AXB
    -0.15
    ADV
    -0.15
    POSITIVE LOGITS
    ew
    0.18
    -sw
    0.17
     sw
    0.17
    gg
    0.17
     Bo
    0.17
     oc
    0.16
    -w
    0.16
    lw
    0.16
    -as
    0.16
    oc
    0.15
    Act Density 0.005%

    No Known Activations