INDEX
    Explanations

    humor that references specific cultural knowledge or events

    New Auto-Interp
    Negative Logits
    undef
    -0.15
    mdir
    -0.14
    ollo
    -0.14
    zent
    -0.14
    aris
    -0.14
    ARI
    -0.14
    代
    -0.14
    иÑģÑĤ
    -0.14
    radient
    -0.13
     代
    -0.13
    POSITIVE LOGITS
     detail
    0.16
     spotted
    0.16
     subtle
    0.16
    kke
    0.15
    iland
    0.15
    hidden
    0.15
    /reference
    0.15
     fle
    0.15
    èĽĽ
    0.15
     synchron
    0.14
    Act Density 0.034%

    No Known Activations