INDEX
    Explanations

    references to specific media or news sources

    New Auto-Interp
    Negative Logits
    olik
    -0.16
    agged
    -0.15
    stown
    -0.15
    etal
    -0.14
    ÄįÃŃ
    -0.14
    aub
    -0.14
    sta
    -0.14
    uels
    -0.14
    blr
    -0.14
    oji
    -0.14
    POSITIVE LOGITS
     Mirror
    0.15
     Kidd
    0.15
    ERO
    0.15
    GY
    0.14
    ained
    0.14
     Morrow
    0.14
     mirror
    0.14
    ÄĽl
    0.14
    /gtest
    0.14
    .chain
    0.14
    Act Density 0.004%

    No Known Activations