INDEX
    Explanations

    Movie reviews

    New Auto-Interp
    Negative Logits
     pha
    -0.06
     хорошо
    -0.06
     lesbians
    -0.06
    properties
    -0.06
     spoke
    -0.06
     различ
    -0.06
    …”
    -0.06
    .ms
    -0.06
     variety
    -0.06
    -publish
    -0.06
    POSITIVE LOGITS
    。↵↵↵↵
    0.08
    back
    0.06
    ioxid
    0.06
    ckpt
    0.06
     yere
    0.06
     تون
    0.06
    0.06
    agent
    0.06
    ění
    0.06
     tvor
    0.06
    Act Density 0.082%

    No Known Activations