INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    d
    -0.70
     POPULAR
    -0.64
     popular
    -0.61
    m
    -0.61
    s
    -0.58
    atrix
    -0.58
    enko
    -0.57
     populares
    -0.57
    t
    -0.56
    n
    -0.54
    POSITIVE LOGITS
     pleaſure
    0.95
     uſe
    0.89
     houſe
    0.87
     Diſ
    0.87
     purpoſe
    0.87
     Anſ
    0.86
     raiſ
    0.85
     Reſ
    0.85
     poffe
    0.82
     Majefty
    0.81
    Act Density 0.130%

    No Known Activations