INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Mans
    -0.64
     funt
    -0.63
    пла
    -0.57
     test
    -0.55
     collo
    -0.55
    rison
    -0.55
     JUN
    -0.54
    SUND
    -0.53
    ation
    -0.52
     fri
    -0.52
    POSITIVE LOGITS
    }>
    1.45
    >\
    1.39
    >
    
    1.36
    ]>
    1.26
    )>
    1.21
    \">
    1.21
    >"
    1.20
    }}>
    1.19
     $>$
    1.18
    >$
    1.17
    Act Density 0.367%

    No Known Activations