INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     review
    -1.47
    review
    -1.33
     Review
    -1.30
     REVIEW
    -1.28
    Review
    -1.27
     reviewed
    -1.17
    REVIEW
    -1.17
     reviewing
    -1.16
     للاسماء
    -1.11
     myſelf
    -1.07
    POSITIVE LOGITS
    -
    0.75
    0.72
    <eos>
    0.71
    ,
    0.69
     (
    0.68
     -
    0.66
     for
    0.66
    :
    0.65
    0.63
    .
    0.63
    Act Density 1.124%

    No Known Activations