INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    038
    -0.07
    41
    -0.07
    37
    -0.07
    46
    -0.07
    93
    -0.07
     College
    -0.07
     Korea
    -0.07
     Gerr
    -0.07
    38
    -0.07
     girl
    -0.07
    POSITIVE LOGITS
     advantages
    0.16
     advantage
    0.15
     Advantage
    0.12
     advantageous
    0.11
    antages
    0.10
    вай
    0.08
     advant
    0.08
     disadvantage
    0.08
     disadvantages
    0.08
    optimized
    0.07
    Act Density 0.014%

    No Known Activations