INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     Trou
    -0.07
     discover
    -0.07
    'O
    -0.06
     htt
    -0.06
     Newton
    -0.06
     intuition
    -0.06
     Joy
    -0.06
    typings
    -0.06
     Sevent
    -0.06
    Trou
    -0.06
    POSITIVE LOGITS
     based
    0.21
     Based
    0.19
    based
    0.16
    Based
    0.16
    -based
    0.15
    -Based
    0.13
     biased
    0.08
    _based
    0.08
    ựa
    0.08
     기반
    0.08
    Act Density 0.066%

    No Known Activations