INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     nobody
    -0.07
     Toggle
    -0.06
     Você
    -0.06
     going
    -0.06
    -girl
    -0.06
     don
    -0.06
    742
    -0.06
     Bou
    -0.06
     대학
    -0.06
     boyc
    -0.06
    POSITIVE LOGITS
     as
    0.24
     As
    0.19
    As
    0.17
     AS
    0.15
    as
    0.15
    —as
    0.14
    AS
    0.13
    -as
    0.13
    .As
    0.13
    	as
    0.13
    Act Density 0.334%

    No Known Activations