INDEX
    Explanations

    numbers in data/code

    New Auto-Interp
    Negative Logits
     joke
    -0.06
    UK
    -0.06
    Criterion
    -0.06
    Conversion
    -0.06
    -%
    -0.06
     Vine
    -0.06
    Keyword
    -0.06
    FIT
    -0.05
     bullshit
    -0.05
    .uid
    -0.05
    POSITIVE LOGITS
     {});↵↵
    0.07
     Jackets
    0.07
     складі
    0.07
     thác
    0.07
    undy
    0.06
     capacidad
    0.06
    zza
    0.06
    ***/↵↵
    0.06
     Bylo
    0.06
     неї
    0.06
    Act Density 0.090%

    No Known Activations