INDEX
    Explanations

    phrases that reference reading or written text

    New Auto-Interp
    Negative Logits
    ÙĦب
    -0.16
    åĸ
    -0.15
    αÏģά
    -0.14
     struggles
    -0.14
    erap
    -0.13
    ct
    -0.13
     Ov
    -0.13
    abolic
    -0.13
     Bread
    -0.13
    åij
    -0.13
    POSITIVE LOGITS
    커ìĬ¤
    0.17
    çŃij
    0.16
    Chr
    0.16
    Ral
    0.16
     Teach
    0.16
    needle
    0.15
    .ta
    0.14
    Ỽ
    0.14
    澤
    0.14
    agem
    0.14
    Act Density 0.046%

    No Known Activations