INDEX
    Explanations

    phrases expressing concerns or fears about potential negative consequences

    New Auto-Interp
    Negative Logits
    etler
    -0.16
    aman
    -0.15
    iaux
    -0.15
    kc
    -0.14
    yon
    -0.14
    lar
    -0.14
    ama
    -0.14
     hopefully
    -0.14
     humble
    -0.13
    433
    -0.13
    POSITIVE LOGITS
     too
    0.29
     somehow
    0.28
     TOO
    0.27
    too
    0.26
    Too
    0.20
     Too
    0.20
    -too
    0.20
     might
    0.19
     podrÃŃa
    0.18
     dil
    0.18
    Act Density 0.172%

    No Known Activations