INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     shield
    -0.07
    (Int
    -0.07
    ster
    -0.07
    touch
    -0.07
     intim
    -0.07
    пер
    -0.07
     staffers
    -0.07
     "...
    -0.07
    是一个
    -0.07
    framework
    -0.07
    POSITIVE LOGITS
     below
    0.19
     Below
    0.16
    below
    0.14
    Below
    0.11
     BELOW
    0.11
    elow
    0.09
    _below
    0.08
    Near
    0.08
     above
    0.08
     아래
    0.07
    Act Density 0.025%

    No Known Activations