INDEX
    Explanations

    phrases indicating a contrast or comparison between two situations

    New Auto-Interp
    Negative Logits
    jack
    -0.16
    enton
    -0.15
    uilt
    -0.15
    èħ°
    -0.15
    LETE
    -0.14
    éŀ
    -0.14
    proved
    -0.14
    ãģ£ãģı
    -0.14
    ailability
    -0.14
    uit
    -0.14
    POSITIVE LOGITS
     flip
    0.24
    flip
    0.22
     other
    0.20
     Flip
    0.19
     flips
    0.19
     upside
    0.18
    .flip
    0.18
    Flip
    0.17
    761
    0.17
     flipping
    0.16
    Act Density 0.029%

    No Known Activations