INDEX
    Explanations

    phrases indicating comparisons or similarities

    New Auto-Interp
    Negative Logits
    acco
    -0.19
    orsi
    -0.18
    gars
    -0.16
    IGHL
    -0.15
     LIKE
    -0.15
    ει
    -0.14
    igin
    -0.14
     пÑĥ
    -0.14
    them
    -0.14
    iego
    -0.14
    POSITIVE LOGITS
     they
    0.29
     it
    0.24
     there
    0.23
     we
    0.23
    able
    0.20
     something
    0.19
     maybe
    0.18
    WISE
    0.17
     a
    0.17
     she
    0.17
    Act Density 0.026%

    No Known Activations