INDEX
    Explanations

    phrases that express preference or contrast

    New Auto-Interp
    Negative Logits
    urator
    -0.17
    urious
    -0.16
    rav
    -0.16
    ury
    -0.16
    elly
    -0.15
    sw
    -0.15
    $$$$
    -0.14
    erdale
    -0.14
    Č
    -0.14
    uring
    -0.14
    POSITIVE LOGITS
    instead
    0.19
    -than
    0.19
     than
    0.17
    -known
    0.15
    ÙĦ
    0.15
    erro
    0.15
    445
    0.15
    ÙĨÚ¯ÛĮ
    0.15
    .consume
    0.15
    apy
    0.14
    Act Density 0.018%

    No Known Activations