INDEX
    Explanations

    phrases indicating advice or recommendations

    seeking or offering tips

    New Auto-Interp
    Negative Logits
    <bos>
    -0.59
     a
    -0.44
    A
    -0.39
    ↵↵
    -0.38
     A
    -0.37
    objects
    -0.35
    und
    -0.35
    ander
    -0.35
    rent
    -0.35
     an
    -0.35
    POSITIVE LOGITS
     tips
    2.06
     Tips
    1.89
    Tips
    1.72
    tips
    1.70
     TIPS
    1.59
     Tipps
    1.55
    TIPS
    1.45
    Tipps
    1.24
    tipps
    1.15
     dicas
    1.11
    Act Density 0.003%

    No Known Activations