INDEX
    Explanations

    be followed by adjective

    tokens that occur in the assistant's long, explanatory/instructional response text (i.e., helpful, informative sentences).

    New Auto-Interp
    Negative Logits
    ל
    0.33
     séparation
    0.32
    0.31
     plufieurs
    0.31
    ЕЛЬ
    0.30
     nyní
    0.30
    ת
    0.30
    aldb
    0.30
     اسے
    0.30
    ل
    0.29
    POSITIVE LOGITS
     
    0.47
     an
    0.44
     able
    0.44
     be
    0.42
     in
    0.38
     a
    0.37
     t
    0.35
     of
    0.35
     e
    0.34
    friend
    0.34
    Act Density 0.375%

    No Known Activations