INDEX
    Explanations

    phrases indicating justification or reasoning

    New Auto-Interp
    Negative Logits
    oux
    -0.20
    å¹ķ
    -0.18
     Pok
    -0.17
    earer
    -0.14
    ):-
    -0.14
    ammen
    -0.14
    aku
    -0.14
     hol
    -0.14
    ushman
    -0.14
    931
    -0.13
    POSITIVE LOGITS
     reason
    0.37
    reason
    0.27
     reasons
    0.26
     Reason
    0.24
    Reason
    0.23
     purpose
    0.23
    _reason
    0.21
    .reason
    0.20
    _REASON
    0.20
    purpose
    0.19
    Act Density 0.041%

    No Known Activations