INDEX
    Explanations

    negative or descriptive emotional states and attributes

    New Auto-Interp
    Negative Logits
     ſind
    -0.98
    󠀠
    -0.96
     للاسماء
    -0.95
     Menſchen
    -0.94
    <unused41>
    -0.94
    <unused28>
    -0.94
    <unused47>
    -0.94
    <unused14>
    -0.94
    <unused79>
    -0.94
    [@BOS@]
    -0.93
    POSITIVE LOGITS
     —
    0.35
    0.27
     news
    0.27
     ...
    0.26
    ↵↵
    0.24
     law
    0.24
    ↵↵↵
    0.23
    <eos>
    0.23
     pri
    0.23
    ${
    0.23
    Act Density 0.041%

    No Known Activations