INDEX
    Explanations

    questions and expressions of curiosity

    New Auto-Interp
    Negative Logits
     autorytatywna
    -0.92
    <unused41>
    -0.86
    <unused79>
    -0.86
    <unused16>
    -0.86
    <unused23>
    -0.86
    <unused28>
    -0.86
    <unused43>
    -0.86
    <unused8>
    -0.86
    [@BOS@]
    -0.86
    <unused3>
    -0.86
    POSITIVE LOGITS
     WHAT
    0.46
    what
    0.44
     waste
    0.43
     what
    0.41
     perd
    0.40
    WHAT
    0.39
     What
    0.37
     WTF
    0.36
     wrong
    0.36
    What
    0.36
    Act Density 0.028%

    No Known Activations