INDEX
    Explanations

    statements related to subjective opinions and perspectives

    New Auto-Interp
    Negative Logits
    niſſe
    -1.05
     müſſen
    -0.98
    <unused43>
    -0.98
    <pad>
    -0.97
    <unused41>
    -0.97
     geweſen
    -0.97
    <unused14>
    -0.96
    <unused3>
    -0.96
    [@BOS@]
    -0.96
    <unused1>
    -0.96
    POSITIVE LOGITS
    ,
    0.61
    0.54
    ...
    0.50
     …
    0.45
     ...
    0.39
    ?
    0.38
    ……
    0.36
    !
    0.36
     really
    0.35
    *
    0.34
    Act Density 0.409%

    No Known Activations