INDEX
    Explanations

    instances and examples used in explanations or arguments

    New Auto-Interp
    Negative Logits
    <unused43>
    -0.96
    <unused74>
    -0.95
    <unused41>
    -0.95
    <pad>
    -0.95
    [@BOS@]
    -0.95
    <unused42>
    -0.94
    <unused51>
    -0.94
    <unused28>
    -0.94
    <unused8>
    -0.94
    <unused17>
    -0.94
    POSITIVE LOGITS
    ,
    0.84
    первых
    0.44
     obstante
    0.43
    However
    0.37
    Therefore
    0.35
     However
    0.35
     however
    0.34
    Nevertheless
    0.33
    Moreover
    0.31
    Furthermore
    0.31
    Act Density 0.575%

    No Known Activations