INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    17
    -0.11
    -0.09
    197
    -0.09
    13
    -0.09
    103
    -0.08
    087
    -0.08
    16
    -0.08
    167
    -0.08
    996
    -0.08
    -0.08
    POSITIVE LOGITS
    0.18
    0.18
    0.16
    0.16
    ��
    0.16
    0.15
    0.15
    0.15
    ��
    0.14
    ��
    0.14
    Act Density 0.004%

    No Known Activations