INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     |
    -0.18
    kin
    -0.16
    -0.15
    ania
    -0.15
    ames
    -0.15
    reich
    -0.14
    ):
    -0.14
    et
    -0.14
     equals
    -0.14
    |
    -0.14
    POSITIVE LOGITS
    ="
    0.27
    ="">↵
    0.21
    ></
    0.20
    =”
    0.20
    >↵
    0.19
    />↵
    0.18
    ÃĹ</
    0.18
    ='
    0.17
    ()>↵
    0.17
    =\"
    0.17
    Act Density 0.006%

    No Known Activations