INDEX
    Explanations

    phrases that refer to second-person pronouns or direct address

    New Auto-Interp
    Negative Logits
    resse
    -0.16
    arten
    -0.15
    asz
    -0.14
    ↵↵
    -0.14
    lest
    -0.14
    erness
    -0.13
     penÄĽ
    -0.13
    -cart
    -0.13
    .pb
    -0.13
    tics
    -0.13
    POSITIVE LOGITS
     know
    0.56
     Know
    0.46
    know
    0.42
    Know
    0.41
     knows
    0.41
     KNOW
    0.34
    çŁ¥éģĵ
    0.31
    -know
    0.28
     зна
    0.28
    çŁ¥
    0.27
    Act Density 0.079%

    No Known Activations