INDEX
    Explanations

    phrases addressing the user

    New Auto-Interp
    Negative Logits
     [
    0.61
     (
    0.60
    ↵↵
    0.59
    Со
    0.59
    0.58
    </
    0.58
    0.57
    Ver
    0.56
     )
    0.56
    Пол
    0.56
    POSITIVE LOGITS
     yourselves
    1.84
     yourself
    1.83
     Yourself
    1.83
    yourself
    1.73
     me
    1.72
     نفسك
    1.48
     your
    1.47
    してください
    1.45
     jezelf
    1.44
    해주세요
    1.38
    Act Density 0.451%

    No Known Activations