INDEX
    Explanations

    polite second-person questions asking the assistant to perform an action.

    New Auto-Interp
    Negative Logits
    ل
    0.54
    ডি
    0.50
    0.50
    م
    0.50
    ر
    0.49
    ないと
    0.47
    有两个
    0.47
     двох
    0.46
    لی
    0.46
    0.46
    POSITIVE LOGITS
    ?),
    0.52
    ...?
    0.50
    …?
    0.49
    ker
    0.46
     एखा
    0.44
     подума
    0.43
    ?"
    0.42
    uuuu
    0.42
     gladly
    0.42
     unwittingly
    0.41
    Act Density 0.075%

    No Known Activations