INDEX
    Explanations

    tokens that are part of user instructions or explicit task/request prompts (i.e., directive phrases asking the model to do something).

    New Auto-Interp
    Negative Logits
     to
    0.53
     of
    0.52
     auf
    0.48
    ă
    0.48
     från
    0.48
    ão
    0.46
     een
    0.46
     على
    0.45
     with
    0.43
     của
    0.43
    POSITIVE LOGITS
    6
    0.44
    ის
    0.43
     Dave
    0.43
    7
    0.41
    ございます
    0.41
    5
    0.40
    0.39
    ке
    0.39
    정에
    0.38
    ג
    0.38
    Act Density 12.205%

    No Known Activations