INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ватися
    0.49
     څرنګوالی
    0.48
     personalidade
    0.48
     égaux
    0.47
    0.47
    […]
    0.47
     personalità
    0.47
    ۟
    0.47
     facto
    0.46
     thinkers
    0.46
    POSITIVE LOGITS
    ↵↵↵
    2.17
    ↵↵↵↵
    2.13
    ↵↵↵↵↵
    1.91
    ↵↵↵↵↵↵
    1.86
    ↵↵↵↵↵↵↵
    1.81
    ↵↵↵↵↵↵↵↵↵
    1.57
    ↵↵↵↵↵↵↵↵
    1.45
    ↵↵↵↵↵↵↵↵↵↵
    1.39
    ↵↵↵↵↵↵↵↵↵↵↵↵↵
    1.34
    ↵↵↵↵↵↵↵↵↵↵↵
    1.33
    Act Density 0.450%

    No Known Activations