INDEX
    Explanations

    instructions and guides

    New Auto-Interp
    Negative Logits
     culturais
    0.45
     ಸದಸ್ಯ
    0.42
    0.41
     societal
    0.40
    感染
    0.38
     society
    0.38
     culturale
    0.38
     সমাজে
    0.38
    社會
    0.38
     ಕ್ಷೇತ್ರ
    0.37
    POSITIVE LOGITS
     instructions
    1.11
     Instructions
    1.06
     instrucciones
    0.99
     инструкции
    0.96
     tutorials
    0.94
    instructions
    0.91
     tutorial
    0.91
     Anleitung
    0.89
     Tutorials
    0.89
    Instructions
    0.89
    Act Density 0.618%

    No Known Activations