INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     intact
    -0.08
     chữ
    -0.07
     pagka
    -0.07
    ****/↵
    -0.07
    vorm
    -0.07
     schwer
    -0.07
    -spec
    -0.07
    	ds
    -0.07
    -0.07
     형태
    -0.07
    POSITIVE LOGITS
     AMA
    0.08
     gladly
    0.08
     eindelijk
    0.08
    😊
    0.08
     pretending
    0.08
     verzoek
    0.08
     necesit
    0.08
     assistant
    0.08
     teşekkür
    0.07
     😊
    0.07
    Act Density 0.007%

    No Known Activations