INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
     horrors
    0.63
     defamatory
    0.63
     Verwendung
    0.62
     hazards
    0.61
    écies
    0.61
     inaccuracies
    0.61
     ornate
    0.60
     harms
    0.58
     objectionable
    0.57
     harmful
    0.57
    POSITIVE LOGITS
    努力
    1.05
    💪
    1.02
     effort
    0.99
    頑張
    0.96
     노력
    0.94
     ಪ್ರಯತ್ನ
    0.94
     प्रयत्न
    0.93
    全力
    0.90
     চেষ্টা
    0.89
     diligently
    0.88
    Act Density 0.003%

    No Known Activations