INDEX
    Explanations

    societal and social norms

    New Auto-Interp
    Negative Logits
    ك
    0.41
    يا
    0.36
     (
    0.33
    ö
    0.31
    0.31
    ص
    0.31
     is
    0.30
    0.30
    0.29
    .*
    0.28
    POSITIVE LOGITS
     общество
    0.36
    社会
    0.36
     masyarakat
    0.34
     society
    0.34
     sociali
    0.33
     사회
    0.32
    社会的
    0.32
     الاجتماعي
    0.32
     thiểu
    0.32
     sociaux
    0.31
    Act Density 0.129%

    No Known Activations