INDEX
    Explanations

    reduces or normalizes harm

    New Auto-Interp
    Negative Logits
    各类
    0.43
    任何
    0.43
     Various
    0.43
     çeşitli
    0.43
    various
    0.42
     plupart
    0.40
    aires
    0.40
     কোনও
    0.40
     various
    0.39
     ใด
    0.39
    POSITIVE LOGITS
     столь
    1.01
     arguably
    0.84
     something
    0.80
    的重要
    0.80
     niezwy
    0.74
    something
    0.73
     важней
    0.71
    重要的
    0.70
     важ
    0.70
    ដ៏
    0.70
    Act Density 0.033%

    No Known Activations