INDEX
    Explanations

    negative or restricted actions

    New Auto-Interp
    Negative Logits
     incar
    0.45
     adjustments
    0.44
     inks
    0.42
     reverses
    0.42
     absorber
    0.40
    0.40
    ِل
    0.40
     adjuster
    0.39
     ajustable
    0.39
    ိုး
    0.38
    POSITIVE LOGITS
    onder
    0.43
    eh
    0.39
     wise
    0.39
    ന്മാ
    0.39
     ראש
    0.38
    ראש
    0.38
     করাই
    0.38
     Craig
    0.37
    鸿
    0.37
     maestros
    0.36
    Act Density 0.001%

    No Known Activations