INDEX
    Explanations

    references to action and consequences in high-stakes scenarios

    New Auto-Interp
    Negative Logits
     velkommen
    -0.36
     Dominant
    -0.34
    ніципалі
    -0.32
     Patches
    -0.32
     Processing
    -0.32
    Patches
    -0.31
    httphttps
    -0.31
     sẻ
    -0.31
    centa
    -0.31
    icorn
    -0.30
    POSITIVE LOGITS
    DockStyle
    0.61
     EMERGENCY
    0.55
     emergency
    0.54
    emergency
    0.54
    asztok
    0.52
    Emergency
    0.52
    trigger
    0.51
     triggered
    0.50
     trigger
    0.48
     invoke
    0.47
    Act Density 1.078%

    No Known Activations