INDEX
    Explanations

    references to the concepts of protection and prevention related to people and their actions

    New Auto-Interp
    Head Attr Weights
    0:0.02
    1:0.02
    2:0.05
    3:0.13
    4:0.37
    5:0.04
    6:0.04
    7:0.12
    8:0.03
    9:0.03
    10:0.06
    11:0.05
    Negative Logits
     Rohing
    -1.82
     Airl
    -1.51
     unden
    -1.47
     Wheels
    -1.45
     advoc
    -1.41
     lil
    -1.41
     Ashton
    -1.40
    forums
    -1.37
    :{
    -1.37
    -1.35
    POSITIVE LOGITS
     harm
    1.73
     undue
    1.70
     adversely
    1.62
     harmful
    1.62
     avoidance
    1.61
     future
    1.60
    ividual
    1.60
     altogether
    1.48
     discriminating
    1.47
     falsely
    1.47
    Act Density 0.067%

    No Known Activations