INDEX
    Explanations

    words related to causation or consequences

    New Auto-Interp
    Negative Logits
     rehabilit
    -0.69
     hitter
    -0.68
    abies
    -0.60
     veterinarian
    -0.58
     nurs
    -0.57
     batter
    -0.57
     neighb
    -0.56
     territ
    -0.55
     battered
    -0.53
    igger
    -0.53
    POSITIVE LOGITS
    forth
    2.24
    forward
    1.44
     why
    1.01
    why
    0.89
    noon
    0.80
    entimes
    0.80
    ably
    0.80
    videos
    0.77
    fter
    0.76
    xual
    0.73
    Act Density 0.009%

    No Known Activations