INDEX
    Explanations

    references to things or actions being incorrect, inappropriate, or harmful

    New Auto-Interp
    Negative Logits
    casters
    -0.73
    anned
    -0.72
    ranging
    -0.68
    ivism
    -0.68
    lishes
    -0.67
    ury
    -0.67
    thood
    -0.66
    rs
    -0.66
    cit
    -0.64
    zeb
    -0.63
    POSITIVE LOGITS
     amount
    0.90
     side
    0.87
     way
    0.86
     kind
    0.84
     thing
    0.83
     piece
    0.75
     direction
    0.75
     solution
    0.73
     person
    0.72
     number
    0.72
    Act Density 6.764%

    No Known Activations