INDEX
    Explanations

    phrases related to inappropriate behavior and consequences

    New Auto-Interp
    Negative Logits
    OLOGY
    -0.73
     Trojan
    -0.68
     Waterloo
    -0.67
     Roose
    -0.66
    enegger
    -0.66
    eways
    -0.63
    WM
    -0.63
    esan
    -0.62
     Printing
    -0.61
     Kev
    -0.61
    POSITIVE LOGITS
    inent
    0.92
     abst
    0.90
    ain
    0.90
    ainer
    0.87
    inance
    0.86
    ention
    0.85
    rences
    0.85
    atory
    0.83
    ained
    0.82
    aining
    0.81
    Act Density 0.027%

    No Known Activations