INDEX
    Explanations

    phrases with legal or adversarial connotations

    New Auto-Interp
    Negative Logits
     prosec
    -0.81
     princ
    -0.77
     citiz
    -0.76
     commissions
    -0.75
    ¥ŀ
    -0.74
     skelet
    -0.74
     censored
    -0.73
     obser
    -0.73
     lifes
    -0.72
     newcom
    -0.70
    POSITIVE LOGITS
    "[
    2.00
    "(
    1.91
    "
    1.90
    He
    1.69
    "'
    1.66
    "...
    1.57
    Asked
    1.49
    Instead
    1.48
    She
    1.46
    His
    1.44
    Act Density 0.344%

    No Known Activations