INDEX
    Explanations

    instructions or suggestions for taking specific actions

    phrases emphasizing the necessity of verifying or consulting information

    New Auto-Interp
    Negative Logits
    gery
    -0.73
    hum
    -0.71
    alf
    -0.70
    rock
    -0.69
    bled
    -0.68
    folk
    -0.67
    bern
    -0.67
    MpServer
    -0.66
    pher
    -0.66
    OT
    -0.66
    POSITIVE LOGITS
    icio
    0.78
     Thrones
    0.70
     patience
    0.67
     compr
    0.67
     beforehand
    0.65
     Titus
    0.64
    ilus
    0.62
    ppo
    0.62
     clicking
    0.62
     Siren
    0.61
    Act Density 0.037%

    No Known Activations