INDEX
    Explanations

    prepositional phrases indicating a specific type of action or behavior

    phrases that express various types of attention or critique

    New Auto-Interp
    Negative Logits
     Pigs
    -0.75
     Crush
    -0.72
    gow
    -0.70
    pots
    -0.66
     Sands
    -0.65
     Rocks
    -0.64
    gor
    -0.64
    APS
    -0.64
    mates
    -0.62
    ours
    -0.61
    POSITIVE LOGITS
     thing
    1.02
     scenario
    0.81
     behavior
    0.75
     stuff
    0.73
     nonsense
    0.72
     activity
    0.71
     attrition
    0.70
     kindred
    0.69
    tnc
    0.69
     fate
    0.68
    Act Density 0.033%

    No Known Activations