INDEX
    Explanations

    phrases indicating a comparison or evaluation based on a certain criteria

    New Auto-Interp
    Negative Logits
    rouse
    -0.80
    enh
    -0.74
    Correct
    -0.74
    orem
    -0.71
    anasia
    -0.70
    okes
    -0.69
    oked
    -0.69
    arez
    -0.69
    terior
    -0.68
    ernal
    -0.68
    POSITIVE LOGITS
     how
    0.98
     recent
    0.89
     previous
    0.70
    recent
    0.69
     current
    0.68
     similarities
    0.67
     rumors
    0.66
     its
    0.66
     what
    0.65
     there
    0.65
    Act Density 0.099%

    No Known Activations