INDEX
    Explanations

    phrases related to lists of items or categories

    phrases indicating quantities or amounts, often emphasizing the word "more."

    New Auto-Interp
    Negative Logits
    lance
    -0.87
    Joy
    -0.73
    Unity
    -0.70
    ivism
    -0.69
    stadt
    -0.68
    heed
    -0.68
    POST
    -0.67
    BIL
    -0.66
    rix
    -0.66
    gow
    -0.66
    POSITIVE LOGITS
     dozen
    1.13
     hundred
    0.95
     paragraphs
    0.93
     consecutive
    0.93
     sectors
    0.91
     layers
    0.91
     thousand
    0.90
     episodes
    0.89
     segments
    0.89
     exceptions
    0.88
    Act Density 0.047%

    No Known Activations