INDEX
    Explanations

    phrases indicating similarity or comparison

    phrases that indicate similarity or comparison

    New Auto-Interp
    Negative Logits
    oust
    -0.89
    alt
    -0.89
    utical
    -0.88
    rax
    -0.86
    otype
    -0.84
    inion
    -0.84
    itles
    -0.83
    otypes
    -0.79
    rouse
    -0.78
    oscope
    -0.77
    POSITIVE LOGITS
     somebody
    0.88
    lier
    0.86
     someone
    0.86
     something
    0.84
     fireworks
    0.83
     everybody
    0.79
     everyone
    0.79
     goodbye
    0.78
     they
    0.77
     fun
    0.77
    Act Density 0.042%

    No Known Activations