INDEX
    Explanations

    phrases that indicate preference or favor towards something

    New Auto-Interp
    Negative Logits
    awar
    -0.83
    block
    -0.75
    /-
    -0.69
    }}
    -0.69
    UGH
    -0.65
    break
    -0.65
    stan
    -0.65
    wow
    -0.64
    nor
    -0.62
    eno
    -0.62
    POSITIVE LOGITS
     simpler
    0.87
     simplified
    0.82
     embracing
    0.77
     softer
    0.76
     streamlined
    0.74
     something
    0.71
     sleek
    0.70
     trendy
    0.70
     concentrating
    0.69
     embrace
    0.67
    Act Density 0.075%

    No Known Activations