INDEX
    Explanations

    phrases indicating requests or demands from authority figures

    New Auto-Interp
    Negative Logits
     contr
    -0.19
    677
    -0.15
    893
    -0.15
     semiclassical
    -0.14
    805
    -0.14
    rella
    -0.14
    alia
    -0.14
     CONTR
    -0.13
    829
    -0.13
    veys
    -0.13
    POSITIVE LOGITS
    alink
    0.16
    anlı
    0.15
    å³
    0.15
     PureComponent
    0.14
     ev
    0.14
    anoi
    0.14
    üc
    0.14
    oho
    0.13
    ighet
    0.13
    chie
    0.13
    Act Density 0.059%

    No Known Activations