INDEX
    Explanations

    phrases indicating judgment or evaluation based on standards

    New Auto-Interp
    Negative Logits
    ensa
    -0.16
    owed
    -0.15
    kud
    -0.15
    759
    -0.14
    çĤ
    -0.14
    ried
    -0.14
    -prom
    -0.14
    Capability
    -0.13
    idden
    -0.13
    rouw
    -0.13
    POSITIVE LOGITS
    æĺŃ
    0.16
    atori
    0.15
    zier
    0.15
    алÑĸв
    0.14
    PosX
    0.14
    iram
    0.14
     Scalar
    0.14
    heimer
    0.14
    Scalar
    0.14
    à¤Ĥà¤ľ
    0.14
    Act Density 0.012%

    No Known Activations