INDEX
    Explanations

    expressions of preference or decision-making

    New Auto-Interp
    Negative Logits
    ito
    -0.17
    rac
    -0.17
    anes
    -0.16
    .lst
    -0.15
    oha
    -0.14
    esen
    -0.14
    à¸Ńà¸ĩ
    -0.14
    ña
    -0.14
    ras
    -0.14
     Ø®ÙĦ
    -0.14
    POSITIVE LOGITS
     rather
    0.37
    rather
    0.33
     Rather
    0.32
    Rather
    0.31
     than
    0.28
    than
    0.27
     plutôt
    0.26
     пÑĢедпоÑĩ
    0.23
    å®ģ
    0.22
     prefer
    0.22
    Act Density 0.184%

    No Known Activations