INDEX
    Explanations

    occurrences of evaluation or judgment

    New Auto-Interp
    Negative Logits
    rompt
    -0.16
    azer
    -0.15
    asing
    -0.14
    agt
    -0.14
    addy
    -0.14
    ÑĥÑģÑĤа
    -0.13
    nik
    -0.13
    zeros
    -0.13
    ilia
    -0.13
    akis
    -0.13
    POSITIVE LOGITS
     etc
    0.14
    δή
    0.14
    aucoup
    0.14
     atol
    0.14
    ewed
    0.14
    (strict
    0.13
    olu
    0.13
    dden
    0.13
     Dit
    0.13
    ableView
    0.13
    Act Density 0.089%

    No Known Activations