INDEX
    Explanations

    specifies actions and their consequences

    New Auto-Interp
    Negative Logits
     set
    -1.44
    mis
    -1.39
    Re
    -1.34
     had
    -1.30
     took
    -1.27
     made
    -1.27
    le
    -1.26
    Her
    -1.25
     h
    -1.25
    si
    -1.24
    POSITIVE LOGITS
     всички
    1.83
     tunik
    1.72
     bluz
    1.70
     OGSÅ
    1.70
     superbes
    1.70
    mainly
    1.62
     karier
    1.61
     cewek
    1.60
     incrí
    1.59
    Mainly
    1.56
    Act Density 0.009%

    No Known Activations