INDEX
    Explanations

    So followed by adjectives

    New Auto-Interp
    Negative Logits
    othed
    0.70
    ↵↵
    0.56
     речь
    0.55
    ones
    0.54
     آموز
    0.54
     пре
    0.52
    具有
    0.52
    _|
    0.51
    UPD
    0.51
    rotated
    0.51
    POSITIVE LOGITS
     này
    0.93
     embarrassing
    0.92
    นี้
    0.90
     glad
    0.89
     alot
    0.89
     questo
    0.89
     annoying
    0.88
    नं
    0.85
     intéressant
    0.85
     grateful
    0.85
    Act Density 0.002%

    No Known Activations