INDEX
    Explanations

    names and sentence endings

    New Auto-Interp
    Negative Logits
     expressions
    0.40
     type
    0.39
     Expressions
    0.39
     dissipation
    0.38
     camo
    0.37
     dominates
    0.37
    ပါတယ်
    0.37
    ങ്ങളാണ്
    0.36
    ্যার
    0.36
     loss
    0.36
    POSITIVE LOGITS
    ۔
    0.63
    0.59
    0.58
    ."
    0.51
    .).
    0.49
    ).
    0.47
    0.47
    .”
    0.47
     quando
    0.46
     referring
    0.45
    Act Density 0.016%

    No Known Activations