INDEX
    Explanations

    phrases that indicate involvement or participation in various contexts

    New Auto-Interp
    Negative Logits
    heads
    -1.91
    head
    -1.76
    ogg
    -1.66
    stown
    -1.66
    weights
    -1.61
    setminus
    -1.56
     again
    -1.51
    ctin
    -1.49
    isters
    -1.49
    matter
    -1.45
    POSITIVE LOGITS
    »¿
    2.64
    ĥ½
    2.31
    ¿
    2.26
    2.23
    2.23
    ↵↵  
    2.23
    2.23
                                                                                   
    2.23
    2.23
                                                                                                    
    2.23
    Act Density 0.542%

    No Known Activations