INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ют
    -0.07
     praising
    -0.07
     bunun
    -0.07
    ynet
    -0.06
    _but
    -0.06
     erect
    -0.06
    reas
    -0.06
    indows
    -0.06
    こう
    -0.06
    loat
    -0.06
    POSITIVE LOGITS
    ethical
    0.07
     refrigerator
    0.07
     missing
    0.06
    	line
    0.06
     
    ↵ 
    ↵
    0.06
     easier
    0.06
     basketball
    0.06
    .metric
    0.06
    	username
    0.06
     skilled
    0.06
    Act Density 0.002%

    No Known Activations