INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     well
    -0.08
     everything
    -0.07
     huh
    -0.07
     '',↵
    -0.07
    -
    ↵
    -0.07
     Nothing
    -0.07
    }
    
    ↵
    -0.06
    ="";
    ↵
    -0.06
     nothing
    -0.06
     всі
    -0.06
    POSITIVE LOGITS
    :
    0.40
     :
    0.23
    _:
    0.20
    ):
    0.18
    0.17
    .:
    0.17
    ा:
    0.16
    }:
    0.16
    >:
    0.15
    !:
    0.15
    Act Density 0.627%

    No Known Activations