INDEX
    Explanations

    phrases indicating alignment or similarity

    New Auto-Interp
    Negative Logits
    ı
    -0.06
    normally
    -0.06
     Lon
    -0.06
     Usually
    -0.05
    Blocking
    -0.05
    quez
    -0.05
    ¢
    -0.05
    oard
    -0.05
    specified
    -0.05
    ment
    -0.05
    POSITIVE LOGITS
     match
    0.08
     corre
    0.08
     exactly
    0.08
     matched
    0.08
     same
    0.07
     identical
    0.07
    same
    0.07
    -match
    0.07
    match
    0.07
    .Match
    0.07
    Act Density 0.031%

    No Known Activations