INDEX
    Explanations

    repeated phrases or concepts that indicate similarity

    New Auto-Interp
    Negative Logits
    related
    -0.16
     itself
    -0.15
    #ac
    -0.15
    loff
    -0.14
    the
    -0.14
    rac
    -0.14
    requ
    -0.13
     better
    -0.13
     any
    -0.13
    rious
    -0.13
    POSITIVE LOGITS
     exact
    0.43
     thing
    0.42
    -sex
    0.41
     kind
    0.38
     kinds
    0.35
     sort
    0.35
     amount
    0.33
     type
    0.33
    exact
    0.32
    -old
    0.31
    Act Density 0.083%

    No Known Activations