INDEX
    Explanations

    question words

    New Auto-Interp
    Negative Logits
    ↵↵
    -0.71
    -0.59
     "
    -0.57
    -0.53
    <bos>
    -0.52
      
    -0.49
     (
    -0.49
     “
    -0.48
     of
    -0.46
    "
    -0.46
    POSITIVE LOGITS
     للمعارف
    0.95
     Efq
    0.90
     виправивши
    0.84
     itſelf
    0.82
     مرئيه
    0.82
     $_(
    0.79
     فريبيس
    0.78
    ſelf
    0.78
     ***!
    0.77
     myſelf
    0.77
    Act Density 0.020%

    No Known Activations