INDEX
    Explanations

    guiding and helping actions

    New Auto-Interp
    Negative Logits
     is
    0.45
    ي
    0.39
    ی
    0.38
    5
    0.35
     æ
    0.34
    ка
    0.33
    को
    0.32
     á
    0.32
     g
    0.32
    نه
    0.31
    POSITIVE LOGITS
    lying
    0.35
    に使用
    0.28
     ಅವರಿಗೆ
    0.28
    0.27
    us
    0.26
    give
    0.26
    द्भ
    0.26
    घव
    0.26
     빠르게
    0.26
    J
    0.26
    Act Density 0.718%

    No Known Activations