INDEX
    Explanations

    phrases with specific leading words

    New Auto-Interp
    Negative Logits
     disert
    0.52
     discursive
    0.51
     אטאטורק
    0.50
     whakar
    0.49
    0.49
    legraph
    0.49
    0.49
    Alessandro
    0.48
     औपचारिक
    0.48
     korero
    0.48
    POSITIVE LOGITS
     over
    0.44
    <0xE2>
    0.44
    0.44
     über
    0.43
     Step
    0.41
     Debug
    0.40
     looping
    0.39
     Bonus
    0.38
     Protection
    0.38
     loop
    0.38
    Act Density 0.011%

    No Known Activations