INDEX
    Explanations

    generate human-like text and code

    New Auto-Interp
    Negative Logits
    どうしても
    0.49
    езде
    0.48
     everytime
    0.47
     नेहमी
    0.47
    每次
    0.46
    ALWAYS
    0.46
     всегда
    0.46
     최대한
    0.46
     завжди
    0.45
     vždy
    0.45
    POSITIVE LOGITS
     convincingly
    0.86
     proficient
    0.76
     reasonably
    0.75
     reliably
    0.74
     successfully
    0.73
     almost
    0.68
     confidently
    0.64
     accurately
    0.63
     succesfully
    0.63
     effectively
    0.61
    Act Density 0.030%

    No Known Activations