INDEX
Explanations
generate human-like text and code
New Auto-Interp
Negative Logits
どうしても
0.49
езде
0.48
everytime
0.47
नेहमी
0.47
每次
0.46
ALWAYS
0.46
всегда
0.46
최대한
0.46
завжди
0.45
vždy
0.45
POSITIVE LOGITS
convincingly
0.86
proficient
0.76
reasonably
0.75
reliably
0.74
successfully
0.73
almost
0.68
confidently
0.64
accurately
0.63
succesfully
0.63
effectively
0.61
Activations Density 0.030%