INDEX
Explanations
instances where something can be improved or done better
phrases indicating improvement or positive performance
New Auto-Interp
Negative Logits
Tru
-0.75
jected
-0.75
Personality
-0.69
Mastery
-0.69
went
-0.67
ipel
-0.66
ixed
-0.64
Cutter
-0.63
shaw
-0.61
ãĤ¦ãĤ¹
-0.61
POSITIVE LOGITS
grunt
0.78
job
0.75
injustice
0.74
ingen
0.71
benefit
0.68
offline
0.68
homework
0.67
deed
0.67
eret
0.64
deserve
0.64
Activations Density 0.077%