INDEX
Explanations
leading or instructive phrases
New Auto-Interp
Negative Logits
Khan
0.49
João
0.47
THERE
0.46
Referenced
0.45
nhớ
0.44
Wrong
0.44
worst
0.44
Wikipedia
0.43
angi
0.43
Worst
0.43
POSITIVE LOGITS
spit
0.45
spits
0.42
lac
0.42
pi
0.41
illustrative
0.40
Bol
0.40
boulevard
0.39
instructive
0.39
ramps
0.38
pitching
0.38
Activations Density 0.001%