INDEX
Explanations
adverbs of certainty or manner
New Auto-Interp
Negative Logits
ruining
0.51
ruins
0.50
hates
0.49
screwed
0.49
ruined
0.48
hurting
0.47
messed
0.47
messing
0.47
suka
0.46
usan
0.46
POSITIVE LOGITS
undeniably
0.54
inevitably
0.54
arguably
0.54
invariably
0.51
argu
0.47
undoubtedly
0.46
ultimately
0.46
inadvertently
0.45
inescap
0.45
subtly
0.43
Activations Density 0.265%