INDEX
Explanations
sentences indicating insights, unraveling mysteries, or achieving deep understanding
New Auto-Interp
Negative Logits
amon
-0.69
sequently
-0.68
)",
-0.68
tones
-0.67
igree
-0.65
Others
-0.65
wick
-0.64
occasion
-0.63
Scroll
-0.62
)"
-0.62
POSITIVE LOGITS
goddamn
0.88
enegger
0.75
/(
0.73
fucking
0.73
overest
0.72
BILITY
0.72
damn
0.69
willfully
0.68
reinvent
0.68
fucked
0.67
Activations Density 0.726%