INDEX
Explanations
text structuring and formatting cues
colons and their associated textual contexts
New Auto-Interp
Negative Logits
adversaries
-0.75
æ©
-0.73
pestic
-0.71
¥ŀ
-0.71
userc
-0.70
senal
-0.69
breakthrough
-0.67
phabet
-0.67
ingred
-0.66
hemor
-0.66
POSITIVE LOGITS
âĨij
0.90
Interesting
0.88
Originally
0.85
Show
0.85
Originally
0.73
Wow
0.72
Hmm
0.72
Assuming
0.72
Nice
0.68
Surely
0.68
Activations Density 0.073%