INDEX
Explanations
colons followed by sentences
punctuation or formatting indicators in the text
New Auto-Interp
Negative Logits
reconc
-0.74
tremend
-0.73
æ©
-0.72
ingred
-0.70
territ
-0.68
bean
-0.66
¥ŀ
-0.65
adversaries
-0.65
diseng
-0.65
manif
-0.64
POSITIVE LOGITS
âĨij
1.35
Originally
1.10
Originally
0.91
Show
0.88
Interesting
0.88
Regarding
0.85
Yeah
0.84
Assuming
0.84
Whilst
0.82
>>
0.81
Activations Density 0.059%