INDEX
Explanations
references to specific dates, events, and locations in news articles
New Auto-Interp
Negative Logits
imperson
-0.85
pretended
-0.81
lately
-0.77
pooled
-0.77
misplaced
-0.76
mistaken
-0.75
¬¼
-0.73
melted
-0.72
pired
-0.72
disguise
-0.71
POSITIVE LOGITS
Tickets
1.16
Meanwhile
1.10
Tickets
1.09
Until
1.08
Dates
1.06
Depending
1.03
<|endoftext|>
1.02
Assuming
1.00
Expect
0.98
Sources
0.98
Activations Density 0.354%