INDEX
Explanations
updates or clarifications in news articles
references to specific dates and events
New Auto-Interp
Negative Logits
often
-0.82
Often
-0.76
nurture
-0.73
Imagine
-0.72
²¾
-0.72
ĸļ
-0.71
stereotypical
-0.71
pires
-0.71
often
-0.70
Years
-0.67
POSITIVE LOGITS
UPDATE
1.21
UPDATE
1.20
Update
1.09
confirms
1.08
confirming
1.07
PDATED
1.06
updated
1.03
!]
1.01
corrected
1.00
confirmed
0.99
Activations Density 0.598%