INDEX
Explanations
references to fairness, facts, and a mix of political and economic content
New Auto-Interp
Negative Logits
Niet
-0.66
Borders
-0.65
Azerb
-0.61
Grail
-0.60
Seym
-0.57
Clarkson
-0.56
Nare
-0.56
Berk
-0.55
Frie
-0.55
prest
-0.55
POSITIVE LOGITS
][
0.88
]
0.77
_
0.72
];
0.71
)
0.69
);
0.69
::
0.68
+=
0.67
].
0.67
):
0.66
Activations Density 1.198%