INDEX
Explanations
quotes or strong statements regarding controversial claims and denials
New Auto-Interp
Negative Logits
itud
-0.15
lag
-0.15
zee
-0.14
cynical
-0.14
escal
-0.14
_extraction
-0.13
air
-0.13
Bair
-0.13
aily
-0.13
oose
-0.13
POSITIVE LOGITS
icros
0.19
reject
0.18
reject
0.17
waste
0.16
Reject
0.15
.reject
0.15
rejects
0.15
ibold
0.15
éϤ
0.15
<src
0.15
Activations Density 0.251%