INDEX
Explanations
statements where a citation is needed
references to citations needed in texts
New Auto-Interp
Negative Logits
milo
-0.74
wives
-0.65
mare
-0.64
thrott
-0.63
wife
-0.59
profits
-0.58
condos
-0.58
venge
-0.54
leep
-0.54
bragging
-0.54
POSITIVE LOGITS
]
1.24
]
1.20
][
1.16
])
1.15
][/
1.13
]"
1.11
]:
1.11
]).
1.11
].
1.09
].
1.08
Activations Density 0.014%