INDEX
Explanations
references to evidence or proof in a text
references to evidence or proof in discussions
New Auto-Interp
Negative Logits
occ
-0.81
zinski
-0.69
roup
-0.67
":["
-0.65
Osw
-0.63
atche
-0.62
ade
-0.61
rik
-0.61
ades
-0.60
imar
-0.60
POSITIVE LOGITS
proof
4.06
Proof
3.03
Proof
2.83
proofs
2.73
proof
2.69
evidence
1.77
evidence
1.64
proving
1.60
evid
1.41
verification
1.40
Activations Density 0.005%