INDEX
Explanations
logical reasoning and explanations
references to logical reasoning and arguments
New Auto-Interp
Negative Logits
lain
-0.78
affer
-0.74
paid
-0.74
enf
-0.70
rael
-0.69
orks
-0.68
ammy
-0.68
chuk
-0.67
sung
-0.67
aina
-0.66
POSITIVE LOGITS
posit
0.96
inference
0.92
necessity
0.87
logical
0.85
progression
0.84
deductions
0.83
inconsistency
0.82
istically
0.82
istical
0.81
deduction
0.80
Activations Density 0.007%