INDEX
Explanations
instances of words related to strong reasoning or arguments
expressions of strong interest or persuasive arguments
New Auto-Interp
Negative Logits
hops
-0.92
sterdam
-0.77
keepers
-0.69
paces
-0.69
chie
-0.67
hop
-0.66
isites
-0.63
clair
-0.63
usterity
-0.62
keeper
-0.62
POSITIVE LOGITS
ibly
1.09
proofs
0.95
evidence
0.93
arguments
0.91
ingly
0.91
argument
0.90
ly
0.88
ible
0.87
proof
0.86
evidence
0.86
Activations Density 0.104%