INDEX
Explanations
the phrase 'common sense'
references to the concept of common sense
New Auto-Interp
Negative Logits
atern
-0.77
ETA
-0.76
etsk
-0.75
chrom
-0.74
Stars
-0.72
raph
-0.71
\/\/
-0.69
soon
-0.67
bye
-0.67
href
-0.66
POSITIVE LOGITS
ACTIONS
0.91
smanship
0.90
pants
0.76
ensical
0.73
Cola
0.70
dictates
0.69
constraints
0.66
iness
0.64
imitation
0.63
Dynamics
0.62
Activations Density 0.041%