INDEX
Explanations
instances of knowledge and awareness in various contexts
New Auto-Interp
Negative Logits
Efforts
-0.66
dhist
-0.64
mengel
-0.64
Gill
-0.63
Tos
-0.62
Barbour
-0.60
efforts
-0.58
popd
-0.58
TEntity
-0.58
Larsson
-0.56
POSITIVE LOGITS
knows
1.70
know
1.70
know
1.69
Know
1.68
Know
1.64
knows
1.61
KNOW
1.59
KNOW
1.58
Knows
1.58
knew
1.50
Activations Density 0.125%