INDEX
Explanations
expressions of strong emotions or beliefs
phrases that express different forms of opinion or perspective
New Auto-Interp
Negative Logits
bid
-0.80
cas
-0.70
bats
-0.68
head
-0.66
fn
-0.66
idity
-0.65
notations
-0.64
uity
-0.63
heads
-0.63
orgetown
-0.62
POSITIVE LOGITS
sorts
1.25
pure
0.82
theirs
0.79
icial
0.71
ours
0.70
genuine
0.70
necessity
0.67
Lear
0.66
honour
0.65
aeper
0.65
Activations Density 0.218%