INDEX
Explanations
phrases related to unity or bringing things together
phrases related to reasons or justifications
New Auto-Interp
Negative Logits
Else
-0.66
oros
-0.65
upon
-0.63
ican
-0.63
ior
-0.62
iod
-0.61
Afterwards
-0.61
paren
-0.61
iov
-0.60
ington
-0.60
POSITIVE LOGITS
ones
1.10
overarching
1.04
none
1.00
simplest
0.98
nutshell
0.98
one
0.98
favorites
0.95
particular
0.87
suffice
0.87
predominant
0.85
Activations Density 0.399%