INDEX
Explanations
phrases related to collective identity or shared experiences
New Auto-Interp
Negative Logits
Kush
-0.67
gratification
-0.65
firsthand
-0.64
Cliff
-0.63
conflicts
-0.62
citation
-0.59
contradictions
-0.58
Authority
-0.57
stressing
-0.57
sectarian
-0.56
POSITIVE LOGITS
athered
1.33
bsite
1.28
lder
1.25
arers
1.20
aving
1.19
asel
1.19
eding
1.19
avers
1.17
eps
1.16
eping
1.15
Activations Density 0.080%