INDEX
Explanations
words related to advocating for or supporting a particular issue or belief
references to the concept of "cause."
New Auto-Interp
Negative Logits
PDATE
-0.79
Ku
-0.72
Seym
-0.68
illet
-0.67
egu
-0.66
Seasons
-0.65
aeper
-0.65
awatts
-0.64
raph
-0.63
Leopard
-0.63
POSITIVE LOGITS
cele
1.37
cause
0.86
way
0.79
ality
0.79
celeb
0.76
vier
0.72
facts
0.71
fare
0.70
forge
0.70
wagon
0.69
Activations Density 0.028%