INDEX
Explanations
phrases related to authority figures making public statements
references to political positions or statements made publicly
New Auto-Interp
Negative Logits
unpop
-0.70
asionally
-0.57
predec
-0.56
choes
-0.54
eteenth
-0.53
ommod
-0.53
Peb
-0.52
heterogeneity
-0.50
longstanding
-0.49
zens
-0.49
POSITIVE LOGITS
,,,,
0.94
to
0.87
"""
0.74
unto
0.74
towards
0.71
[/
0.69
[/
0.69
thats
0.69
!!!!
0.67
""
0.67
Activations Density 0.780%