INDEX
Explanations
phrases that refer to specific policies or principles
recurring phrases that indicate policy-related concepts
New Auto-Interp
Negative Logits
ILCS
-0.73
Extras
-0.69
*/(
-0.68
handle
-0.68
ãĤ¼ãĤ¦ãĤ¹
-0.66
Effects
-0.66
acious
-0.64
worthy
-0.63
bucks
-0.63
PI
-0.62
POSITIVE LOGITS
ours
0.84
theirs
0.75
impunity
0.75
neutrality
0.72
attrition
0.71
promoting
0.71
abstinence
0.69
interstitial
0.68
friendship
0.68
nationalism
0.68
Activations Density 0.176%