INDEX
Explanations
phrases related to political discourse and actions, particularly in the context of governance and regulations
preceding "assessing" and similar words
when assessing
New Auto-Interp
Negative Logits
.
-1.01
®.
-0.93
].
-0.93
。
-0.92
).
-0.90
".
-0.87
}.
-0.86
.\\
-0.82
.
-0.82
}$.
-0.82
POSITIVE LOGITS
리는
0.85
")==
0.79
betweenstory
0.76
들은
0.75
이는
0.72
noqa
0.72
것은
0.71
"]=
0.70
지는
0.69
서는
0.68
Activations Density 2.279%