INDEX
Explanations
words related to challenging or being challenged
contexts related to challenging authority or established norms
New Auto-Interp
Negative Logits
opter
-0.68
ãĥ¼ãĥĨãĤ£
-0.68
abet
-0.65
istg
-0.65
psons
-0.64
ppa
-0.64
··
-0.62
anuts
-0.61
]}
-0.61
around
-0.61
POSITIVE LOGITS
assumptions
1.04
incumb
0.93
incumbent
0.92
precon
0.90
stereotypes
0.89
perceptions
0.87
orthodoxy
0.86
misconceptions
0.84
boundaries
0.78
assertions
0.78
Activations Density 0.066%