INDEX
Explanations
references to authority figures or governance-related topics
New Auto-Interp
Negative Logits
ippers
-0.15
prayer
-0.14
uir
-0.13
,
-0.13
(
-0.13
oader
-0.13
artz
-0.13
acades
-0.13
and
-0.12
opes
-0.12
POSITIVE LOGITS
]
0.25
:]
0.17
}
0.17
](
0.15
ãĤ±ãĥĥãĥĪ
0.14
)
0.14
:]
0.14
UTO
0.13
]**
0.13
&)
0.13
Activations Density 0.026%