INDEX
Explanations
phrases indicating skepticism or criticism towards institutional practices and beliefs
New Auto-Interp
Negative Logits
(
-0.98
(
-0.67
الحره
-0.67
.(
-0.63
</h4>
-0.61
.*")]
-0.57
.
-0.57
(
-0.54
。(
-0.54
Hentet
-0.52
POSITIVE LOGITS
?),
1.65
?).
1.63
!),
1.60
!).
1.56
),”
1.44
).</
1.41
!)
1.39
?)
1.38
)”.
1.38
)".
1.33
Activations Density 1.044%