INDEX
Explanations
words related to rules, regulations, and guidelines
concepts related to conformity and alignment with certain ideologies or perspectives
New Auto-Interp
Negative Logits
pard
-0.70
aniel
-0.64
town
-0.63
mare
-0.62
istan
-0.62
patrick
-0.62
reluct
-0.61
antics
-0.60
ppa
-0.59
asking
-0.59
POSITIVE LOGITS
neither
0.83
directly
0.75
measurable
0.74
specific
0.72
specifically
0.72
overlap
0.72
GMOs
0.70
antit
0.70
solely
0.69
nothing
0.69
Activations Density 0.372%