INDEX
Explanations
elements related to authority figures and their interactions
New Auto-Interp
Negative Logits
“
-0.47
“[
-0.40
(“
-0.39
(
-0.34
“
-0.32
”
-0.31
“â̦
-0.31
”
-0.29
âĢŀ
-0.27
”.↵
-0.27
POSITIVE LOGITS
-"
0.31
..."↵
0.29
—"
0.29
..."
0.27
..."↵↵
0.26
â̦"↵↵
0.26
-",
0.25
-'
0.25
â̦"
0.24
your
0.23
Activations Density 1.775%