INDEX
Explanations
elements related to deception or betrayal in interpersonal relationships
New Auto-Interp
Negative Logits
”),
-0.25
"),
-0.25
"},
-0.25
"],
-0.24
/>,
-0.22
'),
-0.22
'},
-0.20
_),
-0.20
'],
-0.20
()},
-0.20
POSITIVE LOGITS
.)↵
0.29
.)↵↵
0.28
.)
0.25
.)↵↵↵↵
0.23
?)↵
0.22
,)↵
0.21
/)↵
0.20
!)↵↵
0.20
?)↵↵
0.20
!)↵
0.20
Activations Density 0.111%