INDEX
Explanations
discussions about societal values and personal beliefs surrounding power dynamics and autonomy
New Auto-Interp
Negative Logits
--
-1.45
---
-1.27
‘’
-1.26
......”
-1.18
----
-1.17
......
-1.12
-----
-1.08
.....
-1.04
―
-1.00
------
-0.98
POSITIVE LOGITS
–
2.63
)–
1.66
.–
1.49
,–
1.30
–)
1.21
・
1.18
–¿
1.06
−
1.05
\_
1.04
––
1.02
Activations Density 1.041%