INDEX
Explanations
phrases related to actions or beliefs concerning society and politics
concepts related to power dynamics and responsibility
New Auto-Interp
Negative Logits
agre
-0.61
".
-0.56
'.
-0.55
+.
-0.55
!.
-0.53
!".
-0.51
ende
-0.51
zik
-0.51
cms
-0.51
.).
-0.51
POSITIVE LOGITS
pires
0.72
pired
0.51
nutshell
0.45
depends
0.45
)]
0.43
resides
0.42
vanished
0.41
hangs
0.41
resided
0.41
Middle
0.41
Activations Density 2.645%