INDEX
Explanations
references to struggles with power dynamics and societal pressures
New Auto-Interp
Negative Logits
532
-0.17
oze
-0.17
utting
-0.16
cracks
-0.14
MOTE
-0.14
handshake
-0.14
alette
-0.14
indr
-0.14
cling
-0.13
clinging
-0.13
POSITIVE LOGITS
stuck
0.42
trapped
0.38
caught
0.35
forced
0.31
caught
0.30
faced
0.28
forced
0.25
sadd
0.24
thrust
0.23
locked
0.23
Activations Density 0.530%