INDEX
Explanations
instances of violence and authority interactions in narratives
New Auto-Interp
Negative Logits
411
-0.16
ailles
-0.15
emos
-0.15
umi
-0.15
ü
-0.14
avin
-0.14
izik
-0.14
ami
-0.14
bo
-0.14
testing
-0.13
POSITIVE LOGITS
zo
0.16
claim
0.16
awareness
0.15
Try
0.15
à¸ŀย
0.15
cken
0.15
getaway
0.15
try
0.14
backups
0.14
ult
0.14
Activations Density 0.015%