INDEX
Explanations
references to various social and moral concepts
New Auto-Interp
Negative Logits
touches
-0.15
ifa
-0.14
顺
-0.14
felt
-0.13
æĮ¯
-0.13
pga
-0.13
feit
-0.13
Facing
-0.13
astes
-0.13
.datas
-0.13
POSITIVE LOGITS
dictate
0.32
dict
0.31
dict
0.31
dictates
0.30
Dict
0.30
intervened
0.27
dictated
0.27
Dict
0.26
interven
0.26
cons
0.26
Activations Density 0.212%