INDEX
Negative Logits
мери
0.49
야
0.49
लाज
0.46
bete
0.45
ෙ
0.44
יין
0.44
૧
0.44
뷰
0.43
风险
0.43
assurer
0.43
POSITIVE LOGITS
artificially
0.64
experiment
0.62
stimuli
0.61
changed
0.61
injected
0.59
experimental
0.59
manipulated
0.59
stimulus
0.57
increased
0.57
interventions
0.56
Activations Density 0.256%