INDEX
Explanations
negative actions and outcomes
New Auto-Interp
Negative Logits
牴
0.45
unfair
0.43
справед
0.42
बचाव
0.42
lerimiz
0.41
rosion
0.41
injustice
0.41
vered
0.40
Opp
0.39
વિશે
0.39
POSITIVE LOGITS
idiot
0.51
idiots
0.50
crude
0.50
broadest
0.46
Idiot
0.45
pia
0.42
fringe
0.42
theatrical
0.41
stupid
0.41
rant
0.41
Activations Density 0.106%