INDEX
Explanations
refusal of harmful requests
New Auto-Interp
Negative Logits
harbor
0.47
harbors
0.44
flavor
0.44
flavors
0.41
multicolored
0.40
Gravel
0.40
favorably
0.39
Sliver
0.39
Harbor
0.39
nessy
0.38
POSITIVE LOGITS
nggak
0.45
TikTok
0.45
चीज़
0.45
TikTok
0.44
netizens
0.44
ख़
0.43
एक्सरसा
0.43
Pics
0.43
personalisation
0.43
آ
0.42
Activations Density 0.009%