INDEX
Explanations
contributes to harm normalization
New Auto-Interp
Negative Logits
關鍵
0.44
incorporated
0.39
を切
0.39
യിലെ
0.37
spiega
0.36
一定要
0.36
截至
0.36
veloce
0.35
규칙
0.35
Curr
0.35
POSITIVE LOGITS
perpet
1.03
perpetuate
0.99
harm
0.89
contributes
0.89
contribute
0.87
undermines
0.82
harm
0.81
potencialmente
0.80
demoral
0.80
dehuman
0.79
Activations Density 0.110%