INDEX
Explanations
be a safe and helpful AI assistant
New Auto-Interp
Negative Logits
word
0.34
mindful
0.34
commonly
0.34
wooded
0.33
reasonable
0.33
uminescent
0.33
unsaturated
0.32
featur
0.32
coniferous
0.32
urgent
0.32
POSITIVE LOGITS
একজন
0.57
一名
0.41
finishes
0.38
seorang
0.38
Reports
0.36
Rates
0.36
molestias
0.36
an
0.36
Compliance
0.36
déplacements
0.36
Activations Density 0.011%