INDEX
Explanations
You followed by role/capability
New Auto-Interp
Negative Logits
Patch
0.71
Patch
0.68
Indigo
0.65
patched
0.64
ाये
0.62
patching
0.60
IP
0.59
Seven
0.59
laugh
0.59
seventy
0.59
POSITIVE LOGITS
baran
0.69
ungee
0.68
rit
0.68
san
0.65
Tn
0.63
میٹر
0.61
OnInit
0.61
rita
0.60
ρίς
0.60
âns
0.60
Activations Density 0.175%