INDEX
Explanations
phrases or commands prompting the user to input information or perform an action
commands or prompts for user input
New Auto-Interp
Negative Logits
coun
-0.71
laundering
-0.67
acca
-0.67
jud
-0.65
ector
-0.62
overr
-0.61
judgement
-0.60
fairness
-0.58
fml
-0.58
conviction
-0.58
POSITIVE LOGITS
prise
1.29
prises
1.26
tain
1.07
tainment
1.03
taining
1.03
Enter
1.03
prising
0.99
Enter
0.97
tis
0.87
igible
0.82
Activations Density 0.006%