INDEX
Explanations
words related to support or endorsement
New Auto-Interp
Negative Logits
unbeliev
-0.67
olars
-0.65
itals
-0.65
ancies
-0.62
ouls
-0.60
budgets
-0.59
careers
-0.58
eni
-0.57
anders
-0.57
uku
-0.57
POSITIVE LOGITS
of
1.02
thereof
0.97
lier
0.89
OF
0.78
hesis
0.78
hetical
0.73
inhibitor
0.72
Of
0.71
Reviewer
0.69
OF
0.69
Activations Density 0.255%