INDEX
Explanations
phrases indicating assistance or contribution
instances of the word "helped."
New Auto-Interp
Negative Logits
Policy
-0.64
gran
-0.61
parts
-0.61
separation
-0.60
uns
-0.60
owl
-0.60
clusions
-0.59
War
-0.59
contrace
-0.58
itar
-0.58
POSITIVE LOGITS
helped
0.83
ĸļ
0.78
propel
0.74
helping
0.73
waukee
0.72
Assist
0.71
usher
0.71
ridor
0.70
buoy
0.68
urated
0.67
Activations Density 0.013%