INDEX
Explanations
phrases indicating judgment or value attribution
instances of the word "that."
New Auto-Interp
Negative Logits
Pets
-0.78
Cheong
-0.74
hips
-0.72
Directions
-0.68
ãĥ¥
-0.68
Gallery
-0.65
Cong
-0.64
Cards
-0.64
Planning
-0.63
raq
-0.62
POSITIVE LOGITS
satisfies
0.82
violates
0.82
consumes
0.81
preceded
0.79
resembles
0.74
ĨĴ
0.74
produces
0.73
¥µ
0.72
cedes
0.72
involves
0.71
Activations Density 0.240%