INDEX
Explanations
concepts related to ethical decision-making and moral reasoning
New Auto-Interp
Negative Logits
akis
-0.15
oods
-0.15
lug
-0.15
Pot
-0.15
reap
-0.14
ieri
-0.14
ugas
-0.14
alama
-0.14
stdClass
-0.13
McMahon
-0.13
POSITIVE LOGITS
course
0.50
Course
0.41
course
0.40
Course
0.39
-course
0.36
route
0.34
courses
0.34
choice
0.32
_course
0.32
move
0.31
Activations Density 0.102%