Challenge

web/scorescope by BrownieInMotion 55 solves / 156 points

I’m really struggling in this class. Care to give me a hand?

scorescope.mc.ax

Visiting the link reveals an automated grading system, with the ability to upload any Python file.

The assignment description reads:

This is a Python programming homework assignment. The template below contains a number of unimplemented functions; your task is to implement each one according to the docstring provided. No local tests are provided, but you can run the autograder to check your work. Unlimited submissions are permitted and the most recent score will be used to determine your grade.

And we’re given a template.py (simplified here):

def add(a, b):
	'''Return the sum of a and b.'''
	raise NotImplementedError

def longest(words):
	'''Return the longest word in a list of words.'''
	raise NotImplementedError

def common(a, b):
	'''Return the longest common subsequence of two strings.'''
	raise NotImplementedError
    
def favorite():
	'''Return your favorite number. Must be the same as my favorite number.'''
	raise NotImplementedError

def factor(n):
	'''Given an integer, find two integers whose product is n.'''
	raise NotImplementedError

def preimage(hash):
	'''Given a sha256 hash, find a preimage (bytes).'''
	raise NotImplementedError

def magic():
	'''Guess the random number I am thinking of.'''
	raise NotImplementedError

Solution

Getting Our Bearings

Or in other words, where flag?

Looking through the template, it became clear that “properly” implementing the functions wasn’t feasible (guess a random number? break SHA-256’s preimage resistance?).

So, I started by assuming that passing all test cases was the first step to getting the flag.

The First Tests

For the first three functions, implementing them properly is totally feasible.
I’m lazy so I skipped implementing common(a, b), but a quick search for “find longest common subseqence” returns plenty of ready-to-copy snippets.

def add(a, b):
	return a + b

def longest(words):
	if not len(words):  # unit tests expect longest([]) == None
		return None
	longest_word = None
	highest_len = 0
	for word in words:
		if len(word) > highest_len:
			longest_word, highest_len = word, len(word)
	return longest_word

def common(a, b): raise NotImplementedError
def favorite(): raise NotImplementedError
def factor(n): raise NotImplementedError
def preimage(hash): raise NotImplementedError
def magic(): raise NotImplementedError

Scorescope seemed happy with my first submimssion:

test results page with 7 of 22 cases passed

From this we learn two things:

  • My ability to write Python hasn’t magically disappeared.
  • At a basic level, the autograder works as I expected.
  • The autograder gives us the exact value of any exception thrown (vs. just telling us it failed).

These exception messages give us a way to receive output from our code, even though stdout and stderr aren’t visible to us.

Custom Classes

At this point, I already had some ideas about how I could bypass these checks.
One idea was to use custom classes to control the results of comparisons via Python’s magic methods.

If we assume the unit test for favorite does something like this:

	result = favorite()
	assert isinstance(result, int)
	assert result == 4  # chosen by fair dice roll

then we can write a subclass of int which passes the == check:

class AlwaysEqualInt(int):
	def __eq__(self, other):
		return True

and return an instance (the value doesn’t matter):

def favorite():
	return AlwaysEqualInt(0)

Because we’re inheriting from int, any checks using isinstance will pass as well.

Time for submission two:

def add(a, b):
	return a + b

def longest(words):
	if not len(words):  # unit tests expect longest([]) == None
		return None
	longest_word = None
	highest_len = 0
	for word in words:
		if len(word) > highest_len:
			longest_word, highest_len = word, len(word)
	return longest_word

class AlwaysEqualInt(int):
	def __eq__(self, other):
		return True

def favorite():
	return AlwaysEqualInt(0)

def common(a, b): raise NotImplementedError
def factor(n): raise NotImplementedError
def preimage(hash): raise NotImplementedError
def magic(): raise NotImplementedError

Scorescope found no issue with my code and rewards my deceit with a score of 8/22.
(favorite has just one test case)

Onto the next function then: common(a, b).

Just for fun, I tried using the same trick here (despite the mismatched types):

def common(a, b):
	return AlwaysEqualInt(0)

And obviously, it faile— I’m sorry, what? It worked. All test cases for common passed, 13/22 total.

partial results page with common test cases, all successful

Turns out, these unit tests don’t care about types at all. They only check for equality.

With that in mind, I tried the same trick on factor(n).
Based on template.py, we’re supposed to return a tuple of integers, which I definitely noticed the first time and totally didn’t mess up.

def factor(n):
	return (AlwaysEqualInt(0), AlwaysEqualInt(0))

And… scorescope finally reveals that it is, in fact, capable of using operators besides ==:

AssertionError: 0 not greater than 1

Let’s just bump those numbers up and try again:

def factor(n):
	return (AlwaysEqualInt(2), AlwaysEqualInt(2))

scorescope?

partial results with factor test cases, all failed. output shows AssertionErrors of the form 4 != number.

4? Where did that come fr— oh.

Multiplying my results and checking for equality to n would be a simple way to implement these test cases.

New assumption: the test cases probably look like this:

	a, b = factor(15)
	assert a * b == 15

In order to bypass these checks, I had to go deeper.
We can override the result of the multiplication operator too, and return one of our AlwaysEqualInt objects from there:

class FactorInt(AlwaysEqualInt):
	def __mul__(self, other):
		return AlwaysEqualInt(0)

def factor(n):
	return FactorInt(2), FactorInt(2)

And submission six is a success: all test cases for factor pass, now up to a total of 16/22.

Unfortunately, this trick starts to break down with the next test cases.
test_preimage_b returns the following error, presumably because it’s doing more detailed checks which require bytes instead of an int:

TypeError: object supporting the buffer API required

Trying to Dig Deeper

While it’s fun to try to blindly outsmart the autograder, it would be very helpful to read its source code to figure out exactly what it checks for, and how.
That way, I can avoid wasting time and stop saying “assume” throughout this writeup.

Another motivation was the mysterious test_hidden test case: the template doesn’t have a corresponding hidden function, and it keeps erroring, but doesn’t show any output - not even an exception.

Attempt 1 - Just Read The File

First, I listed the loaded modules:

def magic():
	import sys
	raise Exception(list(sys.modules.keys()))

By raising an exception containing the data we’re interested in, it’ll be displayed in the test output:

Exception: ['sys', 'builtins', '_frozen_importlib', '_imp', '_thread', '_warnings', '_weakref', '_io', 'marshal', 'posix', '_frozen_importlib_external', 'time', 'zipimport', '_codecs', 'codecs', 'encodings.aliases', 'encodings', 'encodings.utf_8', '_signal', '_abc', 'abc', 'io', '__main__', '_stat', 'stat', '_collections_abc', 'genericpath', 'posixpath', 'os.path', 'os', '_sitebuiltins', 'pwd', '_distutils_hack', 'site', 'types', '_operator', 'operator', 'itertools', 'keyword', 'reprlib', '_collections', 'collections', '_functools', 'functools', 'enum', '_sre', 're._constants', 're._parser', 're._casefix', 're._compiler', 'copyreg', 're', '_json', 'json.scanner', 'json.decoder', 'json.encoder', 'json', 'collections.abc', 'token', 'tokenize', 'linecache', 'textwrap', 'contextlib', 'traceback', 'unittest.util', 'unittest.result', '_heapq', 'heapq', 'difflib', '_weakrefset', 'weakref', 'copy', '_ast', 'ast', '_opcode', 'opcode', 'dis', 'importlib._bootstrap', 'importlib._bootstrap_external', 'warnings', 'importlib', 'importlib.machinery', 'inspect', 'dataclasses', 'pprint', 'unittest.case', 'unittest.suite', 'fnmatch', 'unittest.loader', 'gettext', 'argparse', 'signal', 'unittest.signals', 'unittest.runner', 'unittest.main', 'unittest', 'importlib._abc', 'ntpath', 'errno', 'urllib', 'urllib.parse', 'pathlib', 'zlib', '_compression', '_bz2', 'bz2', '_lzma', 'lzma', 'shutil', 'math', '_bisect', 'bisect', '_random', '_sha512', 'random', 'tempfile', '_typing', 'typing.io', 'typing.re', 'typing', 'importlib.resources.abc', 'importlib.resources._adapters', 'importlib.resources._common', 'importlib.resources._legacy', 'importlib.resources', 'importlib.abc', 'importlib.util', 'cython_runtime', 'seccomp', 'util', 'test_1_add', 'test_2_longest', 'test_3_common', 'test_4_favorite', 'test_5_factor', '_hashlib', '_blake2', 'hashlib', 'test_6_preimage', 'test_7_magic', 'test_8_hidden', 'submission']

Here are the interesting modules:

Name Notes
submission This is our uploaded code.
test_1_add, …, test_8_hidden These contain the unit tests for their respective functions.
unittest This indicates the unit tests are implemented with Python’s own unittest, instead of a 3rd-party framework.
seccomp This is a 3rd-party library which can be used to restrict access to the system.

Next, I tried reading the source code of test_8_hidden:

def magic():
	import sys
	raise Exception(sys.modules["test_8_hidden"])
Exception: <module 'test_8_hidden' from '/app/tests/test_8_hidden.py'>

Perfect! Let’s just read that file:

def magic():
	raise Exception(open("/app/tests/test_8_hidden.py").read())

Autograder Results

There was a problem grading your submission. Make sure your submission does not open files, import extra modules, run shell commands, or do anything too fancy.

Ah, crap. Turns out seccomp isn’t in there for show, it’s being used to restrict filesystem access.
Whenever my script attempts to access the filesystem, Python dies and scorescope returns this message.

Attempt 2 - Clever Dynamic Runtime Introspection Hacks

From my previous experience committing code crimes metaprogramming in Python, I knew about Python’s inspect module, which “provides several useful functions to […] retrieve the source code of a method”, among other things.

inspect.getsource(object) seemed like a perfect way to dump the unit tests’ source.

Unfortunately, these methods still tend to access the filesystem internally.
After the competition, I learned that other players managed to dump compiled code from memory, but I found another method instead.

Attempt 3 - What’s Code Integrity?

These test cases are running in the same interpreter we are, with zero isolation.
We should be able to simply replace the test cases with our own to bypass the remaining tests.

After poking around sys.modules some more, I determined the following layout for test_hidden:

# test_8_hidden.py
from util import TestCase

class TestHidden(TestCase):
	def test_hidden():
		...

I tested my idea out with the following submission:

import sys

def fake_test(*args, **kwargs):
	return True

sys.modules["test_8_hidden"].TestHidden.test_hidden = fake_test

Sure enough, the test case passed!

Note: this only worked so easily because the entire test suite runs once, in the same interpreter.
If each case or function were tested separately, our code would be imported + executed too late: the first test case would already have started, and thus we wouldn’t be able to just replace it.

Final Solve

The script I used to Get The Flag looked like this:

import sys

class AlwaysEqInt:
	def __eq__(self, other):
		return True

def add(a, b): return AlwaysEqInt(0)

def longest(words: list[str]) -> str | None:
	longest_word = None
	longest_len = 0
	for word in words:
		if len(word) > longest_len:
			longest_word = word
			longest_len = len(word)
	return longest_word

def common(a, b): return AlwaysEqInt(0)

def favorite(): return AlwaysEqInt(0)

def factor(n):
	class FactorResult(AlwaysEqInt):
		def __mul__(self, other):
			return AlwaysEqInt(10)

	return FactorResult(2), FactorResult(2)


def fake_test(*args, **kwargs):
	return True

sys.modules['test_6_preimage'].TestPreimage.test_preimage_a = fake_test
sys.modules['test_6_preimage'].TestPreimage.test_preimage_b = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_a = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_b = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_c = fake_test
sys.modules['test_8_hidden'].TestHidden.test_hidden = fake_test

However, we can do better.

Optimized script, for some definition of “optimized”:

import sys

def fake_test(*args, **kwargs): return

# enumerate TestCase subclasses
for cls in sys.modules["util"].TestCase.__subclasses__():
	# iterate over class attributes
	for name in cls.__dict__.keys():
		# look for the test case functions
		if name.startswith("test_"):
			# replace test case with our own
			setattr(cls, name, fake_test)

# pass initial test case
def add(a, b): return a + b

Or, if you prefer golf:

for c in __import__("util").TestCase.__subclasses__():
	for n in dir(c):
		if "st_" in n:setattr(c,n,list)
add=int.__add__

And finally, the result:

Autograder Results: &ldquo;All test cases passed! Flag: dice{still_more_secure_than_gradescope}&rdquo;

Closing Thoughts

I liked this challenge. I’d call it an easy pyjail, one that (in my opinion) is more fun because it gives the player freedom to explore and experiment.

It captured the spirit of my experience poking at automated grading tools in school.
(I don’t think anyone can argue this challenge isn’t realistic!)