Challenge
web/scorescope
by BrownieInMotion 55 solves / 156 pointsI’m really struggling in this class. Care to give me a hand?
scorescope.mc.ax
Visiting the link reveals an automated grading system, with the ability to upload any Python file.
The assignment description reads:
This is a Python programming homework assignment. The template below contains a number of unimplemented functions; your task is to implement each one according to the docstring provided. No local tests are provided, but you can run the autograder to check your work. Unlimited submissions are permitted and the most recent score will be used to determine your grade.
And we’re given a template.py
(simplified here):
def add(a, b):
'''Return the sum of a and b.'''
raise NotImplementedError
def longest(words):
'''Return the longest word in a list of words.'''
raise NotImplementedError
def common(a, b):
'''Return the longest common subsequence of two strings.'''
raise NotImplementedError
def favorite():
'''Return your favorite number. Must be the same as my favorite number.'''
raise NotImplementedError
def factor(n):
'''Given an integer, find two integers whose product is n.'''
raise NotImplementedError
def preimage(hash):
'''Given a sha256 hash, find a preimage (bytes).'''
raise NotImplementedError
def magic():
'''Guess the random number I am thinking of.'''
raise NotImplementedError
Solution
Getting Our Bearings
Or in other words, where flag?
Looking through the template, it became clear that “properly” implementing the functions wasn’t feasible (guess a random number? break SHA-256’s preimage resistance?).
So, I started by assuming that passing all test cases was the first step to getting the flag.
The First Tests
For the first three functions, implementing them properly is totally feasible.
I’m lazy so I skipped implementing common(a, b)
, but a quick search for “find longest common subseqence” returns plenty of ready-to-copy snippets.
def add(a, b):
return a + b
def longest(words):
if not len(words): # unit tests expect longest([]) == None
return None
longest_word = None
highest_len = 0
for word in words:
if len(word) > highest_len:
longest_word, highest_len = word, len(word)
return longest_word
def common(a, b): raise NotImplementedError
def favorite(): raise NotImplementedError
def factor(n): raise NotImplementedError
def preimage(hash): raise NotImplementedError
def magic(): raise NotImplementedError
Scorescope seemed happy with my first submimssion:
From this we learn two things:
My ability to write Python hasn’t magically disappeared.- At a basic level, the autograder works as I expected.
- The autograder gives us the exact value of any exception thrown (vs. just telling us it failed).
These exception messages give us a way to receive output from our code, even though stdout and stderr aren’t visible to us.
Custom Classes
At this point, I already had some ideas about how I could bypass these checks.
One idea was to use custom classes to control the results of comparisons via Python’s magic methods.
If we assume the unit test for favorite
does something like this:
result = favorite()
assert isinstance(result, int)
assert result == 4 # chosen by fair dice roll
then we can write a subclass of int
which passes the ==
check:
class AlwaysEqualInt(int):
def __eq__(self, other):
return True
and return an instance (the value doesn’t matter):
def favorite():
return AlwaysEqualInt(0)
Because we’re inheriting from int
, any checks using isinstance
will pass as well.
Time for submission two:
def add(a, b):
return a + b
def longest(words):
if not len(words): # unit tests expect longest([]) == None
return None
longest_word = None
highest_len = 0
for word in words:
if len(word) > highest_len:
longest_word, highest_len = word, len(word)
return longest_word
class AlwaysEqualInt(int):
def __eq__(self, other):
return True
def favorite():
return AlwaysEqualInt(0)
def common(a, b): raise NotImplementedError
def factor(n): raise NotImplementedError
def preimage(hash): raise NotImplementedError
def magic(): raise NotImplementedError
Scorescope found no issue with my code and rewards my deceit with a score of 8/22.
(favorite
has just one test case)
Onto the next function then: common(a, b)
.
Just for fun, I tried using the same trick here (despite the mismatched types):
def common(a, b):
return AlwaysEqualInt(0)
And obviously, it faile— I’m sorry, what? It worked. All test cases for common
passed, 13/22 total.
Turns out, these unit tests don’t care about types at all. They only check for equality.
With that in mind, I tried the same trick on factor(n)
.
Based on template.py
, we’re supposed to return a tuple of integers, which I definitely noticed the first time and totally didn’t mess up.
def factor(n):
return (AlwaysEqualInt(0), AlwaysEqualInt(0))
And… scorescope finally reveals that it is, in fact, capable of using operators besides ==
:
AssertionError: 0 not greater than 1
Let’s just bump those numbers up and try again:
def factor(n):
return (AlwaysEqualInt(2), AlwaysEqualInt(2))
scorescope?
4? Where did that come fr— oh.
Multiplying my results and checking for equality to n
would be a simple way to implement these test cases.
New assumption: the test cases probably look like this:
a, b = factor(15)
assert a * b == 15
In order to bypass these checks, I had to go deeper.
We can override the result of the multiplication operator too, and return one of our AlwaysEqualInt
objects from there:
class FactorInt(AlwaysEqualInt):
def __mul__(self, other):
return AlwaysEqualInt(0)
def factor(n):
return FactorInt(2), FactorInt(2)
And submission six is a success: all test cases for factor
pass, now up to a total of 16/22.
Unfortunately, this trick starts to break down with the next test cases.
test_preimage_b
returns the following error, presumably because it’s doing more detailed checks which require bytes instead of an int:
TypeError: object supporting the buffer API required
Trying to Dig Deeper
While it’s fun to try to blindly outsmart the autograder, it would be very helpful to read its source code to figure out exactly what it checks for, and how.
That way, I can avoid wasting time and stop saying “assume” throughout this writeup.
Another motivation was the mysterious test_hidden
test case: the template doesn’t have a corresponding hidden
function, and it keeps erroring, but doesn’t show any output - not even an exception.
Attempt 1 - Just Read The File
First, I listed the loaded modules:
def magic():
import sys
raise Exception(list(sys.modules.keys()))
By raising an exception containing the data we’re interested in, it’ll be displayed in the test output:
Exception: ['sys', 'builtins', '_frozen_importlib', '_imp', '_thread', '_warnings', '_weakref', '_io', 'marshal', 'posix', '_frozen_importlib_external', 'time', 'zipimport', '_codecs', 'codecs', 'encodings.aliases', 'encodings', 'encodings.utf_8', '_signal', '_abc', 'abc', 'io', '__main__', '_stat', 'stat', '_collections_abc', 'genericpath', 'posixpath', 'os.path', 'os', '_sitebuiltins', 'pwd', '_distutils_hack', 'site', 'types', '_operator', 'operator', 'itertools', 'keyword', 'reprlib', '_collections', 'collections', '_functools', 'functools', 'enum', '_sre', 're._constants', 're._parser', 're._casefix', 're._compiler', 'copyreg', 're', '_json', 'json.scanner', 'json.decoder', 'json.encoder', 'json', 'collections.abc', 'token', 'tokenize', 'linecache', 'textwrap', 'contextlib', 'traceback', 'unittest.util', 'unittest.result', '_heapq', 'heapq', 'difflib', '_weakrefset', 'weakref', 'copy', '_ast', 'ast', '_opcode', 'opcode', 'dis', 'importlib._bootstrap', 'importlib._bootstrap_external', 'warnings', 'importlib', 'importlib.machinery', 'inspect', 'dataclasses', 'pprint', 'unittest.case', 'unittest.suite', 'fnmatch', 'unittest.loader', 'gettext', 'argparse', 'signal', 'unittest.signals', 'unittest.runner', 'unittest.main', 'unittest', 'importlib._abc', 'ntpath', 'errno', 'urllib', 'urllib.parse', 'pathlib', 'zlib', '_compression', '_bz2', 'bz2', '_lzma', 'lzma', 'shutil', 'math', '_bisect', 'bisect', '_random', '_sha512', 'random', 'tempfile', '_typing', 'typing.io', 'typing.re', 'typing', 'importlib.resources.abc', 'importlib.resources._adapters', 'importlib.resources._common', 'importlib.resources._legacy', 'importlib.resources', 'importlib.abc', 'importlib.util', 'cython_runtime', 'seccomp', 'util', 'test_1_add', 'test_2_longest', 'test_3_common', 'test_4_favorite', 'test_5_factor', '_hashlib', '_blake2', 'hashlib', 'test_6_preimage', 'test_7_magic', 'test_8_hidden', 'submission']
Here are the interesting modules:
Name | Notes |
---|---|
submission |
This is our uploaded code. |
test_1_add , …, test_8_hidden |
These contain the unit tests for their respective functions. |
unittest |
This indicates the unit tests are implemented with Python’s own unittest , instead of a 3rd-party framework. |
seccomp |
This is a 3rd-party library which can be used to restrict access to the system. |
Next, I tried reading the source code of test_8_hidden
:
def magic():
import sys
raise Exception(sys.modules["test_8_hidden"])
Exception: <module 'test_8_hidden' from '/app/tests/test_8_hidden.py'>
Perfect! Let’s just read that file:
def magic():
raise Exception(open("/app/tests/test_8_hidden.py").read())
Autograder Results
There was a problem grading your submission. Make sure your submission does not open files, import extra modules, run shell commands, or do anything too fancy.
Ah, crap. Turns out seccomp
isn’t in there for show, it’s being used to restrict filesystem access.
Whenever my script attempts to access the filesystem, Python dies and scorescope returns this message.
Attempt 2 - Clever Dynamic Runtime Introspection Hacks
From my previous experience committing code crimes metaprogramming in Python, I knew about Python’s inspect
module, which “provides several useful functions to […] retrieve the source code of a method”, among other things.
inspect.getsource(object)
seemed like a perfect way to dump the unit tests’ source.
Unfortunately, these methods still tend to access the filesystem internally.
After the competition, I learned that other players managed to dump compiled code from memory, but I found another method instead.
Attempt 3 - What’s Code Integrity?
These test cases are running in the same interpreter we are, with zero isolation.
We should be able to simply replace the test cases with our own to bypass the remaining tests.
After poking around sys.modules
some more, I determined the following layout for test_hidden
:
# test_8_hidden.py
from util import TestCase
class TestHidden(TestCase):
def test_hidden():
...
I tested my idea out with the following submission:
import sys
def fake_test(*args, **kwargs):
return True
sys.modules["test_8_hidden"].TestHidden.test_hidden = fake_test
Sure enough, the test case passed!
Note: this only worked so easily because the entire test suite runs once, in the same interpreter.
If each case or function were tested separately, our code would be imported + executed too late: the first test case would already have started, and thus we wouldn’t be able to just replace it.
Final Solve
The script I used to Get The Flag looked like this:
import sys
class AlwaysEqInt:
def __eq__(self, other):
return True
def add(a, b): return AlwaysEqInt(0)
def longest(words: list[str]) -> str | None:
longest_word = None
longest_len = 0
for word in words:
if len(word) > longest_len:
longest_word = word
longest_len = len(word)
return longest_word
def common(a, b): return AlwaysEqInt(0)
def favorite(): return AlwaysEqInt(0)
def factor(n):
class FactorResult(AlwaysEqInt):
def __mul__(self, other):
return AlwaysEqInt(10)
return FactorResult(2), FactorResult(2)
def fake_test(*args, **kwargs):
return True
sys.modules['test_6_preimage'].TestPreimage.test_preimage_a = fake_test
sys.modules['test_6_preimage'].TestPreimage.test_preimage_b = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_a = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_b = fake_test
sys.modules['test_7_magic'].TestMagic.test_magic_c = fake_test
sys.modules['test_8_hidden'].TestHidden.test_hidden = fake_test
However, we can do better.
Optimized script, for some definition of “optimized”:
import sys
def fake_test(*args, **kwargs): return
# enumerate TestCase subclasses
for cls in sys.modules["util"].TestCase.__subclasses__():
# iterate over class attributes
for name in cls.__dict__.keys():
# look for the test case functions
if name.startswith("test_"):
# replace test case with our own
setattr(cls, name, fake_test)
# pass initial test case
def add(a, b): return a + b
Or, if you prefer golf:
for c in __import__("util").TestCase.__subclasses__():
for n in dir(c):
if "st_" in n:setattr(c,n,list)
add=int.__add__
And finally, the result:
Closing Thoughts
I liked this challenge. I’d call it an easy pyjail, one that (in my opinion) is more fun because it gives the player freedom to explore and experiment.
It captured the spirit of my experience poking at automated grading tools in school.
(I don’t think anyone can argue this challenge isn’t realistic!)