Abstract Syntax Tree for Patching Code and Assessing Code Quality
Why should you care?
How do we easily and scalably patch 100,000s of lines of source code? Read about how we used a simple yet powerful data structure – Abstract Syntax Tree (AST) to create a system that from one single central point, maps source code dependencies and in-turn patches all dependencies.
A software system is usually built with assumptions around how dependencies such as the underlying language system, frameworks, libraries etc. are written. Changes in these dependencies may have a ripple effect into the software system itself. For example, recently, the famous Python package pandas released its 1.0.0 version, which has deprecated and changed several functionalities that existed in its previous 0.25.x version. An organization may have many systems using 0.25.x version of pandas. Hence, upgrading it to 1.0.0 will require developers of every system to go through the pandas change documentation and patch their code accordingly.
Since we developers love to automate tedious tasks, it is natural for us to think of writing a patch script that will update the source code of all the systems according to the changes in new pandas version. A patch script could be parsing the source code and doing some kind of find+replace. But such a patch script will likely be unreliable and not comprehensive. For example, say the patch script needs to change the name of a function get to create wherever it is called in the code base. A simple find+replace will end up replacing the word “get” even if it was not a function call. Another example would be that find+replace will not be able to handle cases where code statements spill over to multiple lines. We need the patch script to parse the source code, while understanding the language constructs. In this article, we propose the use of Abstract Syntax Trees (AST) to write such patch scripts. And then later, we present how ASTs can be used to assess code quality.
Abstract Syntax Tree (AST)
Almost every language has a way to generate AST from its code. We use Python to build several critical parts of our systems. Hence, this article uses Python to give examples and highlights, but the learnings from here can be applied to any other language.
Looking at the ast.dump output, we can see that the head object which is of type Module has an attribute body whose value is a list of 2 nodes – one representing var = 1 and the other representing print(var). The first node representing var = 1 has a target attribute representing the LHS var and a value attribute representing the RHS 1. Let’s see if we can print the RHS.
Now that we understand ASTs and how to generate them, inspect them, modify them and re-create code from them, let’s go back to the problem of writing patch scripts to modify the code of a system to use pandas 1.0.0 instead of pandas 0.25.x. We call these AST based patch scripts as “IntelliPatch”.
All the backward incompatibilities in pandas 1.0.0 are listed on this page. Let’s take the first backward incompatibility on the list and write IntelliPatch for that.
Avoid using names from MultiIndex.levels
Code using pandas 0.25.x:
Equivalent code using pandas 1.0.0:
The IntelliPatch needs to do the following:
- Create AST of the given code and traverse it.
- Identify if any node represents the code of form <var>.levels[<idx>].name = <val> .
- Replace the identified node with the one that represents the code of form <var> = <var>.set_names(<val>, level=<idx>).
Below is the IntelliPatch script that does that.
Usage Example 1:
Usage Example 2:
One can extend the patch script to take care of all backward incompatibilities in pandas 1.0.0. And then write an outer function that goes through every Python file of a system, reads its code, patches it and writes it back to disk.
It is important to note that a developer should review the changes done by the IntelliPatch before committing it. For example, if code is hosted on git, then a git diff should be performed and reviewed by the developer.
At Soroco, we have written 5 IntelliPatch scripts so far that were ran on 10 systems. Each script successfully parsed and patched about 150,000 lines of code across 10 systems. In terms of productivity, this effort took one of our engineers three full days to complete. This engineer learnt about ASTs before implementing these solutions.
Of the five scripts, one particular script was unique – a code scrubber and not a traditional patch. This need stemmed from an external party seeking to review the outline of the code, without sharing the actual logic and specifics of the code. Hence, we wrote a scrubber, that scrubs logic and other key elements in the code while retaining only the imports, class and function definitions, docstrings, type annotations and some very specific information required for the review. Therefore, the AST proved to be a valuable tool for buiding a code scrubber as well.
Code Quality Assessment
Now that we understand how ASTs can be very useful to write intelligent patch scripts, in this section we will explain how it can be used to assess code quality.
Example 1: Non self-explanatory variable names
Example 2: Un-logged except block of code
The usefulness of ASTs extends far beyond the discussion in this article. For example, the ASTs of the files in a given system can be used to create a call graph. A call graph created during run-time may not cover all the code paths. But a call graph created using ASTs statically will cover all the code paths and thus will be comprehensive. The call graph then can be used to generate a human readable documentation of the system. We have built such a functionality in Soroco that we call “LiveDoc”, but that is a topic for another day in an another article 🙂
Like this article? Spread the word