Another common set of operations in Python when doing data analysis and processing are string operations. Again, it is important to use built in methods to the Pandas framework instead of standard Python calls. Doing so will have orders of magnitude of difference. Another great reference is the Python Data Science Handbook chapter on working with strings. As the author has written:
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas str methods that mirror Python string methods:
The libraries you should always check for fast implementations of algorithms or functions would be: Pandas, numpy, scikit-learn, Spark mllib, and scipy. Outside the scope of this blog post would be Soroco’s use of more tensor-based libraries like TensorFlow and PyTorch, like when and where we use them.
There are many different things you should consider when picking a predominant programming language. Throughout this blog post we shared various dimensions that are important to building large scale products with Python. Everything from development to performance. Building large systems with Python is very doable today. Though there have been challenges throughout the past, this blog post has shown ways to adopt optional parts of the language (e.g., PEP484) to ensure better development. The language itself and the tolling around it continues to rapidly improve. Finally, though Soroco is predominantly building its systems in Python, Soroco still has portions of its product built in Golang and C++ as well. Ultimately, do what is best for the product but always keep in mind development and maintenance. Make it easy to develop, deploy, and maintain.
If you enjoy reading this article and want to work on similar problems, apply here and come work with us!
Like this article? Spread the word