Scientific computing with Python

by Conor Lawless email: conor.lawless@ncl.ac.uk

Workshop Overview

These notes are for a half-day workshop introducing Python to Newcastle University Medical School postgraduate research students. The workshop is next scheduled to run on Tuesday 26th April 2016, but if you can't make it, you are free to work through the notes yourself and get in touch with me if you have any questions.

Introduction & Motivation

Modern biological research involves a lot of data, usually stored on a computer. As scientists, we want to extract the maximum amount of information possible from our data. Writing and executing computer code is an extremely flexible and powerful way to automate this, making good use of the impressive computing power and fast network connections currently at our disposal.

Unfortunately, programming has not been part of many biological researchers' training. All research scientists should have some computational tools at their disposal to help with the capture, handling, processing and analysis of data. Solely relying on Microsoft Excel (as an example) for analysis is highly restrictive and expensive. Learning to program can remove some fairly heavy constraints on the way we think about research. Accounting software like Excel is not capable of carrying out advanced analysis and simply doesn't help with time-consuming tasks such as file formatting, image manipulation or text manipulation, which are often important parts of the research workflow.

Fortunately, programming is something you can become good at without formal training. Modern programming languages like Python make issuing instructions to computers easier than ever before. Outside of programming books, excellent online courses and Q&A forums make learning quite straightforward and enjoyable. Interestingly, in a 2016 survey of expert and professional software developers, the most common self-reported classification for "Education" was "Self-Taught"!

Many biological research scientists spend hours or days on repetitive, computer-based tasks which could be automated away if they had a few basic programming skills. Worse still, some tasks that could easily be automated are not even attempted because of the perceived amount of manual computer work involved. Getting to grips with a little bit of programming will help you to work more efficiently by giving you an extra set of skills which you can use to design experiments and analyse results.

The Python programming language

Python is a friendly, powerful, flexible open-source programming language with many freely available add-ons which allow it to easily handle an incredibly diverse range of data types in a consistent manner. Code written in Python is probably the most readable of any popular programming language.

Python's functionality overlaps with that of many other tools, including Mathematica, Matlab, R, C++ and Java, but Python has some advantages over all of these. Compared to Mathematica and Matlab, Python is a true, general-purpose programming language which is capable of doing more than just mathematical analysis. R is similarly open-source and freely available, and although powerful and useful, R was designed for statistical analysis rather than general programming. R syntax (or language structure) is not as clean, consistent, simple and easy to read as that of Python. C++ and Java are more powerful (faster, more efficient) programming languages but it is much more difficult to write and identify errors in C++ and Java code, making these languages relatively difficult, particularly when beginning programming. Python code is clean and simple. Python programs are quite powerful. It is relatively difficult to make mistakes when writing code in Python.

Both Python and R are distributed under open-source licenses, which is important for sharing of scientific results. Open-source means that anyone, anywhere with an internet connection can access and install the tools necessary to use, test or develop published code. As computer code is an increasingly important part of biological research, universal, free access greatly increases the reproducibility of research. Reproducibility is a fundamental component of the scientific method. Universal access, enabled by open-source software, is also convenient for code developers (you & I), allowing us to reuse code on our personal machines, or on colleagues machines, at a whim, without the need for expensive licenses or specific permissions.

R can be a good alternative to Python

R is a programming environment designed for statistical analysis. It shares several of Python's best features, in particular it is an open source programming tool. It handles spreadsheet-like numerical data easily and contains powerful tools for statistical analysis, and in many ways is preferable to Python for pure data analysis. However, Python syntax is cleaner, simpler and better structured, making it easier to learn. Python is also more flexible, adaptable and powerful (and therefore much more fun). For these two important reasons, learning Python is a much better way to start out learning to program than attempting to learn R.

Having said that, if you do have a little previous programming experience, or, after this course if you have come to grips with some programming concepts and are interested in learning about another amazing tool, I thoroughly recommend the excellent R courses run by the school of Maths & Stats here at Newcastle: http://www.ncl.ac.uk/maths/rcourse/

Objectives

After completing this course on Scientific Computing with Python, you should be able to:

All of these steps will be motivated by practical (and hopefully useful) example code which is included in these notes. By the end of the workshop you will see that it is easy to write simple code and that writing code is a powerful and flexible way to make efficient use of computers.

Some further tools and resources are highlighted in the Other Resources section. In particular, this page includes links to more advanced Python tutorials for continued learning.


OverviewInstallationFirst ScriptExecutionLibrariesStructureOther Resources


Last updated: April 2016