"Oh, I got it!" a blog by Tim Rand

alt text > San Francisco
Data Scientist
Cell Biologist/Physician
contact info

Python Modules

Almost all python code starts with the ubiquitous python import keyword used for pulling in external code.

import pandas

There are also statements like...

from matplotlib import pyplot as plt

It is worth learning how these external code modules are made and start to take advantage of the module system when writing your own code. It helps stay organized and make your code easier to share with others.

So I poked around and made some discoveries and now feel like I understand what is going on. Here are my notes.

In terminal you can type open a python and open the python interpreter and import a module like pandas (you may not have pandas, pip install pandas. You can use any module on your system. pip list to see what is installed).

$ python
>>> import pandas
>>> pandas.__file__
'/Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/__init__.py'

That __file__ attribute is a time saver as it returns the path to the pandas module. Without this path we might have to look through every directory in $PYTHONPATH to find the module. We can use this path to open the source code to the pandas library. The other thing that is interesting to note is that file that gets stored in the __file__ attribute is an __init__.py file. Let's investigate what that file is and how it relates to the a modeule.

$ sublime /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/__init__.py

Looking at the contents of __init__.py it seems to be about 1/3rd pydoc documentation (the kind signature of a good citizen). You can identify this section (located at the end) because it is wrapped in three quotation marks """pydoc info""". 1/3rd is warnings about depricating features and possible missing dependencies, and about 1/3rd a bunch of import statements (even while investigating import statement we run into more import statements!)...

from pandas.core.api import *
from pandas.core.sparse.api import *
from pandas.tseries.api import *
from pandas.core.computation.api import *
from pandas.core.reshape.api import *

One thing to note here are the periods. pandas.core.api etc. and the *. In bash an * is a glob and symbolizes a wild-card that represents all the possibilities available. It means the same thing here. Below you will see the directory and file structure for the pandas module (only 1st level). About 25 sub-directories, note core and tseries which we just saw above. The dot notation in a pandas import statement __file__ attribute is a time saver as it returns the path to the pandas module. Without this path we might have to look through every directory in $PYTHONPATH to find the module. We can use this path to open the source code to the pandas library. The other thing that is interesting to note is that file that gets stored in the __file__ attribute is an /;.

$ tree -L 1 /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas
├── __init__.py
├── __pycache__
├── _libs
├── _version.py
├── api
├── compat
├── computation
├── conftest.py
├── core
├── errors
├── formats
├── io
├── json.py
├── lib.py
├── parser.py
├── plotting
├── testing.py
├── tests
├── tools
├── tseries
├── tslib.py
├── types
└── util

Returning to the list of import statements found in __init__.py file in the pandas module.

#looking at randomly selecteed imports statements above
from pandas.core.sparse.api import *
from pandas.core.computation.api import *

So for instance we should be able to find a sparse directory or file within the core directory. Let's check...

Indeed, there is a /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/core/sparse/api.py and a /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/core/computation/api.py So the '.' is serving the purpose of a unix '/'. Note that pandas is the directory, core is a sub-directory, computation is a sub-sub-directory and api represents api.py, a file and then the asterisk (which must mean the contents of the api.py file, right? Using a period to get a lower level attribute is how python normally does things (for instance accessing attributes of an object, etc. Here, in a similar way it is showing the relationship of downstream items. So modules are composed of directories, files and internal code. All of those levels can be referenced in import statements. And the periods allow one to specify the relationships of that structure just as the '/' does in unix.

So what we have discovered so far is that modules are just made up of a root directory (the module name) containing within it an __init__.py file. But is that __init__.py file is critical or not? We should experiment.

To keep from getting confused, you need to pay attention to where you launch python interpreter from in this experiment. We are going to make a pretend module, but it won't be installed correctly within the $PYTHONPATH, where python looks for modules. But you can take advantage of the fact that python always looks in the $PWD directory for modules. So even without installing the module into its a proper place within the $PYTHONPATH, python will still find our module so long as we are working in the same directory where the test module lives.

$ pwd
~
$ mkdir temp temp2
$ touch temp/__init__.py

So we have an empty temp2 directory and an temp/init.py in our working directory. Let's test them as modules.

$ python
>>> import temp
>>> temp.__file__
'/Users/loaner/ds/playground/temp/__init__.py'
>>> temp
<module 'temp' from '/Users/loaner/ds/playground/temp/__init__.py'>
>>> 
>>> import temp2
>>> temp2.__file__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'temp2' has no attribute '__file__'
>>> temp
<module 'temp' from '/Users/loaner/ds/playground/temp/__init__.py'>
>>> temp2
<module 'temp2' (namespace)>

This experiment has some interesting results.

  1. You can import an empty directory as a module and you will get a namespace with the same name as the directory, but it doesn't have a __file__ attribute, beacuse of the missing __init__.py file.
  2. The __init__.py file within the modules root directory is enough to make the __file__ attribute be assigned its path, even if there is no information in the __init__.py file.

So the __init__.py file is meaningful, even if empty. It offers a place to put extra code pertaining to the module as a whole. Like collapsing a namespace for convenience sake. In fact, I am going to have the init.py file of our module load all of the attributes defined in tools.py. That way I won't have to type PlayTools.tools.a_df or tools.a_df, I will be able to access the variable a_df directly.

__init__.py

from .tools import *

Maybe we know enough to create a module! Let's try. The goal of this module will be to make a suite of pre-defined, basic python data types that can easily be imported into a python interpreter for experimenting with the objects methods etc. without having to do the setup.

So we need a root directory called PlayTime, and within that an empty __inti__.py file and also a another .py file containing the code. One of the objects I am interested in having available is a pandas.DataFrame. So I will also include an extra .csv file specifically to supply that data.

$ mkdir PlayTime
$ touch PlayTime/__inti__.py PlayTime/tools.py
#found a typical .csv file on the web, it will serve as data 
$ curl https://people.sc.fsu.edu/~jburkardt/data/csv/oscar_age_male.csv > PlayTime/oscar_age_male.csv
$ tree PlayTime/
PlayTime/             # <== Our module name root directory
├── __init__.py       # <== __init__.py helps set __file__ attr. of 
├── oscar_age_male.csv # <== only used because we need some data
└── tools.py          # <== .py files hold the code base

Let's add the code to the tools.py file to take care of populating the dataframe and storing the result in a variable to use in the python interpreter.

$ sublime PlayTime    # open the whole module tree in an editor

#tools.py

import pandas as pd
import os

try:
    #line below finds path to this directory dynamically (won't     #matter where on the machine this module is moved).
    fn = os.path.join(os.path.dirname(__file__), 'oscar_male.csv')
    a_df = pd.read_csv(fn)
except:
    print('failed case from local file')

a_str = "Although Python’s extensive standard library covers many programming needs, there often comes a time when you need to add some new functionality to your Python installation in the form of third-party modules. This might be necessary to support your own programming, or to support an application that you want to use and that happens to be written in Python."

a_list = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']

def a_function():
    print('I am a_function living in tools.py')

Note, the os.path.join(os.path.dirname(__file__), 'oscar_male.csv'). Alternative appraoches to providing the .csv file have some downsides. For instance, hardcoding the full path will work great until we move the module somewhere else, then it will break. If you only provide the name of the file oscar_male.csv it will work when python is executed in the directory, but not from outside this directory, so it is not robust. Adding the path to $PYTHONPATH doesn't work either but even if it did it would break if we moved the module. So we need to determine the full path through this dynamic lookup.

Okay last step. We need to install this module. To do that we are going to use the python package installer tool available on the command line, called pip.

$ tree PlayTime
PlayTime
    ├── README.txt
    ├── __init__.py
    ├── oscar_male.csv
    └── tools.py

Currently the module looks like this. We want to place it in an enclosing parent directory of the same name like this:

$ tree PlayTime
PlayTime
├── PlayTime
│   ├── __init__.py
│   ├── oscar_male.csv
│   └── tools.py
└── setup.py

You will see why this organization is useful in a second. cd into the parent PlayTime. From that location ls should give a single directory, the child PlayTime which is our module. We will need to add a setup.py file that help pip know what is going on.

#setup.py

from setuptools import setup

setup(name='PlayTime',
      version='0.1.2',
      description='Simple data types for playing in python REPL, including DataFrames, list, etc.',
      url='not_on_web_yet',
      author='Your Name',
      author_email='username@gmail.com',
      license='MIT',
      packages=['PlayTime'],
      install_requires=[
          'pandas',
      ],
      zip_safe=False)
PlayTime  #<==from $pwd here run the pip command below
├── PlayTime
│   ├── __init__.py
│   ├── oscar_male.csv
│   └── tools.py
└── setup.py
$ pip install -e . --user

FROM: https://andrewhoos.com/blog/some-tricks-with-pip-install

"Installing Package symlinks* By adding the -e flag to pip a symlink to the source source is installed instead of the byte-code compiled source. It is not really a symlink, but it is a close enough analogy to understand what is happening."

pip creates some new files for installation and reports on the results of the installation

Obtaining file:///Users/loaner/ds/playground/PlayTime
Requirement already satisfied: pandas in /Users/loaner/anaconda3/lib/python3.6/site-packages (from PlayTime==0.1.2) (0.23.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (2.7.3)
Requirement already satisfied: pytz>=2011k in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (2018.4)
Requirement already satisfied: numpy>=1.9.0 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (1.14.3)
Requirement already satisfied: six>=1.5 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->PlayTime==0.1.2) (1.11.0)
Installing collected packages: PlayTime
  Running setup.py develop for PlayTime
Successfully installed PlayTime
PlayTime
├── PlayTime
│   ├── __init__.py
│   ├── oscar_male.csv
│   └── tools.py
├── PlayTime.egg-info #<==note this egg-info diretory is from pip
│   ├── PKG-INFO
│   ├── SOURCES.txt
│   ├── dependency_links.txt
│   ├── not-zip-safe
│   ├── requires.txt
│   └── top_level.txt
└── setup.py

pip reported success at installing. So let's try it out in a python interpreter.

$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from PlayTime import *
>>> a_df.head()
   Index   Year   Age               Name               Movie
0      1   1928    44      Emil Jannings   The Last Command
1      2   1929    41      Warner Baxter      In Old Arizona
2      3   1930    62      George Arliss            Disraeli
3      4   1931    53   Lionel Barrymore         A Free Soul
4      5   1932    47      Wallace Beery           The Champ
>>> a_list
['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']

Notice we called form PlayTime import * to collapse PlayTime and it imported the __init__.py file which called from .tools import * to collapse the tools namespace. Therefore, we can directly call the modules objects without any preceding module names. This is not a good strategy for modules that will be used in large complex code bases, but for this purpose it provides easy to use names in the python interpreter, which was our goal. One last thing...

$ pip list | grep PlayTime
PlayTime                           0.1.2

We can verify which version is istalled with pip as above. It shows 0.1.2, which was the version listed in our setup.py file. If you improve the code, you will want to update this version number in the setup.py file before installing with pip. Pip will uninstall the prior version and reinstall the new.