"Oh, I got it!"
a blog by Tim RandSan Francisco Data Scientist Cell Biologist/Physician |
contact info |
---|
Almost all python code starts with the ubiquitous python import
keyword used for pulling in external code.
import pandas
There are also statements like...
from matplotlib import pyplot as plt
It is worth learning how these external code modules are made and start to take advantage of the module system when writing your own code. It helps stay organized and make your code easier to share with others.
So I poked around and made some discoveries and now feel like I understand what is going on. Here are my notes.
In terminal you can type open a python
and open the python interpreter and import a module like pandas (you may not have pandas, pip install pandas
. You can use any module on your system. pip list
to see what is installed).
$ python
>>> import pandas
>>> pandas.__file__
'/Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/__init__.py'
That __file__
attribute is a time saver as it returns the path to the pandas module. Without this path we might have to look through every directory in $PYTHONPATH to find the module. We can use this path to open the source code to the pandas library. The other thing that is interesting to note is that file that gets stored in the __file__
attribute is an __init__.py
file. Let's investigate what that file is and how it relates to the a modeule.
$ sublime /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/__init__.py
Looking at the contents of __init__.py
it seems to be about 1/3rd pydoc documentation (the kind signature of a good citizen). You can identify this section (located at the end) because it is wrapped in three quotation marks """pydoc info"""
. 1/3rd is warnings about depricating features and possible missing dependencies, and about 1/3rd a bunch of import statements (even while investigating import statement we run into more import statements!)...
from pandas.core.api import *
from pandas.core.sparse.api import *
from pandas.tseries.api import *
from pandas.core.computation.api import *
from pandas.core.reshape.api import *
One thing to note here are the periods. pandas.core.api
etc. and the *
. In bash an *
is a glob and symbolizes a wild-card that represents all the possibilities available. It means the same thing here. Below you will see the directory and file structure for the pandas module (only 1st level). About 25 sub-directories, note core
and tseries
which we just saw above. The dot notation in a pandas import statement
__file__
attribute is a time saver as it returns the path to the pandas module. Without this path we might have to look through every directory in $PYTHONPATH to find the module. We can use this path to open the source code to the pandas library. The other thing that is interesting to note is that file that gets stored in the __file__
attribute is an /
;.
$ tree -L 1 /Users/loaner/anaconda3/lib/python3.6/site-packages/pandas
├── __init__.py
├── __pycache__
├── _libs
├── _version.py
├── api
├── compat
├── computation
├── conftest.py
├── core
├── errors
├── formats
├── io
├── json.py
├── lib.py
├── parser.py
├── plotting
├── testing.py
├── tests
├── tools
├── tseries
├── tslib.py
├── types
└── util
Returning to the list of import statements found in __init__.py
file in the pandas module.
#looking at randomly selecteed imports statements above
from pandas.core.sparse.api import *
from pandas.core.computation.api import *
So for instance we should be able to find a sparse directory or file within the core directory. Let's check...
Indeed, there is a
/Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/core/sparse/api.py
and a
/Users/loaner/anaconda3/lib/python3.6/site-packages/pandas/core/computation/api.py
So the '.' is serving the purpose of a unix '/'. Note that pandas
is the directory, core
is a sub-directory, computation
is a sub-sub-directory and api
represents api.py
, a file and then the asterisk (which must mean the contents of the api.py file, right? Using a period to get a lower level attribute is how python normally does things (for instance accessing attributes of an object, etc. Here, in a similar way it is showing the relationship of downstream items. So modules are composed of directories, files and internal code. All of those levels can be referenced in import statements. And the periods allow one to specify the relationships of that structure just as the '/' does in unix.
So what we have discovered so far is that modules are just made up of a root directory (the module name) containing within it an __init__.py
file. But is that __init__.py
file is critical or not? We should experiment.
To keep from getting confused, you need to pay attention to where you launch python interpreter from in this experiment. We are going to make a pretend module, but it won't be installed correctly within the $PYTHONPATH
, where python looks for modules. But you can take advantage of the fact that python always looks in the $PWD directory for modules. So even without installing the module into its a proper place within the $PYTHONPATH, python will still find our module so long as we are working in the same directory where the test module lives.
$ pwd
~
$ mkdir temp temp2
$ touch temp/__init__.py
So we have an empty temp2 directory and an temp/init.py in our working directory. Let's test them as modules.
$ python
>>> import temp
>>> temp.__file__
'/Users/loaner/ds/playground/temp/__init__.py'
>>> temp
<module 'temp' from '/Users/loaner/ds/playground/temp/__init__.py'>
>>>
>>> import temp2
>>> temp2.__file__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'temp2' has no attribute '__file__'
>>> temp
<module 'temp' from '/Users/loaner/ds/playground/temp/__init__.py'>
>>> temp2
<module 'temp2' (namespace)>
This experiment has some interesting results.
__file__
attribute, beacuse of the missing __init__.py
file. __init__.py
file within the modules root directory is enough to make the __file__
attribute be assigned its path, even if there is no information in the __init__.py
file. So the __init__.py
file is meaningful, even if empty. It offers a place to put extra code pertaining to the module as a whole. Like collapsing a namespace for convenience sake. In fact, I am going to have the init.py file of our module load all of the attributes defined in tools.py
. That way I won't have to type PlayTools.tools.a_df
or tools.a_df
, I will be able to access the variable a_df
directly.
__init__.py
from .tools import *
Maybe we know enough to create a module! Let's try. The goal of this module will be to make a suite of pre-defined, basic python data types that can easily be imported into a python interpreter for experimenting with the objects methods etc. without having to do the setup.
So we need a root directory called PlayTime, and within that an empty __inti__.py
file and also a another .py
file containing the code. One of the objects I am interested in having available is a pandas.DataFrame. So I will also include an extra .csv
file specifically to supply that data.
$ mkdir PlayTime
$ touch PlayTime/__inti__.py PlayTime/tools.py
#found a typical .csv file on the web, it will serve as data
$ curl https://people.sc.fsu.edu/~jburkardt/data/csv/oscar_age_male.csv > PlayTime/oscar_age_male.csv
$ tree PlayTime/
PlayTime/ # <== Our module name root directory
├── __init__.py # <== __init__.py helps set __file__ attr. of
├── oscar_age_male.csv # <== only used because we need some data
└── tools.py # <== .py files hold the code base
Let's add the code to the tools.py file to take care of populating the dataframe and storing the result in a variable to use in the python interpreter.
$ sublime PlayTime # open the whole module tree in an editor
#tools.py
import pandas as pd
import os
try:
#line below finds path to this directory dynamically (won't #matter where on the machine this module is moved).
fn = os.path.join(os.path.dirname(__file__), 'oscar_male.csv')
a_df = pd.read_csv(fn)
except:
print('failed case from local file')
a_str = "Although Python’s extensive standard library covers many programming needs, there often comes a time when you need to add some new functionality to your Python installation in the form of third-party modules. This might be necessary to support your own programming, or to support an application that you want to use and that happens to be written in Python."
a_list = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
def a_function():
print('I am a_function living in tools.py')
Note, the os.path.join(os.path.dirname(__file__), 'oscar_male.csv')
. Alternative appraoches to providing the .csv
file have some downsides. For instance, hardcoding the full path will work great until we move the module somewhere else, then it will break. If you only provide the name of the file oscar_male.csv
it will work when python is executed in the directory, but not from outside this directory, so it is not robust. Adding the path to $PYTHONPATH
doesn't work either but even if it did it would break if we moved the module. So we need to determine the full path through this dynamic lookup.
Okay last step. We need to install this module. To do that we are going to use the python package installer tool available on the command line, called pip.
$ tree PlayTime
PlayTime
├── README.txt
├── __init__.py
├── oscar_male.csv
└── tools.py
Currently the module looks like this. We want to place it in an enclosing parent directory of the same name like this:
$ tree PlayTime
PlayTime
├── PlayTime
│ ├── __init__.py
│ ├── oscar_male.csv
│ └── tools.py
└── setup.py
You will see why this organization is useful in a second. cd
into the parent PlayTime
. From that location ls should give a single directory, the child PlayTime
which is our module. We will need to add a setup.py
file that help pip
know what is going on.
#setup.py
from setuptools import setup
setup(name='PlayTime',
version='0.1.2',
description='Simple data types for playing in python REPL, including DataFrames, list, etc.',
url='not_on_web_yet',
author='Your Name',
author_email='username@gmail.com',
license='MIT',
packages=['PlayTime'],
install_requires=[
'pandas',
],
zip_safe=False)
PlayTime #<==from $pwd here run the pip command below
├── PlayTime
│ ├── __init__.py
│ ├── oscar_male.csv
│ └── tools.py
└── setup.py
$ pip install -e . --user
FROM: https://andrewhoos.com/blog/some-tricks-with-pip-install
"Installing Package symlinks*
By adding the -e
flag to pip a symlink to the source source is installed instead of the byte-code compiled source. It is not really a symlink, but it is a close enough analogy to understand what is happening."
pip
creates some new files for installation and reports on the results of the installation
Obtaining file:///Users/loaner/ds/playground/PlayTime
Requirement already satisfied: pandas in /Users/loaner/anaconda3/lib/python3.6/site-packages (from PlayTime==0.1.2) (0.23.0)
Requirement already satisfied: python-dateutil>=2.5.0 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (2.7.3)
Requirement already satisfied: pytz>=2011k in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (2018.4)
Requirement already satisfied: numpy>=1.9.0 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from pandas->PlayTime==0.1.2) (1.14.3)
Requirement already satisfied: six>=1.5 in /Users/loaner/anaconda3/lib/python3.6/site-packages (from python-dateutil>=2.5.0->pandas->PlayTime==0.1.2) (1.11.0)
Installing collected packages: PlayTime
Running setup.py develop for PlayTime
Successfully installed PlayTime
PlayTime
├── PlayTime
│ ├── __init__.py
│ ├── oscar_male.csv
│ └── tools.py
├── PlayTime.egg-info #<==note this egg-info diretory is from pip
│ ├── PKG-INFO
│ ├── SOURCES.txt
│ ├── dependency_links.txt
│ ├── not-zip-safe
│ ├── requires.txt
│ └── top_level.txt
└── setup.py
pip
reported success at installing. So let's try it out in a python interpreter.
$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 26 2018, 08:42:37)
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from PlayTime import *
>>> a_df.head()
Index Year Age Name Movie
0 1 1928 44 Emil Jannings The Last Command
1 2 1929 41 Warner Baxter In Old Arizona
2 3 1930 62 George Arliss Disraeli
3 4 1931 53 Lionel Barrymore A Free Soul
4 5 1932 47 Wallace Beery The Champ
>>> a_list
['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
Notice we called form PlayTime import *
to collapse PlayTime and it imported the __init__.py
file which called
from .tools import *
to collapse the tools namespace. Therefore, we can directly call the modules objects without any preceding module names. This is not a good strategy for modules that will be used in large complex code bases, but for this purpose it provides easy to use names in the python interpreter, which was our goal. One last thing...
$ pip list | grep PlayTime
PlayTime 0.1.2
We can verify which version is istalled with pip as above. It shows 0.1.2, which was the version listed in our setup.py file. If you improve the code, you will want to update this version number in the setup.py file before installing with pip. Pip will uninstall the prior version and reinstall the new.