Posted By

kamthornp on 04/25/18


Tagged

Kindergarden


Versions (?)

Attribute Properties


 / Published in: Python
 

Objective: read a csv file to print its column name, data type, %missing value, %outlier, number of record, and attribute detail of each column

  1. import pandas as pd
  2. import numpy as np
  3.  
  4. counter = 0;
  5. df = pd.read_csv("somewhere/data.csv")
  6.  
  7. header = '{:<3} {:<30} {:>10} {:>9} {:>9} {:>8} {:<50}'.format("#","Attribute", "DataType", "%NA", "%Outlier", "#Record", "Detail")
  8. print(header)
  9. for name, series in df.iteritems():
  10. counter = counter+1
  11. if np.object == df[name].dtype:
  12. if len(df[name].unique())<30:
  13. info = ' '.join(["size =",str(len(df[name].unique())),str(df[name].unique())])
  14. else:
  15. info = ' '.join(["size =",str(len(df[name].unique()))])
  16. pMissing = str(round(100*float(df[name].isnull().sum())/float(len(df[name])),3))
  17. pOutlier = "0.0"
  18. line = '{:<3} {:<30} {:>10} {:>8}% {:>8}% {:>8} {:<50}'.format(counter, name, str(df[name].dtype), pMissing, pOutlier, len(df[name]), info)
  19. else:
  20. info = ' '.join(["max =",str(df[name].max()),
  21. "min =",str(df[name].min()),
  22. "mean =",str(round(df[name].mean(),4)),
  23. "median =",str(round(df[name].median(),4)),
  24. "STD =",str(round(df[name].std(),4))
  25. ])
  26. Q1=df[name].quantile(0.25)
  27. Q3=df[name].quantile(0.75)
  28. IQR = Q3-Q1
  29. N_outlier = float(len(df.loc[(df[name]<Q1 - 1.5 * IQR) | (df[name]>Q3 + 1.5 * IQR)]))
  30. N_missing = float(df[name].isnull().sum())
  31. N = float(len(df[name]))
  32. pMissing = str(round(100*N_missing/N,3))
  33. pOutlier = str(round(100*N_outlier/N,3))
  34. line = '{:<3} {:<30} {:>10} {:>8}% {:>8}% {:>8} {:<50}'.format(counter, name, str(df[name].dtype), pMissing, pOutlier, len(df[name]), info)
  35. print(line)

Report this snippet  

You need to login to post a comment.