Posted By

webonomic on 12/02/12


Tagged

document matrix term correlation textminng


Versions (?)

Co Word Analysis with SAS


 / Published in: SAS
 

URL: https://communities.sas.com/thread/6327?start=0&tstart=0

Text Miner uses a compressed representation of the term-by-doc frequency matrix. You will find an OUT data set in the project data directory of your text miner run. Its label will include the string "OUT" in it. Since a 30,000 document collection will have as many as 500,000 to a million distinct terms, be sure to restrict your terms of interest with a start list. I give an example of creating the cooccurrence matrix with the following code which expands the compressed version to an uncompressed version and then computes the co-occurrence count with proc corr and the sscp option.

  1. data myOUT;
  2. input term doc count;
  3. datalines;
  4. 1 1 1
  5. 1 3 1
  6. 1 4 1
  7. 2 2 1
  8. 2 3 2
  9. 3 1 2
  10. 3 3 2
  11. 3 4 1
  12. 4 2 2
  13. 4 4 1
  14. 5 3 2
  15. ;
  16. run;
  17.  
  18. proc sort data=myOUT;
  19. by doc term;
  20. run;
  21.  
  22. data docbyterm;
  23. set myOUT;
  24. by doc;
  25. array t;
  26. retain t;
  27. if first.doc then do;
  28. do i=1 to 5;
  29. t=0;
  30. end;
  31. end;
  32. t=count;
  33. if last.doc then do;
  34. output;
  35. end;
  36. run;
  37.  
  38.  
  39. proc corr data=docbyterm cov outp=cooccur sscp;
  40. var t1-t5;
  41. run;

Report this snippet  

You need to login to post a comment.