Wednesday, 3 July 2013

WebDriver(Selenium2) : Extract text from PDF file using java



Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any direct methods to achieve this.

If you would like to extract pdf content then we can use Apache PDFBox  API.

Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract text from PDF file .. :)

Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf
  1. import java.io.BufferedInputStream;  
  2. import java.io.IOException;  
  3. import java.net.URL;  
  4. import java.util.concurrent.TimeUnit;  
  5. import org.apache.pdfbox.pdfparser.PDFParser;  
  6. import org.apache.pdfbox.util.PDFTextStripper;  
  7. import org.openqa.selenium.WebDriver;  
  8. import org.openqa.selenium.firefox.FirefoxDriver;  
  9. import org.testng.Reporter;  
  10. import org.testng.annotations.BeforeTest;  
  11. import org.testng.annotations.Test;  
  12.   
  13. public class ReadPdfFile {  
  14.    
  15.  WebDriver driver;  
  16.    
  17.   @BeforeTest  
  18.   public void setUpDriver() {  
  19.    driver = new FirefoxDriver();  
  20.    Reporter.log("I am done");  
  21.      }  
  22.     
  23.   @Test  
  24.   public void start() throws IOException{  
  25.   driver.get("http://votigo.com/overview_collateral.pdf");  
  26.   driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);  
  27.   URL url = new URL(driver.getCurrentUrl());   
  28.   BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());  
  29.   
  30.   //parse()  --  This will parse the stream and populate the COSDocument object.   
  31.   //COSDocument object --  This is the in-memory representation of the PDF document  
  32.   
  33.   PDFParser parser = new PDFParser(fileToParse);  
  34.   parser.parse();  
  35.   
  36.   //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources  
  37.   //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.  
  38.   
  39.   String output=new PDFTextStripper().getText(parser.getPDDocument());  
  40.   System.out.println(output);  
  41.   parser.getPDDocument().close();   
  42.   driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);  
  43.   }  
  44.   
  45. }  
Here is the output of above program :
  1. EarthBox a Day Giveaway   
  2. Objectives   
  3. EarthBox wanted to engage their Facebook   
  4. audience with an Earth Day promotion that would   
  5. also increase their Facebook likes. They needed a   
  6. simple solution that would allow them to create a   
  7. sweepstakes application themselves.   
  8.    
  9.    
  10. Solution   
  11. EarthBox utilized the Votigo   
  12. platform to create a like-  
  13. gated sweepstakes. Utilizing a   
  14. theme and uploading a custom graphic they   
  15. were able to create a branded promotion.   
  16.    
  17.    
  18. Details   
  19. • 1 prize awarded each day for the entire Month of April    
  20. • A grand prize given away on Earth Day    
  21. • Daily winner announcements on Facebook   
  22. • Promoted through email newsletter blast    
  23.    
  24. Results (4 weeks)   
  25. • 6,550 entries   
  26.    
  27. Facebook    

No comments:

Post a Comment

Angular JS Protractor Installation process - Tutorial Part 1

                     Protractor, formally known as E2E testing framework, is an open source functional automation framework designed spe...