Selenium Home: WebDriver(Selenium2) : Extract text from PDF file using java

Wednesday, 3 July 2013

WebDriver(Selenium2) : Extract text from PDF file using java

Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any direct methods to achieve this.

If you would like to extract pdf content then we can use Apache PDFBox API.

Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract text from PDF file .. :)

Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf

view plainprint?
import java.io.BufferedInputStream;  
import java.io.IOException;  
import java.net.URL;  
import java.util.concurrent.TimeUnit;  
import org.apache.pdfbox.pdfparser.PDFParser;  
import org.apache.pdfbox.util.PDFTextStripper;  
import org.openqa.selenium.WebDriver;  
import org.openqa.selenium.firefox.FirefoxDriver;  
import org.testng.Reporter;  
import org.testng.annotations.BeforeTest;  
import org.testng.annotations.Test;  
  
public class ReadPdfFile {  
   
 WebDriver driver;  
   
  @BeforeTest  
  public void setUpDriver() {  
   driver = new FirefoxDriver();  
   Reporter.log("I am done");  
     }  
    
  @Test  
  public void start() throws IOException{  
  driver.get("http://votigo.com/overview_collateral.pdf");  
  driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);  
  URL url = new URL(driver.getCurrentUrl());   
  BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());  
  
  //parse()  --  This will parse the stream and populate the COSDocument object.   
  //COSDocument object --  This is the in-memory representation of the PDF document  
  
  PDFParser parser = new PDFParser(fileToParse);  
  parser.parse();  
  
  //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources  
  //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.  
  
  String output=new PDFTextStripper().getText(parser.getPDDocument());  
  System.out.println(output);  
  parser.getPDDocument().close();   
  driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);  
  }  
  
}  

Here is the output of above program :

view plainprint?
EarthBox a Day Giveaway   
Objectives   
EarthBox wanted to engage their Facebook   
audience with an Earth Day promotion that would   
also increase their Facebook likes. They needed a   
simple solution that would allow them to create a   
sweepstakes application themselves.   
   
   
Solution   
EarthBox utilized the Votigo   
platform to create a like-  
gated sweepstakes. Utilizing a   
theme and uploading a custom graphic they   
were able to create a branded promotion.   
   
   
Details   
• 1 prize awarded each day for the entire Month of April    
• A grand prize given away on Earth Day    
• Daily winner announcements on Facebook   
• Promoted through email newsletter blast    
   
Results (4 weeks)   
• 6,550 entries   
   
Facebook    

Selenium Home

Wednesday, 3 July 2013

WebDriver(Selenium2) : Extract text from PDF file using java

No comments:

Post a Comment

Angular JS Protractor Installation process - Tutorial Part 1

Search This Blog