My Thoughts: WebDriver(Selenium2) : Extract text from PDF file using java

Thursday, May 31

WebDriver(Selenium2) : Extract text from PDF file using java


Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any direct methods to achieve this.

If you would like to extract pdf content then we can use Apache PDFBox  API.

Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract text from PDF file .. :)

Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf
import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.URL;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.Reporter;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;

public class ReadPdfFile {
 
 WebDriver driver;
 
  @BeforeTest
  public void setUpDriver() {
   driver = new FirefoxDriver();
   Reporter.log("I am done");
     }
  
  @Test
  public void start() throws IOException{
  driver.get("http://votigo.com/overview_collateral.pdf");
  driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
  URL url = new URL(driver.getCurrentUrl()); 
  BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());

  //parse()  --  This will parse the stream and populate the COSDocument object. 
  //COSDocument object --  This is the in-memory representation of the PDF document

  PDFParser parser = new PDFParser(fileToParse);
  parser.parse();

  //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources
  //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.

  String output=new PDFTextStripper().getText(parser.getPDDocument());
  System.out.println(output);
  parser.getPDDocument().close(); 
  driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
  }

}
Here is the output of above program :
EarthBox a Day Giveaway 
Objectives 
EarthBox wanted to engage their Facebook 
audience with an Earth Day promotion that would 
also increase their Facebook likes. They needed a 
simple solution that would allow them to create a 
sweepstakes application themselves. 
 
 
Solution 
EarthBox utilized the Votigo 
platform to create a like-
gated sweepstakes. Utilizing a 
theme and uploading a custom graphic they 
were able to create a branded promotion. 
 
 
Details 
• 1 prize awarded each day for the entire Month of April  
• A grand prize given away on Earth Day  
• Daily winner announcements on Facebook 
• Promoted through email newsletter blast  
 
Results (4 weeks) 
• 6,550 entries 
 
Facebook