My Thoughts: WebDriver(Selenium2) : Extract text from PDF file using java

Thursday, May 31

WebDriver(Selenium2) : Extract text from PDF file using java


Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any direct methods to achieve this.

If you would like to extract pdf content then we can use Apache PDFBox  API.

Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract text from PDF file .. :)

Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf
import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.URL;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.Reporter;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;

public class ReadPdfFile {
 
 WebDriver driver;
 
  @BeforeTest
  public void setUpDriver() {
   driver = new FirefoxDriver();
   Reporter.log("I am done");
     }
  
  @Test
  public void start() throws IOException{
  driver.get("http://votigo.com/overview_collateral.pdf");
  driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
  URL url = new URL(driver.getCurrentUrl()); 
  BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());

  //parse()  --  This will parse the stream and populate the COSDocument object. 
  //COSDocument object --  This is the in-memory representation of the PDF document

  PDFParser parser = new PDFParser(fileToParse);
  parser.parse();

  //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources
  //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.

  String output=new PDFTextStripper().getText(parser.getPDDocument());
  System.out.println(output);
  parser.getPDDocument().close(); 
  driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
  }

}
Here is the output of above program :
EarthBox a Day Giveaway 
Objectives 
EarthBox wanted to engage their Facebook 
audience with an Earth Day promotion that would 
also increase their Facebook likes. They needed a 
simple solution that would allow them to create a 
sweepstakes application themselves. 
 
 
Solution 
EarthBox utilized the Votigo 
platform to create a like-
gated sweepstakes. Utilizing a 
theme and uploading a custom graphic they 
were able to create a branded promotion. 
 
 
Details 
• 1 prize awarded each day for the entire Month of April  
• A grand prize given away on Earth Day  
• Daily winner announcements on Facebook 
• Promoted through email newsletter blast  
 
Results (4 weeks) 
• 6,550 entries 
 
Facebook  

15 comments:

  1. Shubham AgrawalFriday, April 26, 2013

    Hi Vamshi,

    This is really a good code but in my case it showing parse error is :
    ***********************************************************

    FAILED: start

    java.lang.NoClassDefFoundError: org/apache/fontbox/afm/AFMParser

    ************************************************************

    Might be I am doing wrong but Can you please suggest any solution ?

    Thanks,
    Shubham

    ReplyDelete
  2. Hi Shubham,

    Sorry I couldn't be of much help. I tried but I couldnot find what was teh actual issue. (Might be jars / pdf ou are trying to read)

    ReplyDelete
  3. it works... thanks

    ReplyDelete
  4. I run the same code and it is working fine for me.


    Just want to know , have you done any changes before you run the script?

    ReplyDelete
  5. hi thanks a lot for this wonder code .

    when i run i get the below code
    can u please try to help me on this

    Sep 27, 2012 9:29:38 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
    WARNING: Specified stream length 98877 is wrong. Fall back to reading stream until 'endstream'.
    FAILED: start
    org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 98877 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
    at com.test.PDF.PDF_Reader.start(PDF_Reader.java:38)

    ReplyDelete
  6. hi,

    Your code is working fine but your download link is not working so please use beloved link so program will be run (pdfbox-app-1.8.2.jar)

    http://www.bizdirusa.com/mirrors/apache/pdfbox/1.8.2/pdfbox-1.8.2-src.zip

    ReplyDelete
  7. Madiraju ChaitanyaThursday, November 28, 2013

    Hi Vamsi Kurra Ji,
    Wonderful Post.Thanks for sharing this with us.Please keep posting articles regularly and share your knowledge and experience with us w.r.t Selenium WebDriver and Other Topics..Thank You.

    ReplyDelete
  8. Hi Vamshi,

    The application that I am automating has set of pages where in user provides certain information. All these information are shown in the next page as PDF embedded within a container / frame. Each text in the PDF is captured as separate element using firbug. I couldn't identify a container itself. Tried css, firepath etc but of no luck. More interesting stuff is after accepting (clicking a checkbox and click continue) the PDF in this page, in the next page again the same PDF, opens with an option to esign., where in user clicks to esign the document (within the PDF) and the user name will be displayed in signature area of PDF. We have a test esign verification created for us.

    Any suggesion?

    Thanks,

    Kannan V

    ReplyDelete
  9. Kannan,

    It sounds like it is not exactly a pdf . Seems it is an iframe (Like "read sample" at the http://www.flipkart.com/my-journey-transforming-dreams-into-actions/p/itmdmzw9yszr94r5?pid=9788129124913&ref=1c9b59ba-12ea-470b-8bb2-aa7f6c1d15ea) .
    If it is pdf , are you able to see the exact location of pdf from htmlsource. If yes your problem is solved.

    ReplyDelete
  10. Thanks for your response. I spoke to the dev. team. It is not iframe. Just they are creating an object and taking the data submitted in the earlier pages and showing in the format. This is saved as pdf when clicked on save in the container. I can share the screenshot of that page if you can share your email.

    ReplyDelete
  11. you can reach me at vamshikurra@gmail.com

    ReplyDelete
  12. http://www.java2s.com/Code/Jar/f/Downloadfontbox182jar.htm

    i faced the same issue downloaded fontbox from this website now it is working fine. thanks!!!!

    ReplyDelete
  13. Hello Vamsi,

    I just check your code on an application that I am automating and guess what? work perfectly, I dont have too much experience working with TestNG but it looks a very useful tool, thanks for sharing and regards from Mexico!!!

    ReplyDelete
  14. Thank you Vamshi..

    ReplyDelete