𝐕𝐚𝐦𝐬𝐡𝐢 𝐊𝐮𝐫𝐫𝐚

Thursday, May 31

WebDriver(Selenium2) : Extract text from PDF file using java

Verifying PDF content is also part of testing.But in WebDriver (Selenium2) we don't have any direct methods to achieve this.

If you would like to extract pdf content then we can use Apache PDFBox API.

Download the Jar files and add them to your Eclipse Class path.Then you are ready to extract text from PDF file .. :)

Here is the sample script which will extract text from the below PDF file.
http://www.votigo.com/pdf/corp/CASE_STUDY_EarthBox.pdf

import java.io.BufferedInputStream;
import java.io.IOException;
import java.net.URL;
import java.util.concurrent.TimeUnit;
import org.apache.pdfbox.pdfparser.PDFParser;
import org.apache.pdfbox.util.PDFTextStripper;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.Reporter;
import org.testng.annotations.BeforeTest;
import org.testng.annotations.Test;

public class ReadPdfFile {
 
 WebDriver driver;
 
  @BeforeTest
  public void setUpDriver() {
   driver = new FirefoxDriver();
   Reporter.log("I am done");
     }
  
  @Test
  public void start() throws IOException{
  driver.get("http://votigo.com/overview_collateral.pdf");
  driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
  URL url = new URL(driver.getCurrentUrl()); 
  BufferedInputStream fileToParse=new BufferedInputStream(url.openStream());

  //parse()  --  This will parse the stream and populate the COSDocument object. 
  //COSDocument object --  This is the in-memory representation of the PDF document

  PDFParser parser = new PDFParser(fileToParse);
  parser.parse();

  //getPDDocument() -- This will get the PD document that was parsed. When you are done with this document you must call    close() on it to release resources
  //PDFTextStripper() -- This class will take a pdf document and strip out all of the text and ignore the formatting and           such.

  String output=new PDFTextStripper().getText(parser.getPDDocument());
  System.out.println(output);
  parser.getPDDocument().close(); 
  driver.manage().timeouts().implicitlyWait(100, TimeUnit.SECONDS);
  }

}

Here is the output of above program :

EarthBox a Day Giveaway 
Objectives 
EarthBox wanted to engage their Facebook 
audience with an Earth Day promotion that would 
also increase their Facebook likes. They needed a 
simple solution that would allow them to create a 
sweepstakes application themselves. 
 
 
Solution 
EarthBox utilized the Votigo 
platform to create a like-
gated sweepstakes. Utilizing a 
theme and uploading a custom graphic they 
were able to create a branded promotion. 
 
 
Details 
• 1 prize awarded each day for the entire Month of April  
• A grand prize given away on Earth Day  
• Daily winner announcements on Facebook 
• Promoted through email newsletter blast  
 
Results (4 weeks) 
• 6,550 entries 
 
Facebook

15 comments:

Shubham AgrawalFriday, April 26, 2013
Hi Vamshi,

This is really a good code but in my case it showing parse error is :
***********************************************************

FAILED: start

java.lang.NoClassDefFoundError: org/apache/fontbox/afm/AFMParser

************************************************************

Might be I am doing wrong but Can you please suggest any solution ?

Thanks,
Shubham
ReplyDelete
Replies
Vamshi KurraFriday, April 26, 2013
Hi Shubham,

Sorry I couldn't be of much help. I tried but I couldnot find what was teh actual issue. (Might be jars / pdf ou are trying to read)
ReplyDelete
Replies
Azhar FirdausFriday, April 26, 2013
it works... thanks
ReplyDelete
Replies
Vamshi KurraFriday, April 26, 2013
I run the same code and it is working fine for me.

Just want to know , have you done any changes before you run the script?
ReplyDelete
Replies
SaiFriday, April 26, 2013
hi thanks a lot for this wonder code .

when i run i get the below code
can u please try to help me on this

Sep 27, 2012 9:29:38 AM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream
WARNING: Specified stream length 98877 is wrong. Fall back to reading stream until 'endstream'.
FAILED: start
org.apache.pdfbox.exceptions.WrappedIOException: Could not push back 98877 bytes in order to reparse stream. Try increasing push back buffer using system property org.apache.pdfbox.baseParser.pushBackSize
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:546)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:566)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
at com.test.PDF.PDF_Reader.start(PDF_Reader.java:38)
ReplyDelete
Replies
pavan ahmedabadWednesday, October 09, 2013
hi,

Your code is working fine but your download link is not working so please use beloved link so program will be run (pdfbox-app-1.8.2.jar)

http://www.bizdirusa.com/mirrors/apache/pdfbox/1.8.2/pdfbox-1.8.2-src.zip
ReplyDelete
Replies
Madiraju ChaitanyaThursday, November 28, 2013
Hi Vamsi Kurra Ji,
Wonderful Post.Thanks for sharing this with us.Please keep posting articles regularly and share your knowledge and experience with us w.r.t Selenium WebDriver and Other Topics..Thank You.
ReplyDelete
Replies
Vamshi KurraThursday, November 28, 2013
thanks :)
ReplyDelete
Replies
KannanFriday, December 06, 2013
Hi Vamshi,

The application that I am automating has set of pages where in user provides certain information. All these information are shown in the next page as PDF embedded within a container / frame. Each text in the PDF is captured as separate element using firbug. I couldn't identify a container itself. Tried css, firepath etc but of no luck. More interesting stuff is after accepting (clicking a checkbox and click continue) the PDF in this page, in the next page again the same PDF, opens with an option to esign., where in user clicks to esign the document (within the PDF) and the user name will be displayed in signature area of PDF. We have a test esign verification created for us.

Any suggesion?

Thanks,

Kannan V
ReplyDelete
Replies
Vamshi KurraFriday, December 06, 2013
Kannan,

It sounds like it is not exactly a pdf . Seems it is an iframe (Like "read sample" at the http://www.flipkart.com/my-journey-transforming-dreams-into-actions/p/itmdmzw9yszr94r5?pid=9788129124913&ref=1c9b59ba-12ea-470b-8bb2-aa7f6c1d15ea) .
If it is pdf , are you able to see the exact location of pdf from htmlsource. If yes your problem is solved.
ReplyDelete
Replies
KannanFriday, December 06, 2013
Thanks for your response. I spoke to the dev. team. It is not iframe. Just they are creating an object and taking the data submitted in the earlier pages and showing in the format. This is saved as pdf when clicked on save in the container. I can share the screenshot of that page if you can share your email.
ReplyDelete
Replies
Vamshi KurraFriday, December 06, 2013
you can reach me at vamshikurra@gmail.com
ReplyDelete
Replies
maniSunday, January 26, 2014
http://www.java2s.com/Code/Jar/f/Downloadfontbox182jar.htm

i faced the same issue downloaded fontbox from this website now it is working fine. thanks!!!!
ReplyDelete
Replies
VictorWednesday, May 14, 2014
Hello Vamsi,

I just check your code on an application that I am automating and guess what? work perfectly, I dont have too much experience working with TestNG but it looks a very useful tool, thanks for sharing and regards from Mexico!!!
ReplyDelete
Replies
Ankit ReddyTuesday, May 27, 2014
Thank you Vamshi..
ReplyDelete
Replies

Add comment