Think Good
View list of my frequently visited bloglines
Tags - Categories : All | General | Software Development

MSHTML in ASP.NET / VB.NET

My basic need is to read the remote webpage and parse the selected tags from it. A way which is mainly used by Ad agency to trace the email ids. But i am not interested in emailid's.

I already knew how to do it with Java Servlet or PHP or VB or C++ or even VB.NET window application.
I can also do it with AJAX but XMLHttpRequest expects you to have well-formed html.(say xhtml)
I want to call MSHTML from ASP.NET with VB.NET as backend.
Problems:

  • MSHTML expects you have an browser interface like IBrowser or AxBrowser or URLMonikors to load the complete remote HTML data into mshtml.HTMLDocument object.see a mshtml example in C#. I dont want it as my interface is not windows forms.
  • HTMLDocument.open(...) will not work without a browser interface. see what msdn says. So you need a wait and read the fully loaded html data in to HTMLDocument.
  • Time. I mean my free time to excavate. :-(


If you are interested, here is the solution based on VB.NET which you can use with ASP.NET(.aspx.vb).

Make sure you add reference to Microsoft.mshtml from the .NET objects collection and "Imports System.Runtime.InteropServices"

'We will use HTMLDocument to open and load remote webpage in to IHTMLDocument2
'we can't use the same HTMLDocument as it is needed for persistance(IPersistStream)
'we also can't use IHTMLDocument2 object as it will not have DOM interface faetures enabled. we will use IHTMLDocument3. Dim url as String = "http://java.sun.com" Dim objMSHTML As New mshtml.HTMLDocument
Dim objMSHTML2 As mshtml.IHTMLDocument2
Dim objMSHTML3 As mshtml.IHTMLDocument3
Dim x As Integer = 10 'a dummy variable

Dim objIPS As IPersistStreamInit 'here is the whole trick
objIPS = DirectCast(objMSHTML, IPersistStreamInit)
objIPS.InitNew() 'you have to do it, if not you will always have readyState as "loading"
objMSHTML2 = objMSHTML.createDocumentFromUrl(url, vbNullString)
Do Until objMSHTML2.readyState = "complete"
  x = x + 1
  Application.DoEvents 'Suggested by John
Loop
objMSHTML3 = DirectCast(objMSHTML2, mshtml.IHTMLDocument3)
Now you can start using DOM interfaces like getElementByID(), getElementsByTagName(..) etc.,
for example objMSHTML3.getElementsByTagName("table") will give the IHTMLElementCollection of tags inside the table tags.
lots of examples were available in internet on to how to traverse into table object using IHTMLElementCollection
If you still need a sample, let me know.

Thats all! ??? No.! if you happen to compile this code, you will get an error saying IPersistStreamInit not found!. For this just as add the below interface defintion to your code on the same page.


    Public Enum HRESULT
        S_OK = 0
        S_FALSE = 1
        E_NOTIMPL = &H80004001
        E_INVALIDARG = &H80070057
        E_NOINTERFACE = &H80004002
        E_FAIL = &H80004005
        E_UNEXPECTED = &H8000FFFF
    End Enum

    <ComVisible(True), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _
        InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
    Public Interface IPersistStreamInit : Inherits IPersist
        Shadows Sub GetClassID(ByRef pClassID As Guid)
        <PreserveSig()> Function IsDirty() As Integer
        <PreserveSig()> Function Load(ByVal pstm As UCOMIStream) As HRESULT
        <PreserveSig()> Function Save(ByVal pstm As UCOMIStream, _
            <MarshalAs(UnmanagedType.Bool)> ByVal fClearDirty As Boolean) As HRESULT
        <PreserveSig()> Function GetSizeMax(<InAttribute(), Out(), _
        MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As HRESULT
        <PreserveSig()> Function InitNew() As HRESULT
    End Interface

    <ComVisible(True), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), _
        InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _
    Public Interface IPersist
        Sub GetClassID(ByRef pClassID As Guid)
    End Interface

    Declare Function CreateStreamOnHGlobal Lib "ole32" (ByVal hGlobal As IntPtr, ByVal fDeleteOnRelease As Boolean, _
        ByRef ppstm As UCOMIStream) As Long
' Please note that i copied above IPersistStream definition from sp!ke. I owe him a drink ;).

Thats it. All done. Compile it and start using the objMSHTML3 to parse the HTML object and so on.

If you happen to know anyother way, please let me know.


i am trying to use the code in my vb.net 2005 program. i am getting the "loading" and never getting complete. This is exactly what i have been looking for. if you have any ideas, please let me know. here is my code: Imports System.Data.SqlClient Imports System.Data Imports System Imports System.Runtime.InteropServices Public Class Class1 Public Enum HRESULT S_OK = 0 S_FALSE = 1 E_NOTIMPL = &H80004001 E_INVALIDARG = &H80070057 E_NOINTERFACE = &H80004002 E_FAIL = &H80004005 E_UNEXPECTED = &H8000FFFF End Enum <ComVisible(True), ComImport(), Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"), _ InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _ Public Interface IPersistStreamInit : Inherits IPersist Shadows Sub GetClassID(ByRef pClassID As Guid) <PreserveSig()> Function IsDirty() As Integer <PreserveSig()> Function Load(ByVal pstm As System.Runtime.InteropServices.ComTypes.IStream) As HRESULT <PreserveSig()> Function Save(ByVal pstm As System.Runtime.InteropServices.ComTypes.IStream, _ <MarshalAs(UnmanagedType.Bool)> ByVal fClearDirty As Boolean) As HRESULT <PreserveSig()> Function GetSizeMax(<InAttribute(), Out(), _ MarshalAs(UnmanagedType.U8)> ByRef pcbSize As Long) As HRESULT <PreserveSig()> Function InitNew() As HRESULT End Interface <ComVisible(True), ComImport(), Guid("0000010c-0000-0000-C000-000000000046"), _ InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)> _ Public Interface IPersist Sub GetClassID(ByRef pClassID As Guid) End Interface Declare Function CreateStreamOnHGlobal Lib "ole32" (ByVal hGlobal As IntPtr, ByVal fDeleteOnRelease As Boolean, _ ByRef ppstm As System.Runtime.InteropServices.ComTypes.IStream) As Long ' Please note that i copied above IPersistStream definition from sp!ke. I owe him a drink ;). Private Function FindXMLPageName(ByVal strHtml As String) As String Try 'We will use HTMLDocument to open and load remote webpage in to IHTMLDocument2 'we can't use the same HTMLDocument as it is needed for persistance(IPersistStream) 'we also can't use IHTMLDocument2 object as it will not have DOM interface faetures enabled. we will use IHTMLDocument3. Dim url As String = "http://java.sun.com" Dim objMSHTML As New mshtml.HTMLDocument Dim objMSHTML2 As mshtml.IHTMLDocument2 Dim objMSHTML3 As mshtml.IHTMLDocument3 Dim x As Integer = 10 'a dummy variable Dim objIPS As IPersistStreamInit 'here is the whole trick objIPS = DirectCast(objMSHTML, IPersistStreamInit) objIPS.InitNew() 'you have to do it, if not you will always have readyState as "loading" objMSHTML2 = objMSHTML.createDocumentFromUrl(url, vbNullString) Do Until objMSHTML2.readyState = "complete" x = x + 1 Loop objMSHTML3 = DirectCast(objMSHTML2, mshtml.IHTMLDocument3) Catch ex As Exception End Try End Function End Class
Hi Marilyn, I can able to parse the remote site. I am using visual studio.net 2003. You can have look at the screen shot(http://radio.javaranch.com/balajidl/images/loadView_MSHTML.GIF) which loads and displays the innerHTML of body tag of http://java.sun.com. If you want i can also send the .aspx and .aspx.vb code via email ?? Regards Balaji
hi Balajidl, I was wondering if you new why many elements HTLMDocument3, like frames get a System.InvalidCastException?
Thanks Balaji! I only had to add an Application.DoEvents call (VB.NET 2003) to the Do loop in order for the load to complete
Thanks John, I added your suggestion as well.
Balaji
Thanks, after 3 hours of "googling" I found exactly what I needed.
Hi Balajidl,
why i am sometimes i am getting the below error?
and even when i am not getting the erorr, i cann't get the remote page?
Me.lbl_Msg.Text = objMSHTML3.body.outerHTML
JIT Debugging failed with the following error: Access is denied
JIT Debugging was initiated by the user account 'BCname\ASPNET'.
Check the documentation index for 'just-in-time debugging, errors' for more information
Hi Dhamen,
I couldnt able to understand your bug fully.Could you please provide some more details on what you are trying to do and where it bugs ?
Are you able to use the above sample code without any error?? to see the sample screenshot, Click here
Hi Balajidl, thank you for your response

This is the error I am getting when I run the code above.

JIT Debugging failed with the following error: Access is denied
JIT Debugging was initiated by the user account 'BCname\ASPNET'.
Check the documentation index for 'just-in-time debugging, errors' for more information

Also when I click OK on the above error message dialog, it asks me to refresh the page, when I refresh the page, then it loads the remote web page successfully.

I am running this code on my local IIS server, and using VS 2003.

Thank you.
Hi Dhamen
This bug occurs only for the very first call. I believe, the bug is because of .NET framework and system settings and not by the coding way.
Sorry, Balajidl
This is what I am trying to do.

I was trying to get a remote web page into my website and manipulate it. Also I want you to know that I am able to do same thing in windows forms application successfully.
Hi Balaji i have resolved the error messag, i have to add ASPNET user into the debugging user group. Now i am not getting the error message

but also i am not getting the remote page. may be because i am behind proxy server and firewall?

but i think this code will work if i move it to my website hosting company because they are not behind a proxy server and firewall.

also i tries it on a local webpage on my computer and it is working fine, thank you for this great code.

regards,
Dhamen
KSA
Hi Balaji,

BTW, one more thing, the following code is not working with me. i have to Comment it.

Application.DoEvents() 'Suggested by John
'DoEvents' is not member of system.Web.HttpApplicationState
Hi Dhamen, I am glad you got it working. Thanks for sharing your experience.
Regards
Balaji
Did you made more progress on this? I'm successfully using MSHTML, but I have a weird behavior. If you start the ASP.NET (1.1), use MSHTML to parse HTML, and then do a DLL update (like re-compiling the code), the existing aspnet_wp.exe crashes, I've get the popup (JIT Debugging failed...), and a new instance of aspnet_wp.exe is created. Refreshing the page works this time. It looks like the old aspnet_wp.exe is holding some resource that causes a side-by-side run to fail. Then it crashes and all is good again.
Hi Marcelo, Thanks for passing by. Sorry I haven't got it solved yet.
hello I have a problem too i'm using mshtml under asp.net it works fine. the problem is when i try to get Dim sLocation As String = objHtml2.location.toString() objHtml2 is Ihtmldocument2 i this point i'm geting Error System.InvalidCastException: No such interface supported but it works under windows.form please help
Hi Shal, document.location is not a string. It is an object. The only reason that you can use it as a string in JavaScript is because the object has a default property (href). What you need to do is to use document.location.href instead. BTW, I decided to abandon the MSHTML since I could not get rid of the crash on ASP.NET reload. It was unacceptable, so I built my own HTML parser. Cheers, Marcelo http://bravenewword.typepad.com http://sampa.com
Hi Dim iLoc As mshtml.IHTMLLocation --- this line generate an Error System.InvalidCastException: No such interface supported iLoc = oHtml2.location --- sLocation = oLoc.href And only under Asp.NET , under window forms Application it works fine
Using COM interop isn't a performance decrease? AFAIK, mshtml is not very thread friendly!
I have it working, but when I execute the URL, it doesn't pop up in a new window. How am I supposed to use this script? Please provide an example. Seems like it would be useful, if I can open a browser window (IE) with the url, and populate some fields.
Hi Nick
I couldn't able to understand your user scenario fully. Could you please explain again what exactly you want to achieve ?
Note:
The execution happens on pageload event of the aspx page, so it wont open on new popup window.
Regards
Balaji
OK, thanks for clearing it up, page load event makes sense.
Thanks for your quick response, I am trying to run this code in a VB.Net application. The intent is to create an application that opens a new web page window, and <u>fills out the login form programmably.</u> Any helps would be appreciated.
Hello Balaji I'm trying to make same thing using C#, but can't connect MSHTML.dll to my web page... Can u write step by step instruction: how to connect MSHTML to the aspx web pages.
Sorry i'm soo lame... forgot to type using mshtml;
umm i'm still can not get remote page i recieve RPC_E_SERVERFAULT error when using
IHTMLDocument4 objMSHTML = new HTMLDocumentClass();

IHTMLDocument2 objMSHTML2;
objMSHTML2 = objMSHTML.createDocumentFromUrl(url, options);
hello everybody, I will be very glad with you if you can support to me, please. The problem is this: I am making a class that instance an object mshtml.HTML document This page already instance takes step to him a value (text), then I want to click in a button submit and read the answer page. I send the code that I am doing At least I have the page captures in objMSHTML3 objMSHTML3 = DirectCast (objMSHTML2, mshtml.IHTMLDocument3) then I go across and put the value that I desire in text. Now how can I make to do click in the button submit and recovered the answer page. Please help me and sorry for my bad English, I am from peru
Hi Balaji, I read in your blog and this discussion . Can you send the .aspx and .aspx.vb code via email to me. Thank you very much. My email : milk_flower@hotmail.com. Best Regards. Please help me and sorry for my bad English, I am from Vietnam.
hello, i dont know what i am doing wrong but it says that IPersistStreamInit is not define. Can anyone help please!
Having a bit of a nightmare converting this to C# - don't suppose you have code that works?
Hallo Balaji, this is nice, thanks. But I do not understand fully why you need to use IHtmlDocument2 and 3 ? There are any changes in that objects when you set it ?. And for all, here is C# version:
[ComVisible(true), ComImport(),
Guid("7FD52380-4E07-101B-AE2D-08002B2EC713"),
InterfaceTypeAttribute(ComInterfaceType.InterfaceIsIUnknown)]
public interface IPersistStreamInit
{
void GetClassID([In, Out] ref Guid pClassID);
[return: MarshalAs(UnmanagedType.I4)][PreserveSig]
int IsDirty();
[return: MarshalAs(UnmanagedType.I4)][PreserveSig]
int Load([In] UCOMIStream pstm);
[return: MarshalAs(UnmanagedType.I4)][PreserveSig]
int Save([In] UCOMIStream pstm, [In,
	MarshalAs(UnmanagedType.Bool)] bool fClearDirty);
void GetSizeMax([Out] long pcbSize);
[return: MarshalAs(UnmanagedType.I4)][PreserveSig]
int InitNew();
}


private void Button1_Click(object sender, System.EventArgs e)
{
mshtml.HTMLDocumentClass htmldoc;
htmldoc = new mshtml.HTMLDocumentClass();
mshtml.IHTMLDocument2 htmldoc2;
mshtml.IHTMLDocument3 htmldoc3;

IPersistStreamInit ips = (IPersistStreamInit)htmldoc;
ips.InitNew();
htmldoc2 = (mshtml.IHTMLDocument2) htmldoc.createDocumentFromUrl("http://www.google.com",null);
while (htmldoc2.readyState !=  "complete")  ;
htmldoc3 = (mshtml.IHTMLDocument3) htmldoc2;

Literal ltr = new Literal();
ltr.Text = Server.HtmlEncode( htmldoc3.documentElement.innerHTML);
Panel1.Controls.Add(ltr);
}
Hi Martin
Thanks for the C# version of the problem.
BTW, i have used IHTMLDocument2 & 3 because of the persistance and methods not found issues. I have quoted them in the sample code.
Regards Balaji
Thanks for share the vb.net code works fine, but cannot make the converted csharp work, :-( DoEvent = ? (in C#)
Hallo Hongtao,
in fact there in c# ver. is no DoEvents. I have tried even this version and I regulary had problems with this version. I am open for any solution, but I dont thing this is language issue (c#). Sometimes this Worked well, but after IIS restart it was unfortunately not working anymore. Sometimes parsing was also very slow. I finally decided to use thirdparty html parser (MILHTMLParser).
Congratulations!!! Very nice code!
Balaji, Your blog is excellent. Your solution to loading into a mshtml object was exactly what I needed. Now I have another problem and I hope that you might know the solution. I have a small bit of HTML saved in a string variable that I want to load into IHTMLDocument2. Is there a way to do this?
I am using this code in a vb.net windows application to parse a web page. Its works fine the first time. But in subsequent calls it doesn't reload the page. It reads the same data regardless if the web page has changed or not. Any ideas?
I figured it out. Add line objMSHTML.Clear() after you are done with the object.
Hi Balaji...very nice code...however I have a question.. Is it possible to reduce the round trips to the HTML document by both populating the objMSHTML2 (as you have done in the example) and a WebBrowser. I tried populating 'your' result into a WebBrowser and of course the other way (dropping the IPS code) which I know doesn't seem to work (hence your code). In summary, I am trying to display the WebPage but ALSO use the DOM to extract the information from the web page...ANy help appreciated...
Hi Balaji...very nice code...however I have a question.. Is it possible to reduce the round trips to the HTML document by both populating the objMSHTML2 (as you have done in the example) and a WebBrowser. I tried populating 'your' result into a WebBrowser and of course the other way (dropping the IPS code) which I know doesn't seem to work (hence your code). In summary, I am trying to display the WebPage but ALSO use the DOM to extract the information from the web page...ANy help appreciated...
Hi Balaji, Can u pls put the sample code of how to traverse into table object using IHTMLElementCollection. My table contains <TH> also. my mail id is sathish_digitally@yahoo.co.in
hi Balaji, i'm programming with mshtml, but I have a problem with messages that are thrown by some websites. These messages are security alerts, login boxes..etc.When one of these is thrown the application stop. Do you know how can I control this?
Hi Juan, U can make use of the "User32.dll" in ur code to find out the IE related Pop-Ups and can close it. U can get the code if u google it.
Hi all, thanks for the code. mshtml object was exactly what I needed. But I'm facing a problem when I try to using the object. I got this error "Access is denied. (Exception from HRESULT: 0x80070005 (E_ACCESSDENIED))" when I'm trying to do something like "element.setAttribute("xpath", path)" Thanks
Hi balaji, I know that this is not correct place to ask the question, From a Winform (C#) application i am launching the IE using Process.start, i opened a Weburl in that IE instance, but that url raising a popup alert, i dont know how to get that popup alert. If you know answer please let me know. Thanks Srinivasa Rao S
Hii I'm new to the world of asp.net n c#.net I'm working on MSHTML in C#. The following is my code ******************************************* private void btnCheck_Click(object sender, EventArgs e) { int intI,intJ; //string strImagePath = ""; ArrayList objArrayList = new ArrayList(); try { if(objWebBrowser.ReadyState != WebBrowserReadyState.Complete) { Console.WriteLine(objWebBrowser.ReadyState); Application.DoEvents(); } HtmlDocument objHtmlDocument = objWebBrowser.Document; Console.WriteLine("******************************The Domain is:"+objHtmlDocument.Domain); MSHTML.IHTMLDocument2 objIHTMLDocument = (MSHTML.IHTMLDocument2)objWebBrowser.Document.DomDocument; if (objIHTMLDocument.frames.length > 0) { MSHTML.FramesCollection objFramesCollection = objIHTMLDocument.frames; for ( intI = 0; intI < objFramesCollection.length; intI++) { object idx = (object)intI; MSHTML.HTMLWindow2 objHTMLWindow2 = (MSHTML.HTMLWindow2)objFramesCollection.item (ref idx); MSHTML.HTMLDocument objHTMLDocument = (MSHTML.HTMLDocument)objHTMLWindow2.document; objArrayList.Add(objHTMLWindow2); Console.WriteLine("-------------------"); Console.WriteLine(objHTMLWindow2.name); Console.WriteLine("-------------------"); HtmlElementCollection objImageCollections = objHtmlDocument.GetElementsByTagName("object"); for ( intJ = 0; intJ < objImageCollections.Count; intJ++) { strImagePath = objImageCollections[intJ].GetAttribute("src"); Console.WriteLine(strImagePath); } } /* for (intI = 0; intI < objArrayList.Count; intI++) { //objComponent = (Component)objArrayList[intI]; Console.WriteLine(objArrayList[intI].GetType()); }*/ } ******************************************* I'm getting the following exception at MSHTML.HTMLDocument objHTMLDocument = (MSHTML.HTMLDocument)objHTMLWindow2.document; **The Exception is A first chance exception of type 'System.UnauthorizedAccessException' occurred in mscorlib.dll at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFlags flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at MSHTML.DispHTMLWindow2.get_document() at WindowsFormsApplication1.Form1.btnCheck_Click(Object sender, EventArgs e) in E:\cSharp_workspace\Parser\Parser\Form1.cs:line 85 *********************************** Can any one help me out with this exception?? My task is to collect images n flash files from various Frames on any site but this exception is holding me back How shuld I go further??


Add a comment

Title
Body
HTML : b, i, blockquote, br, p, pre, a href="", ul, ol, li
Math Quiz 3 + 9 = (Helps stop blog spam)
Name
E-mail address
Website
Remember me Yes  No 

E-mail addresses are not publicly displayed, so please only leave your e-mail address if you would like to be notified when new comments are added to this blog entry (you can opt-out later).

TrackBack to http://radio.javaranch.com/balajidl/addTrackBack.action?entry=1137606354980