웹 스크래핑의 HTTP 403 에러 수정하기

티스토리 뷰

개발/파이썬

웹 스크래핑의 HTTP 403 에러 수정하기

맨날치킨 2022. 12. 29. 20:05

Stack Overflow에 자주 검색, 등록되는 문제들과 제가 개발 중 찾아 본 문제들 중에서 나중에도 찾아 볼 것 같은 문제들을 정리하고 있습니다.

Stack Overflow에서 가장 먼저 확인하게 되는 가장 높은 점수를 받은 Solution과 현 시점에 도움이 될 수 있는 가장 최근에 업데이트(최소 점수 확보)된 Solution을 각각 정리하였습니다.

아래 word cloud를 통해 이번 포스팅의 주요 키워드를 미리 확인하세요.

Problem HTTP error 403 in Python 3 Web Scraping

Python 3 웹 스크래핑의 HTTP 오류 403 문제

문제 내용

I was trying to scrape a website for practice, but I kept on getting the HTTP Error 403 (does it think I'm a bot)?

연습을 위해 웹사이트를 스크래핑 했는데, HTTP Error 403(내가 봇인 것 같니?)이 계속 뜨네요.

Here is my code:

제 코드는 다음과 같습니다.

#import requests
import urllib.request
from bs4 import BeautifulSoup
#from urllib import urlopen
import re

webpage = urllib.request.urlopen('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1').read
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')

row_array = re.findall(findrows, webpage)
links = re.finall(findlink, webpate)

print(len(row_array))

iterator = []

The error I get is:

다음과 같은 오류가 발생합니다.

 File "C:\Python33\lib\urllib\request.py", line 160, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\lib\urllib\request.py", line 479, in open
    response = meth(req, response)
  File "C:\Python33\lib\urllib\request.py", line 591, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python33\lib\urllib\request.py", line 517, in error
    return self._call_chain(*args)
  File "C:\Python33\lib\urllib\request.py", line 451, in _call_chain
    result = func(*args)
  File "C:\Python33\lib\urllib\request.py", line 599, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

높은 점수를 받은 Solution

This is probably because of mod_security or some similar server security feature which blocks known spider/bot user agents (urllib uses something like python urllib/3.3.0, it's easily detected). Try setting a known browser user agent with:

이것은 아마도 mod_security 또는 알려진 스파이더/봇 사용자 에이전트를 차단하는 유사한 서버 보안 기능 때문일 것입니다(urllib는 python urllib/3.3.0과 같은 것을 사용하며 쉽게 감지됩니다). 다음을 사용하여 알려진 브라우저 사용자 에이전트를 설정해 보세요.

from urllib.request import Request, urlopen

req = Request(
    url='http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', 
    headers={'User-Agent': 'Mozilla/5.0'}
)
webpage = urlopen(req).read()

This works for me.

이건 제게 효과가 있었습니다.

By the way, in your code you are missing the () after .read in the urlopen line, but I think that it's a typo.

그건 그렇고, 당신의 코드에서 urlopen 줄의 .read 뒤에 ()가 누락되었지만 오타라고 생각합니다.

TIP: since this is exercise, choose a different, non restrictive site. Maybe they are blocking urllib for some reason...

팁: 연습이므로 제한되지 않는 다른 사이트를 선택하세요. 어떤 이유에서인지 그들이 Urllib을 막고 있는지도...

가장 최근 달린 Solution

Adding cookie to the request headers worked for me

요청 헤더에 쿠키를 추가하는 것이 효과적이었습니다.

from urllib.request import Request, urlopen

# Function to get the page content
def get_page_content(url, head):
  """
  Function to get the page content
  """
  req = Request(url, headers=head)
  return urlopen(req)

url = 'https://example.com'
head = {
  'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36',
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
  'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
  'Accept-Encoding': 'none',
  'Accept-Language': 'en-US,en;q=0.8',
  'Connection': 'keep-alive',
  'refere': 'https://example.com',
  'cookie': """your cookie value ( you can get that from your web page) """
}

data = get_page_content(url, head).read()
print(data)

출처 : https://stackoverflow.com/questions/16627227/problem-http-error-403-in-python-3-web-scraping

'개발 > 파이썬' 카테고리의 다른 글

Python에서 모든 하위 디렉터리 가져오기 (0)	2022.12.31
파일의 마지막 n줄 가져오기 (0)	2022.12.30
데이터프레임 특정 셀의 값 가져오기 (0)	2022.12.29
데이터프레임의 두 열을 인자로 받는 람다 함수 만들기 (0)	2022.12.29
빈 딕셔너리인지 확인하기 (0)	2022.12.29

공지사항

최근에 올라온 글

개발자의 일상

티스토리 뷰