## The Company
Scrapfly is a company providing solutions around web scraping. It aims to centralize all the scraping in the same place to offer a simple and robust to our client, which acquires and collects data on the internet. We currently are a team of 6, which is a delightful scale to own projects and collaborate.
## Role
You will lead the product that we have internally developed that allows our customer to plug their proxy provider into our (DSP) Data Saver Proxy and save bandwidth (Distributed cache, Internal block list of tracker and telemetry, stub of media resources), better reliability (internal retry, backoff) in a simple product (plug and play) and dashboard with all metrics. To be simple: Client — Scrapfly MITM Proxy —→ Backconnect (Client Proxy Provider)
This is a pure technical product. It offers http/HTTPS/socks5 as proxy protocol (we plan the support for http2 and QUIC in the future), it’s a MITM (handling HTTP (s) 1/2, websocket) which has the particularity to replicate the TLS/HTTP2/HEADERS fingerprint or to fix it based on options. Today all MITM screw the fingerprint, and most websites will block you because it seems you are a bot. The product use username/password to be configured and auth
We also maintain some internal tools for the test, like fingerprinters (link removed), (link removed)
– Deeper understanding of TLS protocol, and we maintain our own version of crypto/tls which is based on
– https://github.com/golang/go/tree/master/src/crypto/tls
– https://github.com/cloudflare/go/tree/cf/src/crypto
– https://github.com/cloudflare/circl
– https://github.com/refraction-networking/utls
– A deeper understanding of HTTP protocol and HTTP2. For fingerprint stuff, we operate at the frame level. We also maintain our own fork standard http golang http package.
– DNS protocols and DoH (DNS over HTTPS)
– TCP/UDP protocols
– Deeper understanding and experience of stream/connection, distributed systems
– Being brilliant at caching games and spotting areas to improve the speed and reliability (DNS cache, TLS resumption, Connection cache, and so on!)
– Being curious about protocol updates and RFC, TLS protocol evolve a lot and chrome is always going forward and require to follow their recent update to support new extensions to match fingerprint
You will be directly in a relationship with our CTO, who is also part of this product and coordinates the dashboard and proxy parts. He has strong knowledge of every previous point (TLS, HTTP, Fingerprinting), so you will not be left alone if you have a question or need help.
Other technology involved: Clickhouse (metrics), Mariadb (customer auth, proxy config), Mongodb (Cache rules/blocklist ruleset, exception rules, etc), and Redis. You do not need to be a beast in every database mentioned; be able to learn and understand if required or have basic knowledge; we simply use them. The project runs on Kubernetes; the local environment is a k3d that reproduces the prod and the DX to hot reload on change and share your files with your IDE, and we have internal docs. The project runs only on Linux x86, so it’s possible to have a remote workspace.
Your missions are:
– Keep our proxy up to date and maintain it / fix bugs.
– Create a distributed proxy federator to enable the connection pooling feature and route source ip to the same proxy.
– Improve the proxy, spot and suggest areas of improvement.
– Help the CTO to schedule and organize future task to prioritize
We are fully async, no “call”. We work on Slack and organize everything through Notion. All the code parts are hosted in our github org.
Please send me on the top of your message the word spider and your github profile link.
Posted On: February 01, 2024 06:12 UTC
Category: Back-End Development
Skills:Golang, Web Proxy, Network Administration, HTTP, TCP/IP, TCP, DNS, TLS1.2
Country: United States
click to apply
Powered by WPeMatico
